Extracting Tweet Information with Tweepy and BeautifulSoup

Image for post
Image for post

I love Twitter. Not from a, “I spend all my free time scrolling” type of viewpoint. But more from a, “Wow, Twitter is a data powerhouse” type of viewpoint. It’s like having access to tabloid magazines, the inside of your favorite academics’ brain, or the inside scoop on the (occasionally way too) personal thoughts of both your peers and complete strangers.

And that’s what makes it so, so wonderful.

The best part about Twitter is that we can easily manipulate this information in my favorite coding language, Python, by using Twitter’s API and a simple package called Tweepy.

I’ve had the opportunity to play around a ton with Tweepy, and realized you can do some pretty sweet things with it including sentiment analysis, which I will touch on in the future. For the purpose of this article, though, I’m going to go over some Tweepy basics, how to extract Tweet information, and also how to scrape preview links from tweets for their images using BeautifulSoup and a couple of functions.

First Things First

Before you can use Twitter’s API, you first need to apply for a Twitter Developer account. It’s relatively straightforward, and in the application, you will just need to tell Twitter your purposes for wanting to use their API. The more detailed you can be on your application, the easier it’ll go through. I got verified less than 24 hours after applying, but it varies for everyone.

To begin your application, click here.

Tweepy Overview

There are a couple of different libraries you can use to access the Twitter API, but my personal favorite is Tweepy. Tweepy makes it super easy for you to scan Twitter for data based on a specific hashtag, an individual user, what’s trending in a specific location… plus a ton more. There’s a really good article on all the different things Tweepy is capable of that you can read more on here.

To use this library, along with the Twitter API, make sure to stick this code into your terminal:

pip install TwitterAPI
pip install Tweepy

Let’s Get Coding

Alright, now for the fun part. Just for some insight, with this specific code, I will be using Tweepy to get Tweet information from a specific third-party news source (ZDNet), and then use the link previews from the Tweets to scrape each article for main images using BeautifulSoup, and then save them to a folder on my computer.

Phew, that was a mouthful.

To get started, I always import all packages I’m going to need first. For this project, these include:

import os
import pandas as pd
import numpy as np
import tweepy
from tweepy import OAuthHandler
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse
from requests_oauthlib import OAuth1

After you’re all set there, we then need to authorize our API. Note that for each token, you will need to insert your own credentials that will be given to you once your API application has been accepted.

#Handle Twitter API authenticationconsumer_key = 'inserthere'
consumer_secret = 'inserthere'
access_token = 'inserthere'
access_secret = 'inserthere'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

Now, we’re going to run our first API query (not sure query is the right term, but we’re rolling with it). As I said previously, this code is focusing on a specific Twitter account, ZDNet.

#pull tweets from specific user. In userID, just put the page's usernameuserID = "ZDNet"
tweets = api.user_timeline(screen_name=userID,
count=200,
include_rts = False)
print("Number of Tweets Extracted: {}.\n".format(len(tweets)))

Let’s break this down a bit:

userID: This is just a variable I assigned the name of the account(username) I want to scrape. Make sure you make this variable a string!

tweets: api.user_timeline is a Tweepy function that we use to scrape the defined account. The parameters that we set inside are screen_name (our defined userID), the number of Tweets we want to scrape (the limit is 200 at a time), and if we want to include retweets or not. We then set all of this to the variable tweets, as that is where all scraped tweets will be stored.

Lastly, we just print out the number of tweets to make sure it worked! If you want to double-check, you can run this line of code as well to print out the first few tweets:

#simple overview to make sure that it pulls correctly...for tweet in tweets[:5]:
print(tweet.text)

Better Visualization

Now that we have all of our Tweets housed in one place, let’s see what we’ve got going on by putting it into a dataframe.

With Tweepy, you can call in which information you’re interested in looking at. For the purpose of this project, I wanted to see how many characters in each Tweet (len(tweet.text)), the content in the Tweets (tweet.text), when it was tweeted (tweet.created_at), and also the username just to make sure no retweets made their way in (tweet.user.screen_name).

If you’re interested in seeing retweet and like counts you can call: tweet.retweet_count and tweet.favorite_count.

#create dataframe to house data.ZDNET1 = pd.DataFrame(data=[[len(tweet.text), tweet.text, tweet.created_at, tweet.user.screen_name] for tweet in tweets], columns = ['Tweet_Length', 'Tweet_Text', 'Tweet_Date', 'UserName'])
ZDNET1.head(10)
Image for post
Image for post
dataframe output

Because the Tweet_Text column cuts off the entirety of the Tweet, what you don’t see here is that each Tweet contains a link — and that link shows up as a link preview on Twitter.

A link preview in Twitter is a bit different than solely attaching an image to a tweet, and then adding a clickable link. A link preview is a link that presents itself as an article within the Tweet. They look like this:

Image for post
Image for post

Breaking it Down

Because I’m interested in scraping each link for the main image in the article, I need to separate the content within Tweet_Text and the corresponding link into separate columns.

We can do this with the following code:

#Pull out tweet text and corresponding link to article into two columnsZDNet = ZDNET1ZDNet = ZDNET1['Tweet_Text'].str.split("http", n=1, expand=True)ZDNet[1] = 'http' + ZDNet[1].astype(str)
ZDNet.columns = ["Full_Tweets", "Tweet_Links"]
ZDNet["Tweet_Links"]=ZDNet['Tweet_Links'].str.split().str[0]
ZDNet.head(10)

I split the content within the Tweet_Text column using.str.split on the delimiter ‘http’. However, it’s important to note that doing so removes the delimiter, so I then readded ‘http’ back into the front of each link in the column.

After this, I noticed that some links had content after them, so I had to remove any additional content using the last line of code (before ZDNet.head(10)).

Image for post
Image for post
output

Now that it’s separated, we can go ahead and add all the links to a list, and set this portion of the project aside:

links = ZDNet["Tweet_Links"].tolist()
print(links)

Scraping Articles From Given Links

If you’ve never used BeautifulSoup before, it’s a powerful Python library for getting data out of HTML, XML, and other markup languages.

It can be a little tough if you’re not super familiar with HTML, but a little practice goes a long way.

For this project, I want to use Beautiful Soup to scrape each of the links, and return to me the main image from each article and put it into a folder on my computer. This involves a few different functions, so we’ll walk through them step-by-step.

Step 1: Make sure each link is valid

A “valid” link is composed of a netloc and scheme. A netloc is more or less the http: or https: bit, and scheme is content.content. For example, twitter.com, or zilliow.com.

We want to make sure each link in our list is valid so that when we get to the main function, nothing breaks. This first function is pretty straight-forward:

def is_valid(url):
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)

If you want to review if a specific link is valid, you can also call this line of code:

import urllib.parse
url="https://t.co/8wkQfYzstU" #insert your link here
o = urllib.parse.urlsplit(url)
print(o.scheme, o.netloc)

If it doesn’t return an error message but does return a scheme and netloc, you’re good to go.

Step 2: Getting the images.

Alright, this one is going to be a bit complex, so bear with me here. I’m going to share the entirety of the function, and then walk through it:

def get_all_images(url):
consumer_key = 'inserthere'
consumer_secret = 'inserthere'
access_token = 'inserthere'
access_secret = 'inserthere'
auth = OAuth1('consumer_key', 'consumer_secret',
'access_token', 'access_secret')
soup = bs(requests.get(url, auth = auth).content, "html.parser")
urls = []
for img in tqdm(soup.find_all("img"), "Extracting images"):
img_url = img.attrs.get("src")
if "/thumbnail" in img_url:
continue
if not img_url:
continue
img_url = urljoin(url, img_url)
try:
pos = img_url.index("?")
img_url = img_url[:pos]
except ValueError:
pass
if is_valid(img_url):
urls.append(img_url)
return urls

Don’t be intimidated! This really isn’t THAT awful. The first bit:

def get_all_images(url):
consumer_key = 'inserthere'
consumer_secret = 'inserthere'
access_token = 'inserthere'
access_secret = 'inserthere'
auth = OAuth1('consumer_key', 'consumer_secret',
'access_token', 'access_secret')
soup = bs(requests.get(url, auth = auth).content, "html.parser")
urls = []

is just setting the stage. We need to redefine our Twitter API authorization so that when it cycles through our list of links, it doesn’t return an error.

We then call BeautifulSoup and ask it to take each url, give it authorization, and then parse it.

Lastly, we just create an empty url list.

The second bit:

for img in tqdm(soup.find_all("img"), "Extracting images"):
img_url = img.attrs.get("src")
if "/thumbnail" in img_url:
continue
if not img_url:
#if img does not contain src attribute, just skip
continue
#make the URL absolute by joining domain with the URL that is just extracted
img_url = urljoin(url, img_url)
try:
pos = img_url.index("?")
img_url = img_url[:pos]
except ValueError:
pass
#finally, if the url is valid
if is_valid(img_url):
urls.append(img_url)
return urls

is just creating a for loop to go through each individual parsed HTML bit and find anywhere there is an img tag.

It then tells Python to get the src attribution for the link it’s looking at, and assign it to the img_url variable.

The src attribution is an image link. For example, if you use Google Images and search cats, right-click on an image, and select “copy image address” a link will pop up. This is the src.

For each img_url, it is then saying, hey, if the src contains “/thumbnail”, I don't want it. So if there IS that bit in the src, the function will trash that image, and loop back up to the next one. If it doesn’t contain it, it will double-check and be like “hey, you do have an src attribute, right” and if it does, it’ll move along, if not, it’ll get trashed and loop back to the next.

So once we’ve made sure we’ve got a true src attribute that is not a thumbnail image, it moves onto the next step that just combines the src to the domain url to make it a real link. basically, it’ll just, for example, take the domain zillow.com+src and make it whole.

The scary try/except portion is just giving it a couple of different things to try out to get to that whole link.

Finally, we chuck our first function, is_valid() into there and run the img_url, and if it passes, we add it to that empty list of urls we created at the top.

See? Not that bad.

Step 3: Make the images downloadable

We’ve got our images, but now we need to get them onto our computer. We have one last hairy function to get to our end goal:

def download(url, pathname):
if not os.path.isdir(pathname):
os.makedirs(pathname)y
response = requests.get(url, stream=True)
file_size = int(response.headers.get("Content-Length", 0))
filename = os.path.join(pathname, url.split("/")[-1])
progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True, unit_divisor=1024)
with open(filename, "wb") as f:
for data in progress:
f.write(data)
progress.update(len(data))

This may not be as long as the last one, but it can a little confusing to interpret. Here we are simply saying “if a path doesn’t exist right now, make that path.

We then tell it to download the body by chunk, not in one big swoop. Once it does that, get the file size and information.

Then we use tqdm, a Python progress bar to show us how we’re doing on downloading. The default is by iteration, but here we change to bytes,

Lastly, we are just opening our file, reading the data to the file, and then updating the progress bar manually.

Boom. Done.

When you run it, you’ll see where the progress bar comes into play. It’s actually pretty cool if I do say so myself.

The Home Stretch

If you’ve made it this far, you’re a champ. Also congrats, you’re about to create and run your final function. This one shouldn’t require an explanation, as it’s just all of your defined function put together.

def final(links, path):
for i in links:
imgs = get_all_images(i)
for img in imgs:
download(img, path)

And when you call and run it:

final(links, "Twitter-Images")
Image for post
Image for post

you’ll be able to see that progress bar I was talking about! Cool stuff.

You can then navigate to the folder you called in the final function, and check to see if your images downloaded:

Image for post
Image for post

Annnnd viola! There you have it. You can officially call yourself a Tweepy and BeautifulSoup novice.

As always, I highly encourage you to play around with Tweepy and Beautiful Soup’s capabilities, as the best way to learn is to do. :)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store