Twitter Data Exploration using R


This is a simple program for extracting data from Twitter using R. Once data is extracted many kinds of analysis can be performed. However, I’ll quickly demonstrate getting the data in place with a few powerful R libraries, and doing some basic exploration thereafter.



Register & Create a Twitter Application

In order to get access into the database of Twitter, we need to first create an application on https://apps.twitter.com/

Once done registering and creating the applcation, look at the values of api key, secret and token

Insert these values in the R environment as follows:
api_key <- “ ”

api_secret <- “ ”

access_token <- “ ”

access_token_secret <- “ ”



Initialize the environment

We will be using two libraries – twitteR & httr.

twitteR will help us in extracting and manipulating the tweets from Twitter. Whereas httr is for OAuth configuration. OAuth is explained briefly below.

Also, knitr is used for creating some decent quality tables suited to HTML.


library(twitteR)
library (httr)
library (knitr)


Setup Twitter OAuth

We now connect to the OAuth with setup_twitter_oauth function twitteR package. Library httr will be used to authenticate our connection with the Twitter API called oauth. ‘Oauth’ is an authenticating protocal used by Twitter so that external applications can access it’s database. More info on Oauth can be found here – https://oauth.net/

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)
## [1] "Using direct authentication"


Collect Tweets

We are now ready to fetch some Tweets. We will collect the last 200 Tweets with the hashtag of ‘#DataScience’.

searchTwitter is a function from the twitterR library, and makes it extremeyly simple to fetch data. There are many other parameters which could be added including dates, location, geocode, language and so on.

tweets = searchTwitter('#DataScience', n=200, lang="en")


Organize Data

The data structure of the fetched tweets is in the form of ‘lists’. We will now have to apply some transformation to convert it to a data frame for analysis.

df <- do.call("rbind", lapply(tweets, as.data.frame))

do.call is a base R function and constructs and executes a function along with a list of arguments in our case which is our list of tweets.

# Print Table in HTML suitable formal
#kable(head(df, 2),format = "markdown")

The dataframe headings are a bit off when I’m trying to output the HTML.But it should be clean on R Studio.



Basic Exploration

The dimensions of the data frame are:

dim(df)
## [1] 200  16

The most favourited tweet is:

text favoriteCount screenName
45 Here’s how a decision tree splits the data https://t.co/fDQfPPfAi1 #MachineLearning #ArtificialIntelligence #DataScience 8 fasih_khatib

We can see, although not very pretty, is that this tweet has a favorite count of 8

The tweet says – Here’s how a decision tree splits the data https://t.co/fDQfPPfAi1 #MachineLearning #ArtificialIntelligence #DataScience

Let’s fetch some info about this particular user – fasih_khatib


  
dset2 <-  df[ which.max(df[,3] ) , c(1,3,11)]

userInfo<-lookupUsers(dset2[,3])  

##Convert Info to a dataframe

userDF <- t(twListToDF(userInfo))
kable(userDF,format = "markdown")
fasih_khatib
description Java. Groovy. JVM.
statusesCount 9
followersCount 9
favoritesCount 25
friendsCount 61
url NA
name Fasih Khatib
created 2016-08-26 07:13:08
protected FALSE
verified FALSE
screenName fasih_khatib
location Mumbai, India
lang en
id 769070157268418560
listedCount 1
followRequestSent FALSE
profileImageUrl http://pbs.twimg.com/profile_images/769071861246418944/IqOBa7Ff_normal.jpg

So all feasible information about the user is fetched quite easily. The packge and the API do make this quite an awesome platform for Twitter analysis. There are a few limitations on the data which can be fetched, but still not bad.


We can create a bar plot of each user based on their number of tweets with the ‘DataScience’ hashtag.

counts = table(df$screenName)
barplot(counts)

 

i1

 

Limit the data set to show only those who tweeted more than 2 times in the sample. And also make the visualization a little prettier.

cc=subset(counts,counts>2)
#please remove col = cols, as I haven't posted that chunk of code for colouring
barplot(cc,las=2,cex.names =0.8, col = cols, beside = FALSE, width =c(0.1))

i2


Now, we get an idea of the number of users tweeting most often with the hastag – ‘#DataScience’.



World Map

I will now quickly try to draw a world map with pointers to the approximate location of the tweets.

We will need the following libraries – ggplot2, maps, dismo

library(ggplot2)
library (maps)
library (dismo)

Fetch data again of the users using the lookupUsers function

# Batch lookup of user info
userInfo <- lookupUsers(df[,11])  
# Convert to a nice dF
userFrame <- twListToDF(userInfo) 

Now we build the world map!

locatedUsers <- !is.na(userFrame$location)
# Use  API to guess lat lon
locations <- geocode(userFrame$location[locatedUsers]) 
long <-locations$longitude
lat <- locations$latitude

# Add world map 
worldMap <- map_data("world")  

zp1 <- ggplot(worldMap)
# Draw map
zp2 <- zp1 + geom_path(aes(x = long, y = lat, group = group),  
                       colour = gray(2/3), lwd = 1/3)
# Add points indicating users
zp3 <- zp2 + geom_point(data = locations,  
                        aes(x = longitude, y = latitude),
                        colour = "RED", alpha = 1/2, size = 1)
zp4 <- zp3 + coord_equal() + theme_minimal()  
print(zp4)

i3



Summary

We have demonstrated data fetching along with some basic exploration for Twitter. Detailed analysis will be produced in a future post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s