Mining the social web

This week’s lab shows you how to access the Twitter API using the rtweet package and what you can do with that harvested data.

Installing dependencies

This lab requires you to install a few extra packages into your RStudio environment, so let’s take care of that first. First, we need to check if we need to install the devtools package, and install it if we don’t have it:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

We also need to install the fs package from the extended tidyverse to help with constructing paths to files that are compatible with any operating system:

install.packages("fs")

Once devtools and fs are installed, run the following code to install the latest version of the rtweet package:

devtools::install_github("mkearney/rtweet")

We also run the following code in our Console window in order to create an .Renviron file and the folders where we will be saving our Twitter authentication token.

location_to_save_token <- fs::path(fs::path_home(), ".config", "R", "twitter")
token_file <- fs::path(location_to_save_token, "twitter_token", ext="rds")
fs::dir_create(path = location_to_save_token, recursive = TRUE)

renviron_file <- fs::path(fs::path_home(), ".Renviron")
fs::file_create(path = renviron_file)
fs::file_chmod(path = renviron_file, mode = "600")

if (!stringr::str_detect(readr::read_file(file = renviron_file), "TWITTER_APP=")) {
  readr::write_file(
    x = glue::glue(
      "TWITTER_APP=rtweet_tokens\n",
      "TWITTER_CONSUMER_KEY=\n",
      "TWITTER_CONSUMER_SECRET=\n",
      "TWITTER_ACCESS_TOKEN=\n",
      "TWITTER_ACCESS_TOKEN_SECRET=\n\n"
    ),
    path = fs::path(fs::path_home(), ".Renviron"),
    append = TRUE
  )
}

if (!stringr::str_detect(readr::read_file(file = renviron_file), "TWITTER_PAT=")) {
  readr::write_file(
    x = glue::glue("TWITTER_PAT={token_file}\n\n"),
    path = fs::path(fs::path_home(), ".Renviron"),
    append = TRUE
  )
}

file.edit(renviron_file)

The .Renviron file should have opened up in your editing pane. Leave it open for now, we will be adding information to it shortly.

Next, we turn to getting ourselves set up to access the Twitter API.

Creating an App on the Twitter API

The official method for accessing Twitter data is through their API — the acronym API means application programming interface — that can only be used if you have an authentication token. The procedure for obtaining an authentication token is straightforward, you just need to have a registered Twitter account. If you have not already, create your Twitter account now and log into your account.

Phone verification: The Twitter apps page may require that you have a valid cell phone number associated with your Twitter account that is also verified. To double check whether you’ve associated a phone number with your Twitter account and validated it, navigate to https://twitter.com and click on your profile picture in the top right hand corner, click Settings and privacy, and then click Mobile on the menubar on the left. On this page you can enter your cell phone number and then verify it by entering the code you receive in a text message from Twitter.

The way we will get an authentication token for the Twitter API is by creating a Twitter app. Navigate to https://apps.twitter.com and click Create New App. This will open a page where you will create a new app by providing a Name, Description, and Website of your choosing, similar to the screenshot below:

plot of chunk creating-twitter-app-token

In the input boxes where it asks you to provide a Name, Description, and Website, use the following:

  • Name: cds-102-lab-<extra>

  • Description: CDS 102 lab

  • Website: http://cds101.com

  • Callback URL: http://127.0.0.1:1410

  • Yes, I have read and agree to the Twitter Developer Agreement

  • In Name, replace <extra> with a sequence of letters or numbers of your choosing

  • The setting for Callback URL is recommended by the rtweet developers

  • Check the box next to the Developer Agreement

Once you have filled in all the boxes in the way shown above, click on Create your Twitter Application button.

plot of chunk twitter-app-token-created

After creating the app, click the tab labeled Keys and Access Tokens to retrieve your consumer (api) and secret keys. The top of the page will look like this:

plot of chunk twitter-app-token-keys

You will now need to copy and paste the codes for the Consumer Key, Consumer Secret, Access Token, and Access Token Secret into the .Renviron file that we opened at the very end of the previous section. Copy each of these codes, match it to one of the four keywords in the .Renviron file, and paste your copied code in the space immediately to the right of the equals = sign. Then, in the RStudio menubar, click Session, then click Restart R.

Connecting your R session to Twitter

Now that we have the authentication token codes saved to our account, let’s connect R and rtweet using the following code:

twitter_token <- rtweet::create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"),
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET"),
  access_token = Sys.getenv("TWITTER_ACCESS_TOKEN"),
  access_secret = Sys.getenv("TWITTER_ACCESS_TOKEN_SECRET"),
  set_renv = FALSE
)

Running the above command will complete the connection process and now you can start using the Twitter API! Before we start, let’s perform one last configuration that will make it so that you do not need to re-run the above code each time you close and re-open RStudio:

location_to_save_token <- fs::path(fs::path_home(), ".config", "R", "twitter")
token_file <- fs::path(location_to_save_token, "twitter_token", ext="rds")
fs::dir_create(path = location_to_save_token, recursive = TRUE)
readr::write_rds(x = twitter_token, path = token_file)
fs::file_chmod(path = token_file, mode = "600")

This code saves the configuration information about your Twitter authentication tokens and should be automatically loaded for you the next time you restart RStudio.

Keep your access token codes private: It may seem like it would be more convenient to just store the consumer key, consumer secret, access token, and access token secret directly inside your RMarkdown file. Do not do this! These codes will give anyone that has them direct access to and control over your account. Therefore you should store the token codes somewhere safe in your RStudio Server account, which the above instructions helped you accomplish.

Fetching data about users, their followers, and friends

There are many things we can analyze using a social media platform. Like most things, we will start small by stepping through a basic analysis of a single Twitter account. Our example account will be @CSS_GMU, which is the official Twitter account of the Computational Social Science program in George Mason’s Computational & Data Sciences Department.

We begin by fetching the basic account information for @CSS_GMU, which we will then save to disk for offline access, and then explore the variables returned to us. To ask the Twitter API for the account information belonging to the Computational Social Science program, we use:

lookup_users("CSS_GMU") %>%
  write_rds(path = path("user_css_gmu", ext = "rds"))

This will write a file named user_css_gmu.rds into your project directory that contains the information we just collected. To read it into R’s memory, we run:

user_css_gmu <- read_rds("user_css_gmu.rds")

Now let’s take a look at the information we just gathered about @CSS_GMU. Perhaps we can learn something about the account just by looking at the different variables. Run the following to get a full list of the variables you can look at that are related to the account:

user_css_gmu %>%
  users_data() %>%
  names()
  1. Let’s focus on the following variables associated with the @CSS_GMU account:

    user_css_gmu %>%
      users_data() %>%
      select(
        account_created_at,
        description,
        favourites_count,
        followers_count,
        friends_count,
        account_lang
      ) %>%
      glimpse(width = 200)

    What do these variables describe and why might they be interesting to know? What other variables do you see after running the names() function that would be of interest to a data scientist?

On social media platforms such as Twitter, we can learn a lot about an individual user by looking at the people that user follows (we’ll call these friends) as well as the people that follow the user (we’ll call these followers), and by studying the attributes and behaviors that emerge when these users interact with one another. Thus, while knowing an account’s personal attributes can be very useful, we can infer additional information about a user by analyzing the account’s friends and followers. Let’s see how this works in practice.

  1. We need to fetch the friends and followers of the @CSS_GMU account using the get_followers() and get_friends() functions from rtweet as follows:

    get_followers("CSS_GMU") %>%
      pull(user_id) %>%
      lookup_users() %>%
      write_rds(path = path("css_gmu_followers", ext = "rds"))
    
    get_friends("CSS_GMU") %>%
      pull(user_id) %>%
      lookup_users() %>%
      write_rds(path = path("css_gmu_friends", ext = "rds"))

    Like before, we’ve saved the data we collected to disk so that we have an offline copy.

    Do you understand what the above code is actually doing? Figure out what get_followers() and get_friends() are doing specifically, and what the lookup_users() function is doing in the above code, and explain it. Then, write the code that you need to read the rds files into R so that you can work with the data. Assign the followers data to the variable css_gmu_followers and the friends data to the variable css_gmu_friends.

Visualizing Twitter data

One of the conveniences of the rtweet package is that it stores Twitter data in the tibble data frame format, so all of the tidyverse tools we’ve learned to use so far can be used for analysis. Let’s work through a few examples of what you can do to analyze this data.

One of the questions we can ask of our data is if there is a relationship between how often @CSS_GMU’s followers tweet and the number of followers they have. While we’re at it, we can also ask: what language do they write their tweets in?

  1. Use the following command to get an answer to these questions:

    css_gmu_followers %>%
      users_data() %>%
      ggplot() +
      geom_point(
        mapping = aes(
          x = statuses_count,
          y = followers_count,
          color = account_lang
        )
      ) +
      scale_x_log10() +
      scale_y_log10() +
      coord_equal()

    Notice that we have set our plot to have a log-scale on both axes. Why would we want to do that? Also, what does this graph tell us about the accounts that follow @CSS_GMU, and by consequence, what does all this say about @CSS_GMU?

The ages of the accounts for @CSS_GMU’s friends and followers could contain meaningful patterns and be of interest to us. For example, it may be that the patterns in the account ages for @CSS_GMU look different from accounts not associated with a University.

  1. Use the following command to create a visualization that looks at the account creation date for followers of @CSS_GMU:

    css_gmu_followers %>%
      users_data() %>%
      ggplot() +
      geom_histogram(
        mapping = aes(x = account_created_at),
        bins = 20
      )

    Describe the distribution of ages of the accounts that follow @CSS_GMU, and what can we infer about the account @CSS_GMU itself?

Of course, it wouldn’t make sense for us to interact with the Twitter API if we don’t grab some actual tweets! To request tweets for a specific account, we use the get_timeline() function. For example, to request all the tweets by the @CSS_GMU account we use the following command:

get_timeline(user = "CSS_GMU", n = 500) %>%
  write_rds(path = path("css_gmu_all_tweets", ext = "rds"))

As usual, we save the tweets to disk so that we have an offline copy. After saving them, we load the tweets and assign them to the variable css_gmu_tweets:

css_gmu_tweets <- read_rds("css_gmu_all_tweets.rds")
  1. Create a timeline plot of @CSS_GMU’s monthly tweeting frequency since the account was first created using the following code:

    css_gmu_tweets %>%
      tweets_data() %>%
      ts_plot(by = "months") +
      theme(plot.title = element_text(face = "bold")) +
      labs(
        x = NULL,
        y = NULL,
        title = "Frequency of @CSS_GMU Twitter statuses since account creation",
        subtitle = "Twitter status (tweet) counts aggregated using one-month intervals",
        caption = "\nSource: Data collected from Twitter's REST API via rtweet"
      )

    You’ll notice that there’s a cyclical pattern to the tweet frequency for this account. Explain why this might be happening by suggesting a reasonable hypothesis for the mechanism behind this cyclic behavior.

Choose your own adventure

Now it’s your turn to tell a story. Pick a Twitter account, any account, and explore it using the tools that were just introduced to you. Your exploratory analysis should be encapsulated in no less than 4 – 6 separate plots, which you will then interpret and weave into a coherent narrative about your chosen account. This is an open-ended lab report, so go ahead and have some fun with it!

The grading will be heavily weighted towards this section and what you submit for this part of the lab.

  • While your story should start with one Twitter account, you are welcome to branch outward by looking at the attributes of the account’s friends and followers to enrich and extend your analysis.

  • You are welcome to use any of the available functions in rtweet, even if they weren’t shown in the exercises.

How to submit

When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to your copy of the Github repository you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:

  • Click the Pull Requests tab near the top of the page.

  • Click the green button that says “New pull request”.

  • Click the dropdown menu button labeled “base:”, and select the option starting.

  • Confirm that the dropdown menu button labeled “compare:” is set to master.

  • Click the green button that says “Create pull request”.

  • Give the pull request the following title: Submission: Lab 5, FirstName LastName, replacing FirstName and LastName with your actual first and last name.

  • In the messagebox, write: My lab report is ready for grading @jkglasbrenner.

  • Click “Create pull request” to lock in your submission.

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Lab instructions and exercises originally created by Joe Shaheen for CDS-102. Updates to exercises and instructions for compatibility with the rtweet package by James Glasbrenner. Figures in the Creating an App on the Twitter API section are from the Obtaining and using access tokens rtweet vignette.