Tidying your dataset

This week’s lab will introduce you to the concepts of tidy data and how you can reshape your dataset to take advantage of the tidyverse tools. You will then be guided through the process of using these tools to reshape a real-world gene expression dataset that tested the effect of starvation and growth rate on baker’s yeast.1

Tidy data

The principles of Tidy Data are visually represented in the figure below:

plot of chunk tidy-data-schematic

The three panels are an illustration of the following three rules,

  • Each variable must have its own column.

  • Each observation must have its own row.

  • Each value must have its own cell.

It is worth emphasizing that there is a difference between a tidy dataset and a dirty dataset. “Tidying” a dataset means reshaping it by transposing the rows and columns until the format matches the criteria outlined in the above rules, which then allows us to more easily use the ggplot2 and dplyr functions to analyze and visualize a dataset. Cleaning a “dirty” dataset means that you are fixing misspellings, data entry errors, and dealing with other irregularities in the raw data.

About this week’s dataset

The following quote, taken from a discussion about this paper2, describes the meaning of this dataset pretty well:

Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA — what we call being “expressed”. (If a gene is never turned into RNA, it may as well not be there at all). This provides a sort of “cellular switchboard” that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.

Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

“Starving” the yeast of these nutrients lets us find genes that:

  • Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
  • Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.

Variables

This is a tabular dataset with 5,537 rows and 40 columns:

Variable Description
GID One kind of ID for each gene
YORF Yeast Open Reading Frame (a kind of sub-unit of genetic information)
NAME See below
GWEIGHT The paper doesn’t make this clear, but all entries are 1
[GNP][0.05 ≤ x ≤ 0.30] The letters G, N, and P represent the restricted nutrient. The decimal value is the yeast growth rate.

The NAME column contains the following information separated by the double bar || symbols:

Variable Description
Gene name for example, SFB2. Not all genes have a name here.
Biological process for example, “proteolysis and peptidolysis”
Molecular function for example, “metalloendopeptidase activity”
Systematic ID for example, YNL049C. Every gene has one of these unique IDs.
Unknown ID Number for example, 1082129. The paper doesn’t explain what these mean.

Like in previous labs, it’s recommended that you take a first look at the dataset by viewing it by running View(brauer) in your Console window.

The tidyr package

Reshaping the gene expression dataset will require us to use two functions found in the tidyr package, gather() and separate(). Let’s review how each of these functions works with the extended example from Chapter 12.6 in the R for Data Science textbook.

Running the library(tidyverse) command at the top of our RMarkdown file loads many packages and example datasets for us, which includes a dataset from the World Health Organization that is stored in the variable who. The first few lines of the who dataset are:

country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554
Afghanistan AF AFG 1980 NA NA NA NA NA
Afghanistan AF AFG 1981 NA NA NA NA NA
Afghanistan AF AFG 1982 NA NA NA NA NA
Afghanistan AF AFG 1983 NA NA NA NA NA
Afghanistan AF AFG 1984 NA NA NA NA NA
Afghanistan AF AFG 1985 NA NA NA NA NA

Using gather() to transpose columns into rows

In this data frame, the names of the 57 columns starting with new_sp_m014 and ending with newrel_f65 each refer to a set of three categories, violating the first rule for tidy data. This can easily be fixed by transposing these columns into rows using the gather() command:

who1 <- who %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)

After applying the gather() operation, the first few rows in the dataset now look as follows:

country iso2 iso3 year key cases
Afghanistan AF AFG 1997 new_sp_m014 0
Afghanistan AF AFG 1998 new_sp_m014 30
Afghanistan AF AFG 1999 new_sp_m014 8
Afghanistan AF AFG 2000 new_sp_m014 52
Afghanistan AF AFG 2001 new_sp_m014 129
Afghanistan AF AFG 2002 new_sp_m014 90

As you can see, we’ve taken the 57 category columnms and converted them into categories underneath a single column named key with their corresponding values placed underneath the column cases.

To summarize, the syntax for gather() is as follows:

dataset %>%
  gather(
    ...,           #  Columns you want to gather into rows
    key = "...",   #  Variable for storing names of gathered columns
    value = "..."  #  Variable for values stored under gathered columns
  )

Using separate() to split one column into many

The other function you will need to use is separate(). This function takes values in a single column and splits them out into multiple columns. This is used when you have a dataset that doesn’t follow the third rule of tidy data.

Going back to the who example, after you gathered the 57 columns into a single column, we might wonder what the values under key mean. If you run ?who, you’ll see that the underscores _ separate different variable values. This means that the key column contains 3 values per cell, not one, so this is a case for using separate().

To separate the columns, we run the following:

who2 <- who1 %>%
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  separate(
    col = key,
    into = combine("new", "type", "sexage"),
    sep = "_"
  )

You can ignore the second line with a mutate() command for now, as this just fixes the missing underscore that you need to do before running separate.

The first few lines of the separated dataset are:

country iso2 iso3 year new type sexage cases
Afghanistan AF AFG 1997 new sp m014 0
Afghanistan AF AFG 1998 new sp m014 30
Afghanistan AF AFG 1999 new sp m014 8
Afghanistan AF AFG 2000 new sp m014 52
Afghanistan AF AFG 2001 new sp m014 129
Afghanistan AF AFG 2002 new sp m014 90

As you can see, this has successfully split our one column into three.

To summarize, the syntax for separate() is as follows:

dataset %>%
  separate(
    col = ...,            #  Name of column to separate
    into = combine(...),  #  Names for new columns formed after separation
    sep = ...,            #  Specifies the separator symbols
    convert = ...         #  If TRUE, tries to set data type for new columns
  )

Your should note that the into keyword needs to be specified as a vector of strings that you can create using combine().

Prune the dataset

When dealing with larger datasets that have many columns, it is helpful if you can identify any columns that are extraneous and will not be used for analysis. This shrinks the overall size of the dataset so that it takes up less space in the computer’s memory. For this dataset, the GID, YORF, and GWEIGHT columns aren’t particularly important for any kind of analysis, so they can be removed.

  1. Remove the GID, YORF, and GWEIGHT variables from the dataset and assign the result to the variable brauer2.

Enforcing one observation per row

Now we will employ a systematic approach while tidying this dataset by going through the rules one by one and determining what (if anything) needs to be reshaped. We start with the first rule that “each variable must have its own column.” The columns G0.05 through U0.3 each contain the results of different experimental trials, telling us which nutrient was restricted and what growth rate was selected for the yeast. Knowing this, we conclude that this dataset violates the first rule, as each row contains multiple experimental trials.

  1. Use gather() to transpose the columns G0.05 through U0.3 into a series of rows to satisfy the rule that “each variable must have its own column. Use the label sample for the new categorical variable created by gather() that contains the gathered column names and use the label expression for the new numerical variable containing the values originally under the G0.05 through U0.3 columns. Assign your result to the variable brauer3.

After completing this step, the one observation per row rule will be satisfied.

Enforcing one column per variable and one value per cell

The second and third rules are intertwined, and so fixing one will help to satisfy the other. Looking at the version of the dataset in brauer3, we can clearly see that the one value per cell rule is violated by the NAME and sample columns. These are violations that can be fixed using the separate() function. Let’s separate the sample column we created with the gather() function first.

Hint: If you don’t want the separate() function to delete any individual letters or symbols when splitting a column, you can set the sep argument equal to an integer, for example sep = 1. This tells separate() to split the column right after the first letter.

  1. Separate the sample column that you created in Exercise 2 into two variables, nutrient and rate. After separation, the data type for the rate column should be double (<dbl>), not character. Assign your result to the variable brauer4.

Now we turn to the NAME column where each value is separated by two bars || and there are five variables contained in each cell. Using the information from the about this week’s dataset section, we deduce that we should name the new columns gene_name, biological_process, molecular_function, systematic_id, and number.

Hint: If you try to use sep = "||" in separate(), it will not work as expected. By default, the bar symbol | is interpreted as the boolean operator OR, which should be familiar from creating rules for filter(). To get around this, we can preceed | with backslashes to signal to R that we want it to read | as a symbol and not as the boolean operator OR. Thus, we should use sep = "\\|\\|" to define our separator.

  1. Use separate() to split the values in the NAME column into five columns named gene_name, biological_process, molecular_function, systematic_id, and number. Assign your result to the variable brauer5.

We’re nearly done! If you inspect the dataset with View(brauer5), you’ll notice that the text under some of the columns has quotation marks with unnecessary blank space. For example, " molecular function unknown ". Removing this blank space (typically called “white space”) is called trimming, and we can use the str_trim() function for this purpose.

  1. Remove the whitespace from the five new columns you created using mutate_at() in combination with str_trim():

    brauer5 %>%
      mutate_at(vars(gene_name:systematic_id), str_trim)

    Assign your result to the variable brauer_tidy.

Now we’re done!

Visualizations using the tidy dataset

Having a tidy dataset makes it easy to use the rest of the tidyverse packages, such as ggplot2. To demonstrate this, we will now create a plot that a biologist would use to explore gene expression trends that emerge due to restricting different nutrients and enforcing different growth rates.

  1. Filter brauer_tidy so that it only contains entries with gene_name equal to LEU1. Then, create a line plot where you place rate on the horizontal axis, expression on the vertical axis, and assign different colors to nutrient.

Additional questions

  • What are the dataset’s dimensions (the number of rows and columns) after completing the tidying procedure? Do you find it easier to read the data in this format, or did you think the original format was easier to understand? Why or why not?

  • In Exercise 6 we were able to easily create a line plot of expression as a function of rate for the different kinds of nutrients for one of the genes in the microarray. What specifically did the tidying procedure do to make it simple for us to create this plot? Could we still create the same plot with ggplot2 using the untidy version of the dataset? If so, try to write the ggplot2 code for it. If not, identify what it is about the untidy dataset that prevents you from creating the same plot.

How to submit

When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to your copy of the Github repository you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:

  • Click the Pull Requests tab near the top of the page.

  • Click the green button that says “New pull request”.

  • Click the dropdown menu button labeled “base:”, and select the option starting.

  • Confirm that the dropdown menu button labeled “compare:” is set to master.

  • Click the green button that says “Create pull request”.

  • Give the pull request the following title: Submission: Lab 4, FirstName LastName, replacing FirstName and LastName with your actual first and last name.

  • In the messagebox, write: My lab report is ready for grading @jkglasbrenner.

  • Click “Create pull request” to lock in your submission.

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by James Glasbrenner for CDS-102.


  1. Brauer et. al., “Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast”, Mol. Biol. Cell 19, 352 (2008).

  2. Robinson, David, “Cleaning and Visualizing Genomic Data: A Case Study in Tidy Analysis,” Variance Explained (2015).