Exploration by visualization: the Galton dataset

This week’s lab shows you how to use the ggplot2 package to visualize datasets and how visualization plays a crucial role in data exploration.

Why data visualization?

Why is data visualization an important topic? On the face of it, you might wonder why we need to dedicate any time to this topic. Aren’t plots really easy now that we all have computers? And isn’t making plots and figures one of the last things that we do for a project or lab report, after we’ve figured everything out? Why start with this? Since a picture (or visualization) is worth a thousand words, take a moment to explore the data visualizations linked below.

After a few minutes, be prepared to share with the class one thing you noticed about one of the visualizations that you think made it effective at conveying information.

Visualizations have an important role to play in nearly every stage of a data science project. High-quality visualizations help people to understand your results and can activate their curiosity about your work and ideas. Creating visualizations in R is also easy and fun, and learning how to make them will help you become more comfortable with using R and RStudio. You will quickly see how simple it is to make colorful and eye-catching plots!

We will use the ggplot2 library for all of our visualizations. You are encouraged to download and use the official ggplot2 cheatsheet for this and future labs.

About this week’s dataset

You will be exploring the famous dataset by Francis Galton for this week’s lab on data visualization, which is automatically loaded into the variable galton for you in the RMarkdown file for your lab report. Francis Galton was developing ways to quantify the heritability of traits in the 1880s, and as part of this work he collected data on the heights of adult children and their parents. To explore the dataset, type the following in your Console window:

View(galton)

Variables

This is a tabular dataset with 898 observations on the following variables:

Variable Type Description
family chr a category with levels for each family
father dbl the father’s height (in inches)
mother dbl the mother’s height (in inches)
sex chr the child’s sex: F or M
height dbl the child’s height as an adult (in inches)
nkids int the number of adult children in the family, or, at least, the number whose heights Galton recorded.

Sources

Visualization by example

Before we discuss the general format for creating ggplot2 plots, let’s play around with some examples:

  1. In your lab report, create an R code block that contains the following code:

    ggplot(data = galton) +
      geom_histogram(mapping = aes(x = height), bins = 30)

    To run the code, either click the green “play” button in the upper right corner of the R code block or, while your cursor is inside the code block, press <CTRL>-<SHIFT>-<ENTER>. This should create a plot called a histogram.

    After creating the histogram, look at the height column in the data table you can view with View(galton) and compare it with the histogram. Then, describe what the histogram is doing with the data in this column.

The input parameter bins = 30 controls an important visual element in the histogram plot. Let’s experiment with the parameter in order to figure out what it does.

  1. Using the code you wrote in Exercise 1 as a starting point, try setting the input keyword bins equal to something larger than the number 30, and then equal to something smaller than the number 30. This will create two plots. Then, change the input keyword from bins to binwidth and set its value equal to 1. Compare the plots and write a conclusion about what the bins and binwidth inputs control.

It is simple to add additional arguments to the aesthetic input aes() that change the way data are shown, which can reveal trends that were previously hidden from view. Let’s see what the fill argument does to our histogram:

  1. Write the following code in your lab report:

    ggplot(data = galton) +
      geom_histogram(
        mapping=aes(x = height, fill = sex), binwidth = 1, alpha = 0.5,
        position="identity"
      )

    Run the block and look at the output. What did adding fill = sex do? Does this change the way you might interpret the visualization? What kinds of differences stand out now that we added this?

As you can now see, changing one of the inputs in your ggplot2 code can have a substantial effect on the way your visualization looks. When a visualization reveals new information, we should describe and interpret it in our lab reports.

  1. Describe the shape of the male and female height distributions and where they seem to be centered around. Upon your visual inspection, does there appear to be a tangible difference in the average height for these two distributions? Based on what you know about the relative heights of people, is this a result that you would have expected to see? Why or why not?

Based on what you’ve seen so far, do you understand what all of the inputs are doing? The exercise below guides you through the process of exploring how the different inputs affect the plot’s look:

  1. A good way to figure out how R works is to experiment with inputs. What happens if you change the value of alpha = 0.5 (keep it between 0 and 1). What happens if you remove the input position = "identity"? What happens if you replace it with position = "dodge"? What does it change in your output?

    Hint: When moving the input, be careful with the commas! A comma should separate each input.

The ggplot2 syntax

Let’s take a small break from making plots and review the general syntax for creating a ggplot2 figure. The command ggplot(), as you might have figured out already, creates the plot window. Commands with the prefix geom_ — such as geom_histogram() — convert data points into different kinds of visualizations, and the command aes() controls the aesthetic properties of each geom_ object. We then specify input parameters inside the parentheses of these commands, with each input separated from neighboring inputs by commas , and input names and values separated by the equals sign =. In general, visualizations in ggplot2 have a predictable structure for the commands you need to enter. The general pattern is:

ggplot(data = <data>) +                     # Required in all ggplot2 visualizations
  geom_<function>(
    mapping = aes(<mapping>),               # Required in all ggplot2 visualizations
    stat = <stat>,                          # Optional, sensible default chosen
    position = <position>                   # Optional, sensible default chosen
  ) +
  coord_<function> +                        # Optional, sensible default chosen
  facet_<function> +                        # Optional, sensible default chosen
  scale_<function> +                        # Optional, sensible default chosen
  theme_<function>                          # Optional, sensible default chosen

In the above, anywhere you see a word in angular brackets, you can replace it with one of several choices. Bare minimum, you must always specify the first two functions, all the rest are optional and have sensible defaults chosen for you. Also note that each major category, those being geom_ objects, coord_ objects, facet_ objects, scale_ objects, and theme_ objects, are added to the ggplot2 “sentence” via the plus sign +. Think of it as a series of layers: first you lay down the canvas (ggplot), then you plot the data using a certain type of plot (geom_<function>) as the second layer, and afterward you tweak and polish the plot using additional layers to find-tune things. This layered approach allows you to create nice figures without much effort or the need to memorize dozens of commands.

Scatterplots

Let’s further explore the data using another type of visualization, the scatter-plot.

  1. Use the following code to create a scatter-plot of each person’s height as a function of their father’s height.

    ggplot(data = galton) +
      geom_point(mapping = aes(x = father, y = height))

    Here, height is the response (dependent) variable and father would be the explanatory (independent) variable. Describe any trends that you see using full sentences.

Next, we should try and create a plot that is similar in spirit to what we did in Exercise 3, so that we can see how the height variable depends on the father variable when the sex variable is taken into account. One important difference to know is that we need to use the word color as an input instead of fill. Otherwise, the procedure for grouping over the sex variable is basically the same.

  1. Figure out how to group the data over the sex variable using the color input and create a new plot. What does this plot tell you about the relationship between the height measurements and the sex variable?

Faceting

Let’s introduce another new concept, the facet. Facets create visualizations with multiple panels, splitting things up across a categorical variable. Let’s apply this to our scatter-plot that we created in Exercise 6.

  1. Copy your code from Exercise 6 that created a scatter-plot. Add the following new command to your code snippet using the + sign:

    facet_grid(. ~ sex)

    Describe what you get as output. Then, create a new code block where you reverse the input for facet_grid like so:

    facet_grid(sex ~ .)

    What did adding facet_grid do to your output, and what does the order of . ~ sex versus sex ~ . seem to do? Is the information presented here any different from the information in Exercise 7?

Modeling in ggplot2

You can actually create regression models using geom_smooth, which is another handy way to look for trends. You can choose from one of several kinds of regression methods. Here, we’ll use linear regression for creating our models (perhaps you remember drawing one of these “lines of best fit” by hand in a prior science or math class).

  1. Copy your code from Exercise 6 that created a scatter-plot. Add the following new command to your code snippet using the + sign:

    geom_smooth(mapping = aes(x = father, y = height), method = "lm")

    Describe the line that you get as output. Does it follow the trends (if any) you’ve previously described in the data? What do you think the semi-transparent gray region around the line stands for?

When creating a graph, you often want to make some touch ups after you’ve figured out what to plot. Do the following to add some extra polish to the plot you made in Exercise 9.

  1. Copy your code from Exercise 9 and add the following command to your code:

    labs(
      title = "Height of children as adults versus the height of their fathers",
      x = "Father's height (in)",
      y = "Adult child's height (in)"
    )

    You should also alter the size of the scatterplot circles by adding size = 2 as an input in geom_point.

Additional questions

  • Create a similar plot to Exercise 6 and make a scatter-plot of each person’s height as a function of their mother’s height. Describe any trends that you see using full sentences. If there’s a trend both here and in Exercise 6, does it follow the same general direction or does it not? If the trends move in the same direction, then does one trend look stronger than the other? If so, which one? Explain how you know this using full sentences.

  • Start with your code from Exercise 8 and add a geom_smooth command that separately performs a linear regression on the male and female categories. Each subplot (or facet) should have a line in it. Additionally, polish the graph using what you learned in Exercise 10. Intrepret your results and explain how well these visualizations explain the trends in the dataset, if any.

  • Compare the geom_smooth trend-lines for when used the father’s height only and when you used both the father’s height and whether the child was male or female. Do they both show the same trend? Is one trend more “powerful” than the other? Why or why not? Remember to respond by writing full sentences.

How to submit

When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to your copy of the Github repository you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:

  • Click the Pull Requests tab near the top of the page.

  • Click the green button that says “New pull request”.

  • Click the dropdown menu button labeled “base:”, and select the option starting.

  • Confirm that the dropdown menu button labeled “compare:” is set to master.

  • Click the green button that says “Create pull request”.

  • Give the pull request the following title: Submission: Lab 2, FirstName LastName, replacing FirstName and LastName with your actual first and last name.

  • In the messagebox, write: My lab report is ready for grading @jkglasbrenner.

  • Click “Create pull request” to lock in your submission.

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The description of the Galton dataset was adapted from the documentation of the same dataset that’s available in the mosaicData R package and some of the lab exercises were adapted from problem sets found in Modern Data Science with R by Benjamin Baumer, Daniel Kaplan, and Nicholas Horton. All other exercises and instructions written by James Glasbrenner for CDS-102.