CDS-102: Introduction to Computational and Data Sciences Lab

This week’s lab will show you how to apply statistical methods and resampling techniques to a dataset from the natural sciences, Simon Newcomb’s measurements of the speed of light. Through this, we will see how statistical methods can help us to put the scientific method into practice and provide you with hands-on experience with the kinds of data analysis a scientist will use after completing a series of experimental measurements.

Natural science, data science

Many of the datasets we’ve worked through in our labs this semester have come from fields outside of the natural sciences. That doesn’t mean that the skills we’re building don’t have a useful application in fields such as physics, chemistry, and biology. For that reason, this week we will apply statistical methods to a dataset from the natural sciences that can be used to calculate the speed of light.

About this week’s dataset

The astronomer and applied mathematician Simon Newcomb collected this dataset over three separate days between the dates of July 24, 1882 and September 5, 1882 [1,2] in Washington, DC. He performed the measurements using an apparatus design similar to Léon Foucault’s system of rotating mirrors [3], which allowed Newcomb to measure the time it took a beam of light to travel from Fort Myer on the west bank of the Potomac to a mirror located at the Washington monument and back [1,4], corresponding to a distance of 7443.73 meters. This dataset contains 66 observations, which have been transformed so that the dataset could be analyzed as a series of integers. To convert a dataset value \(t\) to the actual transit time \(t_{meas}\) in seconds, use the formula,

\[\text{t}_{\text{meas}}=\dfrac{\dfrac{\text{t}}{1000}+24.8}{1000000}\]

Visualizing and quantifying the distribution

Let’s start by doing the usual practice of getting to know our dataset. There’s only one relevant variable in this dataset, time, so it’s the distribution of the measured times that matter. Let’s appraise the distribution of time measurements by creating some visualizations:

Visualize the dataset distribution as a boxplot — use geom_boxplot(aes(x = "unfiltered", y = time)) + coord_flip() — and as a probability mass function (PMF) — use geom_histogram() with y = ..density.. inside aes() — with a binwidth that allows you can see the full dataset (only identical numbers should have counts larger than 1). Describe the center, shape, and spread of the distribution (don’t forget to mention the outliers).

One of the things you’ll immediately notice when visualizing this dataset is how pronounced the outliers are. The experimental setup involved a rapidly rotating mirror that had to be precisely tuned. Given that the speed of light is so high, small variations in the rotation speed could significantly impact the measured travel times. As such, it’s quite possible these outliers are due to experimental error. However, without further information we cannot be sure that this is the case. Thus, the best choice is to analyze two versions of the dataset, one with the outliers removed and one where we keep all data points.

Create a second, filtered version of the dataset that removes the outliers that you see in the distribution.

Another useful visualization for understanding a dataset is the cumulative distribution function (CDF), which creates a map from the distribution’s values to their respective percentiles. To plot the CDF for a data distribution, we can use the convenient stat_ecdf() function in ggplot2.

Visualize the CDF for both the unfiltered and filtered versions of the dataset. The code for plotting the CDF for the unfiltered dataset would be:
```
ggplot(data = newcomb) +
  stat_ecdf(mapping = aes(x = time)) +
  labs(y = "CDF")
```
The CDF for the filtered dataset can be visualized by slightly modifying the above code. Do you notice any changes in the CDF after removing the outliers from the original dataset?

Finally, to wrap up this initial exploration, quantify these distributions by computing their summary statistics. The following functions in R are useful for computing the summary statistics of a dataset:

mean(): Computes the average
median(): Computes the median
min(): Finds the minimum value
max(): Finds the maximum value
sd(): Computes the standard deviation
IQR(): Computes the interquartile range

Calculate the following summary statistics for the filtered and unfiltered versions of the dataset: the mean, median, maximum, minimum, standard deviation, and the inter-quartile range (IQR). For the unfiltered dataset, this would be:
```
newcomb %>%
  summarize(
    mean = mean(time),
    median = median(time),
    sd = sd(time),
    iqr = IQR(time),
    min = min(time),
    max = max(time),
  )
```
Which summary statistics are sensitive to removing the outliers? Which ones are not?

infering a trend

Because there is a spread in the time measurements in Newcomb’s dataset, the measured time should be reported as a mean value with error bars. The error bars are typically found by calculating a confidence interval. A typical choice is a 95% confidence interval, which can be estimated using computational simulations that resample the dataset. To perform our statistical resampling, we will use the tidyverse-inspired infer package, which will help us to compute confidence intervals and perform hypothesis tests.

If not already installed, you can easily install infer by running the following in your Console window:

install.packages("infer")

To compute the confidence interval, we will need to generate the so-called bootstrap distribution. We obtain the bootstrap simulation using the following code:

newcomb_bootstrap <- newcomb %>%
  specify(formula = time ~ NULL) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "mean")

To visualize the bootstrap distribution as a probability mass function, we run:

ggplot(newcomb_bootstrap) +
  geom_histogram(
    mapping = aes(x = stat, y = ..density..),
    binwidth = 0.1
  ) +
  labs(x = "average time")

What the bootstrap has done is sample with replacement from the dataset distribution. The basic idea is that, if the underlying sample is representative, then we can sample directly from it as if it were the true population. The number of samples we pull is equal to the number of observations in the dataset. After we resample the data, we complete the procedure by calculating the mean of the simulated sample (or median, sd, or some other parameter), after which we then repeat the process multiple times until we end up with a distribution of means. We can then use the bootstrap sample to determine the confidence interval for the sample statistic of interest.

To construct the confidence interval, we need to “rank” the data in stat from smallest to largest, which we can do with the min_rank() function from dplyr

newcomb_ci <- newcomb_bootstrap %>%
  mutate(rank = min_rank(stat)) %>%
  filter(between(rank, 0.025 * n(), 0.975 * n())) %>%
  summarize(
    lower_bound = min(stat),
    upper_bound = max(stat)
  )

To break down what is going on with the above set of functions:

min_rank(stat) is the stat column’s sorting order from smallest to largest
0.025 * n() is the rank that defines the threshold for the 2.5th percentile
0.975 * n() is the rank that defines the threshold for the 97.5th percentile
min(stat) and max(stat) gives thresholds for the 2.5th and 97.5th percentiles

Using the above code, compute the 95% confidence interval for the unfiltered and filtered dataset using the bootstrap method. How does the confidence interval change when you exclude the outliers (the filtered dataset)?

We can also use infer to perform a two-sided hypothesis test. The code for doing this is relatively similar, we just need to add an additional hypothesize() function. Of course, in order to run a hypothesis test we need some sort of hypothesis to test against, which will allow us to define the null distribution. We also need to select a significance level \(\alpha\), which serves as a kind of evidence threshold that we use when determining whether or not we can reject the null hypothesis. A common choice for \(\alpha\) is 0.05, which is the value that we will use.

Subsequent work on the speed of light has determined that, given the conditions of Newcomb’s setup, that this experiment should yield a “true” mean value of 33.02. With this value in hand, we can formalize the question of whether or not the gap separating our dataset’s distribution could have been generated by chance alone.

Write down (in words) the null hypothesis and the alternative hypothesis for comparing this dataset against the “true” mean value of 33.02.

We modify our code as follows in order to generate the null distribution needed to perform the hypothesis test:

newcomb_null <- newcomb %>%
  specify(formula = time ~ NULL) %>%
  hypothesize(null = "point", mu = 33.02) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "mean")

Now that we have a null distribution, we can use it in combination with the experimental average for the speed of light to calculate the p-value. The p-value is simply the probability that, were we to repeat the experiment again, we would obtain a result that is the same or more extreme than the reported experimental measurement. Put another way, we need to count the number of data points in the simulated null distribution that are the same or more extreme than the experimental measurement. Assuming that the average speed of light for the unfiltered dataset is assigned to the variable average_light_speed, we would run the following:

point_estimate_difference <- abs(average_light_speed - 33.02)  # abs() computes absolute value
newcomb_null %>%
  filter(
    stat >= 33.02 + point_estimate_difference
    | stat <= 33.02 - point_estimate_difference
  ) %>%
  summarize(pvalue = n() / 10000)

If the computed p-value is less than 0.05, our significance level, then we reject the null hypothesis in favor of the alternative hypothesis.

Use the infer package to run the two-sided hypothesis test with \(\alpha = 0.05\) between the ideal value of 33.02 and unfiltered and filtered datasets. Can we reject the null hypothesis for either version (filtered or unfiltered) of the dataset?

Additional questions

From your analysis, does Newcomb’s dataset seem to agree with the “true” mean value of 33.02? Or is it inconsistent? Make reference to your confidence intervals of both the unfiltered and filtered datasets when answering these questions as well as your two-sided hypothesis test. Based on all this, how likely is it that a systematic bias exists within Newcomb’s dataset?

How to submit

When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to your copy of the Github repository you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:

Click the Pull Requests tab near the top of the page.
Click the green button that says “New pull request”.
Click the dropdown menu button labeled “base:”, and select the option starting.
Confirm that the dropdown menu button labeled “compare:” is set to master.
Click the green button that says “Create pull request”.
Give the pull request the following title: Submission: Lab 6, FirstName LastName, replacing FirstName and LastName with your actual first and last name.
In the messagebox, write: My lab report is ready for grading @jkglasbrenner.
Click “Create pull request” to lock in your submission.

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by James Glasbrenner for CDS-102.

References

[1] S. M. Stigler, “Do Robust Estimators Work with REAL Data? (With Discussion),” Annals of Statistics 5, 1055 (1977).

[2] S. Newcomb, “Measures of the Velocity of Light Made Under the Direction of the Secretary of the Navy During the Years 1880-’82,” Astronomical Papers 3, 107 (1882).

[3] B. Jaffe, Michelson and the Speed of Light (Doubleday; Company, Garden City, New York, 1960).

[4] W. E. Carter and M. S. Carter, “The Newcomb-Michelson velocity of light experiments,” Eos, Transactions American Geophysical Union 83, 405 (2002).