Reproducibility
In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months prior.
Science Editor-in-Chief Marcia McNutt:
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we’ll discuss can’t prevent this, but they can make it easier to discover issues.
The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
In a follow-up project, the authors realized their results were opposite of the literature.
The first author, Gunzenhauser, told Retraction Watch:
After some research I found there had been a systematic coding error during data entry (the 0/1 coding for “boy” and “girl” that we had on the paper questionnaires was opposite to the one in the SPSS labels).
Scriptability → R [in contrast to pull down menus]
Literate programming → R Markdown [in contrast to multiple files]
Version control → Git / GitHub [in contrast to multiple versions]
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do.
Taken from [Modern Drive: An introduction to statistical and data sciences via R](https://ismayc.github.io/moderndiver-book/), by Ismay and Kim
[Jessica Ward](https://jkrward.github.io/), PhD student at Newcastle University
On GitHub (on the web) edit the README document and Commit it with a message describing what you did.
Then, in RStudio also edit the README document with a different change.
As you work in teams you will run into merge conflicts, learning how to resolve them properly will be very important.
The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was "an individual in the armed forces." What variables would you use to capture the information in the following table?
image credit: https://www.tidyverse.org/.
Hosted online:
movies <- read_csv("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/movies2.csv")
Hosted locally:
movies <- read_csv("movies2.csv")
Things to note:
View()
can be used in RStudio to bring up an excel-style spreadsheet. Only
for viewing, not editing!View()
has a capital letter V
View()
should not be used in an Rmd documenthead()
can be used to print the first several lines of the dataset to the
console.dim()
can be used to find the dimensions (rows then columns).names()
can be used to find the names of the variables.Running in to problems? Ask your neighbor or try google!
dim(movies)
## [1] 134 5
head(movies,3)
## # A tibble: 3 × 5## ...1 score2 rating2 genre2 `box office2`## <chr> <dbl> <chr> <chr> <dbl>## 1 2 Fast 2 Furious 48.9 PG-13 action 127. ## 2 28 Days Later 78.2 R horror 45.1## 3 A Guy Thing 39.5 PG-13 rom comedy 15.5
names(movies)
## [1] "...1" "score2" "rating2" "genre2" ## [5] "box office2"
For now, we'll work with all flights out of the three NYC airports in 2013.
Download and install the package from CRAN (done in the Console, only once).
install.packages("nycflights13")
Load the package (in the .Rmd file, need it for the .Rmd file to compile appropriately).
library(nycflights13)
Make the data set visible.
data(flights)
Get help.
?flights
Whenever you're learning a new tool, for a long time you're going to suck ... but the good news is that is typical, that's something that happens to everyone, and it's only temporary.
-Hadley Wickham
Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.
This is all possible in base R, but with the tidyverse, it is simple, readable, and fast.
filter()
arrange()
select()
distinct()
mutate()
summarize()
sample_n()
filter()
Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you'd like to apply
For all flights on January 1st:
filter(flights, month == 1, day == 1)
## # A tibble: 842 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # … with 832 more rows, and 12 more variables:## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>
Filters are constructed of logical operators: <
, >
, <=
, >=
, ==
,
!=
(and some others).
Adding them one by one to filter()
is akin to saying "this AND that". To say
"this OR that OR both", use |.
filter(flights, month == 1 | month == 2)
## # A tibble: 51,955 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # … with 51,945 more rows, and 12 more variables:## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>
Construct filters to isolate:
filter(flights, month == 3, day == 17)filter(flights, dest == "ORD")filter(flights, dest == "ORD", carrier == "UA")filter(flights, distance > 2000 | air_time > 5*60)
arrange()
arrange()
reorders the rows: It takes a data frame, and a set of column
names (or more complicated expressions) to order by. If you provide more than
one column name, each additional column will be used to break ties in the values
of preceding columns:
arrange(flights, year, month, day)
Use desc()
to sort in descending order.
arrange(flights, desc(arr_delay))
select()
Often you work with large datasets with many columns where only a few are
actually of interest to you. select()
allows you to rapidly zoom in on a useful
subset using operations that usually only work on numeric variable positions:
select(flights, year, month, day)
You can exclude columns using -
and specify a range of variables using :
.
select(flights, -(year:day))
distinct()
A common use of select()
is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct()
verb which only returns the unique values in a table.
What do the following data correspond to?
distinct(select(flights, origin, dest))
## # A tibble: 224 × 2## origin dest ## <chr> <chr>## 1 EWR IAH ## 2 LGA IAH ## 3 JFK MIA ## 4 JFK BQN ## 5 LGA ATL ## 6 EWR ORD ## 7 EWR FLL ## 8 LGA IAD ## 9 JFK MCO ## 10 LGA ORD ## # … with 214 more rows
mutate()
As well as selecting from the set of existing columns, it's often useful to add
new columns that are functions of existing columns. This is the job of mutate()
:
select(mutate(flights, gain = dep_delay - arr_delay), flight, dep_delay, arr_delay, gain)
## # A tibble: 336,776 × 4## flight dep_delay arr_delay gain## <int> <dbl> <dbl> <dbl>## 1 1545 2 11 -9## 2 1714 4 20 -16## 3 1141 2 33 -31## 4 725 -1 -18 17## 5 461 -6 -25 19## 6 1696 -4 12 -16## 7 507 -5 19 -24## 8 5708 -3 -14 11## 9 79 -3 -8 5## 10 301 -2 8 -10## # … with 336,766 more rows
summarize()
and sample_n()
summarize()
collapses a data frame to a single row. It's not very useful yet.
sample_n()
provides you with a random sample of the rows.
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1## delay## <dbl>## 1 12.6
sample_n(flights, 10)
## # A tibble: 10 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 7 27 612 615 -3 856## 2 2013 1 13 1405 1130 155 1649## 3 2013 8 12 1333 1340 -7 1648## 4 2013 7 25 957 1000 -3 1132## 5 2013 2 7 2211 2058 73 2314## 6 2013 11 16 1951 1929 22 2300## 7 2013 8 13 1828 1828 0 2043## 8 2013 10 17 NA 1822 NA NA## 9 2013 11 4 830 837 -7 932## 10 2013 3 31 1815 1820 -5 2157## # … with 12 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,## # hour <dbl>, minute <dbl>, time_hour <dttm>
Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.
Select the new variable and save it, along with tailnum, as a new data frame object.
Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.
Select the new variable and save it, along with tailnum, as a new data frame object.
flights2 <- mutate(flights, speed = distance/(air_time/60))speed_data <- select(flights2, tailnum, speed)
group_by()
summarize()
and sample_n()
are even more powerful when combined with the idea of "group by", repeating the operation separately on groups of observations within the dataset. The group_by()
function describes how to break a dataset down into groups of rows.
group_by()
Find the fastest airplanes in the bunch, measured as the average speed per airplane.
by_tailnum <- group_by(speed_data, tailnum)avg_speed <- summarize(by_tailnum, count = n(), avg_speed = mean(speed, na.rm = TRUE))arrange(avg_speed, desc(avg_speed))
## # A tibble: 4,044 × 3## tailnum count avg_speed## <chr> <int> <dbl>## 1 N228UA 1 501.## 2 N315AS 1 499.## 3 N654UA 1 499.## 4 N819AW 1 490.## 5 N382HA 26 486.## 6 N388HA 36 484.## 7 N391HA 21 484.## 8 N777UA 1 483.## 9 N385HA 28 483.## 10 N392HA 13 482.## # … with 4,034 more rows
Instead of applying each verb step-by-step, we can chain them into a single
data pipeline, connected with the %>%
operator. You start the pipeline with
a data frame and then pass it to each function in turn.
The pipe syntax (%>%
) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:
x %>% f(y)
is the same as f(x, y)
y %>% f(x, ., z)
is the same as f(x,y,z)
step1 <- dress(me, what = sports) step2 <- exercise(step1, how = running) step3 <- eat(step2, choice = cereal) step4 <- dress(step3, what = school) step5 <- commute(step4, transportation = bike)
commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)
(better??)
commute( dress( eat( exercise( dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)
me %>% dress(what = sports) %>% exercise(how = running) %>% eat(choice = cereal) %>% dress(what = school) %>% commute(transportation = bike)
me %>% dress(what = sports) %>% exercise(how = running) %>% eat(choice = cereal) %>% dress(what = school) %>% commute(transportation = bike)
The pipe syntax (%>%
) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:
x %>% f(y)
is the same as f(x, y)
y %>% f(x, ., z)
is the same as f(x,y,z)
From Hadley Wickham, how to think about piping tidy verbs.
Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head
The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.
foo_foo <- little_bunny()foo_foo_1 <- hop(foo_foo, through = forest)foo_foo_2 <- scoop(foo_foo_2, up = field_mice)foo_foo_3 <- bop(foo_foo_2, on = head)
Another approach is to concatenate the functions so that there is only one output.
bop( scoop( hop(foo_foo, through = forest), up = field_mice), on = head)
Or even worse, as one line:
bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))
Instead, the code can be written using the pipe in the order in which the function is evaluated:
foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head)
Instead, the code can be written using the pipe in the order in which the function is evaluated:
foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head)
The pipe syntax (%>%
) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:
x %>% f(y)
is the same as f(x, y)
y %>% f(x, ., z)
is the same as f(x,y,z)
flights2 %>% select(tailnum, speed) %>% group_by(tailnum) %>% summarize(number = n(), avg_speed = mean(speed, na.rm = TRUE)) %>% arrange(desc(avg_speed))
## # A tibble: 4,044 × 3## tailnum number avg_speed## <chr> <int> <dbl>## 1 N228UA 1 501.## 2 N315AS 1 499.## 3 N654UA 1 499.## 4 N819AW 1 490.## 5 N382HA 26 486.## 6 N388HA 36 484.## 7 N391HA 21 484.## 8 N777UA 1 483.## 9 N385HA 28 483.## 10 N392HA 13 482.## # … with 4,034 more rows
Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?
Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?
flights %>% group_by(carrier) %>% summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% arrange(desc(avg_delay))
## # A tibble: 16 × 2## carrier avg_delay## <chr> <dbl>## 1 F9 20.2 ## 2 EV 20.0 ## 3 YV 19.0 ## 4 FL 18.7 ## 5 WN 17.7 ## 6 9E 16.7 ## 7 B6 13.0 ## 8 VX 12.9 ## 9 OO 12.6 ## 10 UA 12.1 ## 11 MQ 10.6 ## 12 DL 9.26## 13 AA 8.59## 14 AS 5.80## 15 HA 4.90## 16 US 3.78
Say you're curious about the relationship between the number of flights each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.
delay_data <- flights %>% group_by(tailnum) %>% summarize(number = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>% filter(number > 20, dist < 2000)
Say you're curious about the relationship between the number of flights each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |