Introduction to Math 158

Introduction to Math 158Linear ModelsJo HardinJanuary 18, 20221 / 61

Agenda 1/18/22

Syllabus
Workflow
Wrangling
Statistics

2 / 61

Course structureweekly HW (to GitHub)
daily WU (to Gradescope while virtual)
three exams + data project
in-class activities / clickers
3 / 61

Learning Communitiesmentorless pods (fill out Google form after class)
meet with group periodically
work party on Slack or Zoom
regular mentor session: Mon & Thurs 8-10pm
4 / 61

Additional detailsSakai has all the links
course website has all the information
semester flexibility -- good communication
must be on Slack (the HW link will be posted to Slack)
5 / 61

Reproducibility

6 / 61

Example #1

Science retracts gay marriage paper without agreement of lead author LaCour

In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months prior.
Science Editor-in-Chief Marcia McNutt:
- Original survey data not made available for independent reproduction of results.
- Survey incentives misrepresented.
- Sponsorship statement false.
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we’ll discuss can’t prevent this, but they can make it easier to discover issues.

Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent

7 / 61

Example #2

Seizure study retracted after authors realize data got "terribly mixed"

From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:

The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.

Source: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/

8 / 61

Example #3

Bad spreadsheet merge kills depression paper, quick fix resurrects it

The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/

9 / 61

Example #4

Boys will be boys: Data error prompts U-turn on study of sex differences in school

In a follow-up project, the authors realized their results were opposite of the literature.
The first author, Gunzenhauser, told Retraction Watch:

After some research I found there had been a systematic coding error during data entry (the 0/1 coding for “boy” and “girl” that we had on the paper questionnaires was opposite to the one in the SPSS labels).

Source: https://retractionwatch.com/2017/10/17/boys-will-boys-data-error-prompts-u-turn-study-sex-differences-school/

10 / 61

Reproducible data analysis

Scriptability → R [in contrast to pull down menus]
Literate programming → R Markdown [in contrast to multiple files]
Version control → Git / GitHub [in contrast to multiple versions]

11 / 61

Scripting and literate programming

Donald Knuth "Literate Programming" (1983)

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do.

These ideas have been around for years!
and tools for putting them to practice have also been around
but they have never been as accessible as the current tools

12 / 61

Reproducibility checklistAre the tables and figures reproducible from the code and data?
Does the code actually do what you think it does?
In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
Can the code be used for other data?
Can you extend the code to do other things?
13 / 61

Tools: R & R Studio

You must use both software programs
R does the programming
R Studio brings everything together
You may use Pomona's server: https://rstudio.campus.pomona.edu/ (or https://rstudio.cloud)
See course website for getting started: https://m158-lin-mod.netlify.app/github.html

14 / 61

R vs R Studio

Taken from [Modern Drive: An introduction to statistical and data sciences via R](https://ismayc.github.io/moderndiver-book/), by Ismay and Kim

15 / 61

R Studio

[Jessica Ward](https://jkrward.github.io/), PhD student at Newcastle University

16 / 61

Tools: GitHub

You must submit your assignments via GitHub
Follow Jenny Bryan's advice on how to get set-up: http://happygitwithr.com/
Follow course specific advice: https://m158-lin-mod.netlify.app/github.html

17 / 61

Steps for weekly homeworkYou will get a link to the new assignment (clicking on the link will create a new private repo)  
Use R Studio  New Project, version control, Git  
Clone the repo using SSH  

If it exists, rename the Rmd file to ma158-hw#-lname-fname.Rmd  
Do the assignment, knit in RStudio (knit early and often!)
commit and push to GitHub (after completing the assingment, possibly more often)  
All necessary files must be in the same folder (e.g., data)  
18 / 61

Tools: a GitHub merge conflict (demo)

On GitHub (on the web) edit the README document and Commit it with a message describing what you did.
Then, in RStudio also edit the README document with a different change.
- Commit your changes
- Try to push – you’ll get an error!
- Try pulling
- Resolve the merge conflict and then commit and push
As you work in teams you will run into merge conflicts, learning how to resolve them properly will be very important.

19 / 61

Tools: a GitHub merge conflict

20 / 61

The dataWhat does a tidy data set look like?Observations down the rows
Variables across the columns
Flat file versus relational database.
21 / 61

Active Duty Military

The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was "an individual in the armed forces." What variables would you use to capture the information in the following table?

https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit#gid=1811988794

22 / 61

Tidy packages: the tidyverse

Image of hex stickers for the eight core tidyverse packages including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats.

image credit: https://www.tidyverse.org/.

23 / 61

Reading in data from a file

Hosted online:

movies <- read_csv("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/movies2.csv")

Hosted locally:

movies <- read_csv("movies2.csv")

Things to note:

The assign arrow is used to create objects in R, which are stored in your environment.
Object names don't have to correspond to file names.
Be sure R knows where to look for the file!

24 / 61

Viewing data - the viewer / EnvironmentView() can be used in RStudio to bring up an excel-style spreadsheet. Only
for viewing, not editing!
The dimensions of the data can be found in the environment pane.
The names of the variables can be seen at the top of the viewer.
View() has a capital letter V
View() should not be used in an Rmd document
25 / 61

Viewing data - inside .Rmd / the consolehead() can be used to print the first several lines of the dataset to the
console.
dim() can be used to find the dimensions (rows then columns).
names() can be used to find the names of the variables.
26 / 61

Practice

Running in to problems? Ask your neighbor or try google!

What are the dimensions of the data set?
What appears to be the unit of observation?
What are the variables?

dim(movies)

## [1] 134   5

head(movies,3)

## # A tibble: 3 × 5
##   ...1             score2 rating2 genre2     `box office2`
##   <chr>             <dbl> <chr>   <chr>              <dbl>
## 1 2 Fast 2 Furious   48.9 PG-13   action             127. 
## 2 28 Days Later      78.2 R       horror              45.1
## 3 A Guy Thing        39.5 PG-13   rom comedy          15.5

names(movies)

## [1] "...1"        "score2"      "rating2"     "genre2"     
## [5] "box office2"

27 / 61

Reading in data from a package

For now, we'll work with all flights out of the three NYC airports in 2013.

Download and install the package from CRAN (done in the Console, only once).
```
install.packages("nycflights13")
```
Load the package (in the .Rmd file, need it for the .Rmd file to compile appropriately).
```
library(nycflights13)
```
Make the data set visible.
```
data(flights)
```
Get help.
```
?flights
```

28 / 61

wrangling with tidy verbs

Whenever you're learning a new tool, for a long time you're going to suck ... but the good news is that is typical, that's something that happens to everyone, and it's only temporary.

-Hadley Wickham

29 / 61

Why tidyverse verbs?

Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.

This is all possible in base R, but with the tidyverse, it is simple, readable, and fast.

30 / 61

Some Basic Verbsfilter()
arrange()
select()
distinct()
mutate()
summarize()
sample_n()
31 / 61

`filter()`

Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you'd like to apply

For all flights on January 1st:

filter(flights, month == 1, day == 1)

## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 832 more rows, and 12 more variables:
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

32 / 61

Constructing filters

Filters are constructed of logical operators: <, >, <=, >=, ==, != (and some others).

Adding them one by one to filter() is akin to saying "this AND that". To say "this OR that OR both", use |.

filter(flights, month == 1 | month == 2)

## # A tibble: 51,955 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 51,945 more rows, and 12 more variables:
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

33 / 61

Practice

Construct filters to isolate:

Flights that left on St. Patrick's Day.
Flights that were destined for Chicago's primary airport.
Flights that were destined for Chicago's primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

34 / 61

Flights that left on St. Patrick's Day.
Flights that were destined for Chicago's primary airport.
Flights that were destined for Chicago's primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5*60)

35 / 61

`arrange()`

arrange() reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

Use desc() to sort in descending order.

arrange(flights, desc(arr_delay))

36 / 61

`select()`

Often you work with large datasets with many columns where only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

select(flights, year, month, day)

You can exclude columns using - and specify a range of variables using :.

select(flights, -(year:day))

37 / 61

`distinct()`

A common use of select() is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.

What do the following data correspond to?

distinct(select(flights, origin, dest))

## # A tibble: 224 × 2
##    origin dest 
##    <chr>  <chr>
##  1 EWR    IAH  
##  2 LGA    IAH  
##  3 JFK    MIA  
##  4 JFK    BQN  
##  5 LGA    ATL  
##  6 EWR    ORD  
##  7 EWR    FLL  
##  8 LGA    IAD  
##  9 JFK    MCO  
## 10 LGA    ORD  
## # … with 214 more rows

38 / 61

`mutate()`

As well as selecting from the set of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of mutate():

select(mutate(flights, gain = dep_delay - arr_delay), 
       flight, dep_delay, arr_delay, gain)

## # A tibble: 336,776 × 4
##    flight dep_delay arr_delay  gain
##     <int>     <dbl>     <dbl> <dbl>
##  1   1545         2        11    -9
##  2   1714         4        20   -16
##  3   1141         2        33   -31
##  4    725        -1       -18    17
##  5    461        -6       -25    19
##  6   1696        -4        12   -16
##  7    507        -5        19   -24
##  8   5708        -3       -14    11
##  9     79        -3        -8     5
## 10    301        -2         8   -10
## # … with 336,766 more rows

39 / 61

`summarize()` and `sample_n()`

summarize() collapses a data frame to a single row. It's not very useful yet. sample_n() provides you with a random sample of the rows.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

sample_n(flights, 10)

## # A tibble: 10 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7    27      612            615        -3      856
##  2  2013     1    13     1405           1130       155     1649
##  3  2013     8    12     1333           1340        -7     1648
##  4  2013     7    25      957           1000        -3     1132
##  5  2013     2     7     2211           2058        73     2314
##  6  2013    11    16     1951           1929        22     2300
##  7  2013     8    13     1828           1828         0     2043
##  8  2013    10    17       NA           1822        NA       NA
##  9  2013    11     4      830            837        -7      932
## 10  2013     3    31     1815           1820        -5     2157
## # … with 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

40 / 61

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

41 / 61

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

flights2 <- mutate(flights, speed = distance/(air_time/60))
speed_data <- select(flights2, tailnum, speed)

42 / 61

`group_by()`

summarize() and sample_n() are even more powerful when combined with the idea of "group by", repeating the operation separately on groups of observations within the dataset. The group_by() function describes how to break a dataset down into groups of rows.

43 / 61

`group_by()`

Find the fastest airplanes in the bunch, measured as the average speed per airplane.

by_tailnum <- group_by(speed_data, tailnum)
avg_speed <- summarize(by_tailnum, 
                       count = n(), 
                       avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))

## # A tibble: 4,044 × 3
##    tailnum count avg_speed
##    <chr>   <int>     <dbl>
##  1 N228UA      1      501.
##  2 N315AS      1      499.
##  3 N654UA      1      499.
##  4 N819AW      1      490.
##  5 N382HA     26      486.
##  6 N388HA     36      484.
##  7 N391HA     21      484.
##  8 N777UA      1      483.
##  9 N385HA     28      483.
## 10 N392HA     13      482.
## # … with 4,034 more rows

44 / 61

Chaining

Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the %>% operator. You start the pipeline with a data frame and then pass it to each function in turn.

The pipe syntax (%>%) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x %>% f(y) is the same as f(x, y)

y %>% f(x, ., z) is the same as f(x,y,z)

45 / 61

Mornings

step1 <- dress(me, what = sports)  
step2 <- exercise(step1, how = running)  
step3 <- eat(step2, choice = cereal)  
step4 <- dress(step3, what = school)  
step5 <- commute(step4, transportation = bike)

46 / 61

Mornings

commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)

47 / 61

Mornings

(better??)

commute(
  dress(
    eat(
      exercise(
        dress(me, 
              what = sports), 
        how = running), 
      choice = cereal), 
    what = school), 
  transportation = bike)

48 / 61

Mornings

me %>%
  dress(what = sports) %>%
  exercise(how = running) %>%
  eat(choice = cereal) %>%
  dress(what = school) %>%
  commute(transportation = bike)

49 / 61

Mornings

me %>%
  dress(what = sports) %>%
  exercise(how = running) %>%
  eat(choice = cereal) %>%
  dress(what = school) %>%
  commute(transportation = bike)

The pipe syntax (%>%) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x %>% f(y) is the same as f(x, y)

y %>% f(x, ., z) is the same as f(x,y,z)

50 / 61

Little Bunny Foo Foo

From Hadley Wickham, how to think about piping tidy verbs.

Little bunny Foo Foo

Went hopping through the forest

Scooping up the field mice

And bopping them on the head

51 / 61

Little Bunny Foo Foo

The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.

foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_2, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)

52 / 61

Little Bunny Foo Foo

Another approach is to concatenate the functions so that there is only one output.

bop(
   scoop(
      hop(foo_foo, through = forest),
      up = field_mice),
   on = head)

53 / 61

Little Bunny Foo Foo

Or even worse, as one line:

bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))

54 / 61

Little Bunny Foo Foo

Instead, the code can be written using the pipe in the order in which the function is evaluated:

foo_foo %>%
   hop(through = forest) %>%
       scoop(up = field_mice) %>%
           bop(on = head)

55 / 61

Little Bunny Foo Foo

Instead, the code can be written using the pipe in the order in which the function is evaluated:

foo_foo %>%
   hop(through = forest) %>%
       scoop(up = field_mice) %>%
           bop(on = head)

The pipe syntax (%>%) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x %>% f(y) is the same as f(x, y)

y %>% f(x, ., z) is the same as f(x,y,z)

56 / 61

flights2 %>%
  select(tailnum, speed) %>%
  group_by(tailnum) %>% 
  summarize(number = n(), avg_speed = mean(speed, na.rm = TRUE)) %>%
  arrange(desc(avg_speed))

## # A tibble: 4,044 × 3
##    tailnum number avg_speed
##    <chr>    <int>     <dbl>
##  1 N228UA       1      501.
##  2 N315AS       1      499.
##  3 N654UA       1      499.
##  4 N819AW       1      490.
##  5 N382HA      26      486.
##  6 N388HA      36      484.
##  7 N391HA      21      484.
##  8 N777UA       1      483.
##  9 N385HA      28      483.
## 10 N392HA      13      482.
## # … with 4,034 more rows

57 / 61

Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

58 / 61

Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

flights %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay))

## # A tibble: 16 × 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 F9          20.2 
##  2 EV          20.0 
##  3 YV          19.0 
##  4 FL          18.7 
##  5 WN          17.7 
##  6 9E          16.7 
##  7 B6          13.0 
##  8 VX          12.9 
##  9 OO          12.6 
## 10 UA          12.1 
## 11 MQ          10.6 
## 12 DL           9.26
## 13 AA           8.59
## 14 AS           5.80
## 15 HA           4.90
## 16 US           3.78

59 / 61

Practice again

Say you're curious about the relationship between the number of flights each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.

60 / 61

Practice again

delay_data <- flights %>%
  group_by(tailnum) %>%
  summarize(number = n(),
            dist = mean(distance, na.rm = TRUE), 
            delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(number > 20, dist < 2000)

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Introduction to Math 158

Linear Models

Jo Hardin

January 18, 2022

Agenda 1/18/22

Course structure

Learning Communities

Additional details

Example #1

Science retracts gay marriage paper without agreement of lead author LaCour

Example #2

Seizure study retracted after authors realize data got "terribly mixed"

Example #3

Bad spreadsheet merge kills depression paper, quick fix resurrects it

Example #4

Boys will be boys: Data error prompts U-turn on study of sex differences in school

Reproducible data analysis

Scripting and literate programming

Donald Knuth "Literate Programming" (1983)

Reproducibility checklist

Tools: R & R Studio

R vs R Studio

R Studio

Tools: GitHub

Steps for weekly homework

Tools: a GitHub merge conflict (demo)

Tools: a GitHub merge conflict

The data

What does a tidy data set look like?

Active Duty Military

Tidy packages: the tidyverse

Reading in data from a file

Viewing data - the viewer / Environment

Viewing data - inside .Rmd / the console

Practice

Reading in data from a package

wrangling with tidy verbs

Why tidyverse verbs?

Some Basic Verbs

filter()

Constructing filters

Practice

arrange()

select()

distinct()

mutate()

summarize() and sample_n()

Practice

Practice

group_by()

group_by()

Chaining

Mornings

Mornings

Mornings

Mornings

Mornings

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Practice

Practice

Practice again

Practice again

Agenda 1/18/22

Help

`filter()`

`arrange()`

`select()`

`distinct()`

`mutate()`

`summarize()` and `sample_n()`

`group_by()`

`group_by()`