Math 158 - Spring 2022
Jo Hardin (from Mine Çetinkaya-Rundel)
imdb_rating
from other variables in the dataset# A tibble: 188 × 6
season episode title imdb_rating total_votes air_date
<dbl> <dbl> <chr> <dbl> <dbl> <date>
1 1 1 Pilot 7.6 3706 2005-03-24
2 1 2 Diversity Day 8.3 3566 2005-03-29
3 1 3 Health Care 7.9 2983 2005-04-05
4 1 4 The Alliance 8.1 2886 2005-04-12
5 1 5 Basketball 8.4 3179 2005-04-19
6 1 6 Hot Girl 7.8 2852 2005-04-26
7 2 1 The Dundies 8.7 3213 2005-09-20
8 2 2 Sexual Harassment 8.2 2736 2005-09-27
9 2 3 Office Olympics 8.4 2742 2005-10-04
10 2 4 The Fire 8.4 2713 2005-10-11
# … with 178 more rows
episode
as an ID variable and doesn’t use air_date
as a predictorseason
as a factor variableepisode
as an ID variable and doesn’t use air_date
as a predictorseason
as numericoffice_rec1 <- recipe(imdb_rating ~ ., data = office_train) %>%
update_role(title, new_role = "id") %>%
step_rm(air_date) %>%
step_num2factor(season, levels = as.character(1:9)) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
office_rec1
Data Recipe
Inputs:
role #variables
id 1
outcome 1
predictor 4
Operations:
Delete terms air_date
Factor variables from season
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()
season
is now numeric
office_rec2 <- recipe(imdb_rating ~ ., data = office_train) %>%
update_role(title, new_role = "id") %>%
step_rm(air_date) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
office_rec2
Data Recipe
Inputs:
role #variables
id 1
outcome 1
predictor 4
Operations:
Delete terms air_date
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_rm()
• step_num2factor()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_rm()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Actually, not so fast!
Resampling is only conducted on the training set. The test set is not involved. For each iteration of resampling, the data are partitioned into two subsamples:
Source: Kuhn and Silge. Tidy modeling with R.
More specifically, v-fold cross validation – commonly used resampling technique:
Let’s give an example where v = 3
…
Randomly split your training data into 3 partitions:
# Resampling results
# 3-fold cross-validation
# A tibble: 3 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [94/47]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]>
2 <split [94/47]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]>
3 <split [94/47]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]>
# Resampling results
# 3-fold cross-validation
# A tibble: 3 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [94/47]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]>
2 <split [94/47]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]>
3 <split [94/47]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]>
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.373 3 0.0324 Preprocessor1_Model1
2 rsq standard 0.574 3 0.0614 Preprocessor1_Model1
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.407 3 0.00971 Preprocessor1_Model1
2 rsq standard 0.438 3 0.0857 Preprocessor1_Model1
Choose Model 1.
# A tibble: 6 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold1 rmse standard 0.320 Preprocessor1_Model1
2 Fold1 rsq standard 0.687 Preprocessor1_Model1
3 Fold2 rmse standard 0.368 Preprocessor1_Model1
4 Fold2 rsq standard 0.476 Preprocessor1_Model1
5 Fold3 rmse standard 0.432 Preprocessor1_Model1
6 Fold3 rsq standard 0.558 Preprocessor1_Model1
Cross validation RMSE stats:
cv_metrics1 %>%
filter(.metric == "rmse") %>%
summarise(
min = min(.estimate),
max = max(.estimate),
mean = mean(.estimate),
sd = sd(.estimate)
)
# A tibble: 1 × 4
min max mean sd
<dbl> <dbl> <dbl> <dbl>
1 0.320 0.432 0.373 0.0562
Training data IMDB score stats:
To illustrate how CV works, we used v = 3
:
This was useful for illustrative purposes, but v = 3
is a poor choice in practice
Values of v
are most often 5 or 10; we generally prefer 10-fold cross-validation as a default
\[\begin{align} \mbox{prediction error } = \mbox{ irreducible error } + \mbox{ bias } + \mbox{ variance} \end{align}\]
irreducible error is the natural variability that comes with observations.
bias of the model represents the difference between the true model and a model which is too simple. The more complicated the model, the closer the points are to the prediction.
variance represents the variability of the model from sample to sample. A simple model would not change a lot from sample to sample.
Credit: An Introduction to Statistical Learning, James et al.