Chapter 6 Statistical models

In this chapter, we will not learn about all the models out there that you may or may not need. Instead, I will show you how can use what you have learned until now and how you can apply these concepts to modeling. Also, as you read in the beginning of the book, R has many many packages. So the model you need is most probably already implemented in some package and you will very likely not need to write your own from scratch.

In the first section, I will discuss the terminology used in this book. Then I will discuss linear regression; showing how linear regression works illsutrates very well how other models work too, without loss of generality. Then I will introduce the concepte of hyper-parameters with ridge regression. This chapter will then finish with an introduction to cross-validation as a way to tune the hyper-parameters of models that features them.

6.1 Terminology

Before continuing discussing about statistical models and model fitting it is worthwhile to discuss terminology a little bit. Depending on your background, you might call an explanatory variable a feature or the dependent variable the target. These are the same objects. The matrix of features is usually called a design matrix, and the what statisticians call the intercept in a linear regression machine learning engineers call the bias. Calling the intercept the bias is unfortunate, as bias also has a very different meaning; bias is also what we call the error in a model that may cause biased estimates. To finish up, the estimated parameters of the model may be called coefficients or weights. Here again, I don’t like the using weight as weight as a very different meaning in statistics. So, in the remainder of this chapter, and book, I will use the terminology from the statistical litterature, using dependent and explanatory variables (y and x), and calling the estimated parameters coefficients and the intercept… well the intercept (the \(\beta\)s of the model). However, I will talk of training a model, instead of estimating a model.

6.2 Fitting a model to data

Suppose you have a variable y that you wish to explain using a set of other variables x1, x2, x3, etc. Let’s take a look at the Housing dataset from the Ecdat package:

library(Ecdat)

data(Housing)

You can read a description of the dataset by running:

?Housing
Housing                 package:Ecdat                  R Documentation

Sales Prices of Houses in the City of Windsor

Description:

     a cross-section from 1987

     _number of observations_ : 546

     _observation_ : goods

     _country_ : Canada

Usage:

     data(Housing)

Format:

     A dataframe containing :

     price: sale price of a house

     lotsize: the lot size of a property in square feet

     bedrooms: number of bedrooms

     bathrms: number of full bathrooms

     stories: number of stories excluding basement

     driveway: does the house has a driveway ?

     recroom: does the house has a recreational room ?

     fullbase: does the house has a full finished basement ?

     gashw: does the house uses gas for hot water heating ?

     airco: does the house has central air conditioning ?

     garagepl: number of garage places

     prefarea: is the house located in the preferred neighbourhood of the city ?

Source:

     Anglin, P.M.  and R.  Gencay (1996) “Semiparametric estimation of
     a hedonic price function”, _Journal of Applied Econometrics_,
     *11(6)*, 633-648.

References:

     Verbeek, Marno (2004) _A Guide to Modern Econometrics_, John Wiley
     and Sons, chapter 3.

     Journal of Applied Econometrics data archive : <URL:
     http://qed.econ.queensu.ca/jae/>.

See Also:

     ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’,
     ‘Index.Observations’

or by looking for Housing in the help pane of RStudio. Usually, you would take a look a the data before doing any modeling:

glimpse(Housing)
## Observations: 546
## Variables: 12
## $ price    <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000,…
## $ lotsize  <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800, 5…
## $ bedrooms <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4, 1…
## $ bathrms  <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1…
## $ stories  <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3, 1…
## $ driveway <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ recroom  <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, no,…
## $ fullbase <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, no…
## $ gashw    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
## $ airco    <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no, …
## $ garagepl <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0, 0…
## $ prefarea <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Housing prices depend on a set of variables such as the number of bedrooms, the area it is located and so on. If you believe that housing prices depend linearly on a set of explanatory variables, you will want to estimate a linear model. To estimate a linear model, you will need to use the built-in lm() function:

model1 <- lm(price ~ lotsize + bedrooms, data = Housing)

lm() takes a formula as an argument, which defines the model you want to estimate. In this case, I ran the following regression:

\[ \text{price} = \beta_0 + \beta_1 * \text{lotsize} + \beta_2 * \text{bedrooms} + \varepsilon \]

where \(\beta_0, \beta_1\) and \(\beta_2\) are three parameters to estimate. To take a look at the results, you can use the summary() method (not to be confused with dplyr::summarise():

summary(model1)
## 
## Call:
## lm(formula = price ~ lotsize + bedrooms, data = Housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65665 -12498  -2075   8970  97205 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.613e+03  4.103e+03   1.368    0.172    
## lotsize     6.053e+00  4.243e-01  14.265  < 2e-16 ***
## bedrooms    1.057e+04  1.248e+03   8.470 2.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21230 on 543 degrees of freedom
## Multiple R-squared:  0.3703, Adjusted R-squared:  0.3679 
## F-statistic: 159.6 on 2 and 543 DF,  p-value: < 2.2e-16

if you wish to remove the intercept (\(alpha\)) from your model, you can do so with -1:

model2 <- lm(price ~ -1 + lotsize + bedrooms, data = Housing)

summary(model2)
## 
## Call:
## lm(formula = price ~ -1 + lotsize + bedrooms, data = Housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67229 -12342  -1333   9627  95509 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## lotsize      6.283      0.390   16.11   <2e-16 ***
## bedrooms 11968.362    713.194   16.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21250 on 544 degrees of freedom
## Multiple R-squared:  0.916,  Adjusted R-squared:  0.9157 
## F-statistic:  2965 on 2 and 544 DF,  p-value: < 2.2e-16

or if you want to use all the columns inside Housing:

model3 <- lm(price ~ ., data = Housing)

summary(model3)
## 
## Call:
## lm(formula = price ~ ., data = Housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41389  -9307   -591   7353  74875 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4038.3504  3409.4713  -1.184 0.236762    
## lotsize         3.5463     0.3503  10.124  < 2e-16 ***
## bedrooms     1832.0035  1047.0002   1.750 0.080733 .  
## bathrms     14335.5585  1489.9209   9.622  < 2e-16 ***
## stories      6556.9457   925.2899   7.086 4.37e-12 ***
## drivewayyes  6687.7789  2045.2458   3.270 0.001145 ** 
## recroomyes   4511.2838  1899.9577   2.374 0.017929 *  
## fullbaseyes  5452.3855  1588.0239   3.433 0.000642 ***
## gashwyes    12831.4063  3217.5971   3.988 7.60e-05 ***
## aircoyes    12632.8904  1555.0211   8.124 3.15e-15 ***
## garagepl     4244.8290   840.5442   5.050 6.07e-07 ***
## prefareayes  9369.5132  1669.0907   5.614 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared:  0.6731, Adjusted R-squared:  0.6664 
## F-statistic: 99.97 on 11 and 534 DF,  p-value: < 2.2e-16

You can access different elements of model3 (for example) with $, because the result of lm() is a list:

print(model3$coefficients)
##  (Intercept)      lotsize     bedrooms      bathrms      stories 
## -4038.350425     3.546303  1832.003466 14335.558468  6556.945711 
##  drivewayyes   recroomyes  fullbaseyes     gashwyes     aircoyes 
##  6687.778890  4511.283826  5452.385539 12831.406266 12632.890405 
##     garagepl  prefareayes 
##  4244.829004  9369.513239

but I prefer to use the {broom} package, and more specifically the tidy() function, which converts model3 into a neat data.frame:

results3 <- broom::tidy(model3)

glimpse(results3)
## Observations: 12
## Variables: 5
## $ term      <chr> "(Intercept)", "lotsize", "bedrooms", "bathrms", "stor…
## $ estimate  <dbl> -4038.350425, 3.546303, 1832.003466, 14335.558468, 655…
## $ std.error <dbl> 3409.4713, 0.3503, 1047.0002, 1489.9209, 925.2899, 204…
## $ statistic <dbl> -1.184451, 10.123618, 1.749764, 9.621691, 7.086369, 3.…
## $ p.value   <dbl> 2.367616e-01, 3.732442e-22, 8.073341e-02, 2.570369e-20…

I explicitely write broom::tidy() because tidy() is a popular function name. For instance, it is also a function from the {yardstick} package, which does not do the same thing at all. Since I will also be using {yardstick} I prefer explicitely writing broom::tidy() to avoid conflicts.

Using broom::tidy() is useful, because you can then work on the results easily, for example if you wish to only keep results that are significant at the 5% level:

results3 %>%
  filter(p.value < 0.05)
## # A tibble: 10 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 lotsize         3.55     0.350     10.1  3.73e-22
##  2 bathrms     14336.    1490.         9.62 2.57e-20
##  3 stories      6557.     925.         7.09 4.37e-12
##  4 drivewayyes  6688.    2045.         3.27 1.15e- 3
##  5 recroomyes   4511.    1900.         2.37 1.79e- 2
##  6 fullbaseyes  5452.    1588.         3.43 6.42e- 4
##  7 gashwyes    12831.    3218.         3.99 7.60e- 5
##  8 aircoyes    12633.    1555.         8.12 3.15e-15
##  9 garagepl     4245.     841.         5.05 6.07e- 7
## 10 prefareayes  9370.    1669.         5.61 3.19e- 8

You can even add new columns, such as the confidence intervals:

results3 <- broom::tidy(model3, conf.int = TRUE, conf.level = 0.95)

print(results3)
## # A tibble: 12 x 7
##    term         estimate std.error statistic  p.value  conf.low conf.high
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
##  1 (Intercept)  -4038.    3409.        -1.18 2.37e- 1 -10736.     2659.  
##  2 lotsize          3.55     0.350     10.1  3.73e-22      2.86      4.23
##  3 bedrooms      1832.    1047.         1.75 8.07e- 2   -225.     3889.  
##  4 bathrms      14336.    1490.         9.62 2.57e-20  11409.    17262.  
##  5 stories       6557.     925.         7.09 4.37e-12   4739.     8375.  
##  6 drivewayyes   6688.    2045.         3.27 1.15e- 3   2670.    10705.  
##  7 recroomyes    4511.    1900.         2.37 1.79e- 2    779.     8244.  
##  8 fullbaseyes   5452.    1588.         3.43 6.42e- 4   2333.     8572.  
##  9 gashwyes     12831.    3218.         3.99 7.60e- 5   6511.    19152.  
## 10 aircoyes     12633.    1555.         8.12 3.15e-15   9578.    15688.  
## 11 garagepl      4245.     841.         5.05 6.07e- 7   2594.     5896.  
## 12 prefareayes   9370.    1669.         5.61 3.19e- 8   6091.    12648.

Going back to model estimation, you can of course use lm() in a pipe workflow:

Housing %>%
  select(-driveway, -stories) %>%
  lm(price ~ ., data = .) %>%
  broom::tidy()
## # A tibble: 10 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  3025.    3263.        0.927 3.54e- 1
##  2 lotsize         3.67     0.363    10.1   4.52e-22
##  3 bedrooms     4140.    1036.        3.99  7.38e- 5
##  4 bathrms     16443.    1546.       10.6   4.29e-24
##  5 recroomyes   5660.    2010.        2.82  5.05e- 3
##  6 fullbaseyes  2241.    1618.        1.38  1.67e- 1
##  7 gashwyes    13568.    3411.        3.98  7.93e- 5
##  8 aircoyes    15578.    1597.        9.75  8.53e-21
##  9 garagepl     4232.     883.        4.79  2.12e- 6
## 10 prefareayes 10729.    1753.        6.12  1.81e- 9

The first . in the lm() function is used to indicate that we wish to use all the data from Housing (minus driveway and stories which I removed using select() and the - sign), and the second . is used to place the result from the two dplyr instructions that preceded is to be placed there. The picture below should help you understand:

knitr::include_graphics("pics/pipe_to_second_position.png")

You have to specify this, because by default, when using %>% the left hand side argument gets passed as the first argument of the function on the right hand side.

6.3 Diagnostics

Diagnostics are useful metrics to assess model fit. You can read some of these diagnostics, such as the \(R^2\) at the bottom of the summary (when running summary(my_model)), but if you want to do more than simply reading these diagnostics from RStudio, you can put those in a data.frame too, using broom::glance():

glance(model3)
## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic   p.value    df logLik    AIC
##       <dbl>         <dbl>  <dbl>     <dbl>     <dbl> <int>  <dbl>  <dbl>
## 1     0.673         0.666 15423.     100.0 6.18e-122    12 -6034. 12094.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

You can also plot the usual diagnostics plots using ggfortify::autoplot() which uses the ggplot2 package under the hood:

library(ggfortify)

autoplot(model3, which = 1:6) + theme_minimal()

which=1:6 is an additional option that shows you all the diagnostics plot. If you omit this option, you will only get 4 of them.

You can also get the residuals of the regression in two ways; either you grab them directly from the model fit:

resi3 <- residuals(model3)

or you can augment the original data with a residuals column, using broom::augment():

housing_aug <- augment(model3)

Let’s take a look at housing_aug:

glimpse(housing_aug)
## Observations: 546
## Variables: 19
## $ price      <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 6900…
## $ lotsize    <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800,…
## $ bedrooms   <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4,…
## $ bathrms    <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ stories    <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3,…
## $ driveway   <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ recroom    <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, n…
## $ fullbase   <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, …
## $ gashw      <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ airco      <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no…
## $ garagepl   <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0,…
## $ prefarea   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ .fitted    <dbl> 66037.98, 41391.15, 39889.63, 63689.09, 49760.43, 663…
## $ .se.fit    <dbl> 1790.507, 1406.500, 1534.102, 2262.056, 1567.689, 236…
## $ .resid     <dbl> -24037.9757, -2891.1515, 9610.3699, -3189.0873, 11239…
## $ .hat       <dbl> 0.013477335, 0.008316321, 0.009893730, 0.021510891, 0…
## $ .sigma     <dbl> 15402.01, 15437.14, 15431.98, 15437.02, 15429.89, 154…
## $ .cooksd    <dbl> 2.803214e-03, 2.476265e-05, 3.265481e-04, 8.004787e-0…
## $ .std.resid <dbl> -1.56917096, -0.18823924, 0.62621736, -0.20903274, 0.…

A few columns have been added to the original data, among them .resid which contains the residuals. Let’s plot them:

ggplot(housing_aug) +
  geom_density(aes(.resid))

Fitted values are also added to the original data, under the variable .fitted. It would also have been possible to get the fitted values with:

fit3 <- fitted(model3)

but I prefer using augment(), because the columns get merged to the original data, which then makes it easier to find specific individuals, for example, you might want to know for which housing units the model underestimates the price:

total_pos <- housing_aug %>%
  filter(.resid > 0) %>%
  summarise(total = n()) %>%
  pull(total)

we find 261 individuals where the residuals are positive. It is also easier to extract outliers:

housing_aug %>%
  mutate(prank = cume_dist(.cooksd)) %>%
  filter(prank > 0.99) %>%
  glimpse()
## Observations: 6
## Variables: 20
## $ price      <dbl> 163000, 125000, 132000, 175000, 190000, 174500
## $ lotsize    <dbl> 7420, 4320, 3500, 9960, 7420, 7500
## $ bedrooms   <dbl> 4, 3, 4, 3, 4, 4
## $ bathrms    <dbl> 1, 1, 2, 2, 2, 2
## $ stories    <dbl> 2, 2, 2, 2, 3, 2
## $ driveway   <fct> yes, yes, yes, yes, yes, yes
## $ recroom    <fct> yes, no, no, no, no, no
## $ fullbase   <fct> yes, yes, no, yes, no, yes
## $ gashw      <fct> no, yes, yes, no, no, no
## $ airco      <fct> yes, no, no, no, yes, yes
## $ garagepl   <dbl> 2, 2, 2, 2, 2, 3
## $ prefarea   <fct> no, no, no, yes, yes, yes
## $ .fitted    <dbl> 94826.68, 77688.37, 85495.58, 108563.18, 115125.03, 1…
## $ .se.fit    <dbl> 2520.691, 3551.954, 3544.961, 2589.680, 2185.603, 258…
## $ .resid     <dbl> 68173.32, 47311.63, 46504.42, 66436.82, 74874.97, 559…
## $ .hat       <dbl> 0.02671105, 0.05303793, 0.05282929, 0.02819317, 0.020…
## $ .sigma     <dbl> 15144.70, 15293.34, 15298.27, 15159.14, 15085.99, 152…
## $ .cooksd    <dbl> 0.04590995, 0.04637969, 0.04461464, 0.04616068, 0.041…
## $ .std.resid <dbl> 4.480428, 3.152300, 3.098176, 4.369631, 4.904193, 3.6…
## $ prank      <dbl> 0.9963370, 1.0000000, 0.9945055, 0.9981685, 0.9926740…

prank is a variable I created with cume_dist() which is a dplyr function that returns the proportion of all values less than or equal to the current rank. For example:

example <- c(5, 4.6, 2, 1, 0.8, 0, -1)
cume_dist(example)
## [1] 1.0000000 0.8571429 0.7142857 0.5714286 0.4285714 0.2857143 0.1428571

by filtering prank > 0.99 we get the top 1% of outliers according to Cook’s distance.

6.4 Interpreting models

Model interpretation is essential in the social sciences. If one wants to know the effect of variable x on the dependent variable y, marginal effects have to be computed. This is easily done in R with the {margins} package, which aims to provide the same functionality as the margins command in STATA:

library(margins)

effects_model3 <- margins(model3)

summary(effects_model3)
##       factor        AME        SE       z      p      lower      upper
##     aircoyes 12632.8904 1555.0329  8.1239 0.0000  9585.0819 15680.6989
##      bathrms 14335.5585 1482.9885  9.6667 0.0000 11428.9545 17242.1624
##     bedrooms  1832.0035 1045.6558  1.7520 0.0798  -217.4442  3881.4512
##  drivewayyes  6687.7789 2045.2636  3.2699 0.0011  2679.1359 10696.4219
##  fullbaseyes  5452.3855 1587.9782  3.4335 0.0006  2340.0054  8564.7657
##     garagepl  4244.8290  847.2173  5.0103 0.0000  2584.3136  5905.3444
##     gashwyes 12831.4063 3217.6211  3.9879 0.0001  6524.9848 19137.8277
##      lotsize     3.5463    0.3503 10.1250 0.0000     2.8598     4.2328
##  prefareayes  9369.5132 1669.1034  5.6135 0.0000  6098.1307 12640.8957
##   recroomyes  4511.2838 1899.9255  2.3745 0.0176   787.4982  8235.0694
##      stories  6556.9457  924.4211  7.0930 0.0000  4745.1137  8368.7777

It is also possible to plot the results:

plot(effects_model3)

This uses the basic R plotting capabilities, which is useful because it is a simple call to the function plot() but if you’ve been using ggplot2 and want this graph to have the same feel as the others made with ggplot2 you first need to save the summary in a variable. summary(effects_model3) is a data.frame with many more details. Let’s overwrite this effects_model3 with its summary:

effects_model3 <- summary(effects_model3)

And now it is possible to use ggplot2 to have the same plot:

ggplot(data = effects_model3) +
  geom_point(aes(factor, AME)) +
  geom_errorbar(aes(x = factor, ymin = lower, ymax = upper)) +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45))

Of course for model3, the marginal effects are the same as the coefficients, so let’s estimate a logit model and compute the marginal effects. Logit models can be estimated using the glm() function. As an example, we are going to use the Participation data, also from the Ecdat package:

data(Participation)
?Particpation
Participation              package:Ecdat               R Documentation

Labor Force Participation

Description:

     a cross-section

     _number of observations_ : 872

     _observation_ : individuals

     _country_ : Switzerland

Usage:

     data(Participation)

Format:

     A dataframe containing :

     lfp labour force participation ?

     lnnlinc the log of nonlabour income

     age age in years divided by 10

     educ years of formal education

     nyc the number of young children (younger than 7)

     noc number of older children

     foreign foreigner ?

Source:

     Gerfin, Michael (1996) “Parametric and semiparametric estimation
     of the binary response”, _Journal of Applied Econometrics_,
     *11(3)*, 321-340.

References:

     Davidson, R.  and James G.  MacKinnon (2004) _Econometric Theory
     and Methods_, New York, Oxford University Press, <URL:
     http://www.econ.queensu.ca/ETM/>, chapter 11.

     Journal of Applied Econometrics data archive : <URL:
     http://qed.econ.queensu.ca/jae/>.

See Also:

     ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’,
     ‘Index.Observations’

The variable of interest is lfp: whether the individual participates in the labour force. To know which variables are relevant in the decision to participate in the labour force, one could estimate a logit model, using glm().

logit_participation <- glm(lfp ~ ., data = Participation, family = "binomial")

broom::tidy(logit_participation)
## # A tibble: 7 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  10.4       2.17       4.79  1.69e- 6
## 2 lnnlinc      -0.815     0.206     -3.97  7.31e- 5
## 3 age          -0.510     0.0905    -5.64  1.72e- 8
## 4 educ          0.0317    0.0290     1.09  2.75e- 1
## 5 nyc          -1.33      0.180     -7.39  1.51e-13
## 6 noc          -0.0220    0.0738    -0.298 7.66e- 1
## 7 foreignyes    1.31      0.200      6.56  5.38e-11

From the results above, one can only interpret the sign of the coefficients. To know how much a variable influences the labour force participation, one has to use margins():

effects_logit_participation <- margins(logit_participation) %>%
  summary()

print(effects_logit_participation)
##      factor     AME     SE       z      p   lower   upper
##         age -0.1064 0.0176 -6.0494 0.0000 -0.1409 -0.0719
##        educ  0.0066 0.0060  1.0955 0.2733 -0.0052  0.0185
##  foreignyes  0.2834 0.0399  7.1102 0.0000  0.2053  0.3615
##     lnnlinc -0.1699 0.0415 -4.0994 0.0000 -0.2512 -0.0887
##         noc -0.0046 0.0154 -0.2981 0.7656 -0.0347  0.0256
##         nyc -0.2775 0.0333 -8.3433 0.0000 -0.3426 -0.2123

We can use the previous code to plot the marginal effects:

ggplot(data = effects_logit_participation) +
  geom_point(aes(factor, AME)) +
  geom_errorbar(aes(x = factor, ymin = lower, ymax = upper)) +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45))

So an infinitesimal increase, in say, non-labour income (lnnlinc) of 0.001 is associated with a decrease of the probability of labour force participation by 0.001*17 percentage points.

You can also extract the marginal effects of a single variable:

head(dydx(Participation, logit_participation, "lnnlinc"))
##   dydx_lnnlinc
## 1  -0.15667764
## 2  -0.20014487
## 3  -0.18495109
## 4  -0.05377262
## 5  -0.18710476
## 6  -0.19586986

Which makes it possible to extract the effect for a list of individuals that you can create yourself:

my_subjects = tribble(
    ~lfp,  ~lnnlinc, ~age, ~educ, ~nyc, ~noc, ~foreign,
    "yes",   10.780,  7.0,     4,    1,   1,     "yes",
     "no",     1.30,  9.0,     1,    4,   1,     "yes"
)

dydx(my_subjects, logit_participation, "lnnlinc")
##   dydx_lnnlinc
## 1  -0.09228119
## 2  -0.17953451

I used the tribble() function from the tibble package to create this test data set, row by row. Then, using dydx(), I get the marginal effect of variable lnnlinc for these two individuals.

6.5 Comparing models

Let’s estimate another model on the same data; prices are only positive, so a linear regression might not be the best model, because the model could predict negative prices. Let’s look at the distribution of prices:

ggplot(Housing) +
  geom_density(aes(price))

it looks like modeling the log of price might provide a better fit:

model_log <- lm(log(price) ~ ., data = Housing)

result_log <- broom::tidy(model_log)

print(result_log)
## # A tibble: 12 x 5
##    term          estimate  std.error statistic  p.value
##    <chr>            <dbl>      <dbl>     <dbl>    <dbl>
##  1 (Intercept) 10.0       0.0472        212.   0.      
##  2 lotsize      0.0000506 0.00000485     10.4  2.91e-23
##  3 bedrooms     0.0340    0.0145          2.34 1.94e- 2
##  4 bathrms      0.168     0.0206          8.13 3.10e-15
##  5 stories      0.0923    0.0128          7.20 2.10e-12
##  6 drivewayyes  0.131     0.0283          4.61 5.04e- 6
##  7 recroomyes   0.0735    0.0263          2.79 5.42e- 3
##  8 fullbaseyes  0.0994    0.0220          4.52 7.72e- 6
##  9 gashwyes     0.178     0.0446          4.00 7.22e- 5
## 10 aircoyes     0.178     0.0215          8.26 1.14e-15
## 11 garagepl     0.0508    0.0116          4.36 1.58e- 5
## 12 prefareayes  0.127     0.0231          5.50 6.02e- 8

Let’s take a look at the diagnostics:

glance(model_log)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl> <dbl>
## 1     0.677         0.670 0.214      102. 3.67e-123    12   73.9 -122.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Let’s compare these to the ones from the previous model:

diag_lm <- glance(model3)

diag_lm <- diag_lm %>%
  mutate(model = "lin-lin model")

diag_log <- glance(model_log)

diag_log  <- diag_log %>%
  mutate(model = "log-lin model")

diagnostics_models <- full_join(diag_lm, diag_log)
## Joining, by = c("r.squared", "adj.r.squared", "sigma", "statistic", "p.value", "df", "logLik", "AIC", "BIC", "deviance", "df.residual", "model")
print(diagnostics_models)
## # A tibble: 2 x 12
##   r.squared adj.r.squared   sigma statistic   p.value    df  logLik     AIC
##       <dbl>         <dbl>   <dbl>     <dbl>     <dbl> <int>   <dbl>   <dbl>
## 1     0.673         0.666 1.54e+4     100.0 6.18e-122    12 -6034.   12094.
## 2     0.677         0.670 2.14e-1     102.  3.67e-123    12    73.9   -122.
## # … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   model <chr>

I saved the diagnostics in two different data.frame objects using the glance() function and added a model column to indicate which model the diagnostics come from. Then I merged both datasets using full_join(), a {dplyr} function.

As you can see, the model with the logarithm of the prices as the dependent variable has a higher likelihood (and thus lower AIC and BIC) than the simple linear model. Let’s take a look at the diagnostics plots:

autoplot(model_log, which = 1:6) + theme_minimal()

6.6 Using a model for prediction

Once you estimated a model, you might want to use it for prediction. This is easily done using the predict() function that works with most models. Prediction is also useful as a way to test the accuracy of your model: split your data into a training set (used for estimation) and a testing set (used for the pseudo-prediction) and see if your model overfits the data. We are going to see how to do that in a later section; for now, let’s just get acquainted with predict() and other functions. I insist, keep in mind that this section is only to get acquainted with these functions. We are going to explore prediction, overfitting and tuning of models in a later section.

Let’s go back to the models we estimated in the previous section, model3 and model_log. Let’s also take a subsample of data, which we will be using for prediction:

set.seed(1234)

pred_set <- Housing %>%
  sample_n(20)

so that we get always the same pred_set I set the random seed first. Let’s take a look at the data:

print(pred_set)
##     price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1   52000    4280        2       1       1      yes      no       no    no
## 2   62500    3900        3       1       2      yes      no       no    no
## 3  175000    8960        4       4       4      yes      no       no    no
## 4  141000    8100        4       1       2      yes     yes      yes    no
## 5   54000    2856        3       1       3      yes      no       no    no
## 6   52000    4130        3       2       2      yes      no       no    no
## 7   66000    4160        3       1       1      yes     yes      yes    no
## 8   95000    4260        4       2       2      yes      no       no   yes
## 9   97000   12090        4       2       2      yes      no       no    no
## 10  70000    6300        3       1       1      yes      no       no    no
## 11  85000    6420        3       1       1      yes      no      yes    no
## 12  39000    4000        3       1       2      yes      no       no    no
## 13  59900    3450        3       1       2      yes      no       no    no
## 14  53000    4050        2       1       1      yes      no       no    no
## 15  60000    2610        4       3       2       no      no       no    no
## 16  72500    5720        2       1       2      yes      no       no    no
## 17  35500    3000        3       1       2       no      no       no    no
## 18  40000    2650        3       1       2      yes      no      yes    no
## 19  35000    3500        2       1       1      yes     yes       no    no
## 20  37000    4400        2       1       1      yes      no       no    no
##    airco garagepl prefarea
## 1    yes        2       no
## 2     no        0       no
## 3    yes        3       no
## 4    yes        2      yes
## 5     no        0      yes
## 6     no        2       no
## 7    yes        0       no
## 8     no        0       no
## 9     no        2      yes
## 10   yes        2       no
## 11   yes        0      yes
## 12    no        1       no
## 13    no        1       no
## 14    no        0       no
## 15    no        0       no
## 16   yes        0      yes
## 17    no        0       no
## 18    no        1       no
## 19    no        0       no
## 20    no        0       no

If we wish to use it for prediction, this is easily done with predict():

predict(model3, pred_set)
##         1         2         3         4         5         6         7 
##  63506.66  49425.47 150689.71 106607.68  61649.59  73066.34  66387.12 
##         8         9        10        11        12        13        14 
##  79701.11 112496.42  72502.20  79260.00  54024.93  52074.46  41568.47 
##        15        16        17        18        19        20 
##  68666.08  76050.14  39546.02  54689.81  44129.28  42809.67

This returns a vector of predicted prices. This can then be used to compute the Root Mean Squared Error for instance. Let’s do it within a tidyverse pipeline:

rmse <- pred_set %>%
  mutate(predictions = predict(model3, .)) %>%
  summarise(sqrt(sum(predictions - price)**2/n()))

The root mean square error of model3 is 1666.1312666.

I also used the n() function which returns the number of observations in a group (or all the observations, if the data is not grouped). Let’s compare model3 ’s RMSE with the one from model_log:

rmse2 <- pred_set %>%
  mutate(predictions = exp(predict(model_log, .))) %>%
  summarise(sqrt(sum(predictions - price)**2/n()))

Don’t forget to exponentiate the predictions, remember you’re dealing with a log-linear model! model_log’s RMSE is 1359.0392252 which is lower than model3’s. However, keep in mind that the model was estimated on the whole data, and then the prediction quality was assessed using a subsample of the data the model was estimated on… so actually we can’t really say if model_log’s predictions are very useful. Of course, this is the same for model3. In a later section we are going to learn how to do cross validation to avoid this issue.

Also another problem of what I did before, unrelated to statistics per se, is that I wanted to compute the same quantity for two different models, and did so by copy and pasting 3 lines of code. That’s not much, but if I wanted to compare 10 models, copy and paste mistakes could have sneaked in. Instead, it would have been nice to have a function that computes the RMSE and then use it on my models. We are going to learn how to write our own function and use it just like if it was another built-in R function.

6.7 Beyond linear regression

R has a lot of other built-in functions for regression, such as glm() (for Generalized Linear Models) and nls() for (for Nonlinear Least Squares). There are also functions and additional packages for time series, panel data, machine learning, bayesian and nonparametric methods. Presenting everything here would take too much space, and would be pretty useless as you can find whatever you need using an internet search engine. What you have learned until now is quite general and should work on many type of models. To help you out, here is a list of methods and the recommended packages that you can use:

Model Package Quick example
Robust Linear Regression MASS rlm(y ~ x, data = mydata)
Nonlinear Least Squares stats2 nls(y ~ x1 / (1 + x2), data = mydata)3
Logit stats glm(y ~ x, data = mydata, family = "binomial")
Probit stats glm(y ~ x, data = mydata, family = binomial(link = "probit"))
K-Means stats kmeans(data, n)4
PCA stats prcomp(data, scale = TRUE, center = TRUE)5
Multinomial Logit mlogit Requires several steps of data pre-processing and formula definition, refer to the Vignette for more details.
Cox PH survival coxph(Surv(y_time, y_status) ~ x, data = mydata)6
Time series Several, depending on your needs. Time series in R is a vast subject that would require a very thick book to cover. You can get started with the following series of blog articles, Tidy time-series, part 1, Tidy time-series, part 2, Tidy time-series, part 3 and Tidy time-series, part 3
Panel data plm plm(y ~ x, data = mydata, model = "within|random")
Neural Networks Several, depending on your needs. R is a very popular programming language for machine learning. This blog post lists and compares some of the most useful packages for Neural nets and deep learning.
Nonparametric regression np Several functions and options available, refer to the Vignette for more details.

I put neural networks in the table, but you can also find packages for regression trees, naive bayes, and pretty much any machine learning method out there! The same goes for Bayesian methods. Popular packages include rstan, rjags which link R to STAN and JAGS (two other pieces of software that do the Gibbs sampling for you) which are tools that allow you to fit very general models. It is also possible to estimate models using Bayesian inference without the need of external tools, with the bayesm package which estimates the usual micro-econometric models. There really are a lot of packages available for Bayesian inference, and you can find them all in the related CRAN Task View.

6.8 Hyper-parameters

Hyper-parameters are parameters of the model that cannot be directly learned from the data. A linear regression does not have any hyper-parameters, but a random forest for instance has several. You might have heard of ridge regression, lasso and elasticnet. These are extensions to linear models that avoid over-fitting by penalizing large models. These extensions of the linear regression have hyper-parameters that the practitioner has to tune. There are several ways one can tune these parameters, for example, by doing a grid-search, or a random search over the grid or using more elaborate methods. To introduce hyper-parameters, let’s get to know ridge regression, also called Tikhonov regularization.

6.8.1 Ridge regression

Ridge regression is used when the data you are working with has a lot of explanatory variables, or when there is a risk that a simple linear regression might overfit to the training data, because, for example, your explanatory variables are collinear. If you are training a linear model and then you notice that it generalizes very badly to new, unseen data, it is very likely that the linear model you trained overfit the data. In this case, ridge regression might prove useful. The way ridge regression works might seem counter-intuititive; it boils down to fitting a worse model to the training data, but in return, this worse model will generalize better to new data.

The closed form solution of the ordinary least squares estimator is defined as:

\[ \widehat{\beta} = (X'X)^{-1}X'Y \]

where \(X\) is the design matrix (the matrix made up of the explanatory variables) and \(Y\) is the dependent variable. For ridge regression, this closed form solution changes a little bit:

\[ \widehat{\beta} = (X'X + \lambda I_p)^{-1}X'Y \]

where \(\lambda \in \mathbb{R}\) is an hyper-parameter and \(I_p\) is the identity matrix of dimension \(p\) (\(p\) is the number of explanatory variables). This formula above is the closed form solution to the following optimisation program:

\[ \sum_{i=1}^n \left(y_i - \sum_{j=1}^px_{ij}\beta_j\right)^2 \]

such that:

\[ \sum_{j=1}^p(\beta_j)^2 < c \]

for any strictly positive \(c\).

The glmnet() function from the {glmnet} package can be used for ridge regression, by setting the alpha argument to 0 (setting it to 1 would do LASSO, and setting it to a number between 0 and 1 would do elasticnet). But in order to compare linear regression and ridge regression, let me first divide the data into a training set and a testing set:

index <- 1:nrow(Housing)

set.seed(12345)
train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE)

test_index <- setdiff(index, train_index)

train_x <- Housing[train_index, ] %>% 
    select(-price)

train_y <- Housing[train_index, ] %>% 
    pull(price)

test_x <- Housing[test_index, ] %>% 
    select(-price)

test_y <- Housing[test_index, ] %>% 
    pull(price)

I do the train/test split this way, because glmnet() requires a design matrix as input, and not a formula. Design matrices can be created using the model.matrix() function:

library("glmnet")
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded glmnet 2.0-16
train_matrix <- model.matrix(train_y ~ ., data = train_x)

test_matrix <- model.matrix(test_y ~ ., data = test_x)

Let’s now run a linear regression, by setting the penalty to 0:

model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0)

The model above provides the same result as a linear regression, because I set lambda to 0. Let’s compare the coefficients between the two:

coef(model_lm_ridge)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -3247.030393
## (Intercept)     .       
## lotsize         3.520283
## bedrooms     1745.211187
## bathrms     14337.551325
## stories      6736.679470
## drivewayyes  5687.132236
## recroomyes   5701.831289
## fullbaseyes  5708.978557
## gashwyes    12508.524241
## aircoyes    12592.435621
## garagepl     4438.918373
## prefareayes  9085.172469

and now the coefficients of the linear regression (because I provide a design matrix, I have to use lm.fit() instead of lm() which requires a formula, not a matrix.)

coef(lm.fit(x = train_matrix, y = train_y))
##  (Intercept)      lotsize     bedrooms      bathrms      stories 
## -3245.146665     3.520357  1744.983863 14336.336858  6737.000410 
##  drivewayyes   recroomyes  fullbaseyes     gashwyes     aircoyes 
##  5686.394123  5700.210775  5709.493884 12509.005265 12592.367268 
##     garagepl  prefareayes 
##  4439.029607  9085.409155

as you can see, the coefficients are the same. Let’s compute the RMSE for the unpenalized linear regression:

preds_lm <- predict(model_lm_ridge, test_matrix)

rmse_lm <- sqrt(mean(preds_lm - test_y)^2)

The RMSE for the linear unpenalized regression is equal to 2077.4197343.

Let’s now run a ridge regression, with lambda equal to 100, and see if the RMSE is smaller:

model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100)

and let’s compute the RMSE again:

preds <- predict(model_ridge, test_matrix)

rmse <- sqrt(mean(preds - test_y)^2)

The RMSE for the linear penalized regression is equal to 2072.6117757, which is smaller than before. But which value of lambda gives smallest RMSE? To find out, one must run model over a grid of lambda values and pick the model with lowest RMSE. This procedure is available in the cv.glmnet() function, which picks the best value for lambda:

best_model <- cv.glmnet(train_matrix, train_y)
# lambda that minimises the MSE
best_model$lambda.min
## [1] 66.07936

According to cv.glmnet() the best value for lambda is 66.0793576. In the next section, we will implement cross validation ourselves, in order to find the hyper-parameters of a random forest.

6.9 Training, validating, and testing models

Cross-validation is an important procedure which is used to compare models but also to tune the hyper-parameters of a model.

In this section, we are going to use several packages from the {tidymodels} collection of packages, namely {recipes}, {rsample} and {parsnip} to train a random forest the tidy way. I will also use {mlrMBO} to tune the hyper-parameters of the random forest.

6.9.1 Set up

Let’s load the needed packages:

library("tidyverse")
library("tidymodels")
library("parsnip")
library("brotools")
library("mlbench")

Load the data, included in the {mlrbench} package:

data("BostonHousing2")

I will train a random forest to predict the housing price, which is the cmedv column:

head(BostonHousing2)
##         town tract      lon     lat medv cmedv    crim zn indus chas   nox
## 1     Nahant  2011 -70.9550 42.2550 24.0  24.0 0.00632 18  2.31    0 0.538
## 2 Swampscott  2021 -70.9500 42.2875 21.6  21.6 0.02731  0  7.07    0 0.469
## 3 Swampscott  2022 -70.9360 42.2830 34.7  34.7 0.02729  0  7.07    0 0.469
## 4 Marblehead  2031 -70.9280 42.2930 33.4  33.4 0.03237  0  2.18    0 0.458
## 5 Marblehead  2032 -70.9220 42.2980 36.2  36.2 0.06905  0  2.18    0 0.458
## 6 Marblehead  2033 -70.9165 42.3040 28.7  28.7 0.02985  0  2.18    0 0.458
##      rm  age    dis rad tax ptratio      b lstat
## 1 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 6.430 58.7 6.0622   3 222    18.7 394.12  5.21

Only keep relevant columns:

boston <- BostonHousing2 %>% 
    select(-medv, -tract, -lon, -lat) %>% 
    rename(price = cmedv)

I remove tract, lat and lon because the information contained in the column town is enough.

To train and evaluate the model’s performance, I split the data in two. One data set, called the training set, will be further split into two down below. I won’t touch the second data set, the test set, until the very end, to finally assess the model’s performance.

train_test_split <- initial_split(boston, prop = 0.9)

housing_train <- training(train_test_split)

housing_test <- testing(train_test_split)

initial_split(), training() and testing() are functions from the {rsample} package.

I will train a random forest on the training data, but the question, is which random forest? Because random forests have several hyper-parameters, and as explained in the intro these hyper-parameters cannot be directly learned from the data, which one should we choose? We could train 6 random forests for instance and compare their performance, but why only 6? Why not 16?

In order to find the right hyper-parameters, the practitioner can use values from the literature that seemed to have worked well (like is done in Macro-econometrics) or you can further split the train set into two, create a grid of hyperparameter, train the model on one part of the data for all values of the grid, and compare the predictions of the models on the second part of the data. You then stick with the model that performed the best, for example, the model with lowest RMSE. The thing is, you can’t estimate the true value of the RMSE with only one value. It’s like if you wanted to estimate the height of the population by drawing one single observation from the population. You need a bit more observations. To approach the true value of the RMSE for a give set of hyperparameters, instead of doing one split, let’s do 30. Then we compute the average RMSE, which implies training 30 models for each combination of the values of the hyperparameters.

First, let’s split the training data again, using the mc_cv() function from {rsample} package. This function implements Monte Carlo cross-validation:

validation_data <- mc_cv(housing_train, prop = 0.9, times = 30)

What does validation_data look like?

validation_data
## # # Monte Carlo cross-validation (0.9/0.1) with 30 resamples  
## # A tibble: 30 x 2
##    splits           id        
##    <list>           <chr>     
##  1 <split [411/45]> Resample01
##  2 <split [411/45]> Resample02
##  3 <split [411/45]> Resample03
##  4 <split [411/45]> Resample04
##  5 <split [411/45]> Resample05
##  6 <split [411/45]> Resample06
##  7 <split [411/45]> Resample07
##  8 <split [411/45]> Resample08
##  9 <split [411/45]> Resample09
## 10 <split [411/45]> Resample10
## # … with 20 more rows

Let’s look further down:

validation_data$splits[[1]]
## <411/45/456>

The first value is the number of rows of the first set, the second value of the second, and the third was the original amount of values in the training data, before splitting again.

How should we call these two new data sets? The author of {rsample}, Max Kuhn, talks about the analysis and the assessment sets:

{{% tweet "1066131042615140353" %}}

Now, in order to continue I need pre-process the data. I will do this in three steps. The first and the second step are used to center and scale the numeric variables and the third step converts character and factor variables to dummy variables. This is needed because I will train a random forest, which cannot handle factor variables directly. Let’s define a recipe to do that, and start by pre-processing the testing set. I write a wrapper function around the recipe, because I will need to apply this recipe to various data sets:

simple_recipe <- function(dataset){
    recipe(price ~ ., data = dataset) %>%
        step_center(all_numeric()) %>%
        step_scale(all_numeric()) %>%
        step_dummy(all_nominal())
}

Once the recipe is defined, I can use the prep() function, which estimates the parameters from the data which are needed to process the data. For example, for centering, prep() estimates the mean which will then be subtracted from the variables. With bake() the estimates are then applied on the data:

testing_rec <- prep(simple_recipe(housing_test), testing = housing_test)

test_data <- bake(testing_rec, new_data = housing_test)

It is important to split the data before using prep() and bake(), because if not, you will use observations from the test set in the prep() step, and thus introduce knowledge from the test set into the training data. This is called data leakage, and must be avoided. This is why it is necessary to first split the training data into an analysis and an assessment set, and then also pre-process these sets separately. However, the validation_data object cannot now be used with recipe(), because it is not a dataframe. No worries, I simply need to write a function that extracts the analysis and assessment sets from the validation_data object, applies the pre-processing, trains the model, and returns the RMSE. This will be a big function, at the center of the analysis.

But before that, let’s run a simple linear regression, as a benchmark. For the linear regression, I will not use any CV, so let’s pre-process the training set:

trainlm_rec <- prep(simple_recipe(housing_train), testing = housing_train)

trainlm_data <- bake(trainlm_rec, new_data = housing_train)

linreg_model <- lm(price ~ ., data = trainlm_data)

broom::augment(linreg_model, newdata = test_data) %>% 
    rmse(price, .fitted)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.345

broom::augment() adds the predictions to the test_data in a new column, .fitted. I won’t use this trick with the random forest, because there is no augment() method for random forests from the {ranger} which I’ll use. I’ll add the predictions to the data myself.

Ok, now let’s go back to the random forest and write the big function:

my_rf <- function(mtry, trees, split, id){
    
    analysis_set <- analysis(split)
    
    analysis_prep <- prep(simple_recipe(analysis_set), training = analysis_set)
    
    analysis_processed <- bake(analysis_prep, new_data = analysis_set)
    
    model <- rand_forest(mtry = mtry, trees = trees) %>%
        set_engine("ranger", importance = 'impurity') %>%
        fit(price ~ ., data = analysis_processed)

    assessment_set <- assessment(split)
    
    assessment_prep <- prep(simple_recipe(assessment_set), testing = assessment_set)
    
    assessment_processed <- bake(assessment_prep, new_data = assessment_set)

    tibble::tibble("id" = id,
        "truth" = assessment_processed$price,
        "prediction" = unlist(predict(model, new_data = assessment_processed)))
}

The rand_forest() function is available from the {parsnip} package. This package provides an unified interface to a lot of other machine learning packages. This means that instead of having to learn the syntax of range() and randomForest() and, and… you can simply use the rand_forest() function and change the engine argument to the one you want (ranger, randomForest, etc).

Let’s try this function:

results_example <- map2_df(.x = validation_data$splits,
                           .y = validation_data$id,
                           ~my_rf(mtry = 3, trees = 200, split = .x, id = .y))
head(results_example)
## # A tibble: 6 x 3
##   id          truth prediction
##   <chr>       <dbl>      <dbl>
## 1 Resample01 -0.518     -0.259
## 2 Resample01 -1.03      -0.354
## 3 Resample01 -0.913     -0.345
## 4 Resample01 -0.874     -0.407
## 5 Resample01  0.430      0.241
## 6 Resample01  0.259      0.301

I can now compute the RMSE when mtry = 3 and trees = 200:

results_example %>%
    group_by(id) %>%
    rmse(truth, prediction) %>%
    summarise(mean_rmse = mean(.estimate)) %>%
    pull
## [1] 0.6096097

The random forest has already lower RMSE than the linear regression. The goal now is to lower this RMSE by tuning the mtry and trees hyperparameters. For this, I will use Bayesian Optimization methods implemented in the {mlrMBO} package.

6.9.2 Bayesian hyperparameter optimization

I will re-use the code from above, and define a function that does everything from pre-processing to returning the metric I want to minimize by tuning the hyperparameters, the RMSE:

tuning <- function(param, validation_data){

    mtry <- param[1]
    trees <- param[2]

    results <- purrr::map2_df(.x = validation_data$splits,
                       .y = validation_data$id,
                       ~my_rf(mtry = mtry, trees = trees, split = .x, id = .y))

    results %>%
        group_by(id) %>%
        rmse(truth, prediction) %>%
        summarise(mean_rmse = mean(.estimate)) %>%
        pull
}

This is exactly the code from before, but it now returns the RMSE. Let’s try the function with the values from before:

tuning(c(3, 200), validation_data)
## [1] 0.6085931

I now follow the code that can be found in the arxiv paper to run the optimization. I think I got the gist of the paper, but I did not understand everything yet. For now, I am still experimenting with the library at the moment, but from what I understand, a simpler model, called the surrogate model, is used to look for promising points and to evaluate the value of the function at these points. This seems somewhat similar (in spirit) to the Indirect Inference method as described in Gourieroux, Monfort, Renault.

Let’s first load the package and create the function to optimize:

library("mlrMBO")
fn <- makeSingleObjectiveFunction(name = "tuning",
                                 fn = tuning,
                                 par.set = makeParamSet(makeIntegerParam("x1", lower = 3, upper = 8),
                                                        makeIntegerParam("x2", lower = 100, upper = 500)))

This function is based on the function I defined before. The parameters to optimize are also defined as are their bounds. I will look for mtry between the values of 3 and 8, and trees between 50 and 500.

Now comes the part I didn’t quite get.

# Create initial random Latin Hypercube Design of 10 points
library(lhs)# for randomLHS
des <- generateDesign(n = 5L * 2L, getParamSet(fn), fun = randomLHS)

I think this means that these 10 points are the points used to start the whole process. I did not understand why they have to be sampled from a hypercube, but ok. Then I choose the surrogate model, a random forest too, and predict the standard error. Here also, I did not quite get why the standard error can be an option.

# Specify kriging model with standard error estimation
surrogate <- makeLearner("regr.ranger", predict.type = "se", keep.inbag = TRUE)

Here I define some options:

# Set general controls
ctrl <- makeMBOControl()
ctrl <- setMBOControlTermination(ctrl, iters = 10L)
ctrl <- setMBOControlInfill(ctrl, crit = makeMBOInfillCritEI())

And this is the optimization part:

# Start optimization
result <- mbo(fn, des, surrogate, ctrl, more.args = list("validation_data" = validation_data))
result
## Recommended parameters:
## x1=8; x2=314
## Objective: y = 0.484
## 
## Optimization path
## 10 + 10 entries in total, displaying last 10 (or less):
##    x1  x2         y dob eol error.message exec.time            ei
## 11  8 283 0.4855415   1  NA          <NA>     7.353 -3.276847e-04
## 12  8 284 0.4852047   2  NA          <NA>     7.321 -3.283713e-04
## 13  8 314 0.4839817   3  NA          <NA>     7.703 -3.828517e-04
## 14  8 312 0.4841398   4  NA          <NA>     7.633 -2.829713e-04
## 15  8 318 0.4841066   5  NA          <NA>     7.692 -2.668354e-04
## 16  8 314 0.4845221   6  NA          <NA>     7.574 -1.382333e-04
## 17  8 321 0.4843018   7  NA          <NA>     7.693 -3.828924e-05
## 18  8 318 0.4868457   8  NA          <NA>     7.696 -8.692828e-07
## 19  8 310 0.4862687   9  NA          <NA>     7.594 -1.061185e-07
## 20  8 313 0.4878694  10  NA          <NA>     7.628 -5.153015e-07
##    error.model train.time prop.type propose.time           se      mean
## 11        <NA>      0.011 infill_ei        0.450 0.0143886864 0.5075765
## 12        <NA>      0.011 infill_ei        0.427 0.0090265872 0.4971003
## 13        <NA>      0.012 infill_ei        0.443 0.0062693960 0.4916927
## 14        <NA>      0.012 infill_ei        0.435 0.0037308971 0.4878950
## 15        <NA>      0.012 infill_ei        0.737 0.0024446891 0.4860699
## 16        <NA>      0.013 infill_ei        0.442 0.0012713838 0.4850705
## 17        <NA>      0.012 infill_ei        0.444 0.0006371109 0.4847248
## 18        <NA>      0.013 infill_ei        0.467 0.0002106381 0.4844576
## 19        <NA>      0.014 infill_ei        0.435 0.0002182254 0.4846214
## 20        <NA>      0.013 infill_ei        0.748 0.0002971160 0.4847383

So the recommended parameters are 8 for mtry and 314 for trees. The user can access these recommended parameters with result$x$x1 and result$x$x2. The value of the RMSE is lower than before, and equals 0.4839817. It can be accessed with result$y. Let’s now train the random forest on the training data with this values. First, I pre-process the training data

training_rec <- prep(simple_recipe(housing_train), testing = housing_train)

train_data <- bake(training_rec, new_data = housing_train)

Let’s now train our final model and predict the prices:

final_model <- rand_forest(mtry = result$x$x1, trees = result$x$x2) %>%
        set_engine("ranger", importance = 'impurity') %>%
        fit(price ~ ., data = train_data)

price_predict <- predict(final_model, new_data = select(test_data, -price))

Let’s transform the data back and compare the predicted prices to the true ones visually:

cbind(price_predict * sd(housing_train$price) + mean(housing_train$price), 
      housing_test$price)
##       .pred housing_test$price
## 1  30.59078               33.4
## 2  20.45857               22.1
## 3  20.65227               15.0
## 4  21.00041               18.9
## 5  21.85504               21.0
## 6  22.22871               19.7
## 7  25.50695               23.5
## 8  23.37194               22.8
## 9  23.10205               22.9
## 10 35.32846               43.8
## 11 19.75544               19.4
## 12 20.62800               18.5
## 13 19.32962               20.5
## 14 17.47344               19.2
## 15 17.85629               23.0
## 16 37.12904               50.0
## 17 19.92393               17.4
## 18 25.12932               29.9
## 19 28.58319               29.6
## 20 27.52733               30.5
## 21 36.23437               48.5
## 22 37.18469               50.0
## 23 22.24399               24.4
## 24 25.13665               27.5
## 25 36.78475               44.8
## 26 31.46111               31.5
## 27 32.45295               44.0
## 28 29.72116               36.5
## 29 29.56234               35.1
## 30 23.19542               23.9
## 31 23.80056               26.4
## 32 19.83584               17.8
## 33 22.48759               19.4
## 34 20.94058               18.7
## 35 24.33038               23.1
## 36 22.50212               20.6
## 37 15.94037               17.8
## 38 21.41992               50.0
## 39 12.89569               12.3
## 40 12.40893               12.7
## 41 15.62015               11.9
## 42 19.64325               15.0
## 43 14.30708               11.7
## 44 12.09624                8.3
## 45 13.70672               14.4
## 46 17.14228               19.1
## 47 15.79025               13.3
## 48 16.86834               21.4
## 49 20.49379               16.8
## 50 21.41435               22.4

Let’s now compute the RMSE:

tibble::tibble("truth" = test_data$price,
        "prediction" = unlist(price_predict)) %>% 
    rmse(truth, prediction)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.481

  1. This package gets installed with R, no need to add it

  2. The formula in the example is shown for illustration purposes.

  3. data must only contain numeric values, and n is the number of clusters.

  4. data must only contain numeric values, or a formula can be provided.

  5. Surv(y_time, y_status) creates a survival object, where y_time is the time to event y_status. It is possible to create more complex survival objects depending on exactly which data you are dealing with.