Bootstrap Resampling Essentials in R

kassambara | 11/03/2018 | 25026 | Comments (2) | Regression Model Validation

Similarly to cross-validation techniques (Chapter @ref(cross-validation)), the bootstrap resampling method can be used to measure the accuracy of a predictive model. Additionally, it can be used to measure the uncertainty associated with any statistical estimator.

Bootstrap resampling consists of repeatedly selecting a sample of n observations from the original data set, and to evaluate the model on each copy. An average standard error is then calculated and the results provide an indication of the overall variance of the model performance.

This chapter describes the basics of bootstrapping and provides practical examples in R for computing a model prediction error. Additionally, we’ll show you how to compute an estimator uncertainty using bootstrap techniques.

Contents:

Loading required R packages
Example of data
Bootstrap procedure
Evaluating a predictive model performance
Quantifying an estimator uncertainty and confidence intervals
Discussion

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easily computing cross-validation methods

library(tidyverse)
library(caret)

Example of data

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)

Bootstrap procedure

The bootstrap method is used to quantify the uncertainty associated with a given statistical estimator or with a predictive model.

It consists of randomly selecting a sample of n observations from the original data set. This subset, called bootstrap data set is then used to evaluate the model.

This procedure is repeated a large number of times and the standard error of the bootstrap estimate is then calculated. The results provide an indication of the variance of the models performance.

Note that, the sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.

Evaluating a predictive model performance

The following example uses a bootstrap with 100 resamples to test a linear regression model:

# Define training control
train.control <- trainControl(method = "boot", number = 100)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE 
##   8.4   0.597     6.76
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The output shows the average model performance across the 100 resamples.

RMSE (Root Mean Squared Error) and MAE(Mean Absolute Error), represent two different measures of the model prediction error. The lower the RMSE and the MAE, the better the model. The R-squared represents the proportion of variation in the outcome explained by the predictor variables included in the model. The higher the R-squared, the better the model. Read more on these metrics at Chapter @ref(regression-model-accuracy-metrics).

Quantifying an estimator uncertainty and confidence intervals

The bootstrap approach can be used to quantify the uncertainty (or standard error) associated with any given statistical estimator.

For example, you might want to estimate the accuracy of the linear regression beta coefficients using bootstrap method.

The different steps are as follow:

Create a simple function, model_coef(), that takes the swiss data set as well as the indices for the observations, and returns the regression coefficients.
Apply the function boot_fun() to the full data set of 47 observations in order to compute the coefficients

We start by creating a function that returns the regression model coefficients:

model_coef <- function(data, index){
  coef(lm(Fertility ~., data = data, subset = index))
}
model_coef(swiss, 1:47)

##      (Intercept)      Agriculture      Examination        Education 
##           66.915           -0.172           -0.258           -0.871 
##         Catholic Infant.Mortality 
##            0.104            1.077

Next, we use the boot() function [boot package] to compute the standard errors of 500 bootstrap estimates for the coefficients:

library(boot)
boot(swiss, model_coef, 500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*   66.915 -2.04e-01     10.9174
## t2*   -0.172 -5.62e-03      0.0639
## t3*   -0.258 -2.27e-02      0.2524
## t4*   -0.871  3.89e-05      0.2203
## t5*    0.104 -7.77e-04      0.0319
## t6*    1.077  4.45e-02      0.4478

In the output above,

original column corresponds to the regression coefficients. The associated standard errors are given in the column std.error.
t1 corresponds to the intercept, t2 corresponds to Agriculture and so on…

For example, it can be seen that, the standard error (SE) of the regression coefficient associated with Agriculture is 0.06.

Note that, the standard errors measure the variability/accuracy of the beta coefficients. It can be used to compute the confidence intervals of the coefficients.

For example, the 95% confidence interval for a given coefficient b is defined as b +/- 2*SE(b), where:

the lower limits of b = b - 2*SE(b) = -0.172 - (2*0.0680) = -0.308 (for Agriculture variable)
the upper limits of b = b + 2*SE(b) = -0.172 + (2*0.0680) = -0.036 (for Agriculture variable)

That is, there is approximately a 95% chance that the interval [-0.308, -0.036] will contain the true value of the coefficient.

Using the standard lm() function gives a slightly different standard errors, because the linear model make some assumptions about the data:

summary(lm(Fertility ~., data = swiss))$coef

##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        66.915    10.7060    6.25 1.91e-07
## Agriculture        -0.172     0.0703   -2.45 1.87e-02
## Examination        -0.258     0.2539   -1.02 3.15e-01
## Education          -0.871     0.1830   -4.76 2.43e-05
## Catholic            0.104     0.0353    2.95 5.19e-03
## Infant.Mortality    1.077     0.3817    2.82 7.34e-03

The bootstrap approach does not rely on any of these assumptions made by the linear model, and so it is likely giving a more accurate estimate of the coefficients standard errors than is the summary() function.

Discussion

This chapter describes bootstrap resampling method for evaluating a predictive model accuracy, as well as, for measuring the uncertainty associated with a given statistical estimator.

An alternative approach to bootstrapping, for evaluating a predictive model performance, is cross-validation techniques (Chapter @ref(cross-validation)).

4 Notes

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!