Cross-Validation Essentials in R

kassambara | 11/03/2018 | 262477 | Comments (7) | Regression Model Validation

Cross-validation refers to a set of methods for measuring the performance of a given predictive model on new test data sets.

The basic idea, behind cross-validation techniques, consists of dividing the data into two sets:

The training set, used to train (i.e. build) the model;
and the testing set (or validation set), used to test (i.e. validate) the model by estimating the prediction error.

Cross-validation is also known as a resampling method because it involves fitting the same statistical method multiple times using different subsets of the data.

In this chapter, you’ll learn:

the most commonly used statistical metrics (Chapter @ref(regression-model-accuracy-metrics)) for measuring the performance of a regression model in predicting the outcome of new test data.
The different cross-validation methods for assessing model performance. We cover the following approaches:
- Validation set approach (or data split)
- Leave One Out Cross Validation
- k-fold Cross Validation
- Repeated k-fold Cross Validation

Each of these methods has their advantages and drawbacks. Use the method that best suits your problem. Generally, the (repeated) k-fold cross validation is recommended.

Practical examples of R codes for computing cross-validation methods.

Contents:

Loading required R packages
Example of data
Model performance metrics
Cross-validation methods
Discussion
References

The Book:

Machine Learning Essentials: Practical Guide in R

Loading required R packages

tidyverse for easy data manipulation and visualization
caret for easily computing cross-validation methods

library(tidyverse)
library(caret)

Example of data

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-analysis), for predicting fertility score on the basis of socio-economic indicators.

# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)

Model performance metrics

After building a model, we are interested in determining the accuracy of this model on predicting the outcome for new unseen observations not used to build the model. Put in other words, we want to estimate the prediction error.

To do so, the basic strategy is to:

Build the model on a training data set
Apply the model on a new test data set to make predictions
Compute the prediction errors

In Chapter @ref(regression-model-accuracy-metrics), we described several statistical metrics for quantifying the overall quality of regression models. These include:

R-squared (R2), representing the squared correlation between the observed outcome values and the predicted values by the model. The higher the adjusted R2, the better the model.
Root Mean Squared Error (RMSE), which measures the average prediction error made by the model in predicting the outcome for an observation. That is, the average difference between the observed known outcome values and the values predicted by the model. The lower the RMSE, the better the model.
Mean Absolute Error (MAE), an alternative to the RMSE that is less sensitive to outliers. It corresponds to the average absolute difference between observed and predicted outcomes. The lower the MAE, the better the model

In classification setting, the prediction error rate is estimated as the proportion of misclassified observations.

R2, RMSE and MAE are used to measure the regression model performance during cross-validation.

In the following section, we’ll explain the basics of cross-validation, and we’ll provide practical example using mainly the caret R package.

Cross-validation methods

Briefly, cross-validation algorithms can be summarized as follow:

Reserve a small sample of the data set
Build (or train) the model using the remaining part of the data set
Test the effectiveness of the model on the the reserved sample of the data set. If the model works well on the test data set, then it’s good.

The following sections describe the different cross-validation techniques.

The Validation set Approach

The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.

The process works as follow:

Build (train) the model on the training data set
Apply the model to the test data set to predict the outcome of new unseen observations
Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.

The example below splits the swiss data set so that 80% is used for training a linear regression model and 20% is used to evaluate the model performance.

# Split the data into training and test set
set.seed(123)
training.samples <- swiss$Fertility %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
            RMSE = RMSE(predictions, test.data$Fertility),
            MAE = MAE(predictions, test.data$Fertility))

##     R2 RMSE  MAE
## 1 0.39 9.11 7.48

When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.

the RMSE and the MAE are measured in the same scale as the outcome variable. Dividing the RMSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible:

RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)

## [1] 0.128

Note that, the validation set method is only useful when you have a large data set that can be partitioned. A disadvantage is that we build a model on a fraction of the data set only, possibly leaving out some interesting information about data, leading to higher bias. Therefore, the test error rate can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set.

Leave one out cross validation - LOOCV

This method works as follow:

Leave out one data point and build the model on the rest of the data set
Test the model against the data point that is left out at step 1 and record the test error associated with the prediction
Repeat the process for all data points
Compute the overall prediction error by taking the average of all these test error estimates recorded at step 2.

Practical example in R using the caret package:

# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE 
##   7.74  0.613     6.12
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The advantage of the LOOCV method is that we make use all data points reducing potential bias.

However, the process is repeated as many times as there are data points, resulting to a higher execution time when n is extremely large.

Additionally, we test the model performance against one data point at each iteration. This might result to higher variation in the prediction error, if some data points are outliers. So, we need a good ratio of testing data points, a solution provided by the k-fold cross-validation method.

K-fold cross-validation

The k-fold cross-validation method evaluates the model performance on different subset of the training data and then calculate the average prediction error rate. The algorithm is as follow:

Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
Reserve one subset and train the model on all other subsets
Test the model on the reserved subset and record the prediction error
Repeat this process until each of the k subsets has served as the test set.
Compute the average of the k recorded errors. This is called the cross-validation error serving as the performance metric for the model.

K-fold cross-validation (CV) is a robust method for estimating the accuracy of a model.

The most obvious advantage of k-fold CV compared to LOOCV is computational. A less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV (James et al. 2014).

Typical question, is how to choose right value of k?

Lower value of K is more biased and hence undesirable. On the other hand, higher value of K is less biased, but can suffer from large variability. It is not hard to see that a smaller value of k (say k = 2) always takes us towards validation set approach, whereas a higher value of k (say k = number of data points) leads us to LOOCV approach.

In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

The following example uses 10-fold cross validation to estimate the prediction error. Make sure to set seed for reproducibility.

# Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE 
##   7.38  0.751     6.03
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Repeated K-fold cross-validation

The process of splitting the data into k-folds can be repeated a number of times, this is called repeated k-fold cross validation.

The final model error is taken as the mean error from the number of repeats.

The following example uses 10-fold cross validation with 3 repeats:

# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

Discussion

In this chapter, we described 4 different methods for assessing the performance of a model on unseen test data.

These methods include: validation set approach, leave-one-out cross-validation, k-fold cross-validation and repeated k-fold cross-validation.

We generally recommend the (repeated) k-fold cross-validation to estimate the prediction error rate. It can be used in regression and classification settings.

Another alternative to cross-validation is the bootstrap resampling methods (Chapter @ref(bootstrap-resampling)), which consists of repeatedly and randomly selecting a sample of n observations from the original data set, and to evaluate the model performance on each copy.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

Last update : 01/04/2018

3 Notes

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

janeman

Member

#900 10/05/2020 at 07h24

Explained perfectly

Comment

kamenrider Cahya

Visitor

#604 09/14/2018 at 10h56

Thank you so much for this article and greetings from Indonesia!
I have a question to you.
I am curious about the LOOCV method you mentioned in here. I tried building an lm model (using caret package, i was following your example) and compared it to an lm model using native approach, something like:
nativelm = lm(predictand ~ predictor, data=dataframe)

to my surprise, the coefficients for both approach are exactly the same. Is this something to be expected?
Is the term cross validation not an approach to estimate the coefficient, just a way to calculate R2, RMSE, and MAE?
Is there any different between these two models?

Comment

Julie

Visitor

#592 08/30/2018 at 13h16

Thank you very much for this article and the great website - I am using it a lot.

I want to evaluate the predictive power of a model with repeated 10-fold cross validation. What I need is to get the test performance of the model : the R² or RMSE of the model when tested on the left-out part of the data, i.e. not used for training the model.
It is not so clear to me whether the train function gives me this test RMSE and test R², or if it is the value for the training model?

Thank you very much, you would help me a lot if you could answer this.

Have a great day,

Julie

Comment

kassambara

Administrator

#429 04/05/2018 at 07h42

Hi sfd,

Thank you for your feedback.

1) One might observe a clear difference between k-fold and repeated k-fold cross-validation with a large data set with thousands of rows. Here we have only 47 rows in the swiss data set.

The results obtained with the repeated k-fold cross-validation is expected to be less biased compared to a single k-fold cross-validation.

2) Required and RMSE are metrics used to compare two models. When comparing two models, a model with the lowest RMSE is the best.

a. In our specific example, the RMSE is about 7.4, representing the average error made by the model in predicting the Fertility score.

Whether or not 7.4 is an acceptable prediction error depends on the problem context. In the swiss data set, the average value of Fertility score is about 70, and so the percentage error is 7.4/70 = 10%.

b. As you know, the Rsquared represents the proportion of variation in the outcome explained by the predictor variables.

Again, when comparing two models, the model with the largest Rsquared is better. Its value is comprised between 0 (no association between the outcome and the predictors) and 1 (Ideal).

For a given specific model, whether or not an Rsquared of 0.75 is large enough, is subjective. You should always inspect the model significance by examining the F-statistic p-value.

Comment

sfd

Member

#427 04/04/2018 at 22h55

Thanks for this clear article!.

Seems that the:
RMSE , Rsquared and MAE numbers

for the K-fold cross-validation
and
for the repeated K-fold cross-validation

are almost the same value.

Q1: Can we infer that
the repeated K-fold cross-validation method
did not make any difference
in measuring model performance?.

Q2: You mentioned before,
that smaller RMSE and MAE numbers is better.
And larger Rsquared numbers is better.

But smaller or larger
compared to what other number?.

How to decide
if one of these 3 metric numbers
is "large" or "small" ?.

For ex:
is an RMSE value of 7.421534
small ? (good)
or large ? (bad).

Thanks for your guidance Kassambara!

Comment

kassambara

Administrator

#420 04/01/2018 at 11h01

This was a typo. Fixed now!

Thank you for your feedback.

Comment

Visitor

#418 04/01/2018 at 09h58

Hi，I have a question.
In the validation set Approach section:
when building a model, why use the data swill instead of train.data?
model <- lm(Fertility ~ ., data = swiss)