Nonlinear Regression Essentials in R: Polynomial and Spline Regression Models

kassambara | 11/03/2018 | 137981 | Comments (9) | Regression Analysis

In some cases, the true relationship between the outcome and a predictor variable might not be linear.

There are different solutions extending the linear regression model (Chapter @ref(linear-regression)) for capturing these nonlinear effects, including:

Polynomial regression. This is the simple approach to model non-linear relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.
Spline regression. Fits a smooth curve with a series of polynomial segments. The values delimiting the spline segments are called Knots.
Generalized additive models (GAM). Fits spline models with automated selection of knots.

In this chapter, you’ll learn how to compute non-linear regression models and how to compare the different models in order to choose the one that fits the best your data.

The RMSE and the R2 metrics, will be used to compare the different models (see Chapter @ref(linear regression)).

Recall that, the RMSE represents the model prediction error, that is the average difference the observed outcome values and the predicted outcome values. The R2 represents the squared correlation between the observed and predicted outcome values. The best model is the model with the lowest RMSE and the highest R2.

Contents:

Loading Required R packages
Preparing the data
Linear regression {linear-reg}
Polynomial regression
Log transformation
Spline regression
Generalized additive models
Comparing the models
Discussion
References

The Book:

Machine Learning Essentials: Practical Guide in R

Loading Required R packages

tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow

library(tidyverse)
library(caret)
theme_set(theme_classic())

Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-analysis), for predicting the median house value (mdev), in Boston Suburbs, based on the predictor variable lstat (percentage of lower status of the population).

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

First, visualize the scatter plot of the medv vs lstat variables as follow:

ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth()

The above scatter plot suggests a non-linear relationship between the two variables

In the following sections, we start by computing linear and non-linear regression models. Next, we’ll compare the different models in order to choose the best one for our data.

Linear regression {linear-reg}

The standard linear regression model equation can be written as medv = b0 + b1*lstat.

Compute linear regression model:

# Build the model
model <- lm(medv ~ lstat, data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)

##   RMSE    R2
## 1 6.07 0.535

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ x)

Polynomial regression

The polynomial regression adds polynomial or quadratic terms to the regression equation as follow:

\[medv = b0 + b1*lstat + b2*lstat^2\]

In R, to create a predictor x^2 you should use the function I(), as follow: I(x^2). This raise x to the power 2.

The polynomial regression can be computed in R as follow:

lm(medv ~ lstat + I(lstat^2), data = train.data)

An alternative simple solution is to use this:

lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)

## 
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
## 
## Coefficients:
##                 (Intercept)  poly(lstat, 2, raw = TRUE)1  
##                      43.351                       -2.340  
## poly(lstat, 2, raw = TRUE)2  
##                       0.043

The output contains two coefficients associated with lstat : one for the linear term (lstat^1) and one for the quadratic term (lstat^2).

The following example computes a sixfth-order polynomial fit:

lm(medv ~ poly(lstat, 6, raw = TRUE), data = train.data) %>%
  summary()

## 
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = train.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14.23  -3.24  -0.74   2.02  26.50 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  7.14e+01   6.00e+00   11.90  < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.45e+01   3.22e+00   -4.48  9.6e-06 ***
## poly(lstat, 6, raw = TRUE)2  1.87e+00   6.26e-01    2.98    0.003 ** 
## poly(lstat, 6, raw = TRUE)3 -1.32e-01   5.73e-02   -2.30    0.022 *  
## poly(lstat, 6, raw = TRUE)4  4.98e-03   2.66e-03    1.87    0.062 .  
## poly(lstat, 6, raw = TRUE)5 -9.56e-05   6.03e-05   -1.58    0.114    
## poly(lstat, 6, raw = TRUE)6  7.29e-07   5.30e-07    1.38    0.170    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.28 on 400 degrees of freedom
## Multiple R-squared:  0.684,  Adjusted R-squared:  0.679 
## F-statistic:  144 on 6 and 400 DF,  p-value: <2e-16

From the output above, it can be seen that polynomial terms beyond the fith order are not significant. So, just create a fith polynomial regression model as follow:

# Build the model
model <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)

##   RMSE    R2
## 1 4.96 0.689

Visualize the fith polynomial regression line as follow:

ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))

Log transformation

When you have a non-linear relationship, you can also try a logarithm transformation of the predictor variables:

# Build the model
model <- lm(medv ~ log(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)

##   RMSE    R2
## 1 5.24 0.657

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ log(x))

Spline regression

Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines (P. Bruce and Bruce 2017).

Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).

The R package splines includes the function bs for creating a b-spline term in a regression model.

You need to specify two parameters: the degree of the polynomial and the location of the knots. In our example, we’ll place the knots at the lower quartile, the median quartile, and the upper quartile:

knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))

We’ll create a model using a cubic spline (degree = 3):

library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)

##   RMSE    R2
## 1 4.97 0.688

Note that, the coefficients for a spline term are not interpretable.

Visualize the cubic spline as follow:

ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))

Generalized additive models

Once you have detected a non-linear relationship in your data, the polynomial terms may not be flexible enough to capture the relationship, and spline terms require specifying the knots.

Generalized additive models, or GAM, are a technique to automatically fit a spline regression. This can be done using the mgcv R package:

library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
  RMSE = RMSE(predictions, test.data$medv),
  R2 = R2(predictions, test.data$medv)
)

##   RMSE    R2
## 1 5.02 0.684

The term s(lstat) tells the gam() function to find the “best” knots for a spline term.

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = gam, formula = y ~ s(x))

Comparing the models

From analyzing the RMSE and the R2 metrics of the different models, it can be seen that the polynomial regression, the spline regression and the generalized additive models outperform the linear regression model and the log transformation approaches.

Discussion

This chapter describes how to compute non-linear regression models using R.

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

Last update : 19/05/2018

5 Notes

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Comment

Tarun pant

Member

#862 05/31/2020 at 09h39

How can we predict future by this polynomial regression ,can you please explain how we predict future of this above data (medv or lstat)
thank you

Comment

Sophie Gooch

Member

#825 07/01/2019 at 09h50

Review that, the RMSE speaks to the model expectation blunder, that is the normal distinction the watched result esteems and the anticipated result esteems.

Comment

Visitor

#763 04/27/2019 at 11h44

merci pour cet article, j'aimerais demander si:
- la régression polynomiale, le modèle additifs et la régression spline sont interprétés aussi comme une régression linéaire ?? si non comment les coefficients s'interprètent pour ces différentes analyses
- pour cet exemple donné, si on va choisir un type final pour conduire l'analyse, lequel devrai-je choisir comme meilleur il ??

Comment

Visitor

#762 04/27/2019 at 11h42

merci pour cet article, j'aimerais demander si:
- la régression polynomiale sont, le model additifs et la régression spline sont interprétés aussi comme une régression linéaire?? si non comment les coefficients s’interprètent pour ces différentes analyses
- pour cet exemple donné, si on allait choisir un type final pour conduire l'analyse, lequel s'agirait t-il???

Comment

Adrien

Visitor

#651 11/23/2018 at 14h11

Good article thank you!

But there is a major confusion here, polynomial models are linear models. They just use polynomial features. Same thing for log transformed models which use features on the log space but are still linear.

See for example "http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-difference-between-linear-and-nonlinear-equations-in-regression-analysis" to get an explanation of the difference.

Comment

kassambara

Administrator

#488 05/19/2018 at 15h13

Thank you for your comment.

1)

The article has been know updated to take your comment into account. You should use raw = TRUE, otherwise orthogonal polynomial regressions will be computed instead of the standard polynomial regression. See discussion on stack overflow

2)

When comparing model, the best model is defined as the model with lowest prediction error on a test set that has been not used to train the model.

Comment

tomer mann

Member

#464 05/12/2018 at 17h41

thank you for another informative tutorial.
i have 2 questions:
regarding the question posted by visitor, when you calculate the ^2 polinomial, you use raw=TRUE. but for higher degrees polinomials, such as ^5, this argument is not used. why?and what is the meaning of raw=TRUE?
2. we keep comparing performances of model with predicting them on the test set , but to my understanding, we are not allowed to select models based on test set performances because than the test set becomes part of the training set in a way. is this true? and if so, how do you pick the best model? performance on the training set? other?
thank you!

Comment

kassambara

Administrator

#402 03/25/2018 at 23h45

Hi,

Thank for your comment. The article has been know updated. Please use, the argument raw = TRUE.

Code R :

lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)

Comment

Visitor

#400 03/23/2018 at 14h33

lm(medv ~ lstat + I(lstat^2), data = train.data) and lm(medv ~ poly(lstat, 2), data = train.data) , as it is said that can be used anyways, but the output is different. Why is it so?