# Articles - Classification Methods Essentials

Previously, we have described the regression model (Chapter @ref(regression-analysis)), which is used to predict a quantitative or continuous outcome variable based on one or multiple predictor variables.

In Classification, the outcome variable is qualitative (or categorical). Classification refers to a set of machine learning methods for predicting the class (or category) of individuals on the basis of one or multiple predictor variables.

In this part, we’ll cover the following topics:

• Logistic regression, for binary classification tasks (Chapter @ref(logistic-regression))
• Stepwise and penalized logistic regression for variable selections (Chapter @ref(stepwise-logistic-regression) and @ref(penalized-logistic-regression))
• Logistic regression assumptions and diagnostics (Chapter @ref(logistic-regression-assumptions-and-diagnostics))
• Multinomial logistic regression, an extension of the logistic regression for multiclass classification tasks (Chapter @ref(multinomial-logistic-regression)).
• Discriminant analysis, for binary and multiclass classification problems (Chapter @ref(discriminant-analysis))
• Naive bayes classifier (Chapter @ref(naive-bayes-classifier))
• Support vector machines (Chapter @ref(support-vector-machine))
• Classification model evaluation (Chapter @ref(classification-model-evaluation))

Most of the classification algorithms computes the probability of belonging to a given class. Observations are then assigned to the class that have the highest probability score.

Generally, you need to decide a probability cutoff above which you consider the an observation as belonging to a given class.

Contents:

## Examples of data set

### PimaIndiansDiabetes2 data set

The Pima Indian Diabetes data set is available in the `mlbench` package. It will be used for binary classification.

``````# Load the data set
data("PimaIndiansDiabetes2", package = "mlbench")
# Inspect the data
``````##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA 33.6    0.627  50      pos
## 2        1      85       66      29      NA 26.6    0.351  31      neg
## 3        8     183       64      NA      NA 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg``````

The data contains 768 individuals (female) and 9 clinical variables for predicting the probability of individuals in being diabete-positive or negative:

• pregnant: number of times pregnant
• glucose: plasma glucose concentration
• pressure: diastolic blood pressure (mm Hg)
• triceps: triceps skin fold thickness (mm)
• insulin: 2-Hour serum insulin (mu U/ml)
• mass: body mass index (weight in kg/(height in m)^2)
• pedigree: diabetes pedigree function
• age: age (years)
• diabetes: class variable

### Iris data set

The `iris` data set will be used for multiclass classification tasks. It contains the length and width of sepals and petals for three iris species. We want to predict the species based on the sepal and petal parameters.

``````# Load the data
data("iris")
# Inspect the data
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa``````
Sort by

## Logistic Regression Essentials in R

Logistic regression is used to predict the class (or category) of individuals based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable,... [Read more]

## Stepwise Logistic Regression Essentials in R

Stepwise logistic regression consists of automatically selecting a reduced number of predictor variables for building the best performing logistic regression model. Read more at Chapter... [Read more]

## Penalized Logistic Regression Essentials in R: Ridge, Lasso and Elastic Net

When you have multiple variables in your logistic regression model, it might be useful to find a reduced set of variables resulting to an optimal performing model (see Chapter... [Read more]

## Logistic Regression Assumptions and Diagnostics in R

The logistic regression model makes several assumptions about the data. This chapter describes the major assumptions and provides practical guide, in R, to check whether these assumptions... [Read more]

## Multinomial Logistic Regression Essentials in R

The multinomial logistic regression is an extension of the logistic regression (Chapter @ref(logistic-regression)) for multiclass classification tasks. It is used when the outcome involves... [Read more]

## Discriminant Analysis Essentials in R

Discriminant analysis is used to predict the probability of belonging to a given class (or category) based on one or multiple predictor variables. It works with continuous and/or categorical... [Read more]

## Naive Bayes Classifier Essentials

The Naive Bayes classifier is a simple and powerful method that can be used for binary and multiclass classification problems. Naive Bayes classifier predicts the class membership... [Read more]

## SVM Model: Support Vector Machine Essentials

Support Vector Machine (or SVM) is a machine learning technique used for classification tasks. Briefly, SVM works by identifying the optimal decision boundary that separates data points from... [Read more]

## Evaluation of Classification Model Accuracy: Essentials

After building a predictive classification model, you need to evaluate the performance of the model, that is how good the model is in predicting the outcome of new observations test data... [Read more]