What is it?

Linear regression is a widely used statistical model used to determine the Correlation between a dependent variable, that you’re trying to predict, and a independent variable, which you’re basing the prediction on.

In Statistics and Machine Learning, a linear regression model tries to capture the linear relationship between two or more variables, in a way that it produces an equation that approximates a prediction for any point in the known data range.


The statistical model

The statistical model for linear regression is a linear function, given by a pair of data and . The dependent variable is given by , and the independent variables is .

Given as the angular coefficient, as the linear coefficient, and as the random error, one can write a simple linear regression model as:


The assumptions of linear regression

However, there are four main assumptions related with a linear regression model:

  • Linearity

    The relationship between and must be linear.

  • Homoscedasticity

    The Variance of the residual error must be the same for any value of .

  • Independence and Normality

    For any value of , the data must be i.i.d., independent and identically distributed.

Making data linear

While it may be not the best choice, it’s possible to linearize data so that it fits in the assumptions made. The process is called Data Linearization.


Estimating the model parameters

To determine the best fitting line that captures the data, one can use several metrics. The most used method is the Least Squares method, also called OLS, standing for Ordinary Least Squares.

The idea is to estimate parameters that minimize the error , meaning the fitted line is close to the observed values. The sum of squared errors is given by:

To find the values of and that minimizes , we can use Derivatives with optimization techniques to find the critical points, with:

Which can be applied the Chain Rule from Differentiation:

Simplifying to:

The linear coefficient has an easier calculation with the linear regression model equation itself:


Evaluating a linear regression model

To evaluate a linear regression model, the most used metric is the Coefficient of Determination, referred as , which captures the explained Variance in the model.

One can also use other metrics like Mean Squared Error, or Mean Absolute Error to determine the goodness of fit.


Other reference