Linear Regression Basics for Machine Learning

Linear regression is a must-know algorithm for any aspiring data scientist and machine learning practitioner. It’s a simple yet useful algorithm in both statistics and machine learning.

If you wanted to predict the weight of fish based on its width (for whatever reason), you probably can use Linear regression.

While linear regression is a fairly known and important algorithm, the unfortunate fact is that most learners jump straight into neural networks to build their machine learning models. When in fact, you probably can use something as simple as linear regression (or a bunch of if statements).

A simple tool for a simple problem, don’t go over-engineering and over complicating things.

In this article, we’ll go through the basics of linear regression, introduce ordinary least square, and implement a simple and multiple linear regression using Sklearn.

What is Linear Regression?

Linear regression is a method to model the relationship between two variables by fitting a linear equation over the data (as we can see above).

Its job is to predict the output variable, $$y$$ based on the input variable, $$x$$. Graphically, that’s drawing a straight line between a bunch of dots that seems like the best fit. And basically, finding the best fitting line based on the observations.

Okay, but how can we draw this straight line and find the best fit?

Introducing Ordinary Least Square

While there are various ways to calculate the best fitting line in linear regression, the most popular way is using ordinary least square.

But let’s understand something first, and you’ve probably seen this before,

$y = mx + c$

$$y = mx + c$$ is the equation of a straight line. Where $$m$$ is the slope of the line and $$c$$ is the intercept. This is the straight line we want to find among the many observations we see.

In ordinary least square, we can find the best fit of $$y = mx + c$$ by minimising the sum of squares of the differences between the actual $$y$$ values and the predicted $$y$$ values.

Clear as mud?

Okay, so let’s just say we have a bunch of fishies. We want to start predicting their weight because our weighting machine is broken. Fortunately, we have a ruler and 5 examples of fish weight (g) and width (cm). We can write this example dataset as:

$\{weight_i, width_i\}_{i=1}^{5} = \{(120, 3.5), (390,5.4), (340,4.7), (363,4.5),(420,5.1) \}$

Because we have these examples as our data points, we can calculate the slope, $$m$$ of our linear equation by using the following formula:

$m = \frac{N \sum(xy) \,-\, \sum x \sum y}{N \sum(x^2) \,-\,(\sum x)^2} \\ \hspace{0cm} \\ \hspace{0.77cm} = \frac{5 \cdot 7899.5 \,-\, 23.2\cdot1633}{5 \cdot 109.76 538.24} \hspace{0cm} \\ \hspace{0cm} \\ \hspace{-1.05cm} = 152.64205$

And with the slope, we can calculate the intercept, $$c$$ by using:

$b = \frac{\sum y \,-\, m\sum x}{N} = \frac{1633 \,-\, 152.64205 \cdot 23.2}{5} = -381.6591$

Now that we have all our variables, we can assemble them into our linear equation, $$y = mx + c$$ and plot our line of best fit.

$y = 152.64205x \,-\, 381.6591$

Now that we understand how to mathematically calculate the line of best fit using ordinary least square, let’s move on to some practical stuff and start modelling using Sklearn.

Simple Linear Regression using Sklearn

Explanatory Data Analysis

To get started with linear regression for machine learning purposes, we’ll use the fishmarket dataset found on Kaggle.

The dataset contains several fish species found in fish market sales. It includes features such as its height, lengths, and width to estimate the weight of the fish.

We can first start off by obviously, importing all our much needed modules and reading our data:

As we can see, our data contains 7 variables. They are: species, weight (g), length1 (vertical length in cm), length2 (diagonal length in cm), length3 (cross length in cm), height (cm), and width (diagonal width in cm).

One way to find out if there’s any correlation in our data and whether they have a linear relationship is by creating a pair plot. It’s basically a plot that visually compares variables against one another.

As we can see from our plot, it does seem like there’s somewhat of a linear relationship between weight and the other variables.

We could also use Seaborn’s heatmap to get a numerical representative of the correlation between the variables.

From our heatmap, we can see that Length 1, 2, and 3 have a high correlation of 0.92 with Weight. As for Width and Height, not as much, but still a decent correlation with 0.89 and 0.72 respectively.

Now that we’ve done some basic exploration around our data, we can start building our linear regression model.

Linear Regression Modelling

For our model, we’ll use Width as our input variable to estimate Weight. We’ll also use the coefficient of determination, $$R^2$$ to tell us the goodness of fit of our model.

Interestingly, we managed to achieve a $$R^2$$ of 0.726! Which tells us that our model accounts for 72.6% of the variation in the $$y$$ values are accounted for by the $$x$$ values.

However, using $$R^2$$ does not tell us if coefficients and predictions are biased.

As so, we can find out if there is any biased by plotting the residuals against the fitted values.

Residuals plot help us determine whether there is a bias in our model. If our model was unbiased, the residual plot will show data points residuals scattered around 0 with no obvious patterns.

As we can see from our residual plot, most of our residuals are situated around 0. Moreover, there’s no obvious pattern in the residuals, this means that our model is quite unlikely to be biased.

We can also see two outliers in the plot, but let’s just leave them for now.

Note. When dealing with outliers, we have to be careful. Because each outlier might have a strong influence on our linear regression line. Whether we drop the outlier here or not is an article for another day. But do check out the recommended further reading section for an article that explains when to drop the outlier.

Multiple Linear Regression

While we did achieve a rather high $$R^2$$, but can we do better?

We could get try achieving better accuracy by creating a multiple linear regression model. This works similar to a simple linear regression model. But, instead of using just one input variable, we’ll fit multiple variables to the linear regression.

So, let’s try fitting all the continuous variables into our model.

By fitting in more of our variables into the linear regression, we manage to achieve a higher $$R^2$$ score of 0.863!

While we managed to achieve a better score by fitting more variables, we do have to remember, fitting unnecessary variables could lead to additional noise and could lower our $$R^2$$ score.

Summary

In summary, we learned what linear regression is, introduced ordinary least square to find the line of best fit, and implemented a simple and multiple linear regression.

While implementing a linear regression model using Sklearn was fairly straight forward, the mathematics behind it might be slightly difficult for anyone new to it.

Another important point is that we should use linear regression only when the relationship between variables in the data is linear. For other types of relationships, we should be mindful of which model should be used so that we don’t break any hearts.