Building Linear Regression (Least Squares) with Linear Algebra (2024)

Fully linear algebra based approach to solve linear regression problem using excel or numpy.

With a lot of sophisticated packages in python and R at our disposal, the math behind an algorithm is unlikely to be gone through by us each time we have to fit a bunch of data points. But it is sometimes useful to learn the math and solve an algorithm from scratch manually so that we will be able to build intuition of how it is done in the background. During my course work for ISB-CBA, one of the lectures for statistics involved solving for intercept, coefficients and R Square values of multiple linear regression with just matrix multiplication on an excel using linear algebra. Before that, I have always used statmodel OLS in python or lm() command on R to get the intercept and coefficients and a glance at the R Square value will tell how good a fit it is.

Since the completion of my course, I have long forgotten how to solve it using excel, so I wanted to brush up on the concepts and also write this post so that it could be useful to others as well.

I have done this entire post using numpy on my Kaggle notebook here. Please review and upvote my notebook if you found this post useful!

Let us take a simple linear regression to begin with. We want to find the best fit line through a set of data points: (x1, y1), (x2, y2), … (xn, yn). But what does the best fit mean?

If we can find a slope and an intercept for a single line that passes through all the possible data points, then that is the best fit line. But in most of the cases, such a line does not exist! So we resolve to finding a line such that when a connecting line is drawn parallel to the y-axis from the data points to the regression line, which measures the error of each data point, the sum of all such errors should be minimum. Simple, eh?

Building Linear Regression (Least Squares) with Linear Algebra (4)
Building Linear Regression (Least Squares) with Linear Algebra (5)

In the diagram, errors are represented by red, blue, green, yellow, and the purple line correspondingly. To formulate this as a matrix solving problem, consider linear equation is given below, where Beta 0 is the intercept and Beta is the slope.

Building Linear Regression (Least Squares) with Linear Algebra (6)

To simplify this notation, we will add Beta 0 to the Beta vector. This is done by adding an extra column with 1’s in X matrix and adding an extra variable in the Beta vector. Consequently, the matrix form will be:

Building Linear Regression (Least Squares) with Linear Algebra (7)

Then the least square matrix problem is:

Building Linear Regression (Least Squares) with Linear Algebra (8)

Let us consider our initial equation:

Building Linear Regression (Least Squares) with Linear Algebra (9)

Multiplying both sides by X_transpose matrix:

Building Linear Regression (Least Squares) with Linear Algebra (10)

Where:

Building Linear Regression (Least Squares) with Linear Algebra (11)
Building Linear Regression (Least Squares) with Linear Algebra (12)

Ufff that is a lot of equations. But it will be simple enough to follow when we solve it with a simple case below.

For simplicity, we will start with a simple linear regression problem which has 4 data points (1, 1), (2, 3), (3, 3) and (4, 5). X = [1, 2, 3, 4] and y = [1, 3, 3, 5]. When we convert into matrix form as described above, we get:

Building Linear Regression (Least Squares) with Linear Algebra (13)
Building Linear Regression (Least Squares) with Linear Algebra (14)
Building Linear Regression (Least Squares) with Linear Algebra (15)

Here is the numpy code to implement this simple solution:

Solving for multiple linear regression is also quite similar to simple linear regression and we follow the 6 steps:

  1. Add a new column the beginning with all 1’s for the intercept in the X matrix
  2. Take the transpose of X matrix
  3. Multiply X transpose and X matrices
  4. Find the inverse of this matrix
  5. Multiply X transpose with y matrix
  6. Multiply both the matrices to find the intercept and the coefficient

For solving multiple linear regression I have taken a dataset from kaggle which has prices of used car sales from UK.

I have manually computed all the calculations in excel. I have taken the first 300 rows from Volkswagen dataset and took out only the numerical variables from it. The regression gives a r square score of 0.77.

I urge you to download the excel workbook and follow the calculations (the formatting for new math font on google sheet is not good. You can download and view it on MS excel for better readability). In the sheet “Explanation” I have matrix multiplied X_Transpose and X. This has all the information that we need for calculation of model parameters like R-Square value.

Building Linear Regression (Least Squares) with Linear Algebra (16)

Please refer to the section 3 of the kaggle notebook here: https://www.kaggle.com/gireeshs/diy-build-linear-regression-with-linear-algebra#Part-3:-Multiple-linear-regression where I have solved this problem using matrix multiplication.

Building Linear Regression (Least Squares) with Linear Algebra (2024)
Top Articles
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 6196

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.