Building Linear Regression (Least Squares) with Linear Algebra (2024)

Fully linear algebra based approach to solve linear regression problem using excel or numpy.

With a lot of sophisticated packages in python and R at our disposal, the math behind an algorithm is unlikely to be gone through by us each time we have to fit a bunch of data points. But it is sometimes useful to learn the math and solve an algorithm from scratch manually so that we will be able to build intuition of how it is done in the background. During my course work for ISB-CBA, one of the lectures for statistics involved solving for intercept, coefficients and R Square values of multiple linear regression with just matrix multiplication on an excel using linear algebra. Before that, I have always used statmodel OLS in python or lm() command on R to get the intercept and coefficients and a glance at the R Square value will tell how good a fit it is.

Since the completion of my course, I have long forgotten how to solve it using excel, so I wanted to brush up on the concepts and also write this post so that it could be useful to others as well.

I have done this entire post using numpy on my Kaggle notebook here. Please review and upvote my notebook if you found this post useful!

Let us take a simple linear regression to begin with. We want to find the best fit line through a set of data points: (x1, y1), (x2, y2), … (xn, yn). But what does the best fit mean?

If we can find a slope and an intercept for a single line that passes through all the possible data points, then that is the best fit line. But in most of the cases, such a line does not exist! So we resolve to finding a line such that when a connecting line is drawn parallel to the y-axis from the data points to the regression line, which measures the error of each data point, the sum of all such errors should be minimum. Simple, eh?

Building Linear Regression (Least Squares) with Linear Algebra (4)

Building Linear Regression (Least Squares) with Linear Algebra (5)

In the diagram, errors are represented by red, blue, green, yellow, and the purple line correspondingly. To formulate this as a matrix solving problem, consider linear equation is given below, where Beta 0 is the intercept and Beta is the slope.

Building Linear Regression (Least Squares) with Linear Algebra (6)

To simplify this notation, we will add Beta 0 to the Beta vector. This is done by adding an extra column with 1’s in X matrix and adding an extra variable in the Beta vector. Consequently, the matrix form will be:

Building Linear Regression (Least Squares) with Linear Algebra (7)

Then the least square matrix problem is:

Building Linear Regression (Least Squares) with Linear Algebra (8)

Let us consider our initial equation:

Building Linear Regression (Least Squares) with Linear Algebra (9)

Multiplying both sides by X_transpose matrix:

Building Linear Regression (Least Squares) with Linear Algebra (10)

Where:

Building Linear Regression (Least Squares) with Linear Algebra (11)

Building Linear Regression (Least Squares) with Linear Algebra (12)

Ufff that is a lot of equations. But it will be simple enough to follow when we solve it with a simple case below.

For simplicity, we will start with a simple linear regression problem which has 4 data points (1, 1), (2, 3), (3, 3) and (4, 5). X = [1, 2, 3, 4] and y = [1, 3, 3, 5]. When we convert into matrix form as described above, we get:

Building Linear Regression (Least Squares) with Linear Algebra (13)

Building Linear Regression (Least Squares) with Linear Algebra (14)

Building Linear Regression (Least Squares) with Linear Algebra (15)

Here is the numpy code to implement this simple solution:

Solving for multiple linear regression is also quite similar to simple linear regression and we follow the 6 steps:

Add a new column the beginning with all 1’s for the intercept in the X matrix
Take the transpose of X matrix
Multiply X transpose and X matrices
Find the inverse of this matrix
Multiply X transpose with y matrix
Multiply both the matrices to find the intercept and the coefficient

For solving multiple linear regression I have taken a dataset from kaggle which has prices of used car sales from UK.

I have manually computed all the calculations in excel. I have taken the first 300 rows from Volkswagen dataset and took out only the numerical variables from it. The regression gives a r square score of 0.77.

VW_Regression

Data year,mileage,tax,mpg,engineSize,price 2019,13904,145,49.6,2,25000 2019,4562,145,49.6,2,26883…

docs.google.com

I urge you to download the excel workbook and follow the calculations (the formatting for new math font on google sheet is not good. You can download and view it on MS excel for better readability). In the sheet “Explanation” I have matrix multiplied X_Transpose and X. This has all the information that we need for calculation of model parameters like R-Square value.

Building Linear Regression (Least Squares) with Linear Algebra (16)

Please refer to the section 3 of the kaggle notebook here: https://www.kaggle.com/gireeshs/diy-build-linear-regression-with-linear-algebra#Part-3:-Multiple-linear-regression where I have solved this problem using matrix multiplication.

https://www.youtube.com/watch?v=Lx6CfgKVIuE
Complete business statistics book
My course work for ISB CBA

Image :

http://onlinestatbook.com/2/regression/intro.html

Building Linear Regression (Least Squares) with Linear Algebra (2024)

FAQs

Is linear regression a part of linear algebra? ›

Essentially, the reason is that regression can be interpreted as a projection onto the nearest function in some appropriate vector space, making it very literally a linear algebra problem.

Show Me More ›

How do you solve a linear regression model? ›

Calculating the Linear Regression

The equation is in the form of “Y = a + bX”. You may also recognize it as the slope formula. To find the linear equation by hand, you need to get the value of “a” and “b”. Then substitute the resulting value in the slope formula and that gives you your linear regression equation.

View Details ›

What is the formula for linear least squares regression? ›

The least squares regression line, ̂ 𝑦 = 𝑎 + 𝑏 𝑥 , minimizes the sum of the squared differences of the points from the line, hence, the phrase “least squares.” We will not cover the derivation of the formulae for the line of best fit here.

Keep Reading ›

What is the least square approximation in linear algebra? ›

Definition. Let be an m × n matrix with and rank ( A ) = n . The least squares approximation of the system A x ≈ b is the vector which minimizes the distance ‖ A x − b ‖ .

Get More Info Here ›

What is the least squares regression line for dummies? ›

A least squares regression line represents the relationship between variables in a scatterplot. The procedure fits the line to the data points in a way that minimizes the sum of the squared vertical distances between the line and the points. It is also known as a line of best fit or a trend line.

View Details ›

How to calculate the least squares regression line using summary statistics? ›

To identify the least squares line from summary statistics:

Estimate the slope parameter, b1, using Equation 7.3. ...
Noting that the point (ˉx,ˉy) is on the least squares line, use x0=ˉx and y0=ˉy along with the slope b1 in the point-slope equation: y−ˉy=b1(x−ˉx)
Simplify the equation.

Apr 23, 2022

Learn More ›

Is linear regression algebra or calculus? ›

Linear algebra is a branch in mathematics that deals with matrices and vectors. From linear regression to the latest-and-greatest in deep learning: they all rely on linear algebra “under the hood”.

Is linear algebra past calculus? ›

Two main courses after calculus are linear algebra and differential equations.

See Details ›

Is linear algebra a college math? ›

Due to its broad range of applications, linear algebra is one of the most widely taught subjects in college-level mathematics (and increasingly in high school).

What is the formula for constructing linear regression? ›

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

What is the mathematical formula for linear regression? ›

The simple linear regression line, ^y=a+bx y ^ = a + b x , can be interpreted as follows: ^y is the predicted value of y , a is the intercept and predicts where the regression line will cross the y -axis, b predicts the change in y for every unit change in x .

Explore More ›

How do you create a linear regression model? ›

The formula for simple linear regression is Y = mX + b, where Y is the response (dependent) variable, X is the predictor (independent) variable, m is the estimated slope, and b is the estimated intercept.

Find Out More ›

What is the least squares estimated regression equation? ›

The least squares method is the most widely used procedure for developing estimates of the model parameters. For simple linear regression, the least squares estimates of the model parameters β₀ and β₁ are denoted b₀ and b₁. Using these estimates, an estimated regression equation is constructed: ŷ = b₀ + b₁x .

How do you solve ordinary least squares regression? ›

The ordinary least squares formula: what is the equation of the model? where Y is the dependent variable, β₀, is the intercept of the model, X _j corresponds to the j^th explanatory variable of the model (j= 1 to p), and e is the random error with expectation 0 and variance σ².

What is least square regression techniques? ›

The Least-Squares regression model is a statistical technique that may be used to estimate a linear total cost function for a mixed cost, based on past cost data. The function can then be used to forecast costs at different activity levels, as part of the budgeting process or to support decision-making processes.

Find Out More ›