The Linear Algebra View of Least-Squares Regression (2024)

Andrew Chamberlain, Ph.D.

Linear regression is the most important statistical tool most people ever learn. However, the way it’s usually taught makes it hard to see the essence of what regression is really doing.

Most courses focus on the “calculus” view. In this view, regression starts with a large algebraic expression for the sum of the squared distances between each observed point and a hypothetical line. The expression is then minimized by taking the first derivative, setting it equal to zero, and doing a ton of algebra until we arrive at our regression coefficients.

Most textbooks walk students through one painful calculation of this, and thereafter rely on statistical packages like R or Stata — practically inviting students to become dependent on software and never develop deep intuition about what’s going on. That’s the way people who don’t really understand math teach regression.

In this post I’ll illustrate a more elegant view of least-squares regression — the so-called “linear algebra” view.

The Problem

The goal of regression is to fit a mathematical model to a set of observed points. Say we’re collecting data on the number of machine failures per day in some factory. Imagine we’ve got three data points: (day, number of failures) (1,1) (2,2) (3,2)

The goal is to find a linear equation that fits these points. We believe there’s an underlying mathematical relationship that maps “days” uniquely to “number of machine failures,” or

The Linear Algebra View of Least-Squares Regression (2)

in the form

The Linear Algebra View of Least-Squares Regression (3)

where b is the number of failures per day, x is the day, and C and D are the regression coefficients we’re looking for.

We can write these three data points as a simple linear system like this:

The Linear Algebra View of Least-Squares Regression (4)

For the first two points the model is a perfect linear system. When x = 1, b = 1; and when x = 2, b = 2. But things go wrong when we reach the third point. When x = 3, b = 2 again, so we already know the three points don’t sit on a line and our model will be an approximation at best.

Now that we have a linear system we’re in the world of linear algebra. That’s good news, since it helps us step back and see the big picture. Rather than hundreds of numbers and algebraic terms, we only have to deal with a few vectors and matrices.

Here’s our linear system in the matrix form Ax = b:

The Linear Algebra View of Least-Squares Regression (5)

What this is saying is that we hope the vector b lies in the column space of A, C(A). That is, we’re hoping there’s some linear combination of the columns of A that gives us our vector of observed b values.

Unfortunately, we already know b doesn’t fit our model perfectly. That means it’s outside the column space of A. So we can’t simply solve that equation for the vector x.

Let’s look at a picture of what’s going on.

In the drawing below the column space of A is marked C(A). It forms a flat plane in three-space. If we think of the columns of A as vectors a1 and a2, the plane is all possible linear combinations of a1 and a2. These are marked in the picture.

The Linear Algebra View of Least-Squares Regression (6)

By contrast, the vector of observed values b doesn’t lie in the plane. It sticks up in some direction, marked “b” in the drawing.

The plane C(A) is really just our hoped-for mathematical model. And the errant vector b is our observed data that unfortunately doesn’t fit the model. So what should we do?

The linear regression answer is that we should forget about finding a model that perfectly fits b, and instead swap out b for another vector that’s pretty close to it but that fits our model. Specifically, we want to pick a vector p that’s in the column space of A, but is also as close as possible to b.

The picture below illustrates the process. Think of shining a flashlight down onto b from above. This casts a shadow onto C(A). This is the projection of the vector b onto the column space of A. This projection is labeled p in the drawing.

The Linear Algebra View of Least-Squares Regression (7)

The line marked e is the “error” between our observed vector b and the projected vector p that we’re planning to use instead. The goal is to choose the vector p to make e as small as possible. That is, we want to minimize the error between the vector p used in the model and the observed vector b.

In the drawing, e is just the observed vector b minus the projection p, or b - p. And the projection itself is just a combination of the columns of A — that’s why it’s in the column space after all — so it’s equal to A times some vector x-hat.

To minimize e, we want to choose a p that’s perpendicular to the error vector e, but points in the same direction as b. In the figure, the intersection between e and p is marked with a 90-degree angle.

The geometry makes it pretty obvious what’s going on. We started with b, which doesn’t fit the model, and then switched to p, which is a pretty good approximation and has the virtue of sitting in the column space of A.

Solving for Regression Coefficients

Since the vector e is perpendicular to the plane of A’s column space, that means the dot product between them must be zero. That is,

The Linear Algebra View of Least-Squares Regression (8)

But since e = b - p, and p = A times x-hat, we get,

The Linear Algebra View of Least-Squares Regression (9)

Solving for x-hat, we get

The Linear Algebra View of Least-Squares Regression (10)

The elements of the vector x-hat are the estimated regression coefficients C and D we’re looking for. They minimize the distance e between the model and the observed data in an elegant way that uses no calculus or explicit algebraic sums.

Here’s an easy way to remember how this works: Doing linear regression is just trying to solve Ax = b. But if any of the observed points in b deviate from the model, A won’t be an invertible matrix. So instead we force it to become invertible by multiplying both sides by the transpose of A. The transpose of A times A will always be square and symmetric, so it’s always invertible. Then we just solve for x-hat.

There are other good things about this view as well. For one, it’s a lot easier to interpret the correlation coefficient r. If our x and y data points are normalized about their means — that is, if we subtract their mean from each observed value — r is just the cosine of the angle between b and the flat plane in the drawing.

Cosine ranges from -1 to 1, just like r. If the regression is perfect, r = 1, which means b lies in the plane. If b lies in the plane, the angle between them is zero, which makes sense since cos 0 = 1. If the regression is terrible, r = 0, and b points perpendicular to the plane. In that case, the angle between them is 90 degrees or pi/2 radians. This makes sense also, since the cos (pi/2) = 0 as well.

The Linear Algebra View of Least-Squares Regression (2024)

FAQs

How do you solve least squares linear regression? ›

Steps

Step 1: For each (x,y) point calculate x² and xy.
Step 2: Sum all x, y, x² and xy, which gives us Σx, Σy, Σx² and Σxy (Σ means "sum up")
Step 3: Calculate Slope m:
m = N Σ(xy) − Σx Σy N Σ(x²) − (Σx)²
Step 4: Calculate Intercept b:
b = Σy − m Σx N.
Step 5: Assemble the equation of a line.

Show Me More ›

What is the least squares method used for in linear regression? ›

What is the Least Squares Regression method and why use it? Least squares is a method to apply linear regression. It helps us predict results based on an existing set of data as well as clear anomalies in our data. Anomalies are values that are too good, or bad, to be true or that represent rare cases.

View Details ›

Is linear regression linear algebra? ›

Linear algebra is a branch in mathematics that deals with matrices and vectors. From linear regression to the latest-and-greatest in deep learning: they all rely on linear algebra “under the hood”. In this blog post, I explain how linear regression can be interpreted geometrically through linear algebra.

Keep Reading ›

What is the system of equations in linear regression? ›

A system of linear regression equations is a model with the following characteristics: (i) The model has at least two (linear) equations. (ii) Each equation has one and only one endogenous variable, which is the equa- tion's LHS (Left Hand Side) variable. (iii) Each equation has one or more exogenous variables.

Get More Info Here ›

What is the formula for the least squares regression line? ›

What is the least-squares regression line equation? The least-squares regression line equation is y = mx + b, where m is the slope, which is equal to (Nsum(xy) - sum(x)sum(y))/(Nsum(x^2) - (sum x)^2), and b is the y-intercept, which is equals to (sum(y) - msum(x))/N.

View Details ›

What is the least squares regression line for dummies? ›

The equation for calculating the least-squares regression line is y = mx + b. If two variables have a negative relationship, which letter is guaranteed to be negative? The slope of the equation is m. If two variables have a negative relationship, they will have a negative slope.

Learn More ›

What is a least square solution in linear algebra? ›

A least squares solution of Ax = b is a vector x that satisfies Ax = b, where b is the orthogonal projection of b onto Col(A). (Equivalently,) A least squares solution to Ax = b is a vector x that makes Ax as close to b as possible.

What is the formula for the least-squares method? ›

So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5. The least-squares method is used to predict the behavior of the dependent variable with respect to the independent variable. The sum of the squares of errors is called variance.

See Details ›

How to interpret a least squares regression line? ›

How to Interpret the Coefficients of the Least-Squares Regression Line Model. Step 1: Identify the independent variable and the dependent variable . Step 2: For the least-squares regression line y ^ ( x ) = a x + b , the value is the -intercept of the regression line.

Is linear algebra above calculus? ›

If you are a math major:

As an entering student, you will probably go into Calculus II, then Linear Algebra, followed by Calculus III.

What grade math is linear algebra? ›

Linear algebra is usually taken by sophom*ore math majors after they finish their calculus classes, but you don't need a lot of calculus in order to do it.

Explore More ›

Is linear algebra real math? ›

Linear algebra is the branch of mathematics concerning linear equations such as: In three-dimensional Euclidean space, these three planes represent solutions to linear equations, and their intersection represents the set of common solutions: in this case, a unique point.

Find Out More ›

What does linear regression tell you? ›

Linear regression is a data analysis technique that predicts the value of unknown data by using another related and known data value. It mathematically models the unknown or dependent variable and the known or independent variable as a linear equation.

What is the basic equation for linear regression? ›

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

What does the regression equation tell you? ›

The regression equation representing how much y changes with any given change of x can be used to construct a regression line on a scatter diagram, and in the simplest case this is assumed to be a straight line. The direction in which the line slopes depends on whether the correlation is positive or negative.

Find Out More ›

How do you solve ordinary least squares regression? ›

The ordinary least squares formula: what is the equation of the model? where Y is the dependent variable, β₀, is the intercept of the model, X _j corresponds to the j^th explanatory variable of the model (j= 1 to p), and e is the random error with expectation 0 and variance σ².

Get More Info Here ›

What is the least squares estimated regression equation? ›

The least squares method is the most widely used procedure for developing estimates of the model parameters. For simple linear regression, the least squares estimates of the model parameters β₀ and β₁ are denoted b₀ and b₁. Using these estimates, an estimated regression equation is constructed: ŷ = b₀ + b₁x .