6.5: Orthogonal Least Squares (2024)

Last updated
Save as PDF

Page ID: 82506

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

\( \newcommand{\vectorC}[1]{\textbf{#1}}\)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}}\)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}\)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)

Suppose we collect some data when performing an experiment and plot it as shown on the left of Figure 6.5.1. Notice that there is no line on which all the points lie; in fact, it would be surprising if there were since we can expect some uncertainty in the measurements recorded. There does, however, appear to a line, as shown on the right, on which the points almost lie.

In this section, we'll explore how the techniques developed in this chapter enable us to find the line that best approximates the data. More specifically, we'll see how the search for a line passing through the data points leads to an inconsistent system \(A\mathbf x=\mathbf b\text{.}\) Since we are unable to find a solution \(\mathbf x\text{,}\) we instead seek the vector \(\mathbf x\) where \(A\mathbf x\) is as close as possible to \(\mathbf b\text{.}\) Orthogonal projection give us just the right tool for doing this.

Preview Activity 6.5.1.

Is there a solution to the equation \(A\mathbf x=\mathbf b\) where \(A\) and \(\mathbf b\) are such that
\begin{equation*} \begin{bmatrix} 1 & 2 \\ 2 & 5 \\ -1 & 0 \\ \end{bmatrix} \mathbf x = \threevec5{-3}{-1}\text{.} \end{equation*}
We know that \(\threevec12{-1}\) and \(\threevec250\) form a basis for \(\col(A)\text{.}\) Find an orthogonal basis for \(\col(A)\text{.}\)
Find the orthogonal projection \(\widehat\mathbf b\) of \(\mathbf b\) onto \(\col(A)\text{.}\)
Explain why the equation \(A\mathbf x=\widehat\mathbf b\) must be consistent and then find its solution.

A first example

When we've encountered inconsistent systems in the past, we've simply said there is no solution and moved on. The preview activity, however, shows how we can find approximate solutions to an inconsistent system: if there are no solutions to \(A\mathbf x = \mathbf b\text{,}\) we instead solve the consistent system \(A\mathbf x = \bhat\text{,}\) the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{.}\) As we'll see, this solution is, in a specific sense, the best possible.

Activity 6.5.2.

Suppose we have three data points \((1,1)\text{,}\) \((2,1)\text{,}\) and \((3,3)\) and that we would like to find a line passing through them.

Plot these three points in Figure 6.5.2. Are you able to draw a line that passes through all three points?

Let's write the conditions that would describe a line passing through the points. Remember that the equation of a line can be written as \(b + mx=y\) where \(m\) is the slope and \(b\) is the \(y\)-intercept. We will try to find \(b\) and \(m\) so that the three points lie on the line.
The first data point \((1,1)\) gives an equation for \(b\) and \(m\text{.}\) In particular, we know that when \(x=1\text{,}\) then \(y=1\) so we have \(b + m(1) = 1\) or \(b + m = 1\text{.}\) Use the other two data points to create a linear system describing \(m\) and \(b\text{.}\)
We have obtained a linear system having three equations, one from each data point, for the two unknowns \(b\) and \(m\text{.}\) Identify a matrix \(A\) and vector \(\mathbf b\) so that the system has the form \(A\mathbf x=\mathbf b\text{,}\) where \(\mathbf x=\twovec bm\text{.}\)
Notice that the unknown vector \(\mathbf x=\twovec bm\) describes the line that we seek.
Is there a solution to this linear system? How does this question relate to your attempt to draw a line through the three points above?
Since this system is inconsistent, we know that \(\mathbf b\) is not in the column space \(\col(A)\text{.}\) Find an orthogonal basis for \(\col(A)\) and use it to find the orthogonal projection \(\widehat\mathbf b\) of \(\mathbf b\) onto \(\col(A)\text{.}\)
Since \(\widehat\mathbf b\) is in \(\col(A)\text{,}\) the equation \(A\mathbf x = \widehat\mathbf b\) is consistent. Find its solution \(\mathbf x = \twovec{b}{m}\) and sketch the line \(y=b + mx\) in Figure 6.5.2. We say that this is the line of best fit.

This activity illustrates the idea behind a technique known as orthogonal least squares, which we have been working toward throughout this chapter. If the data points are denoted as \((x_i, y_i)\text{,}\) we construct the matrix \(A\) and vector \(\mathbf b\) as

Solving least squares problems

Now that we've seen an example of what we're trying to accomplish, let's put this technique into a more general framework.

Given an inconsistent system \(A\mathbf x = \mathbf b\text{,}\) we seek to find \(\mathbf x\) that minimizes the distance from \(A\mathbf x\) to \(\mathbf b\text{.}\) We find \(\mathbf x\) by forming \(\widehat\mathbf b\text{,}\) the orthogonal projection of \(\mathbf b\) onto the column space \(\col(A)\) and then solving \(A\mathbf x = \widehat\mathbf b\text{.}\) Moving forward, we will denote the solution of \(A\mathbf x = \bhat\) by \(\xhat\) and call this vector the least squares approximate solution of \(A\mathbf x=\mathbf b\) to distinguish it from a (non-existent) solution of \(A\mathbf x=\mathbf b\text{.}\)

Remember that the orthogonal projection \(\widehat\mathbf b\) of \(\mathbf b\) onto the column space \(\col(A)\) is defined by the fact that \(\widehat\mathbf b - \mathbf b\) is orthogonal to \(\col(A)\text{.}\) In other words, \(\bhat-\mathbf b\) is in the orthogonal complement \(\col(A)^\perp\text{,}\) which Proposition 6.2.10 tells us is the same as \(\nul(A^T)\text{.}\) Since \(\bhat-\mathbf b\) is in \(\nul(A^T)\text{,}\) it follows that

\[ A^T(\widehat\mathbf b-\mathbf b) = \zerovec\text{.} \]

Finally, the least squares approximate solution is the vector \(\xhat\) such that \(A\xhat = \widehat\mathbf b\text{,}\) which gives

\[\begin{align*} A^T(A\xhat - \mathbf b) & = \zerovec\\ A^TA\xhat - A^T\mathbf b & = \zerovec\\ A^TA\xhat & = A^T\mathbf b\text{.} \end{align*}\]

Let's record our work in the following proposition.

Proposition 6.5.6.

The least squares approximate solution \(\widehat\mathbf x\) to the equation \(A\mathbf x = \mathbf b\) is given by the normal equations

\[ A^TA\widehat\mathbf x = A^T\mathbf b\text{.} \]

The linear system represented by the normal equations is consistent since \(\xhat\text{,}\) the least squares approximate solution to \(A\mathbf x=\mathbf b\text{,}\) is a solution. If we further assume that the columns of \(A\) are linearly independent, we can see that there is a unique solution. Imagine, for the moment, that \(\mathbf x\) is a solution to the hom*ogeneous equation \(A^TA\mathbf x = \zerovec\text{.}\) We then have

\[\begin{align*} \mathbf x\cdot(A^TA\mathbf x) & = \mathbf x\cdot\zerovec = 0\\ \mathbf x^TA^TA\mathbf x & = 0\\ (A\mathbf x)^T(A\mathbf x) & = 0\\ (A\mathbf x)\cdot(A\mathbf x) & = 0\\ \len{A\mathbf x}^2 & = 0\\ A\mathbf x & = \zerovec \text{.} \end{align*}\]

In other words, if \(\mathbf x\) is a solution to the hom*ogeneous equation \(A^TA\mathbf x = \zerovec\text{,}\) then we know that \(A\mathbf x = \zerovec\text{.}\) Since we are assuming that the columns of \(A\) are linearly independent, we know that the hom*ogeneous equation \(A\mathbf x=\zerovec\) has only the zero solution \(\mathbf x = \zerovec\text{.}\) Therefore, the hom*ogeneous equation \(A^TA\mathbf x=\zerovec\) has only the zero solution, which means that \(A^TA\) has a pivot position in every column. Hence, the normal equations \(A^TA\xhat = A^T\mathbf b\) must have a unique solution.

Proposition 6.5.7.

If the columns of \(A\) are linearly independent, then there is a unique least squares approximate solution \(\xhat\) to the equation \(A\mathbf x=\mathbf b\) given by the normal equations

\[ A^TA\xhat = A^T\mathbf b\text{.} \]

Let's put this proposition to use in the next activity.

Activity 6.5.3.

The rate at which a cricket chirps is related to the outdoor temperature, as reflected in some experimental data that we'll study in this activity. The chirp rate \(C\) is expressed in chirps per second while the temperature \(T\) is in degrees Fahrenheit. Evaluate the following cell to load in the data:

Evaluating this cell also provides:

the vectors chirps and temps formed from the columns of the dataset.
the command onesvec(n), which creates an \(n\)-dimensional vector whose entries are all one.
Remember that you can form a matrix whose columns are the vectors v1 and v2 with matrix([v1, v2]).T.

We would like to represent this relationship by a linear function

\begin{equation*} \beta_0 + \beta_1 C = T\text{.} \end{equation*}

Use the first data point \((C_1,T_1)=(20.0,88.6)\) to write an equation involving \(\beta_0\) and \(\beta_1\text{.}\)
Suppose that we represent the unknowns using a vector \(\mathbf x = \twovec{\beta_0}{\beta_1}\text{.}\) Use the 15 data points to create the matrix \(A\) and vector \(\mathbf b\) so that the linear system \(A\mathbf x= \mathbf b\) describes the unknown vector \(\mathbf x\text{.}\)
Write the normal equations \(A^TA\xhat = A^T\mathbf b\text{;}\) that is, find the matrix \(A^TA\) and the vector \(A^T\mathbf b\text{.}\)
Solve the normal equations to find \(\xhat\text{,}\) the least squares approximate solution to the equation \(A\mathbf x=\mathbf b\text{.}\) Call your solution xhat since x has another meaning in Sage.
What are the values of \(\beta_0\) and \(\beta_1\) that you found?
If the chirp rate is 22 chirps per second, what is your prediction for the temperature?
You can plot the data and your line, assuming you called the solution xhat, using the cell below.

This example demonstrates an approach called linear regression, in which a collection of data is modeled using a linear function found by solving a least squares problem. Once we have the linear function that best fits the data, we can make predictions about situations that we haven't encountered in the data.

If we're going to use our function to make predictions, it's natural to ask how much confidence we have in these predictions. This is a statistical question that leads to a rich and well-developed theory, which we won't explore in much detail here. However, there is one simple measure of how well our linear function fits the data that is known as the coefficient of determination and denoted by \(R^2\text{.}\)

We have seen that the square of the distance \(\len{\mathbf b-A\mathbf x}^2\) measures the amount by which the line fails to pass through the data points. When the line is close to the data points, we expect this number to be small. However, the size of this measure depends on the scale of the data. For instance, the two lines shown in Figure 6.5.8 seem to fit the data equally well, but \(|\mathbf b-A\xhat|^2\) is 100 times larger on the right.

The coefficient of determination \(R^2\) is defined by normalizing \(|\mathbf b-A\xhat|^2\) so that it is independent of the scale. Recall that we described how to demean a vector in Section 6.1: given a vector \(\mathbf v\text{,}\) we obtain \(\widetilde{\mathbf v}\) by subtracting the average of the components from each component.

Definition 6.5.9. Coefficient of determination

The coefficient of determination is

\begin{equation*} R^2 = 1 - \frac{|\mathbf b - A\xhat|^2} {|\widetilde{\mathbf b}|^2}, \end{equation*}

Using \(QR\) factorizations

As we've seen, the least squares approximate solution \(\xhat\) to \(A\mathbf x=\mathbf b\) may be found by solving the normal equations \(A^TA\xhat = A^T\mathbf b\text{,}\) and this can be a practical strategy for some problems. However, this approach is not generally sound as small rounding errors can accumulate and lead to inaccurate final results.

As the next activity demonstrates, there is an alternate method for finding the least squares approximate solution \(\xhat\) using a \(QR\) factorization of the matrix \(A\text{,}\) and this method is preferable as it is numerically more reliable.

Activity 6.5.4.

Suppose we are interested in finding the least squares approximate solution to the equation \(A\mathbf x = \mathbf b\) and that we have the \(QR\) factorization \(A=QR\text{.}\) Explain why the least squares approximation solution is given by solving
\begin{align*} A\xhat & = QQ^T\mathbf b \\\\ QR\xhat & = QQ^T\mathbf b \\ \end{align*}
Multiply both sides of the second expression by \(Q^T\) and explain why
\begin{equation*} R\xhat = Q^T\mathbf b. \end{equation*}

Since \(R\) is upper triangular, this is a relatively simple equation to solve using back substitution, as we saw in Section 5.1. We will therefore write the least squares approximate solution as

\begin{equation*} \xhat = R^{-1}Q^T\mathbf b, \end{equation*}

and put this to use in the following context.
Brozak’s formula, which is used to calculate a person's body fat index \(BFI\text{,}\) is
\begin{equation*} BFI = 100 \left(\frac{4.57}{\rho} - 4.142\right) \end{equation*}

where \(\rho\) denotes a person's body density in grams per cubic centimeter. Obtaining an accurate measure of \(\rho\) is difficult, however, because it requires submerging the person in water and measuring the volume of water displaced. Instead, we will gather several other body measurements, which are more easily obtained, and use it to predict \(BFI\text{.}\)

For instance, suppose we take 10 patients and measure their weight \(w\) in pounds, height \(h\) in inches, abdomen \(a\) in centimeters, wrist circumference \(r\) in centimeters, neck circumference \(n\) in centimeters, and \(BFI\text{.}\) Evaluating the following cell loads and displays the data.
In addition, that cell provides:
1. vectors weight, height, abdomen, wrist, neck, and BFI formed from the columns of the dataset.
2. the command onesvec(n), which returns an \(n\)-dimensional vector whose entries are all one.
3. the command QR(A) that returns the \(QR\) factorization of \(A\) as Q, R = QR(A).
4. the command demean(v), which returns the demeaned vector \(\widetilde{\mathbf v}\text{.}\)
We would like to find the linear function

\begin{equation*} \beta_0 + \beta_1w + \beta_2h + \beta_3a + \beta_4r + \beta_5n = BFI \end{equation*}

that best fits the data.

Use the first data point to write an equation for the parameters \(\beta_0,\beta_1,\ldots,\beta_5\text{.}\)
Describe the linear system \(A\mathbf x = \mathbf b\) for these parameters. More specifically, describe how the matrix \(A\) and the vector \(\mathbf b\) are formed.
Construct the matrix \(A\) and find its \(QR\) factorization in the cell below.
Find the least squares approximate solution \(\xhat\) by solving the equation \(R\xhat = Q^T\mathbf b\text{.}\) You may want to use N(xhat) to display a decimal approximation of the vector. What are the parameters \(\beta_0,\beta_1,\ldots,\beta_5\) that best fit the data?
Find the coefficient of determination \(R^2\) for your parameters. What does this imply about the quality of the fit?
Suppose a person's measurements are: weight 190, height 70, abdomen 90, wrist 18, and neck 35. Estimate this person's \(BFI\text{.}\)

To summarize, we have seen that

Proposition 6.5.10.

If the columns of \(A\) are linearly independent and we have the \(QR\) factorization \(A=QR\text{,}\) then the least squares approximate solution \(\xhat\) to the equation \(A\mathbf x=\mathbf b\) is given by

\begin{equation*} \xhat = R^{-1}Q^T\mathbf b\text{.} \end{equation*}

Polynomial Regression

In the examples we've seen so far, we have fit a linear function to a dataset. Sometimes, however, a polynomial, such as a quadratic function, may be more appropriate. It turns out that the techniques we've developed in this section are still useful as the next activity demonstrates.

Activity 6.5.5.

Suppose that we have a small dataset containing the points \((0,2)\text{,}\) \((1,1)\text{,}\) \((2,3)\text{,}\) and \((3,3)\text{,}\) such as appear when the following cell is evaluated. In addition to loading and plotting the data, evaluating that cell provides the following commands:
- Q, R = QR(A) returns the \(QR\) factorization of \(A\text{.}\)
- demean(v) returns the demeaned vector \(\widetilde{\mathbf v}\text{.}\)
Let's fit a quadratic function of the form

\begin{equation*} \beta_0 + \beta_1 x + \beta_2 x^2 = y \end{equation*}

to this dataset.

Write four equations, one for each data point, that describe the coefficients \(\beta_0\text{,}\) \(\beta_1\text{,}\) and \(\beta_2\text{.}\)
Express these four equations as a linear system \(A\mathbf x = \mathbf b\) where \(\mathbf x = \threevec{\beta_0}{\beta_1}{\beta_2}\text{.}\)
Find the \(QR\) factorization of \(A\) and use it to find the least squares approximate solution \(\xhat\text{.}\)
Use the parameters \(\beta_0\text{,}\) \(\beta_1\text{,}\) and \(\beta_2\) that you found to write the quadratic function that fits the data. You can plot this function, along with the data, by entering your function in the appropriate place below.
What is your predicted \(y\) value when \(x=1.5\text{.}\)
Find the coefficient of determination \(R^2\) for the quadratic function? What does this say about the quality of the fit?
Now fit a cubic polynomial of the form
\begin{equation*} \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3x^3 = y \end{equation*}

to this dataset.
Find the coefficient of determination \(R^2\) for the cubic function. What does this say about the quality of the fit?
What do you notice when you plot the cubic function along with the data? How does this reflect the value of \(R^2\) that you found?

The matrices \(A\) that you created in the last activity when fitting a quadratic and cubic function to a dataset have a special form. In particular, if the data points are labeled \((x_i, y_i)\) and we seek a degree \(k\) polynomial, then

\begin{equation*} A = \begin{bmatrix} 1 & x_1 & x_1^2 & \ldots & x_1^k \\ 1 & x_2 & x_2^2 & \ldots & x_2^k \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \ldots & x_m^k \\ \end{bmatrix}. \end{equation*}

This is called a Vandermonde matrix of degree \(k\text{.}\)

Activity 6.5.6.

This activity explores a dataset describing Arctic sea ice and that comes from Sustainability Math.

Evaluating the cell below will plot the extent of Arctic sea ice, in millions of square kilometers, during the twelve months of 2012.

In addition, you have access to a few special variables and commands:

month is the vector of month values and ice is the vector of sea ice values from the table above.
vandermonde(x, k) constructs the Vandermonde matrix of degree \(k\) using the points in the vector x.
Q, R = QR(A) provides the \(QR\) factorization of \(A\text{.}\)
demean(v) returns the demeaned vector \(\widetilde{\mathbf v}\text{.}\)

Find the vector \(\xhat\text{,}\) the least squares approximate solution to the linear system that results from fitting a degree 5 polynomial to the data.
If your result is stored in the variable xhat, you may plot the polynomial and the data together using the following cell.
Find the coefficient of determination \(R^2\) for this polynomial fit.
Repeat these steps to fit a degree 8 polynomial to the data, plot the polynomial with the data, and find \(R^2\text{.}\)
Repeat one more time by fitting a degree 11 polynomial to the data, plotting it, and finding \(R^2\text{.}\)
It's certainly true that higher degree polynomials fit the data better, as seen by the increasing values of \(R^2\text{,}\) but that's not always a good thing. For instance, when \(k=11\text{,}\) you may notice that the graph of the polynomial wiggles a little more than we would expect. In this case, the polynomial is trying too hard to fit the data, which usually contains some uncertainty, especially if it's obtained from measurements. The error built in to the data is called noise, and its presence means that we shouldn't expect our polynomial to fit the data perfectly. When we choose a polynomial whose degree is too high, we give the noise too much weight in the model, which leads to some undesirable behavior, like the wiggles in the graph.

Fitting the data with a polynomial whose degree is too high is called overfitting, a phenomenon that can appear in many machine learning applications. Generally speaking, we would like to choose \(k\) large enough to capture the essential features of the data but not so large that we overfit and build the noise into the model. There are ways to determine the optimal value of \(k\text{,}\) but we won't pursue that here.
Choosing a reasonable value of \(k\text{,}\) estimate the extent of Arctic sea ice at month 6.5, roughly at the Summer Solstice.

Summary

This section introduced some types of least squares problems and a framework for working with them.

Given an inconsistent system \(A\mathbf x=\mathbf b\text{,}\) we find \(\xhat\text{,}\) the least squares approximate solution, by requiring that \(A\xhat\) be as possible to \(\mathbf b\) as possible. In other words, \(A\xhat = \bhat\) where \(\bhat\) is the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{.}\)
One way to find \(\xhat\) is by solving the normal equations \(A^TA\xhat = A^T\mathbf b.\) This is not our preferred method since numerical problems can arise.
A second way to find \(\xhat\) uses a \(QR\) factorization of \(A\text{.}\) If \(A=QR\text{,}\) then \(\xhat = R^{-1}Q^T\mathbf b\) and finding \(R^{-1}\) is computationally feasible since \(R\) is upper triangular.
This technique may be applied widely and is useful for modeling data. We saw examples in this section where linear functions of several input variables and polynomials provided effective models for different datasets.
A simple measure of the quality of the fit is the coefficient of determination \(R^2\) though some additional thought should be given in real applications.

Exercises 6.5.6Exercises

Evaluating the following cell loads in some commands that will be helpful in the following exercises. In particular, there are commands

QR(A) that returns the \(QR\) factorization of A as Q, R = QR(A),
onesvec(n) that returns the \(n\)-dimensional vector whose entries are all 1,
demean(v) that demeans the vector v,
vandermonde(x, k) that returns the Vandermonde matrix of degree \(k\) formed from the components of the vector x, and
plot_model(xhat, data) that plots the data and the model xhat.

1

Suppose we write the linear system

\begin{equation*} \begin{bmatrix} 1 & -1 \\ 2 & -1 \\ -1 & 3 \end{bmatrix} \mathbf x = \threevec{-8}5{-10} \end{equation*}

as \(A\mathbf x=\mathbf b\text{.}\)

Find an orthogonal basis for \(\col(A)\text{.}\)
Find \(\bhat\text{,}\) the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{.}\)
Find a solution to the linear system \(A\mathbf x = \bhat\text{.}\)

2

Consider the data in Table 6.5.11.

\(x\)	\(y\)
1	1
2	1
3	1
4	2

Set up the linear system \(A\mathbf x=\mathbf b\) that describes the line \(b + mx = y\) passing through these points.
Write the normal equations that describe the least squares approximate solution to \(A\mathbf x=\mathbf b\text{.}\)
Find the least squares approximate solution \(\xhat\) and plot the data and the resulting line.
What is your predicted \(y\)-value when \(x=3.5\text{?}\)
Find the coefficient of determination \(R^2\text{.}\)

3

Consider the four points in Table 6.5.11.

Set up a linear system \(A\mathbf x = \mathbf b\) that describes a quadratic function
\begin{equation*} \beta_0+\beta_1x+\beta_2x^2 = y \end{equation*}

passing through the points.
Use a \(QR\) factorization to find the least squares approximate solution \(\xhat\) and plot the data and the graph of the resulting quadratic function.
What is your predicted \(y\)-value when \(x=3.5\text{?}\)
Find the coefficient of determination \(R^2\text{.}\)

4

Consider the data in Table 6.5.12.

\(x_1\)	\(x_2\)	\(y\)
1	1	4.2
1	2	3.3
2	1	5.9
2	2	5.1
3	2	7.5
3	3	6.3

Set up a linear system \(A\mathbf x = \mathbf b\) that describes the relationship
\begin{equation*} \beta_0 + \beta_1 x_1 + \beta_2 x_2 = y. \end{equation*}
Find the least squares approximate solution \(\xhat\text{.}\)
What is your predicted \(y\)-value when \(x_1 = 2.4\) and \(x_2=2.9\text{?}\)
Find the coefficient of determination \(R^2\text{.}\)

5

Determine whether the following statements are true or false and explain your thinking.

If \(A\mathbf x=\mathbf b\) is consistent, then \(\xhat\) is a solution to \(A\mathbf x=\mathbf b\text{.}\)
If \(R^2=1\text{,}\) then the least squares approximate solution \(\xhat\) is also a solution to the original equation \(A\mathbf x=\mathbf b\text{.}\)
Given the \(QR\) factorization \(A=QR\text{,}\) we have \(A\xhat=Q^TQ\mathbf b\text{.}\)
A \(QR\) factorization provides a method method for finding the approximate least squares solution to \(A\mathbf x=\mathbf b\) that is more reliable than solving the normal equations.
A solution to \(AA^T\mathbf x = A\mathbf b\) is the least squares approximate solution to \(A\mathbf x = \mathbf b\text{.}\)

6

Explain your response to the following questions.

If \(\xhat=\zerovec\text{,}\) what does this say about the vector \(\mathbf b\text{?}\)
If the columns of \(A\) are orthonormal, how can you easily find the least squares approximate solution to \(A\mathbf x=\mathbf b\text{?}\)

7

The following cell loads in some data showing the number of people in Bangladesh living without electricity over 27 years. It also defines vectors year, which records the years in the data set, and people, which records the number of people.

Suppose we want to write
\begin{equation*} N = \beta_0 + \beta_1 t \end{equation*}

where \(t\) is the year and \(N\) is the number of people. Construct the matrix \(A\) and vector \(\mathbf b\) so that the linear system \(A\mathbf x=\mathbf b\) describes the vector \(\mathbf x=\twovec{\beta_0}{\beta_1}\text{.}\)
Using a \(QR\) factorization of \(A\text{,}\) find the values of \(\beta_0\) and \(\beta_1\) in the least squares approximate solution \(\xhat\text{.}\)
What is the coefficient of determination \(R^2\) and what does this tell us about the quality of the approximation?
What is your prediction for the number of people living without electricity in 1985?
Estimate the year in which there will be no people living without electricity.

8

This problem concerns a data set describing planets in our Solar system. For each planet, we have the length \(L\) of the semi-major axis, essentially the distance from the planet to the Sun in AU (astronomical units), and the period \(P\text{,}\) the length of time in years required to completed one orbit around the Sun.

We would like to model this data using the function \(P = CL^r\) where \(C\) and \(r\) are parameters we need to determine. Since this isn't a linear function, we will transform this relationship by taking the natural logarithm of both sides to obtain

\begin{equation*} \ln(P) = \ln(C) + r\ln(L). \end{equation*}

Evaluating the following cell loads the data set and defines two vectors logaxis, whose components are \(\ln(L)\text{,}\) and logperiod, whose components are \(\ln(P)\text{.}\)

Construct the matrix \(A\) and vector \(\mathbf b\) so that the solution to \(A\mathbf x=\mathbf b\) is the vector \(\mathbf x=\twovec{\ln(C)}r\text{.}\)
Find the least squares approximate solution \(\xhat\text{.}\) What does this give for the values of \(C\) and \(r\text{?}\)
Find the coefficient of determination \(R^2\text{.}\) What does this tell us about the quality of the approximation?
Suppose that the orbit of an asteroid has a semi-major axis whose length is \(L=4.0\) AU. Estimate the period \(P\) of the asteroid's orbit.
Halley's Comet has a period of \(P=75\) years. Estimate the length of its semi-major axis.

9

Evaluating the following cell loads a data set describing the temperature in the Earth's atmosphere at various altitudes. There are also two vectors altitude, expressed in kilometers, and temperature, in degress Celsius.

Describe how to form the matrix \(A\) and vector \(\mathbf b\) so that the linear system \(A\mathbf x=\mathbf b\) describes a degree \(k\) polynomial fitting the data.
Choose a value of \(k\text{,}\) construct the matrix \(A\) and vector \(\mathbf b\text{,}\) and find the least squares approximate solution \(\xhat\text{.}\)
Plot the polynomial and data using plot_model(xhat, data).
Now examine what happens as you vary the degree of the polynomial \(k\text{.}\) Choose an appropriate value of \(k\) that seems to capture the most important features of the data while avoiding overfitting, and explain your choice.
Use your value of \(k\) to estimate the temperature at an altitude of 55 kilometers.

10

The following cell loads some data describing 1057 houses in a particular real estate market. For each house, we record the living area in square feet, the lot size in acres, the age in years, and the price in dollars. The cell also defines variables area, size, age, and price.

We will use linear regression to predict the price of a house given its living area, lot size, and age:

\begin{equation*} \beta_0 + \beta_1~\text{Living Area} + \beta_2~\text{Lot Size} + \beta_3~\text{Age} = \text{Price}. \end{equation*}

Use a \(QR\) factorization to find the least squares approximate solution \(\xhat\text{.}\)
Discuss the significance of the signs of \(\beta_1\text{,}\) \(\beta_2\text{,}\) and \(\beta_3\text{.}\)
If two houses are identical except for differing in age by one year, how would you predict their prices compare to one another?
Find the coefficient of determination \(R^2\text{.}\) What does this say about the quality of the fit?
Predict the price of a house whose living area is 2000 square feet, lot size is 1.5 acres, and age is 50 years.

11

This problem is about the meaning of the coefficient of determination \(R^2\) and its connection to variance, a topic that appears in the next section. Throughout this problem, we consider the linear system \(A\mathbf x=\mathbf b\) and the approximate least squares solution \(\xhat\text{,}\) where \(A\xhat=\bhat\text{.}\) We suppose that \(A\) is an \(m\times n\) matrix. We will denote the \(m\)-dimensional vector \(\onevec = \fourvec11{\vdots}1\text{.}\)

Explain why \(\bbar\text{,}\) the mean of the components of \(\mathbf b\text{,}\) can be found as the dot product
\begin{equation*} \bbar = \frac 1m \mathbf b\cdot\onevec. \end{equation*}
In the examples we have seen in this section, explain why \(\onevec\) is in \(\col(A)\text{.}\)
If we write \(\mathbf b = \bhat + \mathbf b^\perp\text{,}\) we explain why
\begin{equation*} \mathbf b^\perp\cdot\onevec = 0 \end{equation*}

and hence why the mean of the components of \(\mathbf b^\perp\) is zero.
The variance of an \(m\)-dimensional vector \(\mathbf v\) is \(\var(\mathbf v) = \frac1m \len{\widetilde{\mathbf v}}^2\text{,}\) where \(\widetilde{\mathbf v}\) is the vector obtained by demeaning \(\mathbf v\text{.}\)
Explain why

\begin{equation*} \var(\mathbf b) = \var(\bhat) + \var(\mathbf b^\perp). \end{equation*}
Explain why
\begin{equation*} \frac{\len{\mathbf b - A\xhat}^2}{\len{\widetilde{\mathbf b}}^2} = \frac{\var(\mathbf b^\perp)}{\var(\mathbf b)} \end{equation*}

and hence

\begin{equation*} R^2 = \frac{\var(\bhat)}{\var(\mathbf b)} = \frac{\var(A\xhat)}{\var(\mathbf b)}. \end{equation*}

These expressions indicate why it is sometimes said that \(R^2\) measures the “fraction of variance explained” by the function we are using to fit the data. As seen in the previous exercise, there may be other features that are not recorded in the dataset that influence the quantity we wish to predict.
Explain why \(0\leq R^2 \leq 1\text{.}\)

FAQs

What is the orthogonal least square algorithm? ›

Orthogonal least square esti- mation is an algorithm which can determine the structure of a model by identifying the significant terms contributed to the model and also capable of providing the final values of parameter esti- mates.

Show Me More ›

What is total least squares orthogonal regression? ›

This is the linear case of what is known variously as Orthogonal Regression, Total Least Squares, Deming Regression, or Errors-in-variables, and is appropriate when there is no natural distinction between predictor and response variables, or when all variables are measured with error.

View Details ›

What is the orthogonality principle of linear regression? ›

Loosely stated, the orthogonality principle says that the error vector of the optimal estimator (in a mean square error sense) is orthogonal to any possible estimator. The orthogonality principle is most commonly stated for linear estimators, but more general formulations are possible.

Keep Reading ›

What is the least squares problem? ›

The problem to find x ∈ Rn that minimizes Ax − b2 is called the least squares problem. A minimizing vector x is called a least squares solution of Ax = b. 2 = (x1 − 1)2 + (x1 − 1)2 + (x1 − 2)2. The second derivative is positive (it is equal to 6) and x = 4/3 is a global minimum.

Get More Info Here ›

What is the least squares orthogonal distance? ›

The least-squares fitting minimizes the squares sum of error-of-fit in predefined measures. By the geometric fitting, the error distances are defined with the orthogonal, or shortest, distances from the given points to the geometric feature to be fitted.

View Details ›

What is the difference between PLS and opls? ›

To put it simply, this means that unlike PLS — which handle random noise fairly well — , the new method also known as Orthogonal Partial Least-Squares (OPLS) enables to filter out the structured noise in the data set by modeling separately variations of the X-predictors correlated and uncorrelated to the Y-responses.

Learn More ›

What is the difference between ordinary least squares and orthogonal regression? ›

Ordinary least squares (OLS) is one of the most commonly used criteria for fitting data to models and for estimating parameters. Orthogonal distance regression (ODR) extends least squares data fitting to problems with independent variables that are not known exactly.

What is the difference between least squares and total least squares? ›

The least squares and total least squares methods assess the fitting accuracy in different ways: the least squares method minimizes the sum of the squared vertical distances from the data points to the fitting line, while the total least squares method minimizes the sum of the squared orthogonal distances from the data ...

See Details ›

Why is orthogonality important in regression? ›

If you use simple regression to determine comparability when both variables contain measurement error, the results depend on which variable the calculations assume have no measurement error. Orthogonal regression addresses this problem so the roles of the variables have little influence on the results.

What is the orthogonality rule? ›

We say that 2 vectors are orthogonal if they are perpendicular to each other. i.e. the dot product of the two vectors is zero.

How do you calculate orthogonality? ›

Two vectors u → and v → in an inner product space are said to be orthogonal if and only if, their dot product equals zero: u → ⋅ v → = 0.

Explore More ›

What is least squares for dummies? ›

The least squares principle states that the SRF should be constructed (with the constant and slope values) so that the sum of the squared distance between the observed values of your dependent variable and the values estimated from your SRF is minimized (the smallest possible value).

Find Out More ›

Why is least squares the best? ›

In 1822, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, normally distributed, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares ...

What is the orthogonal distance regression algorithm? ›

Orthogonal Distance regression minimizes the sum of squared perpendicular distances, unlike the sum of least squared distances. Orthogonal regression is generally applied when both Y and X are susceptible to error and can also be applied to the transformable non-linear model.

What is the least squares based algorithm? ›

The least square method is the process of finding the best-fitting curve or line of best fit for a set of data points by reducing the sum of the squares of the offsets (residual part) of the points from the curve.

Find Out More ›

What is the orthogonal regression theory? ›

Orthogonal regression examines the linear relationship between two continuous variables: one response (Y) and one predictor (X). Unlike simple linear regression (least squares regression), both the response and predictor in orthogonal regression contain measurement error.

Get More Info Here ›

What is the least squares optimization algorithm? ›

The least squares method is a statistical procedure to find the best fit for a set of data points. The method works by minimizing the sum of the offsets or residuals of points from the plotted curve. Least squares regression is used to predict the behavior of dependent variables.