- Last updated

- Save as PDF

- Page ID
- 82506

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

\( \newcommand{\vectorC}[1]{\textbf{#1}}\)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}}\)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}\)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)

Suppose we collect some data when performing an experiment and plot it as shown on the left of Figure 6.5.1. Notice that there is no line on which all the points lie; in fact, it would be surprising if there were since we can expect some uncertainty in the measurements recorded. There does, however, appear to a line, as shown on the right, on which the points *almost* lie.

In this section, we'll explore how the techniques developed in this chapter enable us to find the line that best approximates the data. More specifically, we'll see how the search for a line passing through the data points leads to an inconsistent system \(A\mathbf x=\mathbf b\text{.}\) Since we are unable to find a solution \(\mathbf x\text{,}\) we instead seek the vector \(\mathbf x\) where \(A\mathbf x\) is as close as possible to \(\mathbf b\text{.}\) Orthogonal projection give us just the right tool for doing this.

### Preview Activity 6.5.1.

- Is there a solution to the equation \(A\mathbf x=\mathbf b\) where \(A\) and \(\mathbf b\) are such that
\begin{equation*} \begin{bmatrix} 1 & 2 \\ 2 & 5 \\ -1 & 0 \\ \end{bmatrix} \mathbf x = \threevec5{-3}{-1}\text{.} \end{equation*}

- We know that \(\threevec12{-1}\) and \(\threevec250\) form a basis for \(\col(A)\text{.}\) Find an orthogonal basis for \(\col(A)\text{.}\)
- Find the orthogonal projection \(\widehat\mathbf b\) of \(\mathbf b\) onto \(\col(A)\text{.}\)
- Explain why the equation \(A\mathbf x=\widehat\mathbf b\) must be consistent and then find its solution.

## A first example

When we've encountered inconsistent systems in the past, we've simply said there is no solution and moved on. The preview activity, however, shows how we can find approximate solutions to an inconsistent system: if there are no solutions to \(A\mathbf x = \mathbf b\text{,}\) we instead solve the consistent system \(A\mathbf x = \bhat\text{,}\) the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{.}\) As we'll see, this solution is, in a specific sense, the best possible.

### Activity 6.5.2.

Suppose we have three data points \((1,1)\text{,}\) \((2,1)\text{,}\) and \((3,3)\) and that we would like to find a line passing through them.

- Plot these three points in Figure 6.5.2. Are you able to draw a line that passes through all three points?

- Let's write the conditions that would describe a line passing through the points. Remember that the equation of a line can be written as \(b + mx=y\) where \(m\) is the slope and \(b\) is the \(y\)-intercept. We will try to find \(b\) and \(m\) so that the three points lie on the line.
The first data point \((1,1)\) gives an equation for \(b\) and \(m\text{.}\) In particular, we know that when \(x=1\text{,}\) then \(y=1\) so we have \(b + m(1) = 1\) or \(b + m = 1\text{.}\) Use the other two data points to create a linear system describing \(m\) and \(b\text{.}\)

- We have obtained a linear system having three equations, one from each data point, for the two unknowns \(b\) and \(m\text{.}\) Identify a matrix \(A\) and vector \(\mathbf b\) so that the system has the form \(A\mathbf x=\mathbf b\text{,}\) where \(\mathbf x=\twovec bm\text{.}\)
Notice that the unknown vector \(\mathbf x=\twovec bm\) describes the line that we seek.

- Is there a solution to this linear system? How does this question relate to your attempt to draw a line through the three points above?
- Since this system is inconsistent, we know that \(\mathbf b\) is not in the column space \(\col(A)\text{.}\) Find an orthogonal basis for \(\col(A)\) and use it to find the orthogonal projection \(\widehat\mathbf b\) of \(\mathbf b\) onto \(\col(A)\text{.}\)
- Since \(\widehat\mathbf b\) is in \(\col(A)\text{,}\) the equation \(A\mathbf x = \widehat\mathbf b\) is consistent. Find its solution \(\mathbf x = \twovec{b}{m}\) and sketch the line \(y=b + mx\) in Figure 6.5.2. We say that this is the line of best fit.

This activity illustrates the idea behind a technique known as *orthogonal least squares*, which we have been working toward throughout this chapter. If the data points are denoted as \((x_i, y_i)\text{,}\) we construct the matrix \(A\) and vector \(\mathbf b\) as

\begin{equation*} A = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ \end{bmatrix},\hspace{24pt} \mathbf b = \threevec{y_1}{y_2}{y_3}\text{.} \end{equation*}

With the vector \(\mathbf x=\twovec bm\) representing the line \(b+mx = y\text{,}\) we see that the equation \(A\mathbf x=\mathbf b\) describes a line passing through all the data points. In our example, it is visually apparent that there is no such line, a fact confirmed by the inconsistency of the equation \(A\mathbf x=\mathbf b\text{.}\)

Remember that \(\bhat\text{,}\) the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{,}\) is the closest vector in \(\col(A)\) to \(\mathbf b\text{.}\) Therefore, when we solve the equation \(A\mathbf x=\bhat\text{,}\) we are finding the vector \(\mathbf x\) so that \(A\mathbf x = \threevec{b+mx_1}{b+mx_2}{b+mx_3}\) is as close to \(\mathbf b=\threevec{y_1}{y_2}{y_3}\) as possible. Let's think about what this means within the context of this problem.

The difference \(\mathbf b-A\mathbf x = \threevec{y_1-(b+mx_1)}{y_2-(b+mx_2)}{y_3-(b+mx_3)}\) so that the square of the distance between \(A\mathbf x\) and \(\mathbf b\) is

\begin{align*} \len{\mathbf b - A\mathbf x}^2 & =\\ & \left(y_1-(b+mx_1)\right)^2 + \left(y_2-(b+mx_2)\right)^2 + \left(y_3-(b+mx_3)\right)^2\text{.} \end{align*}

Our approach finds the values for \(b\) and \(m\) that make this sum of squares as small as possible, which explains why we call this a *least squares* problem.

Drawing the line defined by the vector \(\mathbf x=\twovec bm\text{,}\) the quantity \(y_i - (b + mx_i)\) reflects the vertical distance between the line and the data point \((x_i, y_i)\text{,}\) as shown in Figure 6.5.5. Seen in this way, the square of the distance \(\len{\mathbf b-A\mathbf x}^2\) is a measure of how much the line defined by the vector \(\mathbf x\) misses the data points. The solution to the least squares problem is the line that misses the data points by the smallest amount possible.

## Solving least squares problems

Now that we've seen an example of what we're trying to accomplish, let's put this technique into a more general framework.

Given an inconsistent system \(A\mathbf x = \mathbf b\text{,}\) we seek to find \(\mathbf x\) that minimizes the distance from \(A\mathbf x\) to \(\mathbf b\text{.}\) We find \(\mathbf x\) by forming \(\widehat\mathbf b\text{,}\) the orthogonal projection of \(\mathbf b\) onto the column space \(\col(A)\) and then solving \(A\mathbf x = \widehat\mathbf b\text{.}\) Moving forward, we will denote the solution of \(A\mathbf x = \bhat\) by \(\xhat\) and call this vector the *least squares approximate solution* of \(A\mathbf x=\mathbf b\) to distinguish it from a (non-existent) solution of \(A\mathbf x=\mathbf b\text{.}\)

Remember that the orthogonal projection \(\widehat\mathbf b\) of \(\mathbf b\) onto the column space \(\col(A)\) is defined by the fact that \(\widehat\mathbf b - \mathbf b\) is orthogonal to \(\col(A)\text{.}\) In other words, \(\bhat-\mathbf b\) is in the orthogonal complement \(\col(A)^\perp\text{,}\) which Proposition 6.2.10 tells us is the same as \(\nul(A^T)\text{.}\) Since \(\bhat-\mathbf b\) is in \(\nul(A^T)\text{,}\) it follows that

\[ A^T(\widehat\mathbf b-\mathbf b) = \zerovec\text{.} \]

Finally, the least squares approximate solution is the vector \(\xhat\) such that \(A\xhat = \widehat\mathbf b\text{,}\) which gives

\[\begin{align*} A^T(A\xhat - \mathbf b) & = \zerovec\\ A^TA\xhat - A^T\mathbf b & = \zerovec\\ A^TA\xhat & = A^T\mathbf b\text{.} \end{align*}\]

Let's record our work in the following proposition.

##### Proposition 6.5.6.

The least squares approximate solution \(\widehat\mathbf x\) to the equation \(A\mathbf x = \mathbf b\) is given by the *normal equations*

\[ A^TA\widehat\mathbf x = A^T\mathbf b\text{.} \]

The linear system represented by the normal equations is consistent since \(\xhat\text{,}\) the least squares approximate solution to \(A\mathbf x=\mathbf b\text{,}\) is a solution. If we further assume that the columns of \(A\) are linearly independent, we can see that there is a unique solution. Imagine, for the moment, that \(\mathbf x\) is a solution to the hom*ogeneous equation \(A^TA\mathbf x = \zerovec\text{.}\) We then have

\[\begin{align*} \mathbf x\cdot(A^TA\mathbf x) & = \mathbf x\cdot\zerovec = 0\\ \mathbf x^TA^TA\mathbf x & = 0\\ (A\mathbf x)^T(A\mathbf x) & = 0\\ (A\mathbf x)\cdot(A\mathbf x) & = 0\\ \len{A\mathbf x}^2 & = 0\\ A\mathbf x & = \zerovec \text{.} \end{align*}\]

In other words, if \(\mathbf x\) is a solution to the hom*ogeneous equation \(A^TA\mathbf x = \zerovec\text{,}\) then we know that \(A\mathbf x = \zerovec\text{.}\) Since we are assuming that the columns of \(A\) are linearly independent, we know that the hom*ogeneous equation \(A\mathbf x=\zerovec\) has only the zero solution \(\mathbf x = \zerovec\text{.}\) Therefore, the hom*ogeneous equation \(A^TA\mathbf x=\zerovec\) has only the zero solution, which means that \(A^TA\) has a pivot position in every column. Hence, the normal equations \(A^TA\xhat = A^T\mathbf b\) must have a unique solution.

##### Proposition 6.5.7.

If the columns of \(A\) are linearly independent, then there is a unique least squares approximate solution \(\xhat\) to the equation \(A\mathbf x=\mathbf b\) given by the normal equations

\[ A^TA\xhat = A^T\mathbf b\text{.} \]

Let's put this proposition to use in the next activity.

### Activity 6.5.3.

The rate at which a cricket chirps is related to the outdoor temperature, as reflected in some experimental data that we'll study in this activity. The chirp rate \(C\) is expressed in chirps per second while the temperature \(T\) is in degrees Fahrenheit. Evaluate the following cell to load in the data:

Evaluating this cell also provides:

- the vectors
`chirps`

and`temps`

formed from the columns of the dataset. - the command
`onesvec(n)`

, which creates an \(n\)-dimensional vector whose entries are all one. - Remember that you can form a matrix whose columns are the vectors
`v1`

and`v2`

with`matrix([v1, v2]).T`

.

We would like to represent this relationship by a linear function

\begin{equation*} \beta_0 + \beta_1 C = T\text{.} \end{equation*}

- Use the first data point \((C_1,T_1)=(20.0,88.6)\) to write an equation involving \(\beta_0\) and \(\beta_1\text{.}\)
- Suppose that we represent the unknowns using a vector \(\mathbf x = \twovec{\beta_0}{\beta_1}\text{.}\) Use the 15 data points to create the matrix \(A\) and vector \(\mathbf b\) so that the linear system \(A\mathbf x= \mathbf b\) describes the unknown vector \(\mathbf x\text{.}\)
- Write the normal equations \(A^TA\xhat = A^T\mathbf b\text{;}\) that is, find the matrix \(A^TA\) and the vector \(A^T\mathbf b\text{.}\)
- Solve the normal equations to find \(\xhat\text{,}\) the least squares approximate solution to the equation \(A\mathbf x=\mathbf b\text{.}\) Call your solution
`xhat`

since`x`

has another meaning in Sage.What are the values of \(\beta_0\) and \(\beta_1\) that you found?

- If the chirp rate is 22 chirps per second, what is your prediction for the temperature?
You can plot the data and your line, assuming you called the solution

`xhat`

, using the cell below.

This example demonstrates an approach called *linear regression*, in which a collection of data is modeled using a linear function found by solving a least squares problem. Once we have the linear function that best fits the data, we can make predictions about situations that we haven't encountered in the data.

If we're going to use our function to make predictions, it's natural to ask how much confidence we have in these predictions. This is a statistical question that leads to a rich and well-developed theory, which we won't explore in much detail here. However, there is one simple measure of how well our linear function fits the data that is known as the coefficient of determination and denoted by \(R^2\text{.}\)

We have seen that the square of the distance \(\len{\mathbf b-A\mathbf x}^2\) measures the amount by which the line fails to pass through the data points. When the line is close to the data points, we expect this number to be small. However, the size of this measure depends on the scale of the data. For instance, the two lines shown in Figure 6.5.8 seem to fit the data equally well, but \(|\mathbf b-A\xhat|^2\) is 100 times larger on the right.

The coefficient of determination \(R^2\) is defined by normalizing \(|\mathbf b-A\xhat|^2\) so that it is independent of the scale. Recall that we described how to demean a vector in Section 6.1: given a vector \(\mathbf v\text{,}\) we obtain \(\widetilde{\mathbf v}\) by subtracting the average of the components from each component.

##### Definition 6.5.9. Coefficient of determination

The coefficient of determination is

\begin{equation*} R^2 = 1 - \frac{|\mathbf b - A\xhat|^2} {|\widetilde{\mathbf b}|^2}, \end{equation*}

where \(\widetilde{\mathbf b}\) is the vector obtained by demeaning \(\mathbf b\text{.}\)

A more complete explanation of this definition relies on the concept of variance, which we explore in Exercise 6.5.6.11 and the next chapter. For the time being, it's enough to know that \(0\leq R^2 \leq 1\) and that the closer \(R^2\) is to 1, the better the line fits the data. In our original example, illustrated in Figure 6.5.8, we find that \(R^2 = 0.75\text{,}\) and in our study of cricket chirp rates, we have \(R^2=0.69\text{.}\) However, assessing the confidence we have in predictions made by solving a least squares problem can require considerable thought, and it would be naive to rely only on the value of \(R^2\text{.}\)

## Using \(QR\) factorizations

As we've seen, the least squares approximate solution \(\xhat\) to \(A\mathbf x=\mathbf b\) may be found by solving the normal equations \(A^TA\xhat = A^T\mathbf b\text{,}\) and this can be a practical strategy for some problems. However, this approach is not generally sound as small rounding errors can accumulate and lead to inaccurate final results.

As the next activity demonstrates, there is an alternate method for finding the least squares approximate solution \(\xhat\) using a \(QR\) factorization of the matrix \(A\text{,}\) and this method is preferable as it is numerically more reliable.

### Activity 6.5.4.

- Suppose we are interested in finding the least squares approximate solution to the equation \(A\mathbf x = \mathbf b\) and that we have the \(QR\) factorization \(A=QR\text{.}\) Explain why the least squares approximation solution is given by solving
\begin{align*} A\xhat & = QQ^T\mathbf b \\\\ QR\xhat & = QQ^T\mathbf b \\ \end{align*}

- Multiply both sides of the second expression by \(Q^T\) and explain why
\begin{equation*} R\xhat = Q^T\mathbf b. \end{equation*}

Since \(R\) is upper triangular, this is a relatively simple equation to solve using back substitution, as we saw in Section 5.1. We will therefore write the least squares approximate solution as

\begin{equation*} \xhat = R^{-1}Q^T\mathbf b, \end{equation*}

and put this to use in the following context.

- Brozak’s formula, which is used to calculate a person's body fat index \(BFI\text{,}\) is
\begin{equation*} BFI = 100 \left(\frac{4.57}{\rho} - 4.142\right) \end{equation*}

where \(\rho\) denotes a person's body density in grams per cubic centimeter. Obtaining an accurate measure of \(\rho\) is difficult, however, because it requires submerging the person in water and measuring the volume of water displaced. Instead, we will gather several other body measurements, which are more easily obtained, and use it to predict \(BFI\text{.}\)

For instance, suppose we take 10 patients and measure their weight \(w\) in pounds, height \(h\) in inches, abdomen \(a\) in centimeters, wrist circumference \(r\) in centimeters, neck circumference \(n\) in centimeters, and \(BFI\text{.}\) Evaluating the following cell loads and displays the data.

In addition, that cell provides:- vectors
`weight`

,`height`

,`abdomen`

,`wrist`

,`neck`

, and`BFI`

formed from the columns of the dataset. - the command
`onesvec(n)`

, which returns an \(n\)-dimensional vector whose entries are all one. - the command
`QR(A)`

that returns the \(QR\) factorization of \(A\) as`Q, R = QR(A)`

. - the command
`demean(v)`

, which returns the demeaned vector \(\widetilde{\mathbf v}\text{.}\)

We would like to find the linear function

\begin{equation*} \beta_0 + \beta_1w + \beta_2h + \beta_3a + \beta_4r + \beta_5n = BFI \end{equation*}

that best fits the data.

Use the first data point to write an equation for the parameters \(\beta_0,\beta_1,\ldots,\beta_5\text{.}\)

- vectors
- Describe the linear system \(A\mathbf x = \mathbf b\) for these parameters. More specifically, describe how the matrix \(A\) and the vector \(\mathbf b\) are formed.
- Construct the matrix \(A\) and find its \(QR\) factorization in the cell below.
- Find the least squares approximate solution \(\xhat\) by solving the equation \(R\xhat = Q^T\mathbf b\text{.}\) You may want to use
`N(xhat)`

to display a decimal approximation of the vector. What are the parameters \(\beta_0,\beta_1,\ldots,\beta_5\) that best fit the data? - Find the coefficient of determination \(R^2\) for your parameters. What does this imply about the quality of the fit?
- Suppose a person's measurements are: weight 190, height 70, abdomen 90, wrist 18, and neck 35. Estimate this person's \(BFI\text{.}\)

To summarize, we have seen that

##### Proposition 6.5.10.

If the columns of \(A\) are linearly independent and we have the \(QR\) factorization \(A=QR\text{,}\) then the least squares approximate solution \(\xhat\) to the equation \(A\mathbf x=\mathbf b\) is given by

\begin{equation*} \xhat = R^{-1}Q^T\mathbf b\text{.} \end{equation*}

## Polynomial Regression

In the examples we've seen so far, we have fit a linear function to a dataset. Sometimes, however, a polynomial, such as a quadratic function, may be more appropriate. It turns out that the techniques we've developed in this section are still useful as the next activity demonstrates.

### Activity 6.5.5.

- Suppose that we have a small dataset containing the points \((0,2)\text{,}\) \((1,1)\text{,}\) \((2,3)\text{,}\) and \((3,3)\text{,}\) such as appear when the following cell is evaluated. In addition to loading and plotting the data, evaluating that cell provides the following commands:
`Q, R = QR(A)`

returns the \(QR\) factorization of \(A\text{.}\)`demean(v)`

returns the demeaned vector \(\widetilde{\mathbf v}\text{.}\)

Let's fit a quadratic function of the form

\begin{equation*} \beta_0 + \beta_1 x + \beta_2 x^2 = y \end{equation*}

to this dataset.

Write four equations, one for each data point, that describe the coefficients \(\beta_0\text{,}\) \(\beta_1\text{,}\) and \(\beta_2\text{.}\)

- Express these four equations as a linear system \(A\mathbf x = \mathbf b\) where \(\mathbf x = \threevec{\beta_0}{\beta_1}{\beta_2}\text{.}\)
Find the \(QR\) factorization of \(A\) and use it to find the least squares approximate solution \(\xhat\text{.}\)

- Use the parameters \(\beta_0\text{,}\) \(\beta_1\text{,}\) and \(\beta_2\) that you found to write the quadratic function that fits the data. You can plot this function, along with the data, by entering your function in the appropriate place below.
- What is your predicted \(y\) value when \(x=1.5\text{.}\)
- Find the coefficient of determination \(R^2\) for the quadratic function? What does this say about the quality of the fit?
- Now fit a cubic polynomial of the form
\begin{equation*} \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3x^3 = y \end{equation*}

to this dataset.

- Find the coefficient of determination \(R^2\) for the cubic function. What does this say about the quality of the fit?
- What do you notice when you plot the cubic function along with the data? How does this reflect the value of \(R^2\) that you found?

The matrices \(A\) that you created in the last activity when fitting a quadratic and cubic function to a dataset have a special form. In particular, if the data points are labeled \((x_i, y_i)\) and we seek a degree \(k\) polynomial, then

\begin{equation*} A = \begin{bmatrix} 1 & x_1 & x_1^2 & \ldots & x_1^k \\ 1 & x_2 & x_2^2 & \ldots & x_2^k \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \ldots & x_m^k \\ \end{bmatrix}. \end{equation*}

This is called a *Vandermonde* matrix of degree \(k\text{.}\)

### Activity 6.5.6.

This activity explores a dataset describing Arctic sea ice and that comes from Sustainability Math.

Evaluating the cell below will plot the extent of Arctic sea ice, in millions of square kilometers, during the twelve months of 2012.

In addition, you have access to a few special variables and commands:

`month`

is the vector of month values and`ice`

is the vector of sea ice values from the table above.`vandermonde(x, k)`

constructs the Vandermonde matrix of degree \(k\) using the points in the vector`x`

.`Q, R = QR(A)`

provides the \(QR\) factorization of \(A\text{.}\)`demean(v)`

returns the demeaned vector \(\widetilde{\mathbf v}\text{.}\)

- Find the vector \(\xhat\text{,}\) the least squares approximate solution to the linear system that results from fitting a degree 5 polynomial to the data.
- If your result is stored in the variable
`xhat`

, you may plot the polynomial and the data together using the following cell. - Find the coefficient of determination \(R^2\) for this polynomial fit.
- Repeat these steps to fit a degree 8 polynomial to the data, plot the polynomial with the data, and find \(R^2\text{.}\)
- Repeat one more time by fitting a degree 11 polynomial to the data, plotting it, and finding \(R^2\text{.}\)
It's certainly true that higher degree polynomials fit the data better, as seen by the increasing values of \(R^2\text{,}\) but that's not always a good thing. For instance, when \(k=11\text{,}\) you may notice that the graph of the polynomial wiggles a little more than we would expect. In this case, the polynomial is trying too hard to fit the data, which usually contains some uncertainty, especially if it's obtained from measurements. The error built in to the data is called

*noise,*and its presence means that we shouldn't expect our polynomial to fit the data perfectly. When we choose a polynomial whose degree is too high, we give the noise too much weight in the model, which leads to some undesirable behavior, like the wiggles in the graph.Fitting the data with a polynomial whose degree is too high is called

*overfitting*, a phenomenon that can appear in many machine learning applications. Generally speaking, we would like to choose \(k\) large enough to capture the essential features of the data but not so large that we overfit and build the noise into the model. There are ways to determine the optimal value of \(k\text{,}\) but we won't pursue that here. - Choosing a reasonable value of \(k\text{,}\) estimate the extent of Arctic sea ice at month 6.5, roughly at the Summer Solstice.

## Summary

This section introduced some types of least squares problems and a framework for working with them.

- Given an inconsistent system \(A\mathbf x=\mathbf b\text{,}\) we find \(\xhat\text{,}\) the least squares approximate solution, by requiring that \(A\xhat\) be as possible to \(\mathbf b\) as possible. In other words, \(A\xhat = \bhat\) where \(\bhat\) is the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{.}\)
- One way to find \(\xhat\) is by solving the normal equations \(A^TA\xhat = A^T\mathbf b.\) This is not our preferred method since numerical problems can arise.
- A second way to find \(\xhat\) uses a \(QR\) factorization of \(A\text{.}\) If \(A=QR\text{,}\) then \(\xhat = R^{-1}Q^T\mathbf b\) and finding \(R^{-1}\) is computationally feasible since \(R\) is upper triangular.
- This technique may be applied widely and is useful for modeling data. We saw examples in this section where linear functions of several input variables and polynomials provided effective models for different datasets.
- A simple measure of the quality of the fit is the coefficient of determination \(R^2\) though some additional thought should be given in real applications.

## Exercises 6.5.6Exercises

Evaluating the following cell loads in some commands that will be helpful in the following exercises. In particular, there are commands

`QR(A)`

that returns the \(QR\) factorization of`A`

as`Q, R = QR(A)`

,`onesvec(n)`

that returns the \(n\)-dimensional vector whose entries are all 1,`demean(v)`

that demeans the vector`v`

,`vandermonde(x, k)`

that returns the Vandermonde matrix of degree \(k\) formed from the components of the vector`x`

, and`plot_model(xhat, data)`

that plots the`data`

and the model`xhat`

.

##### 1

Suppose we write the linear system

\begin{equation*} \begin{bmatrix} 1 & -1 \\ 2 & -1 \\ -1 & 3 \end{bmatrix} \mathbf x = \threevec{-8}5{-10} \end{equation*}

as \(A\mathbf x=\mathbf b\text{.}\)

- Find an orthogonal basis for \(\col(A)\text{.}\)
- Find \(\bhat\text{,}\) the orthogonal projection of \(\mathbf b\) onto \(\col(A)\text{.}\)
- Find a solution to the linear system \(A\mathbf x = \bhat\text{.}\)

##### 2

Consider the data in Table 6.5.11.

\(x\) | \(y\) |

1 | 1 |

2 | 1 |

3 | 1 |

4 | 2 |

- Set up the linear system \(A\mathbf x=\mathbf b\) that describes the line \(b + mx = y\) passing through these points.
- Write the normal equations that describe the least squares approximate solution to \(A\mathbf x=\mathbf b\text{.}\)
- Find the least squares approximate solution \(\xhat\) and plot the data and the resulting line.
- What is your predicted \(y\)-value when \(x=3.5\text{?}\)
- Find the coefficient of determination \(R^2\text{.}\)

##### 3

Consider the four points in Table 6.5.11.

- Set up a linear system \(A\mathbf x = \mathbf b\) that describes a quadratic function
\begin{equation*} \beta_0+\beta_1x+\beta_2x^2 = y \end{equation*}

passing through the points.

- Use a \(QR\) factorization to find the least squares approximate solution \(\xhat\) and plot the data and the graph of the resulting quadratic function.
- What is your predicted \(y\)-value when \(x=3.5\text{?}\)
- Find the coefficient of determination \(R^2\text{.}\)

##### 4

Consider the data in Table 6.5.12.

\(x_1\) | \(x_2\) | \(y\) |

1 | 1 | 4.2 |

1 | 2 | 3.3 |

2 | 1 | 5.9 |

2 | 2 | 5.1 |

3 | 2 | 7.5 |

3 | 3 | 6.3 |

- Set up a linear system \(A\mathbf x = \mathbf b\) that describes the relationship
\begin{equation*} \beta_0 + \beta_1 x_1 + \beta_2 x_2 = y. \end{equation*}

- Find the least squares approximate solution \(\xhat\text{.}\)
- What is your predicted \(y\)-value when \(x_1 = 2.4\) and \(x_2=2.9\text{?}\)
- Find the coefficient of determination \(R^2\text{.}\)

##### 5

Determine whether the following statements are true or false and explain your thinking.

- If \(A\mathbf x=\mathbf b\) is consistent, then \(\xhat\) is a solution to \(A\mathbf x=\mathbf b\text{.}\)
- If \(R^2=1\text{,}\) then the least squares approximate solution \(\xhat\) is also a solution to the original equation \(A\mathbf x=\mathbf b\text{.}\)
- Given the \(QR\) factorization \(A=QR\text{,}\) we have \(A\xhat=Q^TQ\mathbf b\text{.}\)
- A \(QR\) factorization provides a method method for finding the approximate least squares solution to \(A\mathbf x=\mathbf b\) that is more reliable than solving the normal equations.
- A solution to \(AA^T\mathbf x = A\mathbf b\) is the least squares approximate solution to \(A\mathbf x = \mathbf b\text{.}\)

##### 6

Explain your response to the following questions.

- If \(\xhat=\zerovec\text{,}\) what does this say about the vector \(\mathbf b\text{?}\)
- If the columns of \(A\) are orthonormal, how can you easily find the least squares approximate solution to \(A\mathbf x=\mathbf b\text{?}\)

##### 7

The following cell loads in some data showing the number of people in Bangladesh living without electricity over 27 years. It also defines vectors `year`

, which records the years in the data set, and `people`

, which records the number of people.

- Suppose we want to write
\begin{equation*} N = \beta_0 + \beta_1 t \end{equation*}

where \(t\) is the year and \(N\) is the number of people. Construct the matrix \(A\) and vector \(\mathbf b\) so that the linear system \(A\mathbf x=\mathbf b\) describes the vector \(\mathbf x=\twovec{\beta_0}{\beta_1}\text{.}\)

- Using a \(QR\) factorization of \(A\text{,}\) find the values of \(\beta_0\) and \(\beta_1\) in the least squares approximate solution \(\xhat\text{.}\)
- What is the coefficient of determination \(R^2\) and what does this tell us about the quality of the approximation?
- What is your prediction for the number of people living without electricity in 1985?
- Estimate the year in which there will be no people living without electricity.

##### 8

This problem concerns a data set describing planets in our Solar system. For each planet, we have the length \(L\) of the semi-major axis, essentially the distance from the planet to the Sun in AU (astronomical units), and the period \(P\text{,}\) the length of time in years required to completed one orbit around the Sun.

We would like to model this data using the function \(P = CL^r\) where \(C\) and \(r\) are parameters we need to determine. Since this isn't a linear function, we will transform this relationship by taking the natural logarithm of both sides to obtain

\begin{equation*} \ln(P) = \ln(C) + r\ln(L). \end{equation*}

Evaluating the following cell loads the data set and defines two vectors `logaxis`

, whose components are \(\ln(L)\text{,}\) and `logperiod`

, whose components are \(\ln(P)\text{.}\)

- Construct the matrix \(A\) and vector \(\mathbf b\) so that the solution to \(A\mathbf x=\mathbf b\) is the vector \(\mathbf x=\twovec{\ln(C)}r\text{.}\)
- Find the least squares approximate solution \(\xhat\text{.}\) What does this give for the values of \(C\) and \(r\text{?}\)
- Find the coefficient of determination \(R^2\text{.}\) What does this tell us about the quality of the approximation?
- Suppose that the orbit of an asteroid has a semi-major axis whose length is \(L=4.0\) AU. Estimate the period \(P\) of the asteroid's orbit.
- Halley's Comet has a period of \(P=75\) years. Estimate the length of its semi-major axis.

##### 9

Evaluating the following cell loads a data set describing the temperature in the Earth's atmosphere at various altitudes. There are also two vectors `altitude`

, expressed in kilometers, and `temperature`

, in degress Celsius.

- Describe how to form the matrix \(A\) and vector \(\mathbf b\) so that the linear system \(A\mathbf x=\mathbf b\) describes a degree \(k\) polynomial fitting the data.
- Choose a value of \(k\text{,}\) construct the matrix \(A\) and vector \(\mathbf b\text{,}\) and find the least squares approximate solution \(\xhat\text{.}\)
- Plot the polynomial and data using
`plot_model(xhat, data)`

. - Now examine what happens as you vary the degree of the polynomial \(k\text{.}\) Choose an appropriate value of \(k\) that seems to capture the most important features of the data while avoiding overfitting, and explain your choice.
- Use your value of \(k\) to estimate the temperature at an altitude of 55 kilometers.

##### 10

The following cell loads some data describing 1057 houses in a particular real estate market. For each house, we record the living area in square feet, the lot size in acres, the age in years, and the price in dollars. The cell also defines variables `area`

, `size`

, `age`

, and `price`

.

We will use linear regression to predict the price of a house given its living area, lot size, and age:

\begin{equation*} \beta_0 + \beta_1~\text{Living Area} + \beta_2~\text{Lot Size} + \beta_3~\text{Age} = \text{Price}. \end{equation*}

- Use a \(QR\) factorization to find the least squares approximate solution \(\xhat\text{.}\)
- Discuss the significance of the signs of \(\beta_1\text{,}\) \(\beta_2\text{,}\) and \(\beta_3\text{.}\)
- If two houses are identical except for differing in age by one year, how would you predict their prices compare to one another?
- Find the coefficient of determination \(R^2\text{.}\) What does this say about the quality of the fit?
- Predict the price of a house whose living area is 2000 square feet, lot size is 1.5 acres, and age is 50 years.

##### 11

This problem is about the meaning of the coefficient of determination \(R^2\) and its connection to variance, a topic that appears in the next section. Throughout this problem, we consider the linear system \(A\mathbf x=\mathbf b\) and the approximate least squares solution \(\xhat\text{,}\) where \(A\xhat=\bhat\text{.}\) We suppose that \(A\) is an \(m\times n\) matrix. We will denote the \(m\)-dimensional vector \(\onevec = \fourvec11{\vdots}1\text{.}\)

- Explain why \(\bbar\text{,}\) the mean of the components of \(\mathbf b\text{,}\) can be found as the dot product
\begin{equation*} \bbar = \frac 1m \mathbf b\cdot\onevec. \end{equation*}

- In the examples we have seen in this section, explain why \(\onevec\) is in \(\col(A)\text{.}\)
- If we write \(\mathbf b = \bhat + \mathbf b^\perp\text{,}\) we explain why
\begin{equation*} \mathbf b^\perp\cdot\onevec = 0 \end{equation*}

and hence why the mean of the components of \(\mathbf b^\perp\) is zero.

- The variance of an \(m\)-dimensional vector \(\mathbf v\) is \(\var(\mathbf v) = \frac1m \len{\widetilde{\mathbf v}}^2\text{,}\) where \(\widetilde{\mathbf v}\) is the vector obtained by demeaning \(\mathbf v\text{.}\)
Explain why

\begin{equation*} \var(\mathbf b) = \var(\bhat) + \var(\mathbf b^\perp). \end{equation*}

- Explain why
\begin{equation*} \frac{\len{\mathbf b - A\xhat}^2}{\len{\widetilde{\mathbf b}}^2} = \frac{\var(\mathbf b^\perp)}{\var(\mathbf b)} \end{equation*}

and hence

\begin{equation*} R^2 = \frac{\var(\bhat)}{\var(\mathbf b)} = \frac{\var(A\xhat)}{\var(\mathbf b)}. \end{equation*}

These expressions indicate why it is sometimes said that \(R^2\) measures the “fraction of variance explained” by the function we are using to fit the data. As seen in the previous exercise, there may be other features that are not recorded in the dataset that influence the quantity we wish to predict.

- Explain why \(0\leq R^2 \leq 1\text{.}\)