5.6: Best Approximation and Least Squares (2024)

  1. Last updated
  2. Save as PDF
  • Page ID
    58860
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

    \( \newcommand{\vectorC}[1]{\textbf{#1}}\)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}}\)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}\)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)

    Often an exact solution to a problem in applied mathematics is difficult to obtain. However, it is usually just as useful to find arbitrarily close approximations to a solution. In particular, finding “linear approximations” is a potent technique in applied mathematics. One basic case is the situation where a system of linear equations has no solution, and it is desirable to find a “best approximation” to a solution to the system. In this section best approximations are defined and a method for finding them is described. The result is then applied to “least squares” approximation of data.

    Suppose \(A\) is an \(m \times n\) matrix and \(\mathbf{b}\) is a column in \(\mathbb{R}^m\), and consider the system

    \[A\mathbf{x} = \mathbf{b} \nonumber \]

    of \(m\) linear equations in \(n\) variables. This need not have a solution. However, given any column \(\mathbf{z} \in \mathbb{R}^n\), the distance \(\|\mathbf{b} - A\mathbf{z}\|\) is a measure of how far \(A\mathbf{z}\) is from \(\mathbf{b}\). Hence it is natural to ask whether there is a column \(\mathbf{z}\) in \(\mathbb{R}^n\) that is as close as possible to a solution in the sense that

    \[\| \mathbf{b} - A\mathbf{z} \| \nonumber \]

    is the minimum value of \(\|\mathbf{b} - A\mathbf{x}\|\) as \(\mathbf{x}\) ranges over all columns in \(\mathbb{R}^n\).

    The answer is “yes”, and to describe it define

    \[U = \{A\mathbf{x} \mid \mathbf{x} \mbox{ lies in } \mathbb{R}^n \} \nonumber \]

    This is a subspace of \(\mathbb{R}^n\) (verify) and we want a vector \(A\mathbf{z}\) in \(U\) as close as possible to \(\mathbf{b}\). That there is such a vector is clear geometrically if \(n = 3\) by the diagram. In general such a vector \(A\mathbf{z}\) exists by a general result called the projection theorem that will be proved in Chapter [chap:8] (Theorem [thm:023885]). Moreover, the projection theorem gives a simple way to compute \(\mathbf{z}\) because it also shows that the vector \(\mathbf{b} - A\mathbf{z}\) is orthogonal to every vector \(A\mathbf{x}\) in \(U\). Thus, for all \(\mathbf{x}\) in \(\mathbb{R}^n\),

    \[\begin{aligned} 0 = (A\mathbf{x})\bullet (\mathbf{b} - A\mathbf{z}) = (A\mathbf{x})^T(\mathbf{b} - A\mathbf{z}) &= \mathbf{x}^TA^T(\mathbf{b} - A\mathbf{z}) \\ &= \mathbf{x}\bullet [A^T(\mathbf{b} - A\mathbf{z})]\end{aligned} \nonumber \]

    In other words, the vector \(A^{T}(\mathbf{b} - A\mathbf{z})\) in \(\mathbb{R}^n\) is orthogonal to every vector in \(\mathbb{R}^n\) and so must be zero (being orthogonal to itself). Hence \(\mathbf{z}\) satisfies

    \[(A^TA)\mathbf{z} = A^T\mathbf{b} \nonumber \]

    Normal Equations016724 This is a system of linear equations called the normal equations for \(\mathbf{z}\).

    Note that this system can have more than one solution (see Exercise [ex:5_6_5]). However, the \(n \times n\) matrix \(A^{T}A\) is invertible if (and only if) the columns of \(A\) are linearly independent (Theorem [thm:015672]); so, in this case, \(\mathbf{z}\) is uniquely determined and is given explicitly by \(\mathbf{z} = (A^{T}A)^{-1}A^{T}\mathbf{b}\). However, the most efficient way to find \(\mathbf{z}\) is to apply gaussian elimination to the normal equations.

    This discussion is summarized in the following theorem.

    Best Approximation Theorem016733 Let \(A\) be an \(m \times n\) matrix, let \(\mathbf{b}\) be any column in \(\mathbb{R}^m\), and consider the system

    \[A\mathbf{x} = \mathbf{b} \nonumber \]

    of \(m\) equations in \(n\) variables.

    1. Any solution \(\mathbf{z}\) to the normal equations

      \[(A^TA)\mathbf{z} = A^T\mathbf{b} \nonumber \]

    2. If the columns of \(A\) are linearly independent, then \(A^{T}A\) is invertible and \(\mathbf{z}\) is given uniquely by \(\mathbf{z} = (A^{T}A)^{-1}A^{T}\mathbf{b}\).

    We note in passing that if \(A\) is \(n \times n\) and invertible, then

    \[\mathbf{z} = (A^TA)^{-1} A^T\mathbf{b} = A^{-1}\mathbf{b} \nonumber \]

    is the solution to the system of equations, and \(\|\mathbf{b} - A\mathbf{z}\| = 0\). Hence if \(A\) has independent columns, then \((A^{T}A)^{-1}A^{T}\) is playing the role of the inverse of the nonsquare matrix \(A\). The matrix \(A^{T}(AA^{T})^{-1}\) plays a similar role when the rows of \(A\) are linearly independent. These are both special cases of the generalized inverse of a matrix \(A\) (see Exercise [ex:5_6_14]). However, we shall not pursue this topic here.

    016759 The system of linear equations

    \[ \begin{array}{rlrcr} 3x & - & y & = & 4 \\ x & + & 2y & = & 0 \\ 2x & + & y & = & 1 \end{array} \nonumber \]

    has no solution. Find the vector \(\mathbf{z} = \left[ \begin{array}{r} x_0 \\ y_0 \end{array} \right]\) that best approximates a solution.

    In this case,

    \[A = \left[ \begin{array}{rr} 3 & -1 \\ 1 & 2 \\ 2 & 1 \end{array} \right] \mbox{, so } A^TA = \left[ \begin{array}{rrr} 3 & 1 & 2 \\ -1 & 2 & 1 \end{array} \right] \left[ \begin{array}{rr} 3 & -1 \\ 1 & 2 \\ 2 & 1 \end{array} \right] = \left[ \begin{array}{rr} 14 & 1 \\ 1 & 6 \end{array} \right] \nonumber \]

    is invertible. The normal equations \((A^{T}A)\mathbf{z} = A^{T}\mathbf{b}\) are

    \[\left[ \begin{array}{rr} 14 & 1 \\ 1 & 6 \end{array} \right] \mathbf{z} = \left[ \begin{array}{r} 14 \\ -3 \end{array} \right] \mbox{, so } \mathbf{z} = \frac{1}{83} \left[ \begin{array}{r} 87 \\ -56 \end{array} \right] \nonumber \]

    Thus \(x_0 = \frac{87}{83}\) and \(y_0 = \frac{-56}{83}\). With these values of \(x\) and \(y\), the left sides of the equations are, approximately,

    \[ \def\arraystretch{1.5} \begin{array}{rlrcrcr} 3x_0 & - & y_0 & = & \frac{317}{83} & = & 3.82 \\ x_0 & + & 2y_0 & = & \frac{-25}{83} & = & -0.30 \\ 2x_0 & + & y_0 & = & \frac{118}{83} & = & 1.42 \end{array} \nonumber \]

    This is as close as possible to a solution.

    016776 The average number \(g\) of goals per game scored by a hockey player seems to be related linearly to two factors: the number \(x_{1}\) of years of experience and the number \(x_{2}\) of goals in the preceding 10 games. The data on the following page were collected on four players. Find the linear function \(g = a_{0} + a_{1}x_{1} + a_{2}x_{2}\) that best fits these data.

    \[\begin{array}{|c|c|c|} \hline g & x_1 & x_2 \\ \hline 0.8 & 5 & 3 \\ 0.8 & 3 & 4 \\ 0.6 & 1 & 5 \\ 0.4 & 2 & 1 \\ \hline \end{array} \nonumber \]

    If the relationship is given by \(g = r_{0} + r_{1}x_{1} + r_{2}x_{2}\), then the data can be described as follows:

    \[\left[ \begin{array}{rrr} 1 & 5 & 3 \\ 1 & 3 & 4 \\ 1 & 1 & 5 \\ 1 & 2 & 1 \end{array} \right] \left[ \begin{array}{r} r_0 \\ r_1 \\ r_2 \end{array} \right] = \left[ \begin{array}{r} 0.8 \\ 0.8 \\ 0.6 \\ 0.4 \end{array} \right] \nonumber \]

    Using the notation in Theorem [thm:016733], we get

    \[\begin{aligned} \mathbf{z} &= (A^TA)^{-1}A^T\mathbf{b} \\ &= \frac{1}{42} \left[ \begin{array}{rrr} 119 & -17 & -19 \\ -17 & 5 & 1 \\ -19 & 1 & 5 \end{array} \right] \left[ \begin{array}{rrrr} 1 & 1 & 1 & 1 \\ 5 & 3 & 1 & 2 \\ 3 & 4 & 5 & 1 \end{array} \right] \left[ \begin{array}{r} 0.8 \\ 0.8 \\ 0.6 \\ 0.4 \end{array} \right] = \left[ \begin{array}{r} 0.14 \\ 0.09 \\ 0.08 \end{array} \right]\end{aligned} \nonumber \]

    Hence the best-fitting function is \(g = 0.14 + 0.09x_{1} + 0.08x_{2}\). The amount of computation would have been reduced if the normal equations had been constructed and then solved by gaussian elimination.

    Least Squares Approximation

    In many scientific investigations, data are collected that relate two variables. For example, if \(x\) is the number of dollars spent on advertising by a manufacturer and \(y\) is the value of sales in the region in question, the manufacturer could generate data by spending \(x_{1}, x_{2}, \dots, x_{n}\) dollars at different times and measuring the corresponding sales values \(y_{1}, y_{2}, \dots, y_{n}\).

    Suppose it is known that a linear relationship exists between the variables \(x\) and \(y\)—in other words, that \(y = a + bx\) for some constants \(a\) and \(b\). If the data are plotted, the points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) may appear to lie on a straight line and estimating \(a\) and \(b\) requires finding the “best-fitting” line through these data points. For example, if five data points occur as shown in the diagram, line 1 is clearly a better fit than line 2. In general, the problem is to find the values of the constants \(a\) and \(b\) such that the line \(y = a + bx\) best approximates the data in question. Note that an exact fit would be obtained if \(a\) and \(b\) were such that \(y_{i} = a + bx_{i}\) were true for each data point \((x_{i}, y_{i})\). But this is too much to expect. Experimental errors in measurement are bound to occur, so the choice of \(a\) and \(b\) should be made in such a way that the errors between the observed values \(y_{i}\) and the corresponding fitted values \(a + bx_{i}\) are in some sense minimized. Least squares approximation is a way to do this.

    The first thing we must do is explain exactly what we mean by the best fit of a line \(y = a + bx\) to an observed set of data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\). For convenience, write the linear function \(r_{0} + r_{1}x\) as

    \[f(x) = r_0 +r_1x \nonumber \]

    so that the fitted points (on the line) have coordinates \((x_{1}, f(x_{1})), \dots, (x_{n}, f(x_{n}))\).

    The second diagram is a sketch of what the line \(y = f(x)\) might look like. For each \(i\) the observed data point \((x_{i}, y_{i})\) and the fitted point \((x_{i}, f(x_{i}))\) need not be the same, and the distance \(d_{i}\) between them measures how far the line misses the observed point. For this reason \(d_{i}\) is often called the error at \(x_{i}\), and a natural measure of how close the line \(y = f(x)\) is to the observed data points is the sum \(d_{1} + d_{2} + \dots + d_{n}\) of all these errors. However, it turns out to be better to use the sum of squares

    \[S = d_1^2 + d_2^2 + \dotsb + d_n^2 \nonumber \]

    as the measure of error, and the line \(y = f(x)\) is to be chosen so as to make this sum as small as possible. This line is said to be the least squares approximating line for the data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\).

    The square of the error \(d_i\) is given by \(d_i^2 = [y_i - f(x_i)]^2\) for each \(i\), so the quantity \(S\) to be minimized is the sum:

    \[S = [y_1 - f(x_1)]^2 + [y_2 - f(x_2)]^2 + \dotsb + [y_n - f(x_n)]^2 \nonumber \]

    Note that all the numbers \(x_{i}\) and \(y_{i}\) are given here; what is required is that the function \(f\) be chosen in such a way as to minimize \(S\). Because \(f(x) = r_{0} + r_{1}x\), this amounts to choosing \(r_{0}\) and \(r_{1}\) to minimize \(S\). This problem can be solved using Theorem [thm:016733]. The following notation is convenient.

    \[\mathbf{x} = \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] \ \mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \quad \mbox{ and } \quad f(\mathbf{x}) = \left[ \begin{array}{c} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{array} \right] = \left[ \begin{array}{c} r_0 + r_1x_1 \\ r_0 + r_1x_2 \\ \vdots \\ r_0 + r_1x_n \\ \end{array} \right] \nonumber \]

    Then the problem takes the following form: Choose \(r_{0}\) and \(r_{1}\) such that

    \[S = [y_1 - f(x_1)]^2 + [y_2 - f(x_2)]^2 + \dotsb + [y_n - f(x_n)]^2 = \| \mathbf{y} - f(\mathbf{x}) \| ^2 \nonumber \]

    is as small as possible. Now write

    \[M = \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right] \quad \mbox{ and } \quad \mathbf{r} = \left[ \begin{array}{r} r_0 \\ r_1 \end{array} \right] \nonumber \]

    Then \(M\mathbf{r} = f(\mathbf{x})\), so we are looking for a column \(\mathbf{r} = \left[ \begin{array}{r} r_0 \\ r_1 \end{array} \right]\) such that \(\|\mathbf{y} - M\mathbf{r}\|^{2}\) is as small as possible. In other words, we are looking for a best approximation \(\mathbf{z}\) to the system \(M\mathbf{r} = \mathbf{y}\). Hence Theorem [thm:016733] applies directly, and we have

    016878 Suppose that n data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) are given, where at least two of \(x_{1}, x_{2}, \dots, x_{n}\) are distinct. Put

    \[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \ M = \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right] \nonumber \]

    Then the least squares approximating line for these data points has equation

    \[y = z_0 +z_1x \nonumber \]

    where \(\mathbf{z} = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right]\) is found by gaussian elimination from the normal equations

    \[(M^TM)\mathbf{z} = M^T\mathbf{y} \nonumber \]

    The condition that at least two of \(x_{1}, x_{2}, \dots, x_{n}\) are distinct ensures that \(M^{T}M\) is an invertible matrix, so \(\mathbf{z}\) is unique:

    \[\mathbf{z} = (M^TM)^{-1}M^T\mathbf{y} \nonumber \]

    016901 Let data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{5}, y_{5})\) be given as in the accompanying table. Find the least squares approximating line for these data.

    \[\begin{array}{|c|c|} \hline x & y \\ \hline 1 & 1 \\ 3 & 2 \\ 4 & 3 \\ 6 & 4 \\ 7 & 5 \\ \hline \end{array} \nonumber \]

    In this case we have

    \[\begin{aligned} M^TM &= \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_5 \end{array} \right] \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_5 \end{array} \right] \\ &= \left[ \begin{array}{cc} 5 & x_1 + \dotsb + x_5 \\ x_1 + \dotsb + x_5 & x_1^2 + \dotsb + x_5^2 \\ \end{array} \right] = \left[ \begin{array}{rr} 5 & 21 \\ 21 & 111 \end{array} \right] \\ \mbox{and } M^T\mathbf{y} &= \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_5 \end{array} \right] \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_5 \end{array} \right] \\ &= \left[ \begin{array}{c} y_1 + y_2 + \dotsb + y_5 \\ x_1y_1 + x_2y_2 + \dotsb + x_5y_5 \end{array} \right] = \left[ \begin{array}{r} 15 \\ 78 \end{array} \right]\end{aligned} \nonumber \]

    so the normal equations \((M^{T}M)\mathbf{z} = M^{T}\mathbf{y}\) for \(\mathbf{z} = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right]\) become

    \[\left[ \begin{array}{rr} 5 & 21 \\ 21 & 111 \end{array} \right] = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right] = \left[ \begin{array}{r} 15 \\ 78 \end{array} \right] \nonumber \]

    The solution (using gaussian elimination) is \(\mathbf{z} = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right] =\left[ \begin{array}{r} 0.24 \\ 0.66 \end{array} \right]\) to two decimal places, so the least squares approximating line for these data is \(y = 0.24 + 0.66x\). Note that \(M^{T}M\) is indeed invertible here (the determinant is \(114\)), and the exact solution is

    \[\mathbf{z} = (M^TM)^{-1}M^T\mathbf{y} = \frac{1}{114} \left[ \begin{array}{rr} 111 & -21 \\ -21 & 5 \end{array} \right] \left[ \begin{array}{rr} 15 \\ 78 \end{array} \right] = \frac{1}{114} \left[ \begin{array}{r} 27 \\ 75 \end{array} \right] = \frac{1}{38} \left[ \begin{array}{r} 9 \\ 25 \end{array} \right] \nonumber \]

    Least Squares Approximating Polynomials

    Suppose now that, rather than a straight line, we want to find a polynomial

    \[y = f(x) = r_0 + r_1x + r_2x^2 + \dots + r_mx^m \nonumber \]

    of degree \(m\) that best approximates the data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\). As before, write

    \[\mathbf{x} = \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] \ \mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \quad \mbox{ and } \quad f(\mathbf{x}) = \left[ \begin{array}{c} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{array} \right] \nonumber \]

    For each \(x_{i}\) we have two values of the variable \(y\), the observed value \(y_{i}\), and the computed value \(f(x_{i})\). The problem is to choose \(f(x)\)—that is, choose \(r_{0}, r_{1}, \dots, r_{m}\) —such that the \(f(x_{i})\) are as close as possible to the \(y_{i}\). Again we define “as close as possible” by the least squares condition: We choose the \(r_{i}\) such that

    \[\| \mathbf{y} - f(\mathbf{x}) \| ^2 = [y_1 - f(x_1)]^2 + [y_2 - f(x_2)]^2 + \dotsb + [y_n - f(x_n)]^2 \nonumber \]

    is as small as possible.

    Least Squares Approximation016945 A polynomial \(f(x)\) satisfying this condition is called a least squares approximating polynomial of degree \(m\) for the given data pairs.

    If we write

    \[M = \def\arraystretch{1.3} \left[ \begin{array}{ccccc} 1 & x_1 & x_1^2 & \cdots & x_1^m \\ 1 & x_2 & x_2^2 & \cdots & x_2^m \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^m \\ \end{array} \right] \quad \mbox{ and } \quad \mathbf{r} = \left[ \begin{array}{c} r_0 \\ r_1 \\ \vdots \\ r_m \end{array} \right] \nonumber \]

    we see that \(f(\mathbf{x}) = M\mathbf{r}\). Hence we want to find \(\mathbf{r}\) such that \(\|\mathbf{y} - M\mathbf{r}\|^{2}\) is as small as possible; that is, we want a best approximation \(\mathbf{z}\) to the system \(M\mathbf{r} = \mathbf{y}\). Theorem [thm:016733] gives the first part of Theorem [thm:016951].

    016951 Let \(n\) data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) be given, and write

    \[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \ M = \def\arraystretch{1.3} \left[ \begin{array}{ccccc} 1 & x_1 & x_1^2 & \cdots & x_1^m \\ 1 & x_2 & x_2^2 & \cdots & x_2^m \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^m \\ \end{array} \right] \ \mathbf{z} = \left[ \begin{array}{c} z_0 \\ z_1 \\ \vdots \\ z_m \end{array} \right] \nonumber \]

    1. If \(\mathbf{z}\) is any solution to the normal equations

      \[(M^TM)\mathbf{z} = M^T\mathbf{y} \nonumber \]

      \[z_0 + z_1x + z_2x^2 + \dots + z_mx^m \nonumber \]

      is a least squares approximating polynomial of degree m for the given data pairs.

    2. If at least \(m + 1\) of the numbers \(x_{1}, x_{2}, \dots, x_{n}\) are distinct (so \(n \geq m + 1\)), the matrix \(M^{T}M\) is invertible and \(\mathbf{z}\) is uniquely determined by

      \[\mathbf{z} = (M^TM)^{-1}M^T\mathbf{y} \nonumber \]

    It remains to prove (2), and for that we show that the columns of \(M\) are linearly independent (Theorem [thm:015672]). Suppose a linear combination of the columns vanishes:

    \[r_0 \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right] + r_1 \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] + \dots + r_m \left[ \begin{array}{c} x_1^m \\ x_2^m \\ \vdots \\ x_n^m \end{array} \right] = \left[ \begin{array}{c} 0 \\ 0 \\ \vdots \\ 0 \\ \end{array} \right] \nonumber \]

    If we write \(q(x) = r_{0} + r_{1}x + \dots + r_{m}x^{m}\), equating coefficients shows that

    \[q(x_{1}) = q(x_{2}) = \dots = q(x_{n}) = 0 \nonumber \]

    Hence \(q(x)\) is a polynomial of degree \(m\) with at least \(m + 1\) distinct roots, so \(q(x)\) must be the zero polynomial (see Appendix [chap:appdpolynomials] or Theorem [thm:020203]). Thus \(r_{0} = r_{1} = \dots = r_{m} = 0\) as required.

    016988 Find the least squares approximating quadratic \(y = z_{0} + z_{1}x + z_{2}x^{2}\) for the following data points.

    \[(-3, 3), (-1, 1), (0, 1), (1, 2), (3, 4) \nonumber \]

    This is an instance of Theorem [thm:016951] with \(m = 2\). Here

    \[\mathbf{y} = \left[ \begin{array}{r} 3\\ 1\\ 1\\ 2\\ 4 \end{array} \right] \ M = \left[ \begin{array}{rrr} 1 & -3 & 9 \\ 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 3 & 9 \end{array} \right] \nonumber \]

    Hence,

    \[M^TM = \left[ \begin{array}{rrrrr} 1 & 1 & 1 & 1 & 1 \\ -3 & -1 & 0 & 1 & 3 \\ 9 & 1 & 0 & 1 & 9 \end{array} \right] \left[ \begin{array}{rrr} 1 & -3 & 9 \\ 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 3 & 9 \end{array} \right] = \left[ \begin{array}{rrr} 5 & 0 & 20 \\ 0 & 20 & 0 \\ 20 & 0 & 164 \end{array} \right] \nonumber \]

    \[M^T\mathbf{y} = \left[ \begin{array}{rrrrr} 1 & 1 & 1 & 1 & 1 \\ -3 & -1 & 0 & 1 & 3 \\ 9 & 1 & 0 & 1 & 9 \end{array} \right] \left[ \begin{array}{r} 3 \\ 1 \\ 1 \\ 2 \\ 4 \end{array} \right] = \left[ \begin{array}{r} 11 \\ 4 \\ 66 \end{array} \right] \nonumber \]

    The normal equations for \(\mathbf{z}\) are

    \[\left[ \begin{array}{rrr} 5 & 0 & 20 \\ 0 & 20 & 0 \\ 20 & 0 & 164 \end{array} \right] \mathbf{z} = \left[ \begin{array}{r} 11 \\ 4 \\ 66 \end{array} \right] \quad \mbox{ whence } \mathbf{z} = \left[ \begin{array}{r} 1.15 \\ 0.20 \\ 0.26 \end{array} \right] \nonumber \]

    This means that the least squares approximating quadratic for these data is \(y = 1.15 + 0.20x + 0.26x^{2}\).

    Other Functions

    There is an extension of Theorem [thm:016951] that should be mentioned. Given data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \\ \dots, (x_{n}, y_{n})\), that theorem shows how to find a polynomial

    \[f(x) = r_0 + r_1x + \dots + r_mx^m \nonumber \]

    such that \(\|\mathbf{y} - f(\mathbf{x})\|^{2}\) is as small as possible, where \(\mathbf{x}\) and \(f(\mathbf{x})\) are as before. Choosing the appropriate polynomial \(f(x)\) amounts to choosing the coefficients \(r_{0}, r_{1}, \dots, r_{m}\), and Theorem [thm:016951] gives a formula for the optimal choices. Here \(f(x)\) is a linear combination of the functions \(1, x, x^{2}, \dots, x^{m}\) where the \(r_{i}\) are the coefficients, and this suggests applying the method to other functions. If \(f_{0}(x), f_{1}(x), \dots, f_{m}(x)\) are given functions, write

    \[f(x) = r_0f_0(x) + r_1f_1(x) + \dots + r_mf_m(x) \nonumber \]

    where the \(r_{i}\) are real numbers. Then the more general question is whether \(r_{0}, r_{1}, \dots, r_{m}\) can be found such that \(\|\mathbf{y} - f(\mathbf{x})\|^{2}\) is as small as possible where

    \[f(\mathbf{x}) = \left[ \begin{array}{c} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_m) \end{array} \right] \nonumber \]

    Such a function \(f(\mathbf{x})\) is called a least squares best approximation for these data pairs of the form \(r_{0}f_{0}(x) + r_{1}f_{1}(x) + \dots + r_{m}f_{m}(x)\), \(r_{i}\) in \(\mathbb{R}\). The proof of Theorem [thm:016951] goes through to prove

    017041 Let \(n\) data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) be given, and suppose that \(m + 1\) functions \(f_{0}(x), f_{1}(x), \dots, f_{m}(x)\) are specified. Write

    \[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \quad M = \left[ \begin{array}{cccc} f_0(x_1) & f_1(x_1) & \cdots & f_m(x_1) \\ f_0(x_2) & f_1(x_2) & \cdots & f_m(x_2) \\ \vdots & \vdots & & \vdots \\ f_0(x_n) & f_1(x_n) & \cdots & f_m(x_n) \\ \end{array} \right] \quad \mathbf{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_m \end{array} \right] \nonumber \]

    1. If \(\mathbf{z}\) is any solution to the normal equations

      \[(M^TM)\mathbf{z} = M^T\mathbf{y} \nonumber \]

      \[z_0f_0(x) + z_1f_1(x) + \dots + z_mf_m(x) \nonumber \]

      is the best approximation for these data among all functions of the form \(r_{0}f_{0}(x) + r_{1}f_{1}(x) + \dots + r_{m}f_{m}(x)\) where the \(r_{i}\) are in \(\mathbb{R}\).

    2. If \(M^{T}M\) is invertible (that is, if \(rank \;(M) = m + 1\)), then \(\mathbf{z}\) is uniquely determined; in fact, \(\mathbf{z} = (M^{T}M)^{-1}(M^{T}\mathbf{y})\).

    Clearly Theorem [thm:017041] contains Theorem [thm:016951] as a special case, but there is no simple test in general for whether \(M^{T}M\) is invertible. Conditions for this to hold depend on the choice of the functions \(f_{0}(x), f_{1}(x), \dots, f_{m}(x)\).

    017077 Given the data pairs \((-1, 0)\), \((0, 1)\), and \((1, 4)\), find the least squares approximating function of the form \(r_{0}x + r_{1}2^{x}\).

    The functions are \(f_{0}(x) = x\) and \(f_{1}(x) = 2^{x}\), so the matrix \(M\) is

    \[M = \left[ \begin{array}{cc} f_0(x_1) & f_1(x_1) \\ f_0(x_2) & f_1(x_2) \\ f_0(x_3) & f_1(x_3) \end{array} \right] = \left[ \begin{array}{rl} -1 & 2^{-1} \\ 0 & 2^0 \\ 1 & 2^1 \end{array} \right] =\frac{1}{2} \left[ \begin{array}{rr} -2 & 1 \\ 0 & 2 \\ 2 & 4 \end{array} \right] \nonumber \]

    In this case \(M^TM = \frac{1}{4} \left[ \begin{array}{rr} 8 & 6 \\ 6 & 21 \end{array} \right]\) is invertible, so the normal equations

    \[\frac{1}{4} \left[ \begin{array}{rr} 8 & 6 \\ 6 & 21 \end{array} \right] \mathbf{z} = \left[ \begin{array}{r} 4 \\ 9 \end{array} \right] \nonumber \]

    have a unique solution \(\mathbf{z} = \frac{1}{11} \left[ \begin{array}{r} 10 \\ 16 \end{array} \right].\) Hence the best-fitting function of the form \(r_0x + r_12^x\) is \(\overline{f}(x) = \frac{10}{11}x + \frac{16}{11}2^x\). Note that \(\overline{f}(\mathbf{x}) = \left[ \begin{array}{c} \overline{f}(-1) \\ \overline{f}(0) \\ \overline{f}(1) \\ \end{array} \right] = \def\arraystretch{1.5} \left[ \begin{array}{r} \frac{-2}{11} \\ \frac{16}{11} \\ \frac{42}{11} \end{array} \right]\), compared with \(\mathbf{y} = \left[ \begin{array}{r} 0 \\ 1 \\ 4 \end{array} \right]\)

      5.6: Best Approximation and Least Squares (2024)

      FAQs

      What is the difference between best approximation and least squares? ›

      The best approximation can be defined as that which minimizes the difference between the original function and the approximation; for a least-squares approach the quality of the approximation is measured in terms of the squared differences between the two.

      What is approximation using least squares? ›

      Fitting Models to Data

      is the least squares approximation of the system A c ≈ y . See Wikipedia:Simple linear regression. We assume that m ≥ n and f 1 , … , f n are linearly independently (which implies rank ( A ) = n ). Therefore the vector of coefficients is the least squares approximation of the system A c ≈ y .

      What does best approximation mean? ›

      The concept of a best approximation is meaningful in an arbitrary metric space X when μ(x,u) is defined by the distance between x and u; in this case E(x,F) is the distance from x to the set F. If X is a normed linear space, then for a fixed F⊂X the best approximation.

      What is the formula for the least-squares method? ›

      So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5. The least-squares method is used to predict the behavior of the dependent variable with respect to the independent variable. The sum of the squares of errors is called variance.

      When should I use least squares? ›

      The least squares method is a mathematical technique that allows the analyst to determine the best way of fitting a curve on top of a chart of data points. It is widely used to make scatter plots easier to interpret and is associated with regression analysis.

      Is maximum likelihood better than least squares? ›

      Usually, the advantages of the MLE method outweigh the advantages of the LSE method. The LSE method is easier to calculate by hand and easier to program. The LSE method is also traditionally associated with the use of probability plots to assess goodness-of-fit.

      What does least squares tell you? ›

      The method of least squares actually defines the solution for the minimization of the sum of squares of deviations or the errors in the result of each equation. Find the formula for sum of squares of errors, which help to find the variation in observed data. The least-squares method is often applied in data fitting.

      Why use least squares mean? ›

      In the case where the data contains NO missing values, the results of the MEANS and LSMEANS statements are identical. When missing values do occur, the two will differ. In such a case the LSMEANS are preferred because they reflect the model that is being fit to the data.

      What is least squares mean estimation? ›

      The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals (a residual being the difference between an observed value and the fitted value provided by a model) made in the results of each individual equation.

      What makes a good approximation? ›

      More generally, if you are trying to estimate a number that has D digits and you get it almost right, but with an error that has no more than, roughly, half that many digits, let us say, again, that you have made an approximation with square-root error or synonymously, a good approximation.

      What is approximation and examples? ›

      Approximation and Estimation Definition

      An approximation is a value that is close to the true value but not quite equal to it. Note that the symbol for "approximately" is . We could approximate pi by saying that π 14 . In actual fact, pi is an irrational number that never ends, so.

      How good of an approximation is 22 7? ›

      While 22/7 serves as a convenient approximation, it does not capture the true essence and value of π. The actual value of π is an irrational number, approximately 3.141592653, characterized by an infinite, non-repeating decimal sequence.

      How do you solve least squares? ›

      Recipe 1: Compute a least-squares solution

      Compute the matrix A T A and the vector A T b . Form the augmented matrix for the matrix equation A T Ax = A T b , and row reduce. This equation is always consistent, and any solution K x is a least-squares solution.

      What is the least squares rule? ›

      The least-squares regression line equation is y = mx + b, where m is the slope, which is equal to (Nsum(xy) - sum(x)sum(y))/(Nsum(x^2) - (sum x)^2), and b is the y-intercept, which is equals to (sum(y) - msum(x))/N. N is the number of data points, and x and y are the coordinates of the data points.

      What is the formula for least squares means? ›

      After the mean for each cell is calculated, the least squares means are simply the average of these means. For treatment A, the LS mean is (3+7.5)/2 = 5.25. For treatment B, it is (5.5+5)/2=5.25. The LS Mean for both treatment groups are identical.

      What is the difference between a line of best fit and a least squares regression line? ›

      We use the least squares criterion to pick the regression line. The regression line is sometimes called the "line of best fit" because it is the line that fits best when drawn through the points.

      What is the difference between least squares and least mean squares? ›

      The term "least squares" just refers to the form of regression in which you try to control (minimize) the square of the deviations between the predicted and observed values, while "least mean square" combines these ideas.

      What are the three types of approximation? ›

      We discuss a mixed-integer nonlinear programming formulation for the problem of covering a set of points with a given number of slabs of minimum width, known as the bottleneck variant of the hyperplane clustering problem.

      Why is the least square method the best? ›

      It is quite obvious that the fitting of curves for a particular data set are not always unique. Thus, it is required to find a curve having a minimal deviation from all the measured data points. This is known as the best-fitting curve and is found by using the least-squares method.

      Top Articles
      Latest Posts
      Article information

      Author: Rev. Porsche Oberbrunner

      Last Updated:

      Views: 6180

      Rating: 4.2 / 5 (73 voted)

      Reviews: 80% of readers found this page helpful

      Author information

      Name: Rev. Porsche Oberbrunner

      Birthday: 1994-06-25

      Address: Suite 153 582 Lubowitz Walks, Port Alfredoborough, IN 72879-2838

      Phone: +128413562823324

      Job: IT Strategist

      Hobby: Video gaming, Basketball, Web surfing, Book restoration, Jogging, Shooting, Fishing

      Introduction: My name is Rev. Porsche Oberbrunner, I am a zany, graceful, talented, witty, determined, shiny, enchanting person who loves writing and wants to share my knowledge and understanding with you.