Section 26 Least Squares Approximation

26.1 Introduction

Let’s start with a summary of Least Squares Approximation.

The Why: Given a matrix \(A\) and a vector \(\mathsf{b}\) that is not in \(W = \mathrm{Col}(A)\), we want to find the “best approximate solution” \(\hat{\mathsf{b}} \in W\). In other words, we want to pick the best possible \(\hat{\mathsf{b}} \approx \mathsf{b}\) that lies in the column space of \(A\).

The What: The answer is to use projections.

This “best approximation” is the projection \(\hat{\mathsf{b}} = \mbox{proj}_W \mathsf{b}\).
The residual vector vector \(\mathsf{r} = \mathsf{b} - \hat{\mathbf{b}}\) is in \(W^{\perp}\).
The length \(\| \mathsf{r} \|\) of the residual vector measures the closeness the approximation.
The approximate solution to our original problem is the vector \(\hat{\mathsf{x}}\) such that \(A \hat{\mathsf{x}} = \hat{\mathsf{b}}\).

The How: A clever way to solve this is to use the normal equations. The best choice for \(\hat{\mathsf{x}}\) satisfies \[ A^{\top} A \hat{\mathsf{x}} = A^{\top} \mathsf{b}. \]

26.2 Example 1

Here is an example that we did by hand in class. We now see how to do it in R.

Find the least-squares solution to \({\mathsf{A}}x = {\mathsf{b}}\) if \[\begin{equation} {\mathsf{A}}= \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 0 & -1 \\ \end{bmatrix} \quad \mbox{and} \quad {\mathsf{b}}= \begin{bmatrix} 1 \\ 1 \\ 3 \end{bmatrix}. \tag{26.1} \end{equation}\] First, for good measure, let’s see if the system is inconsistent

A = cbind(c(1,1,0),c(1,2,-1))
b = c(1,1,3)
Ab = cbind(A,b)
Ab

##           b
## [1,] 1  1 1
## [2,] 1  2 1
## [3,] 0 -1 3

rref(Ab)

##          b
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1

True indeed: \({\mathsf{b}}\not \in Col({\mathsf{A}})\).

Now we compute the normal equations to see what they look like:

t(A) %*% A

##      [,1] [,2]
## [1,]    2    3
## [2,]    3    6

t(A) %*% b

##      [,1]
## [1,]    2
## [2,]    0

So we are going to instead solve the following normal equations instead of (26.1): \[\begin{equation} \begin{bmatrix} 2 & 3 \\ 3 & 6 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix}. \tag{26.2} \end{equation}\]

We can do this in the following nice, single R command

(xhat = solve(t(A) %*% A, t(A) %*% b))

##      [,1]
## [1,]    4
## [2,]   -2

To compute \(\hat {\mathsf{b}}\) we use

(bhat = A %*% xhat)

##      [,1]
## [1,]    2
## [2,]    0
## [3,]    2

And to get the residual, we use

(r = b - bhat)

##      [,1]
## [1,]   -1
## [2,]    1
## [3,]    1

sqrt(t(r) %*% r)

##          [,1]
## [1,] 1.732051

We can also check that the residual is orthogonal to \(Col({\mathsf{A}})\):

t(A) %*% r

##      [,1]
## [1,]    0
## [2,]    0

26.3 Template

The following R code does it all. You can use this as a template for future problems. Just enter the matrix A and the vector b.

# Given: the matrix A
A = cbind(c(1,1,0),c(1,2,-1))
# Given: the target vector b
b = c(1,1,3)

#solve the normal equation
(xhat = solve(t(A) %*% A, t(A) %*% b))
##      [,1]
## [1,]    4
## [2,]   -2

# find the projection
(bhat = A %*% xhat)
##      [,1]
## [1,]    2
## [2,]    0
## [3,]    2

# find the residual vector
(r = b - bhat)
##      [,1]
## [1,]   -1
## [2,]    1
## [3,]    1

# check that z is orthogonal to Col(A)
t(A) %*% r
##      [,1]
## [1,]    0
## [2,]    0

# measure the distance between bhat and b
sqrt( t(r) %*% r)
##          [,1]
## [1,] 1.732051

26.4 Fitting for a Linear Function

Here are some points that we’d like to fit to a linear function \(y = a_0 + a_1 x\).

Note: Here we use y instead of b because we like to write linear equations as “\(y = cx + d\).” So the expression “\(b = a_0 + a_1 x\)” looks funny to us. So we will talk about y and yhat instead of b and bhat.

x = c(1,2,3,4,5,6)
y = c(7,2,1,3,7,7)
plot(x,y,pch=19,ylim=c(0,10))
grid()

The linear equations that we want to fit are as follows. \[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \\ 1 & 6 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \\ 3 \\ 7 \\ 7 \end{bmatrix}. \]

These equations are inconsistent, so we solve the normal equations \(A^T A x = A^T y\) and find an approximate solution instead. Pro Tip: a clever way to create the desired matrix \(A\) is to use the fact that \(x^0=1\) for any number \(x\).

(A = cbind(x^0,x))

##        x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 4
## [5,] 1 5
## [6,] 1 6

Let’s take a look at the normal equations:

t(A) %*% A

##       x
##    6 21
## x 21 91

t(A) %*% y

##   [,1]
##     27
## x  103

So the normal equations to solve are below. It’s surprising how, even though there are 6 variables, we only have to solve a 2x2 equation, since there are 2 unknowns. \[ \begin{bmatrix} 6 & 21 \\ 21 & 91 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} = \begin{bmatrix} 27 \\ 103 \end{bmatrix}. \]

(xhat = solve(t(A) %*% A, t(A) %*% y))

##        [,1]
##   2.8000000
## x 0.4857143

This tells us that the desired intercept is \(a_0 = 2.8\), the desired slope is \(a_1 = 0.4856\), and the linear model is \(y = 2.8 + 0.4856x\).

We can plot the points together with the solution using:

#plot the original set of points
plot(x,y,pch=19,xlim=c(0,7),ylim=c(0,10), main='the best-fit linear function')

# generate points for the fitted line and plot it
tt = seq(1,6,len=100)  
lines(tt,xhat[1]+xhat[2]*tt,col='blue')

# get yhat
yhat = A %*% xhat

# add the residuals to the plot
for (i in 1:length(x)) {
  lines(c(x[i],x[i]),c(y[i],yhat[i]), col='red')
}

#add yhat to the plot
points(x,yhat,pch=19,col='orange')

#put the original points back on the plot last so we can see them 
points(x,y,pch=19,col="black")
grid()

In this visualization we see the following:

The black points: the original data points cbind(x,y). This represents the entries of the desired target vector y.
The blue curve: the fitted curve, created from the approximate solution xhat.
The orange points: the approximations cbind(x,yhat) of the data points cbind(x,y). This represents entries of the projection yhat.
The red line segments: the distances between the original data points (block dots) and their approximations (orange dots). The lengths of these red segments are the entries of the residual vector r.

Let’s look at the residual and see that it is indeed orthogonal to the columns of \(A\).

yhat = A %*% xhat
r = y - yhat
res=cbind(y,yhat,r)
colnames(res) = c("y","yhat","r")
res

##      y     yhat         r
## [1,] 7 3.285714  3.714286
## [2,] 2 3.771429 -1.771429
## [3,] 1 4.257143 -3.257143
## [4,] 3 4.742857 -1.742857
## [5,] 7 5.228571  1.771429
## [6,] 7 5.714286  1.285714

t(A) %*% r

##            [,1]
##   -8.881784e-16
## x  3.552714e-15

sqrt(t(r) %*% r)

##          [,1]
## [1,] 5.947388

26.5 Fitting for a Quadratic Function

The data we have been working with has a quadratic look to it, so let’s try adding an \(x^2\) term. That is, we will fit the model \(y = a_0 + a_1 x + a_2 x^2\). The equations we want to solve are In this case, the linear model that we’d like to solve is: \[ \begin{bmatrix} 1 & 1 & 1\\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ 1 & 5 & 25 \\ 1 & 6 & 36 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \\ 3 \\ 7 \\ 7 \end{bmatrix}. \]

It is easy enough to add this to our matrix \(A\).

(A = cbind(x^0,x,x^2))

##        x   
## [1,] 1 1  1
## [2,] 1 2  4
## [3,] 1 3  9
## [4,] 1 4 16
## [5,] 1 5 25
## [6,] 1 6 36

In this case our normal equations are 3x3

t(A) %*% A

##        x     
##    6  21   91
## x 21  91  441
##   91 441 2275

t(A) %*% y

##   [,1]
##     27
## x  103
##    499

\[ \begin{bmatrix} 6 & 21 & 91 \\ 21 & 91 & 441 \\ 91 & 441 & 2275 \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} = \begin{bmatrix} 27 \\ 103 \\ 499 \end{bmatrix}. \] Whose solution is computed by

(xhat = solve(t(A) %*% A, t(A) %*% y))

##         [,1]
##   10.3000000
## x -5.1392857
##    0.8035714

Notice that our solution is now \(y = 10.3 - 5.1393 x + 0.8036 x^2\). The linear term is now negative, but there is a positive quadratic term. Let’s look at the same plo but with the addex \(x^2\) term. We see that the residuals are smaller and, importantly, the model appears to better fit the data.

#plot the original set of points
plot(x,y,pch=19,xlim=c(0,7),ylim=c(0,10), main='the best-fit quadratic function')

# generate points for the fitted line and plot it
tt = seq(0,7,len=100)  
lines(tt,xhat[1]+xhat[2]*tt+xhat[3]*tt^2,col='blue')

# get yhat
yhat = A %*% xhat

# add the residuals to the plot
for (i in 1:length(x)) {
  lines(c(x[i],x[i]),c(y[i],yhat[i]), col='red')
}

#add yhat to the plot
points(x,yhat,pch=19,col='orange')

#put the original points back on the plot last so we can see them 
points(x,y,pch=19,col="black")
grid()

Let’s look again at the residual and see that it is indeed orthogonal to the columns of \(A\) and see that the residual got shorter.

yhat = A %*% xhat
r = y - yhat
res=cbind(y,yhat,r)
colnames(res) = c("y","yhat","r")
res

##      y     yhat         r
## [1,] 7 5.964286  1.035714
## [2,] 2 3.235714 -1.235714
## [3,] 1 2.114286 -1.114286
## [4,] 3 2.600000  0.400000
## [5,] 7 4.692857  2.307143
## [6,] 7 8.392857 -1.392857

t(A) %*% r

##            [,1]
##   -1.820766e-14
## x -2.486900e-14
##   -1.847411e-13

sqrt(t(r) %*% r)

##          [,1]
## [1,] 3.356231