Section 26 Least Squares Approximation
26.1 Introduction
Let’s start with a summary of Least Squares Approximation.
The Why: Given a matrix \(A\) and a vector \(\mathsf{b}\) that is not in \(W = \mathrm{Col}(A)\), we want to find the “best approximate solution” \(\hat{\mathsf{b}} \in W\). In other words, we want to pick the best possible \(\hat{\mathsf{b}} \approx \mathsf{b}\) that lies in the column space of \(A\).
The What: The answer is to use projections.
- This “best approximation” is the projection \(\hat{\mathsf{b}} = \mbox{proj}_W \mathsf{b}\).
- The residual vector vector \(\mathsf{r} = \mathsf{b} - \hat{\mathbf{b}}\) is in \(W^{\perp}\).
- The length \(\| \mathsf{r} \|\) of the residual vector measures the closeness the approximation.
- The approximate solution to our original problem is the vector \(\hat{\mathsf{x}}\) such that \(A \hat{\mathsf{x}} = \hat{\mathsf{b}}\).
The How: A clever way to solve this is to use the normal equations. The best choice for \(\hat{\mathsf{x}}\) satisfies \[ A^{\top} A \hat{\mathsf{x}} = A^{\top} \mathsf{b}. \]
26.2 Example 1
Here is an example that we did by hand in class. We now see how to do it in R.
Find the least-squares solution to \({\mathsf{A}}x = {\mathsf{b}}\) if \[\begin{equation} {\mathsf{A}}= \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 0 & -1 \\ \end{bmatrix} \quad \mbox{and} \quad {\mathsf{b}}= \begin{bmatrix} 1 \\ 1 \\ 3 \end{bmatrix}. \tag{26.1} \end{equation}\] First, for good measure, let’s see if the system is inconsistent
= cbind(c(1,1,0),c(1,2,-1))
A = c(1,1,3)
b = cbind(A,b)
Ab Ab
## b
## [1,] 1 1 1
## [2,] 1 2 1
## [3,] 0 -1 3
rref(Ab)
## b
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
True indeed: \({\mathsf{b}}\not \in Col({\mathsf{A}})\).
Now we compute the normal equations to see what they look like:
t(A) %*% A
## [,1] [,2]
## [1,] 2 3
## [2,] 3 6
t(A) %*% b
## [,1]
## [1,] 2
## [2,] 0
So we are going to instead solve the following normal equations instead of (26.1): \[\begin{equation} \begin{bmatrix} 2 & 3 \\ 3 & 6 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix}. \tag{26.2} \end{equation}\]
We can do this in the following nice, single R command
xhat = solve(t(A) %*% A, t(A) %*% b)) (
## [,1]
## [1,] 4
## [2,] -2
To compute \(\hat {\mathsf{b}}\) we use
bhat = A %*% xhat) (
## [,1]
## [1,] 2
## [2,] 0
## [3,] 2
And to get the residual, we use
r = b - bhat) (
## [,1]
## [1,] -1
## [2,] 1
## [3,] 1
sqrt(t(r) %*% r)
## [,1]
## [1,] 1.732051
We can also check that the residual is orthogonal to \(Col({\mathsf{A}})\):
t(A) %*% r
## [,1]
## [1,] 0
## [2,] 0
26.3 Template
The following R code does it all. You can use this as a template for future problems. Just enter the matrix A and the vector b.
# Given: the matrix A
= cbind(c(1,1,0),c(1,2,-1))
A # Given: the target vector b
= c(1,1,3)
b
#solve the normal equation
xhat = solve(t(A) %*% A, t(A) %*% b))
(## [,1]
## [1,] 4
## [2,] -2
# find the projection
bhat = A %*% xhat)
(## [,1]
## [1,] 2
## [2,] 0
## [3,] 2
# find the residual vector
r = b - bhat)
(## [,1]
## [1,] -1
## [2,] 1
## [3,] 1
# check that z is orthogonal to Col(A)
t(A) %*% r
## [,1]
## [1,] 0
## [2,] 0
# measure the distance between bhat and b
sqrt( t(r) %*% r)
## [,1]
## [1,] 1.732051
26.4 Fitting for a Linear Function
Here are some points that we’d like to fit to a linear function \(y = a_0 + a_1 x\).
Note: Here we use y
instead of b
because we like to write linear equations as “\(y = cx + d\).” So the expression “\(b = a_0 + a_1 x\)” looks funny to us. So we will talk about y
and yhat
instead of b
and bhat
.
= c(1,2,3,4,5,6)
x = c(7,2,1,3,7,7)
y plot(x,y,pch=19,ylim=c(0,10))
grid()
The linear equations that we want to fit are as follows. \[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \\ 1 & 6 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \\ 3 \\ 7 \\ 7 \end{bmatrix}. \]
These equations are inconsistent, so we solve the normal equations \(A^T A x = A^T y\) and find an approximate solution instead. Pro Tip: a clever way to create the desired matrix \(A\) is to use the fact that \(x^0=1\) for any number \(x\).
A = cbind(x^0,x)) (
## x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 4
## [5,] 1 5
## [6,] 1 6
Let’s take a look at the normal equations:
t(A) %*% A
## x
## 6 21
## x 21 91
t(A) %*% y
## [,1]
## 27
## x 103
So the normal equations to solve are below. It’s surprising how, even though there are 6 variables, we only have to solve a 2x2 equation, since there are 2 unknowns. \[ \begin{bmatrix} 6 & 21 \\ 21 & 91 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} = \begin{bmatrix} 27 \\ 103 \end{bmatrix}. \]
xhat = solve(t(A) %*% A, t(A) %*% y)) (
## [,1]
## 2.8000000
## x 0.4857143
This tells us that the desired intercept is \(a_0 = 2.8\), the desired slope is \(a_1 = 0.4856\), and the linear model is \(y = 2.8 + 0.4856x\).
We can plot the points together with the solution using:
#plot the original set of points
plot(x,y,pch=19,xlim=c(0,7),ylim=c(0,10), main='the best-fit linear function')
# generate points for the fitted line and plot it
= seq(1,6,len=100)
tt lines(tt,xhat[1]+xhat[2]*tt,col='blue')
# get yhat
= A %*% xhat
yhat
# add the residuals to the plot
for (i in 1:length(x)) {
lines(c(x[i],x[i]),c(y[i],yhat[i]), col='red')
}
#add yhat to the plot
points(x,yhat,pch=19,col='orange')
#put the original points back on the plot last so we can see them
points(x,y,pch=19,col="black")
grid()
In this visualization we see the following:
The black points: the original data points
cbind(x,y)
. This represents the entries of the desired target vectory
.The blue curve: the fitted curve, created from the approximate solution
xhat
.
The orange points: the approximations
cbind(x,yhat)
of the data pointscbind(x,y)
. This represents entries of the projectionyhat
.The red line segments: the distances between the original data points (block dots) and their approximations (orange dots). The lengths of these red segments are the entries of the residual vector
r
.
Let’s look at the residual and see that it is indeed orthogonal to the columns of \(A\).
= A %*% xhat
yhat = y - yhat
r =cbind(y,yhat,r)
rescolnames(res) = c("y","yhat","r")
res
## y yhat r
## [1,] 7 3.285714 3.714286
## [2,] 2 3.771429 -1.771429
## [3,] 1 4.257143 -3.257143
## [4,] 3 4.742857 -1.742857
## [5,] 7 5.228571 1.771429
## [6,] 7 5.714286 1.285714
t(A) %*% r
## [,1]
## -8.881784e-16
## x 3.552714e-15
sqrt(t(r) %*% r)
## [,1]
## [1,] 5.947388
26.5 Fitting for a Quadratic Function
The data we have been working with has a quadratic look to it, so let’s try adding an \(x^2\) term. That is, we will fit the model \(y = a_0 + a_1 x + a_2 x^2\). The equations we want to solve are In this case, the linear model that we’d like to solve is: \[ \begin{bmatrix} 1 & 1 & 1\\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ 1 & 5 & 25 \\ 1 & 6 & 36 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \\ 3 \\ 7 \\ 7 \end{bmatrix}. \]
It is easy enough to add this to our matrix \(A\).
A = cbind(x^0,x,x^2)) (
## x
## [1,] 1 1 1
## [2,] 1 2 4
## [3,] 1 3 9
## [4,] 1 4 16
## [5,] 1 5 25
## [6,] 1 6 36
In this case our normal equations are 3x3
t(A) %*% A
## x
## 6 21 91
## x 21 91 441
## 91 441 2275
t(A) %*% y
## [,1]
## 27
## x 103
## 499
\[ \begin{bmatrix} 6 & 21 & 91 \\ 21 & 91 & 441 \\ 91 & 441 & 2275 \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} = \begin{bmatrix} 27 \\ 103 \\ 499 \end{bmatrix}. \] Whose solution is computed by
xhat = solve(t(A) %*% A, t(A) %*% y)) (
## [,1]
## 10.3000000
## x -5.1392857
## 0.8035714
Notice that our solution is now \(y = 10.3 - 5.1393 x + 0.8036 x^2\). The linear term is now negative, but there is a positive quadratic term. Let’s look at the same plo but with the addex \(x^2\) term. We see that the residuals are smaller and, importantly, the model appears to better fit the data.
#plot the original set of points
plot(x,y,pch=19,xlim=c(0,7),ylim=c(0,10), main='the best-fit quadratic function')
# generate points for the fitted line and plot it
= seq(0,7,len=100)
tt lines(tt,xhat[1]+xhat[2]*tt+xhat[3]*tt^2,col='blue')
# get yhat
= A %*% xhat
yhat
# add the residuals to the plot
for (i in 1:length(x)) {
lines(c(x[i],x[i]),c(y[i],yhat[i]), col='red')
}
#add yhat to the plot
points(x,yhat,pch=19,col='orange')
#put the original points back on the plot last so we can see them
points(x,y,pch=19,col="black")
grid()
Let’s look again at the residual and see that it is indeed orthogonal to the columns of \(A\) and see that the residual got shorter.
= A %*% xhat
yhat = y - yhat
r =cbind(y,yhat,r)
rescolnames(res) = c("y","yhat","r")
res
## y yhat r
## [1,] 7 5.964286 1.035714
## [2,] 2 3.235714 -1.235714
## [3,] 1 2.114286 -1.114286
## [4,] 3 2.600000 0.400000
## [5,] 7 4.692857 2.307143
## [6,] 7 8.392857 -1.392857
t(A) %*% r
## [,1]
## -1.820766e-14
## x -2.486900e-14
## -1.847411e-13
sqrt(t(r) %*% r)
## [,1]
## [1,] 3.356231