Section 20 Least Squares Approximation
20.1 Least Squares
The Why: Given a matrix \(A\) and a vector \(\mathsf{b}\) that is not in \(W = \mathrm{Col}(A)\), we want to find the “best approximate solution” \(\hat{\mathsf{b}} \in W\). In other words, we want to pick the best possible \(\hat{\mathsf{b}} \approx \mathsf{b}\) that lies in the column space of \(A\).
The What: The answer is to use projections.
- This “best approximation” is the projection \(\hat{\mathsf{b}} = \mbox{proj}_W \mathsf{b}\).
- The residual vector vector \(\mathsf{r} = \mathsf{b} - \hat{\mathbf{b}}\) is in \(W^{\perp}\).
- The length \(\| \mathsf{r} \|\) of the residual vector measures the closeness the approximation.
- The approximate solution to our original problem is the vector \(\hat{\mathsf{x}}\) such that \(A \hat{\mathsf{x}} = \hat{\mathsf{b}}\).
20.2 Example
Find the least-squares solution to \({\mathsf{A}}x = \mathsf{b}\) if \[\begin{equation} {\mathsf{A}}= \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 0 & -1 \\ \end{bmatrix} \quad \mbox{and} \quad \mathsf{b}= \begin{bmatrix} 1 \\ 1 \\ 3 \end{bmatrix}. \tag{20.1} \end{equation}\] First, for good measure, let’s see if the system is inconsistent
= cbind(c(1,1,0),c(1,2,-1))
A = c(1,1,3)
b = cbind(A,b)
Ab Ab
## b
## [1,] 1 1 1
## [2,] 1 2 1
## [3,] 0 -1 3
rref(Ab)
## b
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
True indeed: \(\mathsf{b}\not \in Col({\mathsf{A}})\).
Now we compute the normal equations to see what they look like:
t(A) %*% A
## [,1] [,2]
## [1,] 2 3
## [2,] 3 6
t(A) %*% b
## [,1]
## [1,] 2
## [2,] 0
So we are going to instead solve the following normal equations instead of (20.1): \[\begin{equation} \begin{bmatrix} 2 & 3 \\ 3 & 6 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix}. \tag{20.2} \end{equation}\]
We can do this in the following nice, single R command
xhat = solve(t(A) %*% A, t(A) %*% b)) (
## [,1]
## [1,] 4
## [2,] -2
To compute \(\hat b\) we use
bhat = A %*% xhat) (
## [,1]
## [1,] 2
## [2,] 0
## [3,] 2
And to get the residual, we use
r = b - bhat) (
## [,1]
## [1,] -1
## [2,] 1
## [3,] 1
sqrt(t(r) %*% r)
## [,1]
## [1,] 1.732051
We can also check that the residual is orthogonal to \(Col({\mathsf{A}})\):
t(A) %*% r
## [,1]
## [1,] 0
## [2,] 0
20.3 Template
The following R code does it all. You can use this as a template for future problems. Just enter the matrix A and the vector b.
# Given: the matrix A
= cbind(c(1,1,0),c(1,2,-1))
A # Given: the target vector b
= c(1,1,3)
b #solve the normal equation
xhat = solve(t(A) %*% A, t(A) %*% b))
(# find the projection
bhat = A %*% xhat)
(# find the residual vector
r = b - bhat)
(# check that z is orthogonal to Col(A)
t(A) %*% r
# measure the distance between bhat and b
sqrt(dot(r,r))
20.4 Example from Class: Dec 1
Use the template from the previous section.
# Given: the matrix A
= cbind(c(1,0,1,2),c(2,1,2,1))
A # Given: the target vector b
= c(3,5,9,9)
b #solve the normal equation
xhat = solve(t(A) %*% A, t(A) %*% b))
(## [,1]
## [1,] 3
## [2,] 2
# find the projection
bhat = A %*% xhat)
(## [,1]
## [1,] 7
## [2,] 2
## [3,] 7
## [4,] 8
# find the residual vector
r = b - bhat)
(## [,1]
## [1,] -4
## [2,] 3
## [3,] 2
## [4,] 1
# check that z is orthogonal to Col(A)
t(A) %*% r
## [,1]
## [1,] 0
## [2,] 0
# measure the distance between bhat and b
sqrt(dot(r,r))
## [1] 5.477226
20.5 Fitting a Linear Function
Here are some points that we’d like to fit to a linear function \(y = a_0 + a_1 x\).
Note: Here we use y
instead of b
because we like to write linear equations as “\(y = cx + d\).” So the expression “\(b = a_0 + a_1 x\)” looks funny to us. So we will talk about y
and yhat
instead of b
and bhat
.
= c(1,2,3,4,5,6)
x = c(7,2,1,3,7,7)
y plot(x,y,pch=19,ylim=c(0,10))
grid()
The linear equations that we want to fit are as follows. \[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \\ 1 & 6 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \\ 3 \\ 7 \\ 7 \end{bmatrix}. \]
These equations are inconsistent, so we solve the normal equations \(A^T A x = A^T y\) and find an approximate solution instead. Pro Tip: a clever way to create the desired matrix \(A\) is to use the fact that \(x^0=1\) for any number \(x\).
A = cbind(x^0,x)) (
## x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 4
## [5,] 1 5
## [6,] 1 6
Let’s take a look at the normal equations:
t(A) %*% A
## x
## 6 21
## x 21 91
t(A) %*% y
## [,1]
## 27
## x 103
So the normal equations to solve are below. It’s surprising how, even though there are 6 variables, we only have to solve a 2x2 equation, since there are 2 unknowns. \[ \begin{bmatrix} 6 & 21 \\ 21 & 91 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} = \begin{bmatrix} 27 \\ 103 \end{bmatrix}. \]
xhat = solve(t(A) %*% A, t(A) %*% y)) (
## [,1]
## 2.8000000
## x 0.4857143
This tells us that the desired intercept is \(a_0 = 2.8\), the desired slope is \(a_1 = 0.4856\), and the linear model is \(y = 2.8 + 0.4856x\).
We can plot the points together with the solution using:
#plot the original set of points
plot(x,y,pch=19,xlim=c(0,7),ylim=c(0,10), main='the best-fit linear function')
# generate points for the fitted line and plot it
= seq(1,6,len=100)
tt lines(tt,xhat[1]+xhat[2]*tt,col='blue')
# get yhat
= A %*% xhat
yhat
# add the residuals to the plot
for (i in 1:length(x)) {
lines(c(x[i],x[i]),c(y[i],yhat[i]), col='red')
}
#add yhat to the plot
points(x,yhat,pch=19,col='orange')
#put the original points back on the plot last so we can see them
points(x,y,pch=19,col="black")
grid()
In this visualization we see the following:
The black points: the original data points
cbind(x,y)
. This represents the entries of the desired target vectory
.The blue curve: the fitted curve, created from the approximate solution
xhat
.
The orange points: the approximations
cbind(x,yhat)
of the data pointscbind(x,y)
. This represents entries of the projectionyhat
.The red line segments: the distances between the original data points (block dots) and their approximations (orange dots). The lengths of these red segments are the entries of the residual vector
r
.
Let’s look at the residual and see that it is indeed orthogonal to the columns of \(A\).
= A %*% xhat
yhat = y - yhat
r =cbind(y,yhat,r)
rescolnames(res) = c("y","yhat","r")
res
## y yhat r
## [1,] 7 3.285714 3.714286
## [2,] 2 3.771429 -1.771429
## [3,] 1 4.257143 -3.257143
## [4,] 3 4.742857 -1.742857
## [5,] 7 5.228571 1.771429
## [6,] 7 5.714286 1.285714
t(A) %*% r
## [,1]
## -8.881784e-16
## x 3.552714e-15
t(r) %*% r
## [,1]
## [1,] 35.37143
20.6 Fitting a Quadratic Function
The data we have been working with has a quadratic look to it, so let’s try adding an \(x^2\) term. That is, we will fit the model \(y = a_0 + a_1 x + a_2 x^2\). The equations we want to solve are In this case, the linear model that we’d like to solve is: \[ \begin{bmatrix} 1 & 1 & 1\\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ 1 & 5 & 25 \\ 1 & 6 & 36 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \\ 3 \\ 7 \\ 7 \end{bmatrix}. \]
It is easy enough to add this to our matrix \(A\).
A = cbind(x^0,x,x^2)) (
## x
## [1,] 1 1 1
## [2,] 1 2 4
## [3,] 1 3 9
## [4,] 1 4 16
## [5,] 1 5 25
## [6,] 1 6 36
In this case our normal equations are 3x3
t(A) %*% A
## x
## 6 21 91
## x 21 91 441
## 91 441 2275
t(A) %*% y
## [,1]
## 27
## x 103
## 499
\[ \begin{bmatrix} 6 & 21 & 91 \\ 21 & 91 & 441 \\ 91 & 441 & 2275 \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} = \begin{bmatrix} 27 \\ 103 \\ 499 \end{bmatrix}. \] Whose solution is computed by
xhat = solve(t(A) %*% A, t(A) %*% y)) (
## [,1]
## 10.3000000
## x -5.1392857
## 0.8035714
Notice that our solution is now \(y = 10.3 - 5.1393 x + 0.8036 x^2\). The linear term is now negative, but there is a positive quadratic term. Let’s look at the same plo but with the addex \(x^2\) term. We see that the residuals are smaller and, importantly, the model appears to better fit the data.
#plot the original set of points
plot(x,y,pch=19,xlim=c(0,7),ylim=c(0,10), main='the best-fit quadratic function')
# generate points for the fitted line and plot it
= seq(0,7,len=100)
tt lines(tt,xhat[1]+xhat[2]*tt+xhat[3]*tt^2,col='blue')
# get yhat
= A %*% xhat
yhat
# add the residuals to the plot
for (i in 1:length(x)) {
lines(c(x[i],x[i]),c(y[i],yhat[i]), col='red')
}
#add yhat to the plot
points(x,yhat,pch=19,col='orange')
#put the original points back on the plot last so we can see them
points(x,y,pch=19,col="black")
grid()
Let’s look again at the residual and see that it is indeed orthogonal to the columns of \(A\) and see that the residual got shorter.
= A %*% xhat
yhat = y - yhat
r =cbind(y,yhat,r)
rescolnames(res) = c("y","yhat","r")
res
## y yhat r
## [1,] 7 5.964286 1.035714
## [2,] 2 3.235714 -1.235714
## [3,] 1 2.114286 -1.114286
## [4,] 3 2.600000 0.400000
## [5,] 7 4.692857 2.307143
## [6,] 7 8.392857 -1.392857
t(A) %*% r
## [,1]
## -1.820766e-14
## x -2.486900e-14
## -1.847411e-13
t(r) %*% r
## [,1]
## [1,] 11.26429
20.7 Checkpoint 6.5
Here we have 4 data points
= c(1,2,3,4)
xdata = c(6,10,12,11)
ydata plot(xdata,ydata,xlim=c(0,5),ylim=c(0,15),xlab='x',ylab='x',col='red',pch=19)
20.7.1 Linear Fit
Fit a line to the data: \(a + b x\)
= cbind(xdata^0,xdata^1)
A t(A)%*% A
## [,1] [,2]
## [1,] 4 10
## [2,] 10 30
t(A)%*% ydata
## [,1]
## [1,] 39
## [2,] 106
xhat = solve(t(A)%*%A,t(A)%*%ydata))
(## [,1]
## [1,] 5.5
## [2,] 1.7
And plot it:
= seq(0,6,by=.01)
t plot(t,xhat[1] + xhat[2]*t,col='blue',xlim=c(0,5),ylim=c(0,15),xlab='x',ylab='x',type='l')
points(xdata,ydata,,col='red',pch=19)
Compute the residual:
yhat = A %*% xhat)
(## [,1]
## [1,] 7.2
## [2,] 8.9
## [3,] 10.6
## [4,] 12.3
r = ydata - yhat)
(## [,1]
## [1,] -1.2
## [2,] 1.1
## [3,] 1.4
## [4,] -1.3
t(A) %*% r
## [,1]
## [1,] 7.105427e-15
## [2,] 3.552714e-15
sqrt(t(r) %*% r)
## [,1]
## [1,] 2.50998
20.7.2 Quadratic Fit
Fit a parabola to the data: \(a + b x + c x^2\)
= cbind(xdata^0,xdata^1,xdata^2)
A t(A)%*% A
## [,1] [,2] [,3]
## [1,] 4 10 30
## [2,] 10 30 100
## [3,] 30 100 354
t(A)%*% ydata
## [,1]
## [1,] 39
## [2,] 106
## [3,] 330
xhat = solve(t(A)%*%A,t(A)%*%ydata))
(## [,1]
## [1,] -0.75
## [2,] 7.95
## [3,] -1.25
And plot it:
= seq(0,6,by=.01)
t plot(t,xhat[1] + xhat[2]*t+ xhat[3]*t^2,col='blue',xlim=c(0,5),ylim=c(0,15),xlab='x',ylab='x',type='l')
points(xdata,ydata,,col='red',pch=19)
Compute the residual:
yhat = A %*% xhat)
(## [,1]
## [1,] 5.95
## [2,] 10.15
## [3,] 11.85
## [4,] 11.05
r = ydata - yhat)
(## [,1]
## [1,] 0.05
## [2,] -0.15
## [3,] 0.15
## [4,] -0.05
t(A) %*% r
## [,1]
## [1,] 4.440892e-15
## [2,] 5.062617e-14
## [3,] 3.286260e-14
sqrt(t(r) %*% r)
## [,1]
## [1,] 0.2236068
And add the residuals to the plot
= seq(0,6,by=.01)
t plot(t,xhat[1] + xhat[2]*t + xhat[3]*t^2,col='blue',xlim=c(0,5),ylim=c(0,15),xlab='x',ylab='x',type='l')
for (i in 1:length(xdata)){lines(c(xdata[i],xdata[i]),c(ydata[i],yhat[i]),type='l')}
points(xdata,ydata,,col='red',pch=19)
points(xdata,yhat,pch=16)