Matrix Decompositions¶

variationalform https://variationalform.github.io/¶

Just Enough: progress at pace¶

https://variationalform.github.io/

https://github.com/variationalform

Simon Shaw https://www.brunel.ac.uk/people/simon-shaw.

This work is licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0 International)

Visit http://creativecommons.org/licenses/by-sa/4.0/ to see the terms.

This document uses python

and also makes use of LaTeX

in Markdown

What this is about:¶

You will be introduced to ...

Methods for decomposing matrices in to alternative forms.
Eigenvalues and Eigenvectors of square matrices.
The Singular Value Decomposition (SVD) for non-square matrices
Using numpy to compute these decompositions.

As usual our emphasis will be on doing rather than proving: just enough: progress at pace

Assigned Reading¶

For this worksheet you should read Chapter 7 of [FCLA], and Chapters 4 of [MML].

MML: Mathematics for Machine Learning, by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Cambridge University Press. https://mml-book.github.io.
FCLA: A First Course in Linear Algebra, by Ken Kuttler, https://math.libretexts.org/Bookshelves/Linear_Algebra/A_First_Course_in_Linear_Algebra_(Kuttler)

All of the above can be accessed legally and without cost.

There are also these useful references for coding:

PT: python: https://docs.python.org/3/tutorial
NP: numpy: https://numpy.org/doc/stable/user/quickstart.html
MPL: matplotlib: https://matplotlib.org

Eigenvalues and Eigenvectors¶

Consider this matrix and vector,

$$ \boldsymbol{B} = \left(\begin{array}{rrr} 3 & -2 & 4 \\ -6 & 6 & -11 \\ 6 & 2 & 5 \\ \end{array}\right) \qquad\text{ and }\qquad \boldsymbol{x} = \left(\begin{array}{r} -2 \\ 3 \\ 1 \\ \end{array}\right) \qquad\text{ then }\qquad \boldsymbol{B}\boldsymbol{x} = \left(\begin{array}{r} -8 \\ 19 \\ -1 \\ \end{array}\right) $$

and notice that $\boldsymbol{B}$ mixes up $\boldsymbol{x}$ to such an extent that $\boldsymbol{B}\boldsymbol{x}$ bears no relationship to the original $\boldsymbol{x}$. This isn't so surprising when you think about it.

We can do this in python using numpy as follows...

In [1]:

import numpy as np
B = np.array( [[3, -2, 4],[-6, 6, -11],[ 6, 2, 5 ]])
x = np.array([[-2], [3], [1]])
f = B.dot(x)
print('f = \n', f) 

f = 
 [[-8]
 [19]
 [-1]]

On the other hand, consider this matrix with a different vector,

$$ \boldsymbol{B} = \left(\begin{array}{rrr} 3 & -2 & 4 \\ -6 & 6 & -11 \\ 6 & 2 & 5 \\ \end{array}\right) \qquad\text{ and }\qquad \boldsymbol{w} = \left(\begin{array}{r} -2 \\ 5 \\ 2 \\ \end{array}\right). $$

This time,

$$ \boldsymbol{B}\boldsymbol{w} = \left(\begin{array}{rrr} 3 & -2 & 4 \\ -6 & 6 & -11 \\ 6 & 2 & 5 \\ \end{array}\right) \left(\begin{array}{r} -2 \\ 5 \\ 2 \\ \end{array}\right) = \left(\begin{array}{r} -8 \\ 20 \\ 8 \\ \end{array}\right) = 4\left(\begin{array}{r} -2 \\ 5 \\ 2 \\ \end{array}\right) = 4\boldsymbol{w} $$

So $\boldsymbol{B}\boldsymbol{w}=4\boldsymbol{w}$, and all $\boldsymbol{B}$ does is magnify $\boldsymbol{w}$ to be $4$ times longer.

Here it is again,

Exercise:¶

use python to show that $\frac{1}{4}\boldsymbol{B}\boldsymbol{w}-\boldsymbol{w}=\boldsymbol{0}$

In [2]:

B = np.array( [[3, -2, 4],[-6, 6, -11],[ 6, 2, 5 ]])
w = np.array([[-2], [5], [2]])
f = 0.25*B.dot(w)
print('f - w = \n', f-w) 
# or many other variants, such as
print('result = \n', 0.25*B.dot(w)-w) 

f - w = 
 [[0.]
 [0.]
 [0.]]
result = 
 [[0.]
 [0.]
 [0.]]

This seems more surprising. Here $4$ is called an eigenvalue ('own value') of $\boldsymbol{B}$ and $\boldsymbol{w}$ is the corresponding eigenvector.

In general, for any square matrix $\boldsymbol{B}$ the problem of finding scalars $\lambda$ and vectors $\boldsymbol{v}$ such that

$$ \boldsymbol{B}\boldsymbol{v}=\lambda\boldsymbol{v} $$

is called an eigenvalue problem. The following facts are known to be true:

The Eigenvalue Theorem. Every square matrix of dimension $n$ has $n$ eigenvalue-eigenvector pairs, $(\lambda_1,\boldsymbol{v}_1), (\lambda_2,\boldsymbol{v}_2),\ldots, (\lambda_n,\boldsymbol{v}_n)$. The eigenvalues need not be distinct, and the eigenvector lengths are arbitrary. A matrix which has one or more zero eigenvalues is not invertible. On the other hand, If the eigenvalues of a matrix are all non-zero then that matrix is invertible, and it has full rank. The determinant of a matrix is the product of its eigenvalues.

NOTE: if $\boldsymbol{B}\boldsymbol{v}=\lambda\boldsymbol{v}$ then it can be shown that we need $\det(\boldsymbol{B}-\lambda\boldsymbol{I})=0$. This is not practically useful, but is of central importance for theory.

Example 1¶

For $\boldsymbol{B}$ above we have that

$$ \boldsymbol{B}\boldsymbol{v}=\lambda\boldsymbol{v} $$

for $(\lambda_1,\boldsymbol{v}_1), (\lambda_2,\boldsymbol{v}_2), (\lambda_3,\boldsymbol{v}_3)$ given by

$$ 9 \text{ with } \left(\begin{array}{r} 17 \\ -45 \\ 3 \\ \end{array}\right), \qquad\qquad 1 \text{ with } \left(\begin{array}{r} -1 \\ 1 \\ 1 \end{array}\right) \qquad\text{ and }\qquad 4 \text{ with } \left(\begin{array}{r} -2 \\ 5 \\ 2 \end{array}\right). $$

We have already seen the case $\lambda = 4$ and $\boldsymbol{v}=(-2,5,2)^T$ above.

The eigenvectors are not unique - they can be multiplied by an arbitrary (non-zero) scalar and they remain eigenvectors. For that reason it is usual to normalize an eigenvector by dividing through by its length, as given by the (Euclidean, Pythagorean) $2$-norm, $\Vert\cdot\Vert_2$. For example, for $\boldsymbol{v}=(-2,5,2)^T$ we have

$$ \Vert\boldsymbol{v}\Vert_2 = \sqrt{(-2)^2+5^2+2^2} = \surd{33} $$

which means that

$$ \boldsymbol{v} = \frac{1}{\sqrt{33}}\left(\begin{array}{r} -2 \\ 5 \\ 2 \end{array}\right) = \left(\begin{array}{r} -0.348155311911396 \\ 0.870388279778489 \\ 0.348155311911395 \end{array}\right) $$

is also an eigenvector for the eigenvalue $\lambda = 4$.

THINK ABOUT: An eigenpair satisfies $\boldsymbol{B}\boldsymbol{v}=\lambda\boldsymbol{v}$. Choose a non-zero real number $\alpha$ and write $\boldsymbol{w}=\alpha\boldsymbol{v}$. Is it true that $\boldsymbol{B}\boldsymbol{w}=\lambda\boldsymbol{w}$? Can you see why eigenvectors are not unique in length?

THINK ABOUT: If we choose $\alpha = \Vert\boldsymbol{v}\Vert_2^{-1}$ above what can you say about the value of $\Vert\boldsymbol{w}\Vert_2$?

If we normalize each of the eigenvectors above with their own length we get something like this (the decimals may go on for ever - why?):

$$ 9 \text{ with } \left(\begin{array}{r} 0.3527\ldots \\ -0.9336\ldots \\ 0.0622\ldots \\ \end{array}\right), \qquad\qquad 1 \text{ with } \left(\begin{array}{r} -0.5773\ldots \\ 0.5773\ldots \\ 0.5773\ldots \end{array}\right) \qquad\text{ and }\qquad 4 \text{ with } \left(\begin{array}{r} -0.3481\ldots \\ 0.8703\ldots \\ 0.3481\ldots \end{array}\right). $$

Let's use numpy to calculate the eigensystem for $\boldsymbol{B}$. It goes like this...

In [3]:

w, V = np.linalg.eig(B)
print(w)
print(V)

[9. 1. 4.]
[[ 0.35271531 -0.57735027 -0.34815531]
 [-0.93365819  0.57735027  0.87038828]
 [ 0.06224388  0.57735027  0.34815531]]

We can see that two quantities are returned, w and V. The eigenvalues are collected in w and the corresponding eigenvectors are the columns of V.

Note that these columns of V agree with our calculations above for the normalized eigenvectors.

The Eigen-System¶

As indicated by this computation, we can stack the length-normalized eigenvectors together next to each other, with the eigenvalues on the leading diagonal of an otherwise zero matrix. We also make sure that the eigenvectors appear in the same order as the corresponding eigenvalues and then use the fact that $\boldsymbol{B}\boldsymbol{v}=\lambda \boldsymbol{v}$ for each eigen-pair. Then entire eigen-system can then be represented in one equation. For example,

$$ \begin{split} \left(\begin{array}{rrr} 3 & -2 & 4 \\ -6 & 6 & -11 \\ 6 & 2 & 5 \\ \end{array}\right) \left(\begin{array}{rrr} 0.3527\ldots & -0.5773\ldots & -0.3481\ldots \\ -0.9336\ldots & 0.5773\ldots & 0.8703\ldots \\ 0.0622\ldots & 0.5773\ldots & 0.3481\ldots \\ \end{array}\right) \\\ \\ \qquad {}= \left(\begin{array}{rrr} 0.3527\ldots & -0.5773\ldots & -0.3481\ldots \\ -0.9336\ldots & 0.5773\ldots & 0.8703\ldots \\ 0.0622\ldots & 0.5773\ldots & 0.3481\ldots \\ \end{array}\right) \left(\begin{array}{rrr} 9 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 4 \\ \end{array}\right) \end{split} $$

We can write this as

$$ \boldsymbol{B}\boldsymbol{V} = \boldsymbol{V}\boldsymbol{D} $$

where the columns of $\boldsymbol{V}$ are the eigenvectors of $\boldsymbol{B}$, and the diagonal matrix $\boldsymbol{D}$ has the eigenvalues on the leading diagonal. The left-to-right order of the eigenvalues in $\boldsymbol{D}$ matches the order that the eigenvectors appear in $\boldsymbol{V}$.

All three of these matrices are square and they each have the same dimension as $\boldsymbol{B}$.

Now, suppose that $\det(\boldsymbol{V})\ne 0$, then $\boldsymbol{V}^{-1}$ exists and we can (pre-)multiply both sides of $\boldsymbol{B}\boldsymbol{V} = \boldsymbol{V}\boldsymbol{D}$ by $\boldsymbol{V}^{-1}$ and get,

$$ \boldsymbol{V}^{-1}\boldsymbol{B}\boldsymbol{V} = \boldsymbol{V}^{-1}\boldsymbol{V}\boldsymbol{D} = \boldsymbol{D}. $$

We see that this has produced a diagonal matrix $\boldsymbol{V}^{-1}\boldsymbol{B}\boldsymbol{V}$ that is similar to $\boldsymbol{B}$ in the sense that it has the same eigenvalues (why are they the same? What can you say about the eigenvalues of diagonal matrices?). Such an operation is called a similarity transformation.

On the other hand, we can (post-)multiply both sides of $\boldsymbol{B}\boldsymbol{V} = \boldsymbol{V}\boldsymbol{D}$ by $\boldsymbol{V}^{-1}$ and get,

$$ \boldsymbol{B} = \boldsymbol{B}\boldsymbol{V}\boldsymbol{V}^{-1} = \boldsymbol{V}\boldsymbol{D}\boldsymbol{V}^{-1}. $$

One reason why this is useful is that we can now easily raise $\boldsymbol{B}$ to powers. For example,

$$ \boldsymbol{B}^2 = (\boldsymbol{V}\boldsymbol{D}\boldsymbol{V}^{-1}) (\boldsymbol{V}\boldsymbol{D}\boldsymbol{V}^{-1}) =\boldsymbol{V}\boldsymbol{D}^2\boldsymbol{V}^{-1}. $$

and so on.

However, to do this we needed to assume that $\det(\boldsymbol{V})\ne 0$, and this need not be the case. Matrices for which this is true are called diagonalizable, and matrices for which it isn't true are called defective.

Eigen-systems of Symmetric Matrices¶

The eigenvalues and eigenvectors of a general square matrix could be complex numbers (which we aren't going to be too concerned with), and for large matrices the inverse $\boldsymbol{V}^{-1}$ could be hard to find explicitly.

However, in the special case of a symmetric matrix $\boldsymbol{A}$, the eigensystem $\boldsymbol{A}\boldsymbol{v}=\lambda\boldsymbol{A}$ is made up exclusively of real numbers.

Furthermore, the matrix of normalized eigenvectors, $\boldsymbol{V}$, is an orthogonal matrix. This means that $\boldsymbol{V}^{-1}=\boldsymbol{V}^T$ - which is very easy to calculate once $\boldsymbol{V}$ is known.

This is called the Spectral Theorem - see [MML, Theorem 4.15]

Spectral Theorem (for matrices) If $\boldsymbol{A}$ is real and symmetric (hence square) then its eigenvalues are all real and its eigenvector matrix $\boldsymbol{V}$ can be taken as orthogonal so that $\boldsymbol{V}^{-1}=\boldsymbol{V}^T$.

Let's see an example of this.

$$ \text{if } \boldsymbol{A} = \left(\begin{array}{rrr} 3 & -2 & 4 \\ -2 & 6 & 2 \\ 4 & 2 & 5 \end{array}\right) $$

then (with some rounding),

$$ \boldsymbol{D} \approx \left( \begin{array}{rrr} -1.217 & 0 & 0 \\ 0 & 8.217 & 0 \\ 0 & 0 & 7 \end{array} \right) \quad\text{ and }\quad \boldsymbol{V} \approx \left( \begin{array}{rrr} .726 & .522 & -.447 \\ .363 & .261 & .894 \\ -.584 & .812 & 0 \\ \end{array} \right) $$

We'll take a look at how to do that in the lab.

We will verify that $\boldsymbol{V}^T\boldsymbol{V}=\boldsymbol{I}$ up to rounding error.

We'll also confirm that $\boldsymbol{A}\boldsymbol{V}=\boldsymbol{V}\boldsymbol{D}$.

The Eigen-Decomposition¶

Next, by post-multiplying by $\boldsymbol{V}^{-1}$ we note that we can also write $\boldsymbol{A}\boldsymbol{V}=\boldsymbol{V}\boldsymbol{D}$ as $\boldsymbol{A}=\boldsymbol{V}\boldsymbol{D}\boldsymbol{V}^T$ which, in expanded form, is,

$$ \boldsymbol{A} = (\boldsymbol{v}_1\ \boldsymbol{v}_2\ \boldsymbol{v}_3\ \ldots) \left(\begin{array}{rrrr} \lambda_1 \\ & \lambda_2 \\ & & \lambda_3 \\ & & & \ \ \ddots \end{array}\right) \left(\begin{array}{rrrr} \boldsymbol{v}_1^T\\ \boldsymbol{v}_2^T\\ \boldsymbol{v}_3^T\\ \vdots \end{array}\right) $$

(we haven't shown all the zero elements). Then simplifying this we get

$$ \boldsymbol{A} = (\boldsymbol{v}_1\ \boldsymbol{v}_2\ \boldsymbol{v}_3\ \ldots) \left(\begin{array}{cccc} \lambda_1\boldsymbol{v}_1^T \\ \lambda_2\boldsymbol{v}_2^T \\ \lambda_3\boldsymbol{v}_3^T \\ \vdots \end{array}\right) = \lambda_1\boldsymbol{v}_1\boldsymbol{v}_1^T + \lambda_2\boldsymbol{v}_2\boldsymbol{v}_2^T + \lambda_3\boldsymbol{v}_3\boldsymbol{v}_3^T + \cdots = \sum_{k=1}^n \lambda_k\boldsymbol{v}_k\boldsymbol{v}_k^T. $$

Let's see this in python for a specific example. We start by getting the eigen-system for $\boldsymbol{A}$ as above.

In [4]:

A = np.array([[3,-2,4],[-2,6,2],[4,2,5]])
w, V = np.linalg.eig(A)
D=np.diag(w)
print('D = \n', D)

D = 
 [[-1.21699057  0.          0.        ]
 [ 0.          8.21699057  0.        ]
 [ 0.          0.          7.        ]]

Let's look at each decomposed term in turn. First, for the $k=1$ term,

In [5]:

print( D[0,0]*V[:,0:1]*V[:,0:1].T )

[[-0.64159712 -0.32079856  0.51600297]
 [-0.32079856 -0.16039928  0.25800148]
 [ 0.51600297  0.25800148 -0.41499417]]

Let's think about what is going on here. We are printing out the quantity

D[0,0]*V[:,0:1]*V[:,0:1].T

In this, the D[0,0] factor is just the eigenvalue from the top left (first row, first column) of the $\boldsymbol{D}$ matrix.

Then, next, V[:,0:1] is a numpy slice.

In this type of expression [c,a:b] means take the elements in row c that occupy columns a through to b-1. The expression [:,a:b] means take all of the rows. Remember that column and row numbering starts at zero in numpy.

So, V[:,0:1] says take the first column of V - a column vector, and V[:,0:1].T says take the transpose of the first solumn of V.

It is important to note that we have to write our slicing expressions in the form V[:,0:1] rather than V[:,0], otherwise we lose the shape. See e.g. https://stackoverflow.com/questions/29635501/row-vs-column-vector-return-during-numpy-array-slicing

Here is a demo:

In [6]:

print('V = \n', V)
print('V[:,0:1] = \n', V[:,0:1])
print('V[:,0] = \n', V[:,0])

V = 
 [[ 7.26085219e-01  5.22302838e-01 -4.47213595e-01]
 [ 3.63042610e-01  2.61151419e-01  8.94427191e-01]
 [-5.83952325e-01  8.11787954e-01  2.85088132e-16]]
V[:,0:1] = 
 [[ 0.72608522]
 [ 0.36304261]
 [-0.58395233]]
V[:,0] = 
 [ 0.72608522  0.36304261 -0.58395233]

Therefore, for $k=1$ we can write $\lambda_k\boldsymbol{v}_k\boldsymbol{v}_k^T$ in code as D[0,0]*V[:,0:1]*V[:,0:1].T.

The full reconstruction of $\boldsymbol{A}$ is then...

In [7]:

print('\n The reconstruction of A ...')
print(D[0,0]*V[:,0:1]*V[:,0:1].T + D[1,1]*V[:,1:2]*V[:,1:2].T + D[2,2]*V[:,2:3]*V[:,2:3].T)

 The reconstruction of A ...
[[ 3. -2.  4.]
 [-2.  6.  2.]
 [ 4.  2.  5.]]

Implication: approximation of matrices¶

This is quite a big deal, it means in effect that we can break a square symmetric matrix into pieces and consider as many or as few of those pieces as we wish.

This point of view really suggests an approximation scheme. If $\boldsymbol{A}$ is $n\times n$ and symmetric then,

$$ \boldsymbol{A} = \sum_{k=1}^n \lambda_k\boldsymbol{v}_k\boldsymbol{v}_k^T \qquad\text{ which suggests that }\qquad \boldsymbol{A} \approx \sum_{k=1}^m \lambda_k\boldsymbol{v}_k\boldsymbol{v}_k^T $$

for $m < n$. We would want to sort the eigenvalues so that the most dominant ones come first in this sum, which is the same as saying

$$ \vert\lambda_1\vert \ge \vert\lambda_2\vert \ge \vert\lambda_3\vert \ge \cdots $$

We'll look more closely at this in the workshop session where we will see the effect on the amount of data storage we can save.

Building on these examples we can infer that if a large matrix can be well approximated by just a few terms in the eigen-expansion then the amount of storage required in computer memory can be vastly reduced.

This is very useful. But it only applies to symmetric square matrices.

We can extend it to non-symmetric matrices by introducing complex numbers but we can't extend this to non-square matrices because they don't have eigenvalues.

Fortunately, there is another - even more powerful - tool at our disposal.

SVD: The Singular Value Decomposition¶

Only square matrices have eigenvalues, and not all square matrices are diagonalizable via the similarity transform. For general matrices (containing real numbers for us) there is a tool called the SVD - the Singular Value Decomposition.

Let $\boldsymbol{K}$ be an $n$-row by $m$-column matrix of real numbers.

Then $\boldsymbol{K} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T$ - this is called the Singular Value Decomposition of $\boldsymbol{K}$. In this:

$\boldsymbol{U}$ is an $n\times n$ orthogonal square matrix
$\boldsymbol{\Sigma}$ is an $n\times m$ rectangular diagonal matrix
$\boldsymbol{V}^T$ is an $m\times m$ orthogonal square matrix

The entries on the diagonal of $\boldsymbol{\Sigma}$ are called the singular values of $\boldsymbol{K}$ and the number of non-zero singular values gives the rank of $\boldsymbol{K}$.

The columns of $\boldsymbol{U}$ (resp. $\boldsymbol{V}$) are called the left (resp. right) singular vectors of $\boldsymbol{K}$.

Let's see an example of $\boldsymbol{K} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T$.

$$ \text{If } \boldsymbol{K}=\left(\begin{array}{rrr} 1 & 2 & 5 \\ 5 & -6 & 1 \\ \end{array}\right) \text{ then } \boldsymbol{U}=\left(\begin{array}{rr} -0.06213\ldots & 0.99806\ldots \\ 0.99806\ldots & 0.06213\ldots \\ \end{array}\right), $$$$ \boldsymbol{\Sigma}=\left(\begin{array}{rrr} 7.88191\ldots & 0 & 0 \\ 0 & 5.46584\ldots & 0 \\ \end{array}\right) \text{ and } \boldsymbol{V}=\left(\begin{array}{rrr} 0.62525\ldots & 0.23944\ldots &-0.74278\ldots \\ -0.77553\ldots & 0.29699\ldots &-0.55708\ldots \\ 0.08720\ldots & 0.92437\ldots & 0.37139\ldots \\ \end{array}\right) $$

If we use these we can indeed check that

\begin{align*} & \scriptsize \left(\begin{array}{rr} -0.062\ldots & 0.998\ldots \\ 0.998\ldots & 0.062\ldots \\ \end{array}\right) \left(\begin{array}{rrr} 7.881\ldots & 0 & 0 \\ 0 & 5.465\ldots & 0 \\ \end{array}\right) \left(\begin{array}{rrr} 0.625\ldots & 0.239\ldots &-0.742\ldots \\ -0.775\ldots & 0.296\ldots &-0.557\ldots \\ 0.087\ldots & 0.924\ldots & 0.371\ldots \\ \end{array}\right)^T \\ &\qquad{} = \left(\begin{array}{rrr} 1 & 2 & 5 \\ 5 & -6 & 1 \\ \end{array}\right) \end{align*}

as required. We aren't going to do it by hand though, that's what computers are for!

In [8]:

K = np.array([[1,2,5],[5,-6,1]])
U, S, VT = np.linalg.svd(K)
print(U)
print(S)
print(VT)

[[-0.06213744  0.9980676 ]
 [ 0.9980676   0.06213744]]
[7.88191065 5.4658471 ]
[[ 0.62525456 -0.77553283  0.08720987]
 [ 0.23944227  0.29699158  0.9243719 ]
 [-0.74278135 -0.55708601  0.37139068]]

Note two things:

np.linalg.svd returns $\boldsymbol{V}^T$, not $\boldsymbol{V}$.
The shape of S doesn't agree with $\boldsymbol{\Sigma}$.

We'll need to pad S - and then we can check the reconstruction $\boldsymbol{K} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T$.

The padding is a bit awkward - here it is...

In [9]:

S = np.hstack(( np.diag(S), np.zeros((2,1)) ))
print(S)

[[7.88191065 0.         0.        ]
 [0.         5.4658471  0.        ]]

Now we can check the reconstruction $\boldsymbol{K} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T$.

Note that we can also use @ to perform matrix multiplication.

In [10]:

print(K - U @ S @ VT)

[[-4.44089210e-16 -4.44089210e-16 -1.77635684e-15]
 [-8.88178420e-16  1.77635684e-15  1.11022302e-16]]

This is zero (to machine precision) as expected.

There is a great deal that can be said about the SVD, but we're going to stay narrowly focussed and explore its value in data science and machine learning.

Let's start with the following observation. If we denote the $n$-th column of $\boldsymbol{U}$ by $\boldsymbol{u}_n$, and the $n$-th column of $\boldsymbol{V}$ by $\boldsymbol{v}_n$, then the statement $\boldsymbol{K} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T$ becomes (we saw something very similar to this above with eigenvalues),

$$ \boldsymbol{K} = (\boldsymbol{u}_1\ \boldsymbol{u}_2\ \cdots\ \boldsymbol{u}_n) \left(\begin{array}{rrrr} \sigma_1 & & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_n \end{array}\right) \left(\begin{array}{r} \boldsymbol{v}_1^T \\ \boldsymbol{v}_2^T \\ \vdots\ \\ \boldsymbol{v}_n^T \\ \end{array}\right) = (\boldsymbol{u}_1\ \boldsymbol{u}_2\ \cdots\ \boldsymbol{u}_n) \left(\begin{array}{r} \sigma_1\boldsymbol{v}_1^T \\ \sigma_2\boldsymbol{v}_2^T \\ \vdots\ \\ \sigma_n\boldsymbol{v}_n^T \\ \end{array}\right) $$

(we haven't shown all the zero elements of $\boldsymbol{\Sigma}$). Simplifying this further then gives,

$$ \boldsymbol{K} = \sigma_1\boldsymbol{u}_1\boldsymbol{v}_1^T + \sigma_2\boldsymbol{u}_2\boldsymbol{v}_2^T + \cdots + \sigma_n\boldsymbol{u}_n\boldsymbol{v}_n^T $$

THINK ABOUT: $\boldsymbol{u}\boldsymbol{v}^T$ is a rank 1 matrix - why?

The SVD Theorem¶

Here are the full details. For $\boldsymbol{B}\in\mathbb{R}^{m\times n}$

$$ \boldsymbol{B} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T =\sum_{j=1}^{p} \sigma_j \boldsymbol{u}_j\boldsymbol{v}_j^T $$

where: $\boldsymbol{U}\in\mathbb{R}^{m\times m}$, $\boldsymbol{\Sigma}\in\mathbb{R}^{m\times n}$, $\boldsymbol{V}\in\mathbb{R}^{n\times n}$ and $p=\min\{m,n\}$.

Note that $\boldsymbol{\Sigma}=\text{diag}(\sigma_1,\ldots,\sigma_p)$ and we can always arrange this so that $\sigma_1\ge\sigma_2\ge\cdots\ge\sigma_p \ge 0$.

$\boldsymbol{B}$ is real here (we aren't interested in complex matrices), then $\boldsymbol{U}$ and $\boldsymbol{V}$ are real and orthogonal.

If $\sigma_r\ne 0$ and $\sigma_p= 0$ for all $p>r$ then $r$ is the rank of $\boldsymbol{B}$.

Note storage for $\boldsymbol{B}$ is $mn$. That for the SVD is $r(m+n+1)$. The ratio is

$$ \frac{r(m+n+1)}{mn} = \frac{r}{n} + \frac{r}{m} + \frac{r}{mn} $$

The Thin SVD¶

We saw above that the zero columns were missing from $\boldsymbol{\Sigma}$ when we used numpy to calculate the SVD. This is called the Thin SVD.

Here we still have $\boldsymbol{B}\in\mathbb{R}^{m\times n}$ but with

$$ \boldsymbol{B} = \boldsymbol{U}_1\boldsymbol{\Sigma}_1\boldsymbol{V}^T =\sum_{j=1}^{n} \sigma_j \boldsymbol{u}_j\boldsymbol{v}_j^T $$

where: $\boldsymbol{U}_1\in\mathbb{R}^{m\times n}$, $\boldsymbol{\Sigma}_1\in\mathbb{R}^{n\times n}$.

It is called the thin SVD because we drop the values that make no contribution (i.e. the zeros).

Don't worry too much about this. We'll let numpy do the hard work for us.

HOMEWORK - very important¶

In the lab we are going to see how the SVD can be used to compress data.

We'll use image compression as an example.

Take a good quality jpeg colour photo (e.g. on your phone) of something vivid, detailed and colourful and save it on your account (One Drive, for example) so that your Jupyter notebook in Anaconda can use it.

We are going to use the SVD to compress the image.

Review¶

we understand that matrices can be decomposed into multiplicative factors.
we saw how the spectral theorem allows us to approximate square symmteric matrices using the eigen-system.
we have seen how rectangular matrices can be similarly expanded in terms of the singular value decomposition.
we saw how we can access this functionality using numpy in python.