https://variationalform.github.io/
https://github.com/variationalform
Simon Shaw https://www.brunel.ac.uk/people/simon-shaw.
This work is licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0 International) Visit http://creativecommons.org/licenses/by-sa/4.0/ to see the terms. |
This document uses python | and also makes use of LaTeX | in Markdown |
This is a very quick recap of essential (for us) concepts in
As usual our emphasis will be on doing rather than proving: just enough: progress at pace
For this worksheet you are recommended Chapter 6 of [MML] and Appendix C of [DSML].
MML: Mathematics for Machine Learning, by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Cambridge University Press. https://mml-book.github.io.
DSML: Data Science and Machine Learning, Mathematical and Statistical Methods by Dirk P. Kroese, Zdravko I. Botev, Thomas Taimre, Radislav Vaisman https://people.smp.uq.edu.au/DirkKroese/DSML/DSML.pdf
There are also various resources here: https://stats.libretexts.org/Bookshelves
These can be accessed legally and without cost. NOTE: we haven't referred to the second of these before.
There are also these useful references for coding:
python
: https://docs.python.org/3/tutorialnumpy
: https://numpy.org/doc/stable/user/quickstart.htmlmatplotlib
: https://matplotlib.orgAnd, DSML (as above): Appendix D has a very useful python primer.
Probability is a subtle topic, and also not without its interpretational controversy.
Suppose you flip a coin $S$ times and it lands heads $n$ times.
We do this by looking at the relatve frequencies and say:
This seems fine - we can flip a coin as many times as we like to approximate $S\to\infty$.
We can also introduce prior beliefs. If the coin is fair, a judgement we make by appeal to its physical symmetry and the laws of physics, then we can assert that
$\mathrm{P}(H) = 1/2$, and
$\mathrm{P}(T) = 1/2$.
Sometimes though we can't appeal to this type of simple intuitive interpretation of probability.
There is a 70% chance of rain tomorrow.
Really? What does that mean? It isn't like the coin toss. We can't repeat 'tomorrow' $S$ times and count the number $n$ of times it rains.
What this means is that for $10$ meteorologically similar days we can expect to need an umbrella on 7 of them.
There is a lot of history and lively debate around these questions of interpretation. See for example, https://plato.stanford.edu/entries/probability-interpret/
We're fortunate though. We will usually be able to run our codes many times on large enough data sets, and so we can think about relative frequencies.
We think of running an experiment. We will have a sample space $\Omega$ of all the possible outcomes of the experiment, and an event space $\mathcal{E}$ of all possible results. The event space is the power set of $\Omega$.
For example, if we toss a coin three times there are $2^3$ possible outcomes.
$\Omega = \{HHH, HHT, HTH, THH, HTT, THT, THH, TTT\}$.
Example of events are
(i) 'Two heads and one tail': $\{ HHT,HTH,THH \}$.
(i) 'Head on first fall': $\{ HHH, HHT, HTH, HTT \}$.
There is a function $\mathrm{P}\colon\mathcal{E}\to [0,1]$ that assigns a probability to each event $E\in\mathcal{E}$. This function gives the probability of $E\in\mathcal{E}$.
On the assumption that this is revision, we wont work through examples here.
Suppose you want to know the probability that $A$ occurs given that $B$ does occur. For example,
We write this conditional probability as $\mathrm{P}(A\mid B)$.
To understand it, suppose that in $S$ trials $A$ and $B$ have simultaneously occurred $m$ times while $B$ has occurred $n$ times. We must have that $n\ge m$ and so the probability that $A$ and $B$ both occurred given that $B$ occurred is reasoned out like this:
$$ \mathrm{P}(A\mid B) \approx \frac{m}{n} = \frac{m}{S}\frac{S}{n} = \frac{m}{S}\left(\frac{n}{S}\right)^{-1} \to \frac{\mathrm{P}(A\mathrm{\ and\ }B)}{\mathrm{P}(B)}. $$We take the right hand side as the definition of the left hand side, given the intuitive calculation in the middle.
This is very useful. It allows us to switch the conditioning around.
It is useful to recognise that $\mathrm{P}(A\mathrm{\ and\ }B) = \mathrm{P}(A\mid B)\mathrm{P}(B) = \mathrm{P}(B\mid A)\mathrm{P}(A)$.
$\mathrm{P}(A\mathrm{\ or\ }B) = \mathrm{P}(A)+\mathrm{P}(B)-\mathrm{P}(A\mathrm{\ and\ }B)$.
$\mathrm{P}(A\mathrm{\ or\ }B) = \mathrm{P}(A)+\mathrm{P}(B)$ if $A$ and $B$ are mutually exclusive.
$\mathrm{P}(A\mathrm{\ and\ }B) = \mathrm{P}(A)\mathrm{P}(B)$ if $A$ and $B$ are independent.
$\mathrm{P}(A)+\mathrm{P}(\neg A) = 1$ where $\neg A$ ('not' $A$) means that $A$ did not occur.
The Partition Theorem (or the Law of Total Probability)
$\mathrm{P}(A) = \mathrm{P}(A\mathrm{\ and\ }B) + \mathrm{P}(A\mathrm{\ and\ }\neg B)$
$\mathrm{P}(A) = \mathrm{P}(A\mid B)\mathrm{P}(B) + \mathrm{P}(A\mid \neg B)\mathrm{P}(\neg B)$
Bayes formula (reprised)
$$ \mathrm{P}(B\mid A)\displaystyle = \frac{\mathrm{P}(A\mathrm{\ and\ }B)}{\mathrm{P}(A)} = \frac{\mathrm{P}(A\mid B)\mathrm{P}(B)} {\mathrm{P}(A\mid B)\mathrm{P}(B) + \mathrm{P}(A\mid \neg B)\mathrm{P}(\neg B)} $$We wont have too much need for these, but we are interested in the connection to confusion matrices...
Recall that for a binary classifier our confusion matrices took the very specific form:
$$ \begin{array}{rcc} \begin{array}{r} \text{target, or true} \\ \text{label/class} \end{array}\quad & \begin{array}{c} Y \\ N \end{array}\!\! & \left( \begin{array}{cc} \mathrm{TP} & \mathrm{FN} \\ \mathrm{FP} & \mathrm{TN} \\ \end{array} \right) \\ & & \begin{array}{cc} + & - \end{array} \\ & & \text{output, or predicted} \\ & & \text{label/class} \\ \end{array} $$These numbers represent estimates (that get better as $S\to\infty$) of conditional probabilities...
Example: suppose we have this set of results where $Y$ or $N$ are the known labels and $+$ and $-$ are the predictions:
$$ \begin{array}{rcc} \begin{array}{r} \text{label} \\ \text{} \end{array}\quad & \begin{array}{c} Y \\ N \end{array}\!\! & \left( \begin{array}{cc} \mathrm{TP} & \mathrm{FN} \\ \mathrm{FP} & \mathrm{TN} \\ \end{array} \right) \\ & & \begin{array}{cc} + & - \end{array} \\ & & \text{predicted} \\ \end{array} \qquad\text{ with, specifically,}\qquad \left( \begin{array}{cc} \mathrm{62} & \mathrm{5} \\ \mathrm{9} & \mathrm{44} \\ \end{array} \right). $$There are $S=120$ results. Look along the first row - these are the actual
numbers in the sample which are labelled as Y
(healthy, innocent, passed, safe, ...) as opposed to N
(sick, guilty,
failed, unsafe, ...).
So $62+5$ are in the Y
class out of a total of $120$. If this sample
represents the population then we can estimate...
$\mathrm{P}(Y)=67/120$. Similarly, $P(+)=(62+9)/120$.
Further, in the second row, we know that N
occurs, for these
are all in the N
class. So, with similar reasoning ...
$\mathrm{P}(+\mid N)=9/(9+44)$. Similarly, $P(Y\mid -)=5/(5+44)$.
Let's see all the calculations...
import numpy as np
cm = np.array([[62,5],[9,44]])
N = cm.sum()
print('Number of samples: ', N,' with base rates...')
print('P(Y) = (62+5)/120 = ', (62+5)/120, end=' and ')
print('P(N) = (9+44)/120 = ', (9+44)/120, ' = 1-P(Y)')
print('P(+) = (62+9)/120 = ', (62+9)/120, end=' and ')
print('P(-) = (5+44)/120 = ', (5+44)/120, ' = 1-P(+)')
print('Conditionals...')
print('P(Y|+) = 62/(62+9) = ', 62/(62+9), end=' and ')
print('P(N|+) = 9/(62+9) = ', 9/(62+9))
print('P(Y|-) = 5/(5+44) = ', 5/(5+44), end=' and ')
print('P(N|-) = 44/(5+44) = ', 44/(5+44))
print('P(+|Y) = 62/(62+5) = ', 62/(62+5), end=' and ')
print('P(-|Y) = 5/(62+5) = ', 5/(62+5))
print('P(+|N) = 9/(9+44) = ', 9/(9+44), end=' and ')
print('P(-|N) = 44/(9+44) = ', 44/(9+44))
Number of samples: 120 with base rates... P(Y) = (62+5)/120 = 0.5583333333333333 and P(N) = (9+44)/120 = 0.44166666666666665 = 1-P(Y) P(+) = (62+9)/120 = 0.5916666666666667 and P(-) = (5+44)/120 = 0.4083333333333333 = 1-P(+) Conditionals... P(Y|+) = 62/(62+9) = 0.8732394366197183 and P(N|+) = 9/(62+9) = 0.1267605633802817 P(Y|-) = 5/(5+44) = 0.10204081632653061 and P(N|-) = 44/(5+44) = 0.8979591836734694 P(+|Y) = 62/(62+5) = 0.9253731343283582 and P(-|Y) = 5/(62+5) = 0.07462686567164178 P(+|N) = 9/(9+44) = 0.16981132075471697 and P(-|N) = 44/(9+44) = 0.8301886792452831
Earlier for a binary classifier we defined some useful terms for measuring performance. Some of these can be related to conditional probabilites.
$$ \begin{array}{rcc} \begin{array}{r} \text{target, or true} \\ \text{label/class} \end{array}\quad & \begin{array}{c} Y \\ N \end{array}\!\! & \left( \begin{array}{cc} \mathrm{TP} & \mathrm{FN} \\ \mathrm{FP} & \mathrm{TN} \\ \end{array} \right) \\ & & \begin{array}{cc} + & - \end{array} \\ & & \text{output, or predicted} \\ & & \text{label/class} \\ \end{array} $$Recall that we used $\mathrm{P}$ and $\mathrm{N}$ for the number of positives and negatives overall in the test set.
Prevalence: $\mathrm{Prevalence} = \frac{\mathrm{P}}{\mathrm{P}+\mathrm{N}} = \mathrm{P}(Y)$
TPR: True Positive Rate, sensitivity, recall: $\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{P}} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} = \mathrm{P}(+\mid Y)$
And also...
$$ \begin{array}{rcc} \begin{array}{r} \text{target, or true} \\ \text{label/class} \end{array}\quad & \begin{array}{c} Y \\ N \end{array}\!\! & \left( \begin{array}{cc} \mathrm{TP} & \mathrm{FN} \\ \mathrm{FP} & \mathrm{TN} \\ \end{array} \right) \\ & & \begin{array}{cc} + & - \end{array} \\ & & \text{output, or predicted} \\ & & \text{label/class} \\ \end{array} $$FPR: False Positive Rate: $\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{N}} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}} = \mathrm{P}(+\mid N)$
FNR: False Negative Rate: $\mathrm{FNR} = \frac{\mathrm{FN}}{\mathrm{P}} = \frac{\mathrm{FN}}{\mathrm{FN}+\mathrm{TP}} = \mathrm{P}(-\mid Y)$
PPV: Positive Predictive Value, precision: $\mathrm{PPV} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} = \mathrm{P}(Y\mid +)$
In addition to needing an understanding of how to infer probabilities from our results, we'll also need to understand some related concepts from Mathematical Statistics.
We review these terms:
A random variable is a function $Z\colon \Omega \to\mathbb{R}$. We can use them to take probabilities.
For example, a dice is thrown and $Z$ is assigned the value shown on the upward face.
The Expected Value of the random variable is the sum of all the probabilities weighted by the value of the variable:
This coincides with the notion of average or mean value. Why?
The numerical data we will usually be working with will typically be lists of samples of the random variable, with each value of the random variable occuring with equal probability.
For example, if the random variable, $Z$, takes one of $N$ equally probable values $Z_1, Z_2, \ldots, Z_N$, then the probability that a given value is taken is $N^{-1}$ and then the expected value of $Z$ is,
$$ \mathbb{E}(Z) = \sum_{k=1}^N k P(Z_k) =\frac{1}{N}\sum_{k=1}^N k = \bar{Z}. $$This, the expected value of $Z$, is called the mean, or average, value of $Z$.
We use $\bar{Z}$ to denote the sample mean. It is common to denote the population mean by $\mu_Z$, but this isn't usually accessible to us - we'll almost always be working with samples and so we write $\bar{Z}\approx\mu_Z$.
The mean is a measure of the centre of a distribution. Two other measures are also in common use. Confining ourselves to the discrete case these are...
median: this is the value in the middle of an ordered set. For example $\{1,3,4,78,90\}$ has median $4$. When the set has an even number of elements the median can be taken as the average of the two centre elements.
mode: this is the most frequently occuring value. The set above doesn't have a mode (or all elements are modes). The set $\{1,3,3,78,90\}$ has mode $3$.
The variance of the random variable, $X$, is defined (for our purposes) as
$\mathrm{Var}(X) = \mathbb{E}(X^2) - \big(\mathbb{E}(X)\big)^2$.
For us, with sample size $N$, this is,
$$ \mathrm{Var}(X) = \frac{1}{N}\sum_{k=1}^N\big(X_k-\mathbb{E}(X)\big)^2 $$Also, the standard deviation is given by
$\sigma_X = \mathrm{Std}(X) = \sqrt{\mathrm{Var}(X)}$.
These formulae are sometimes altered slightly for smaller samples sizes, with the denominator $N$ replaced by $N-1$ to get an unbiased estimate. When $N$ is large this has negligible effect.
Let's see concrete examples with the data set $X\in\{1,3,4,5,7\}$.
We'll see that numpy
can make life easy for us...
X = np.array([1,3,4,5,7])
N = X.shape[0]
Xbar = X.sum()/N
print('E(X) = mean = ', Xbar, ' or with numpy: ', X.mean())
# centre X using mean, then sum of squares using dot product
Xc = X-Xbar
VarX = Xc.T.dot(Xc) / N
print('Var(X) = variance = ', VarX, ' or with numpy: ', X.var())
print('SD(X) = Std Dev = ', np.sqrt(VarX), ' or with numpy: ', X.std())
# or, the unbiased result..
VarX = Xc.T.dot(Xc) / (N-1)
print('Var(X) = variance = ', VarX, ' or with numpy: ', X.var(ddof=1))
print('SD(X) = Std Dev = ', np.sqrt(VarX), ' or with numpy: ', X.std(ddof=1))
E(X) = mean = 4.0 or with numpy: 4.0 Var(X) = variance = 4.0 or with numpy: 4.0 SD(X) = Std Dev = 2.0 or with numpy: 2.0 Var(X) = variance = 5.0 or with numpy: 5.0 SD(X) = Std Dev = 2.23606797749979 or with numpy: 2.23606797749979
Often we have more than one random variable in play. We saw four numerical columns in the penguins data set for example. We can calculate stats for each column as shown above, but how can we assess how related these variables might be?
We define the covariance of two random variables as
$$ \mathrm{Cov}(X,Y) = \mathbb{E}\Big(\big(X-\mathbb{E}(X)\big)\big(Y-\mathbb{E}(Y)\big)\Big) = \frac{1}{N}\sum_{k=1}^N \big(X_k-\bar{X}\big)\big(Y_k-\bar{Y}\big) $$and the correlation coefficient of two random variables as
$$ \rho_{XY} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X\,\sigma_Y} $$It is easy to see that $\mathrm{Cov}(X,X)=\mathrm{Var}(X)$ and that $\rho_{XX} = 1$.
These measurements indicate how strongly related the random variables are to each other: positive correlations indicate that both tend to grow or diminish together, while negatives indicate that one grows as the other shrinks. A zero correlation indicates that the variables are unrelated.
We'll normally work with covariance rather than correlation. Let's see an example - using penguins again...
Grab the data and clean it up just like before.
import numpy as np
import seaborn as sns
dfp = sns.load_dataset('penguins')
dfp.head()
dfp = dfp.dropna()
dfp = dfp.reset_index(drop=True)
dfp.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
4 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
Now assign the numerical data in columns 3 - 6 in X
X = dfp.iloc[:, 2:6].values
X[:4,:]
array([[ 39.1, 18.7, 181. , 3750. ], [ 39.5, 17.4, 186. , 3800. ], [ 40.3, 18. , 195. , 3250. ], [ 36.7, 19.3, 193. , 3450. ]])
Each column represents a random variable: $X_0$, $X_1$, $X_2$, $X_3$. We can calculate means, variances and covariances. For example...
print('Mean of column 1 (indexed at 0) : ', X[:,0].mean())
print('Std Dev of column 3 (population): ', X[:,2].std())
print('Std Dev of column 3 (unbiased) : ', X[:,2].std(ddof=1))
Mean of column 1 (indexed at 0) : 43.99279279279279 Std Dev of column 3 (population): 13.994704772576716 Std Dev of column 3 (unbiased) : 14.015765288287879
# remember that we can access some summary stats like this...
dfp.describe()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
count | 333.000000 | 333.000000 | 333.000000 | 333.000000 |
mean | 43.992793 | 17.164865 | 200.966967 | 4207.057057 |
std | 5.468668 | 1.969235 | 14.015765 | 805.215802 |
min | 32.100000 | 13.100000 | 172.000000 | 2700.000000 |
25% | 39.500000 | 15.600000 | 190.000000 | 3550.000000 |
50% | 44.500000 | 17.300000 | 197.000000 | 4050.000000 |
75% | 48.600000 | 18.700000 | 213.000000 | 4775.000000 |
max | 59.600000 | 21.500000 | 231.000000 | 6300.000000 |
What about the covariance? Let's calculate $\mathrm{Cov}(X_1,X_2)$...
# first center the data using the column means...
X1 = X[:,[1]] - X[:,[1]].mean()
X2 = X[:,[2]] - X[:,[2]].mean()
# then multiply, sum and take the unbiased average
N = X.shape[0]
CV12 = np.sum(X1*X2)/(N-1)
print("Cov(X1,X2) = ", CV12)
Cov(X1,X2) = -15.94724845327255
Rather than np.sum()
, we can use the dot product,
$\boldsymbol{X}_1\cdot\boldsymbol{X}_2 = \boldsymbol{X}_1^T\boldsymbol{X}_2$,
like this...
CV12 = X1.T @ X2 / (N-1)
print("Cov(X1,X2) = ", CV12, " or as a scalar Cov(X1,X2) = ", float(CV12) )
Cov(X1,X2) = [[-15.94724845]] or as a scalar Cov(X1,X2) = -15.94724845327255
A useful concept is the covariance matrix, it takes this form:
$$ \boldsymbol{M} = \left(\begin{array}{llll} \mathrm{Var}(X_0) & \mathrm{Cov}(X_0,X_1) & \mathrm{Cov}(X_0,X_2) & \mathrm{Cov}(X_0,X_3) \\ \mathrm{Cov}(X_1,X_0) & \mathrm{Var}(X_1) & \mathrm{Cov}(X_1,X_2) & \mathrm{Cov}(X_1,X_3) \\ \mathrm{Cov}(X_2,X_0) & \mathrm{Cov}(X_2,X_1) & \mathrm{Var}(X_2) & \mathrm{Cov}(X_2,X_3) \\ \mathrm{Cov}(X_3,X_0) & \mathrm{Cov}(X_3,X_1) & \mathrm{Cov}(X_3,X_2) & \mathrm{Var}(X_3) \\ \end{array}\right) $$Recall that $\mathrm{Cov}(X,X)=\mathrm{Var}(X)$ and note that $\mathrm{Cov}(X,Y)=\mathrm{Cov}(Y,X)$. This matrix is therefore symmetric and so has real eigenvalues.
The covariance matrix is also positive semidefinite. This means that
$$ \boldsymbol{u}\cdot\boldsymbol{M}\boldsymbol{u} \ge 0 $$for all vectors $\boldsymbol{u}$. This in turn means that the eigenvalues of the covariance matrix are non-negative. To see this inequality assume without loss of generality that the $X_i$'s are already centered and collect the observed value into the column vectors $\boldsymbol{X}_i$. Then,
$$ (N-1)\boldsymbol{M} = \left(\begin{array}{llll} \boldsymbol{X}_0\cdot\boldsymbol{X}_0 & \boldsymbol{X}_0\cdot\boldsymbol{X}_1 & \boldsymbol{X}_0\cdot\boldsymbol{X}_2 & \boldsymbol{X}_0\cdot\boldsymbol{X}_3 \\ \boldsymbol{X}_1\cdot\boldsymbol{X}_0 & \boldsymbol{X}_1\cdot\boldsymbol{X}_1 & \boldsymbol{X}_1\cdot\boldsymbol{X}_2 & \boldsymbol{X}_1\cdot\boldsymbol{X}_3 \\ \boldsymbol{X}_2\cdot\boldsymbol{X}_0 & \boldsymbol{X}_2\cdot\boldsymbol{X}_1 & \boldsymbol{X}_2\cdot\boldsymbol{X}_2 & \boldsymbol{X}_2\cdot\boldsymbol{X}_3 \\ \boldsymbol{X}_3\cdot\boldsymbol{X}_0 & \boldsymbol{X}_3\cdot\boldsymbol{X}_1 & \boldsymbol{X}_3\cdot\boldsymbol{X}_2 & \boldsymbol{X}_3\cdot\boldsymbol{X}_3 \\ \end{array}\right) = \left(\begin{array}{l} \boldsymbol{X}_0^T \\ \boldsymbol{X}_1^T \\ \boldsymbol{X}_2^T \\ \boldsymbol{X}_3^T \\ \end{array}\right) \left(\begin{array}{llll} \boldsymbol{X}_0 & \boldsymbol{X}_1 & \boldsymbol{X}_2 & \boldsymbol{X}_3 \\ \end{array}\right) $$Write this as $(N-1)\boldsymbol{M} = \boldsymbol{K}^T\boldsymbol{K}$ and then for $\boldsymbol{u}$ arbitrary
$$ \boldsymbol{u}\cdot\boldsymbol{M}\boldsymbol{u} = \frac{1}{N-1} \boldsymbol{u}^T\boldsymbol{K}^T\boldsymbol{K}\boldsymbol{u} = \frac{1}{N-1} \big(\boldsymbol{K}\boldsymbol{u}\big)^T\boldsymbol{K}\boldsymbol{u} \ge 0 $$We have seen how to get a covariance matrix entry using numpy
, but there
are a lot more - and this is for just four columns in the data set.
Lots of work... Fortunately numpy
can do the heavy lifting for us...
# note the transpose...
print(np.cov(X.T))
[[ 2.99063334e+01 -2.46209134e+00 5.00581949e+01 2.59562330e+03] [-2.46209134e+00 3.87788831e+00 -1.59472485e+01 -7.48456122e+02] [ 5.00581949e+01 -1.59472485e+01 1.96441677e+02 9.85219165e+03] [ 2.59562330e+03 -7.48456122e+02 9.85219165e+03 6.48372488e+05]]
We can see that in the third column, second row we have $\mathrm{Cov}(X_1,X_2) = -15.94724845\ldots$ as expected.
But, the pandas
library that gives us the data frames has already
thought of ...
... both covariance and correlation, like this:
dfp.cov()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
bill_length_mm | 29.906333 | -2.462091 | 50.058195 | 2595.623304 |
bill_depth_mm | -2.462091 | 3.877888 | -15.947248 | -748.456122 |
flipper_length_mm | 50.058195 | -15.947248 | 196.441677 | 9852.191649 |
body_mass_g | 2595.623304 | -748.456122 | 9852.191649 | 648372.487699 |
dfp.corr()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
bill_length_mm | 1.000000 | -0.228626 | 0.653096 | 0.589451 |
bill_depth_mm | -0.228626 | 1.000000 | -0.577792 | -0.472016 |
flipper_length_mm | 0.653096 | -0.577792 | 1.000000 | 0.872979 |
body_mass_g | 0.589451 | -0.472016 | 0.872979 | 1.000000 |
THINK ABOUT: do you need both
flipper_length_mm
andbody_mass_g
in your analysis?
We covered just enough, to make progress at pace. We looked at
python
tools.Now we can start putting all of this material to work.