https://variationalform.github.io/
https://github.com/variationalform
Simon Shaw https://www.brunel.ac.uk/people/simon-shaw.
This work is licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0 International) Visit http://creativecommons.org/licenses/by-sa/4.0/ to see the terms. |
This document uses python | and also makes use of LaTeX | in Markdown |
You will be introduced to ...
python
programming: just enough: progress at pacepython
will be introduced and used as a toolpython
syntax, tools and techniques will be taughtThe Quality Assurance Agency for Higher Education (QAA, https://www.qaa.ac.uk) defines one academic credit as nominally equal to 10 hours of study (see https://www.qaa.ac.uk/docs/qaa/quality-code/higher-education-credit-framework-for-england.pdf).
Therefore, this 15 credit block requires nominally 150 hours of your time. Although every one of us is different and may choose to spend our time in different ways, the following sketch of these 150 hours is fairly accurate.
There will be $33$ hours spent on two lectures plus one seminar/lab in each of eleven weeks. There will be a two hour exam, to which you could assign $25$ hours of preparation/revision time. This accounts for $33+2+25 = 60$ hours.
In addition there is assignment which you could allocate $20$ hours to, making up to $80$ hours. This leaves $70$ of the $150$ hours over. In each of $10$ weeks of term there will be a requirement to engage in set tasks and problems, and to read sections of set books and sources in order to strengthen your understanding of imparted material as well as to prepare you for the next topics. These $70$ hours average out to $7$ hours per week over those $10$ weeks.
Note that as a full-time equivalent student you study a $4\times 15 = 60$ credit week. Using the figures above you can think of this as $4\times 3 = 12$ contact hours plus $4 \times 7 = 28$ private study hours per week. This is a $40$ hour week.
Note that engaging at this level does not guarantee any outcome, whether that be a bare pass or an A grade. It is a guideline only. If despite engaging at this level you are struggling to progress and achieve in the module then seek help and advice.
Further, these '$40$ hours' have to be high quality inquisitive engagement. Writing and re-writing notes, procrastinating, and looking at but not engaging with learning materials don't really count. You'll know when you're actually working - you'll feel it. Have a read of this https://en.wikipedia.org/wiki/Flow_(psychology) and make learning a daily rewarding habit.
The first few of these are debateable, evolving and subject to change and interpretation. It's worth searching and reading for yourself. These are a fast growing areas.
A blend of mathematics, computer science and statistics brought to bear with some form of domain expertise.
Systematic computational analysis of data, used typically to discover value and insights.
The stewardship, cleaning, warehousing and preparation of data to support its pipelining to its exploitation.
The development and deployment of digital systems that can effectively substitute for humans in tasks beyond the routine application of fixed rules. When you talk to your home assistant, your phone, or your satellite TV receiver, or your car, or your laptop, and so on, it has no idea what you are going to say. It doesn't have a bank of pre-answered questions, but instead it responds dynamically to what it hears. It has been trained on data, and it has learned how to respond. Incidentally, how do you think these systems even understand what you said? As a child, it took you months to begin to understand human speech...
The development and deployment of algorithms that are able to learn from data without explicit instructions, and then analyze, predict, or otherwise draw inferences, from unseen data. These algorithms would typically be expected to add measurable value by their performance.
Consider for example an algorithm that predicted tails for every coin flip. It's right half the time - but there's no value in that.
Machine learning models do not have intrinsic knowledge but instead learn from data. Typically a data set comprises a list of items each of which has one or more features which correspond to a label. We'll see some examples of this below.
We think of the features as being inputs to the machine learning model, and the label as being the output. Typically we want to be able to feed in new features, and have the model predict the label.
To do this we need a training data set so that the model can learn how to map the features to the label: the input to the output.
There are three basic learning paradigms:
This last is a major topic and will not be covered in these lectures. We will see examples of the first two.
Our algorithms will be developed to perform one of the following tasks:
Regression: here the output, the label, can take any value in a continuous set. For example, the height of a tree, given local climate, soil type, genus, age since planting, could be considered to be any non-negative real number (although not with equal probability).
Classification: in this case the label will be deemed to be one of a certain class. For example, in the handwritten digits example above, the output will be one of the digits $\{0,1,2,3,\ldots,9\}$.
Some of the algorithms we study will be able to perform both the regression and clustering tasks, although we wont always delve deeply into both capabilities.
For the data science, our main sources of information are as follows:
All of the above can be accessed legally and without cost.
There are also these useful references for coding:
python
: https://docs.python.org/3/tutorialnumpy
: https://numpy.org/doc/stable/user/quickstart.htmlmatplotlib
: https://matplotlib.orgThe capitalized abbreviations will be used throughout to refer to these sources. For example, we could say See [MLFCES, Chap 2, Sec. 1] for more discussion of Supervised Learning. This would just be a quick way of saying
Look in Section 1, of Chapter 2, of Machine Learning: A First Course for Engineers and Scientists, by Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, Thomas B. Schön, for more discussion of supervised learning.
There will be other sources shared as we go along. For now these will get us a long way.
python
and some data sets¶For each of our main topics we will see some example data, discuss a means of working with it, and then implement those means in code. We will develop enough theory so as to understand how the codes work, but our main focus will be the intution behind the method, and the effective problem solving using code.
We choose python
because its use in both the commercial and academic data science
arena seems to be pre-eminent.
The data science techniques and algorithms we will study, and the supporting technology
like graphics and number crunching, are implemented in well-known and well-documented
python
libraries. These are the main ones we will use:
matplotlib
: used to create visualizations, plotting 2D graphs in particular.numpy
: this is numerical python, it is used for array processing which for us
will usually mean the numerical calculations involving vectors and matrices.scikit-learn
: a set of well documented and easy to use tools for predictive data analysis.pandas
: a data analysis tool, used for the storing and manipulation of data.seaborn
: a data visualization library for attractive and informative statistical graphics. There will be others, but these are the main ones. Let's look at some examples of how to use these
Eventually we will use the anaconda distribution to access python
and the libraries
we need. The coding itself will be carried out in a Jupyter notebook. We'll go through this
in an early lab session. We'll start though with Binder: click here:
https://mybinder.org/v2/gh/variationalform/FML.git/HEAD
Let's see some code and some data. In the following cell we import seaborn
and look at
the names of the built in data sets. The seaborn
library, https://seaborn.pydata.org,
is designed for data visualization. It uses matplotlib
, https://matplotlib.org,
which is a graphics library for python
.
If you want to dig deeper, you can look at https://blog.enterprisedna.co/how-to-load-sample-datasets-in-python/ and https://github.com/mwaskom/seaborn-data for the background - but you don't need to.
import seaborn as sns
# we can now refer to the seaborn library functions using 'sns'
# note that you can use another character string - but 'sns' is standard.
# note that # is used to write 'comments'
# Now let's get the names of the built-in data sets.
sns.get_dataset_names()
# type SHIFT=RETURN to execute the highlighted (active) cell
['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']
taxis
data set¶# let's take a look at 'taxis'
dft = sns.load_dataset('taxis')
# this just plots the first few lines of the data
dft.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
# this will plot the last few lines... There are 6433 records (Why?)
dft.tail()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn |
What we are seeing here is a data frame. It is furnished by the pandas
library: https://pandas.pydata.org which is used by the seaborn
library
to store its example data sets.
Each row of the data frame corresponds to a single data point, which we
could also call an observation
or measurement
(depending on context).
Each column (except the left-most) corresponds to a feature of the data point. The first column is just an index giving the row number. Note that this index starts at zero - so, for example, the third row will be labelled/indexed as $2$. Be careful of this - it can be confusing.
In this, the variable dft is a pandas data frame: dft = 'data frame taxis'
# let's print the data frame...
print(dft)
pickup dropoff passengers distance fare \ 0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 ... ... ... ... ... ... 6428 2019-03-31 09:51:53 2019-03-31 09:55:27 1 0.75 4.5 6429 2019-03-31 17:38:00 2019-03-31 18:34:23 1 18.74 58.0 6430 2019-03-23 22:55:18 2019-03-23 23:14:25 1 4.14 16.0 6431 2019-03-04 10:09:25 2019-03-04 10:14:29 1 1.12 6.0 6432 2019-03-13 19:31:22 2019-03-13 19:48:02 1 3.85 15.0 tip tolls total color payment pickup_zone \ 0 2.15 0.0 12.95 yellow credit card Lenox Hill West 1 0.00 0.0 9.30 yellow cash Upper West Side South 2 2.36 0.0 14.16 yellow credit card Alphabet City 3 6.15 0.0 36.95 yellow credit card Hudson Sq 4 1.10 0.0 13.40 yellow credit card Midtown East ... ... ... ... ... ... ... 6428 1.06 0.0 6.36 green credit card East Harlem North 6429 0.00 0.0 58.80 green credit card Jamaica 6430 0.00 0.0 17.30 green cash Crown Heights North 6431 0.00 0.0 6.80 green credit card East New York 6432 3.36 0.0 20.16 green credit card Boerum Hill dropoff_zone pickup_borough dropoff_borough 0 UN/Turtle Bay South Manhattan Manhattan 1 Upper West Side South Manhattan Manhattan 2 West Village Manhattan Manhattan 3 Yorkville West Manhattan Manhattan 4 Yorkville West Manhattan Manhattan ... ... ... ... 6428 Central Harlem North Manhattan Manhattan 6429 East Concourse/Concourse Village Queens Bronx 6430 Bushwick North Brooklyn Brooklyn 6431 East Flatbush/Remsen Village Brooklyn Brooklyn 6432 Windsor Terrace Brooklyn Brooklyn [6433 rows x 14 columns]
Rows and rows of numbers aren't that helpful.
seaborn makes visualization easy - here is a scatter plot of the data.
sns.scatterplot(data=dft, x="distance", y="fare")
<AxesSubplot:xlabel='distance', ylabel='fare'>
THINK ABOUT: it looks like fare is roughly proportional to distance. But what could cause the outliers?
# here's another example
sns.scatterplot(data=dft, x="pickup_borough", y="tip")
<AxesSubplot:xlabel='pickup_borough', ylabel='tip'>
# is the tip proportional to the fare?
sns.scatterplot(data=dft, x="fare", y="tip")
<AxesSubplot:xlabel='fare', ylabel='tip'>
# is the tip proportional to the distance?
sns.scatterplot(data=dft, x="distance", y="tip")
<AxesSubplot:xlabel='distance', ylabel='tip'>
tips
data set¶Let's look now at the tips
data set. Along the way we'll see a few more
ways we can use the data frame object
# load the data - dft: data frame tips
# note that this overwrites the previous 'value/meaning' of dft
dft = sns.load_dataset('tips')
dft.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
An extensive list of data frame methods/functions can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame Let's look at some of them
print(dft.info)
print('The shape of the data frame is: ', dft.shape)
print('The size of the data frame is: ', dft.size)
print('Note that 244*7 =', 244*7)
<bound method DataFrame.info of total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]> The shape of the data frame is: (244, 7) The size of the data frame is: 1708 Note that 244*7 = 1708
Again, numbers aren't always that helpful. Plots often give us more insight.
dft.plot()
<AxesSubplot:>
sns.scatterplot(data=dft, x="total_bill", y="tip")
<AxesSubplot:xlabel='total_bill', ylabel='tip'>
You're assumed to be familiar with basic terms and concepts in these areas, but we will revise and review those that we need later.
We can get some basic stats for our data set with the describe()
method...
# here are some descriptive statistics
dft.describe()
total_bill | tip | size | |
---|---|---|---|
count | 244.000000 | 244.000000 | 244.000000 |
mean | 19.785943 | 2.998279 | 2.569672 |
std | 8.902412 | 1.383638 | 0.951100 |
min | 3.070000 | 1.000000 | 1.000000 |
25% | 13.347500 | 2.000000 | 2.000000 |
50% | 17.795000 | 2.900000 | 2.000000 |
75% | 24.127500 | 3.562500 | 3.000000 |
max | 50.810000 | 10.000000 | 6.000000 |
anscombe
data set¶This is pretty famous. There are four sets of 11 coordinate pairs. When plotted they look completely different. But they have the same summary statistics (at least the common ones).
See https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Image Credit: https://upload.wikimedia.org/wikipedia/commons/7/7e/Julia-anscombe-plot-1.png
Let's load the data set and take a look at it - we can look at the head and tail of the table just as we did above.
dfa = sns.load_dataset('anscombe')
# look at how we get an apostrophe in the string...
print("The size of Anscombe's data set is:", dfa.shape)
The size of Anscombe's data set is: (44, 3)
dfa.head()
dataset | x | y | |
---|---|---|---|
0 | I | 10.0 | 8.04 |
1 | I | 8.0 | 6.95 |
2 | I | 13.0 | 7.58 |
3 | I | 9.0 | 8.81 |
4 | I | 11.0 | 8.33 |
dfa.tail()
dataset | x | y | |
---|---|---|---|
39 | IV | 8.0 | 5.25 |
40 | IV | 19.0 | 12.50 |
41 | IV | 8.0 | 5.56 |
42 | IV | 8.0 | 7.91 |
43 | IV | 8.0 | 6.89 |
It looks like the four data sets are in the dataset
column. How can we extract them as separate items?
Well, one way is to print the whole dataset and see which rows correspond to each dataset. Like this...
print(dfa)
dataset x y 0 I 10.0 8.04 1 I 8.0 6.95 2 I 13.0 7.58 3 I 9.0 8.81 4 I 11.0 8.33 5 I 14.0 9.96 6 I 6.0 7.24 7 I 4.0 4.26 8 I 12.0 10.84 9 I 7.0 4.82 10 I 5.0 5.68 11 II 10.0 9.14 12 II 8.0 8.14 13 II 13.0 8.74 14 II 9.0 8.77 15 II 11.0 9.26 16 II 14.0 8.10 17 II 6.0 6.13 18 II 4.0 3.10 19 II 12.0 9.13 20 II 7.0 7.26 21 II 5.0 4.74 22 III 10.0 7.46 23 III 8.0 6.77 24 III 13.0 12.74 25 III 9.0 7.11 26 III 11.0 7.81 27 III 14.0 8.84 28 III 6.0 6.08 29 III 4.0 5.39 30 III 12.0 8.15 31 III 7.0 6.42 32 III 5.0 5.73 33 IV 8.0 6.58 34 IV 8.0 5.76 35 IV 8.0 7.71 36 IV 8.0 8.84 37 IV 8.0 8.47 38 IV 8.0 7.04 39 IV 8.0 5.25 40 IV 19.0 12.50 41 IV 8.0 5.56 42 IV 8.0 7.91 43 IV 8.0 6.89
From this and the head
and tail
output above we can infer that there are four
data sets: I, II, III and IV. They each contain $11$ pairs $(x,y)$.
However, this kind of technique is not going to be useful if we have a data set with millions of data points (rows). We certainly wont want to print them all like we did above.
Is there another way to determine the number of distinct feature values in a given column of the data frame?
Fortunately, yes. We want to know how many different values the dataset
column
has. We can do it like this.
dfa.dataset.unique()
array(['I', 'II', 'III', 'IV'], dtype=object)
We can count the number of different ones automatically too, by asking
for the shape
of the returned value. Here we go:
dfa.dataset.unique().shape
(4,)
This tell us that there are 4 items - as expected.
Don't worry too much about it saying (4,)
rather that just 4
.
We'll come to that later when we discuss numpy
(Numerical python: https://numpy.org).
Now, we want to extract each of the four datasets as separate data sets so we can work
with them. We can do that by using loc
to get the row-wise locations where each
value of the dataset
feature is the same.
For example, using the hints here
https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values,
to get the data for the sub-data-set I
we can do this:
dfa.loc[dfa['dataset'] == 'I']
dataset | x | y | |
---|---|---|---|
0 | I | 10.0 | 8.04 |
1 | I | 8.0 | 6.95 |
2 | I | 13.0 | 7.58 |
3 | I | 9.0 | 8.81 |
4 | I | 11.0 | 8.33 |
5 | I | 14.0 | 9.96 |
6 | I | 6.0 | 7.24 |
7 | I | 4.0 | 4.26 |
8 | I | 12.0 | 10.84 |
9 | I | 7.0 | 4.82 |
10 | I | 5.0 | 5.68 |
Now we have this subset of data we can examine it - with a scatter plot for example.
sns.scatterplot(data=dfa.loc[dfa['dataset'] == 'I'], x="x", y="y")
<AxesSubplot:xlabel='x', ylabel='y'>
To really work properly with each subset we should extract them and give each of them a name that is meaningful.
dfa1 = dfa.loc[dfa['dataset'] == 'I']
dfa2 = dfa.loc[dfa['dataset'] == 'II']
dfa3 = dfa.loc[dfa['dataset'] == 'III']
dfa4 = dfa.loc[dfa['dataset'] == 'IV']
Now let's look at each of the four data sets in a scatter plot,
and use the describe
method to examine the summary statistics.
The outcome is quite surprising...
sns.scatterplot(data=dfa1, x="x", y="y")
dfa1.describe()
x | y | |
---|---|---|
count | 11.000000 | 11.000000 |
mean | 9.000000 | 7.500909 |
std | 3.316625 | 2.031568 |
min | 4.000000 | 4.260000 |
25% | 6.500000 | 6.315000 |
50% | 9.000000 | 7.580000 |
75% | 11.500000 | 8.570000 |
max | 14.000000 | 10.840000 |
sns.scatterplot(data=dfa2, x="x", y="y")
dfa2.describe()
x | y | |
---|---|---|
count | 11.000000 | 11.000000 |
mean | 9.000000 | 7.500909 |
std | 3.316625 | 2.031657 |
min | 4.000000 | 3.100000 |
25% | 6.500000 | 6.695000 |
50% | 9.000000 | 8.140000 |
75% | 11.500000 | 8.950000 |
max | 14.000000 | 9.260000 |
sns.scatterplot(data=dfa3, x="x", y="y")
dfa3.describe()
x | y | |
---|---|---|
count | 11.000000 | 11.000000 |
mean | 9.000000 | 7.500000 |
std | 3.316625 | 2.030424 |
min | 4.000000 | 5.390000 |
25% | 6.500000 | 6.250000 |
50% | 9.000000 | 7.110000 |
75% | 11.500000 | 7.980000 |
max | 14.000000 | 12.740000 |
sns.scatterplot(data=dfa4, x="x", y="y")
dfa4.describe()
x | y | |
---|---|---|
count | 11.000000 | 11.000000 |
mean | 9.000000 | 7.500909 |
std | 3.316625 | 2.030579 |
min | 8.000000 | 5.250000 |
25% | 8.000000 | 6.170000 |
50% | 8.000000 | 7.040000 |
75% | 8.000000 | 8.190000 |
max | 19.000000 | 12.500000 |
For the taxis
data set:
1: sns.scatterplot(data=ds, x="dropoff_borough", y="tip")
2: sns.scatterplot(data=ds, x="distance", y="tip")
For the tips
data set:
1: ds.describe()
2: sns.scatterplot(data=ds, x="total_bill", y="tip")
3: sns.scatterplot(data=ds, x="day", y="total_bill")
4: sns.scatterplot(data=ds, x="sex", y="tip")