https://variationalform.github.io/
https://github.com/variationalform
Simon Shaw https://www.brunel.ac.uk/people/simon-shaw.
This work is licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0 International) Visit http://creativecommons.org/licenses/by-sa/4.0/ to see the terms. |
This document uses python | and also makes use of LaTeX | in Markdown |
This worksheet is based on the material in the notebooks
Note that while the 'lecture' notebooks are prefixed with 1_
, 2_
and so on,
to indicate the order in which they should be studied, the worksheets are prefixed
with A_
, B_
, ...
Refer back to the section where we introduced the use of numpy
for vector calculations.
There we had set up the vectors $\boldsymbol{a}$ and $\boldsymbol{p}$ and illustrated arithmetic like this:
import numpy as np
a = np.array([3, -2, 1])
p = np.array([5, 2, -10])
g = a-p
print(g)
a = g+p
print(a)
numpy
column arrays.np.sqrt(16)
to print out $\sqrt{16}$ (e.g. https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html)np.power(2,3)
to print out $2^3$ (e.g. https://numpy.org/doc/stable/reference/generated/numpy.power.html).np.pi
to print out the area of a circle with radius $5$. (e.g. https://numpy.org/doc/stable/reference/constants.html)numpy
row arrays. Print out $\boldsymbol{x}+2\boldsymbol{y}$.
Repeat 5 and 6 but with $x$ and $y$ as numpy
column arrays.
# put your working in here - make new cells if you like
a = np.array([[3], [-2], [1]])
p = np.array([[5], [2], [-10]])
g = a-p
print(g)
a = g+p
print(a)
print('sqrt(16) = ', np.sqrt(16))
print('2 cubed = ', np.power(2,3))
print('pi*r-squared =', np.pi*5**2)
x = np.array([3, -5, np.pi, 7.2, np.sqrt(9)])
y = np.array([-3, 16, 1, 1089, 15])
print ('x + 2y = ', x + 2*y)
x.shape = (5,1)
y.shape = (5,1)
print ('x + 2y = \n', x + 2*y)
What is special about 1089?
Take three integers from $\{1,2,3,4,5,6,7,8,9\}$ and use them to make a three digit number. Reverse the digits and subtract the smaller from the larger to get a new three digit number (put a zero in front if it is only two digits). Add this number to its own reversal.
Example, $2$, $6$, $8$ gives $268$. Reversing and subtracting gives $862-268=594$. Reversing and adding then gives $594+495 = ?$
Try this for other numbers - is it always the same?
print(862-268)
print(594+495)
We saw the vector $p$-norm for any $p\ge 1$ given by
$$ \Vert\boldsymbol{v}\Vert_p = \left\{\begin{array}{ll}\displaystyle \sqrt[p]{\vert v_1\vert^p + \vert v_2\vert^p +\cdots+\vert v_n\vert^p}, &\text{if } 1\le p < \infty; \\ \max\{\vert v_k\vert\colon k=1,2,\ldots,n\}, &\text{if } p = \infty. \end{array}\right. $$Although $p<1$ is not allowed here, we do sometimes use phoney norms for cases when $p<1$
Using numpy.linalg.norm()
with,
calculate these quantities and check your answers:
Hint: https://numpy.org/doc/stable/reference/generated/numpy.log.html
Furthermore, use numpy
to calculate these phoney norms:
Hint: https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html
# put your working in here - make new cells if you like
import numpy as np
w = np.array([3, 0, -5, 4, np.pi**2, 0, -23, 7, -99])
print(w)
print('||w||_2 = ', np.linalg.norm(w,2))
print('||w||_1 = ', np.linalg.norm(w,1))
print('||w||_inf = ', np.linalg.norm(w,np.inf)) # note how we denote infinity
print('||w||_3 = ', np.linalg.norm(w,3))
print('||w||_ln7 = ', np.linalg.norm(w,np.log(7)))
print('||w||_23 = ', np.linalg.norm(w,23))
print('||w||_0.414... = ', np.linalg.norm(w,np.sqrt(2)-1))
print('||w||_0 = ', np.linalg.norm(w,0))
We look back at some of the data and learn a bit more about python and how to manipulate raw data.
We start with the taxis
data sets and see how we can convert some
num-numerica data into numeric.
import seaborn as sns
dft = sns.load_dataset('taxis')
dft.head(9)
In the following the question mark denotes values that you have to determine.
dft.shape
to determine and print the number of rows and columns in the data set.dft.shape[?]
to determine and print the number of rows in the data set.dft.shape[?]
to determine and print the number of columns in the data set.dft.iat[?,?]
to print out the contents of the fourth column of the third row.dft.loc[?].iat[?]
to print out the contents of the twelfth column of the sixth row.dft.loc[?]
to print out the contents of ninth row.# put your working in here - make new cells if you like
print('The number of rows and columns are: ', dft.shape)
print('The number of rows is: ', dft.shape[0])
print('The number of columns is: ', dft.shape[1])
print('dft.iat[2,3] = ', dft.iat[2,3])
print('dft.loc[5].iat[11] = ', dft.loc[5].iat[11],'\n')
print('dft.loc[8] = ')
datetime
¶We have already seen that the first column in the dft.head()
output can be ignored -
that is just a label for each observation and has nothing to do with the taxi ride data.
The pickup and dropoff columns are dates and times, and we ignored them earlier when we first worked with this data set because they aren't simple numerical values that can be put into a vector.
However, if we think of those as the number of seconds since 1 January 1970 then they are also just numbers (integers if we ignore fractions of a second). These two numbers can then go in a vector, just as those values in columns three to eight did.
Let's see how to convert the dates and times to simpler integers.
The following material on date-time conversion was adapted from that at
https://docs.python.org/3/library/datetime.html and
https://stackoverflow.com/questions/11743019/convert-python-datetime-to-epoch-with-strftime -
it uses the datetime
module.
We'll remind ourself of the first few data points, and then import datetime
.
dft.head(3)
# here is a useful alternative
dft[0:3]
Show only rows 5 to 9 (inclusive) of the data set.
Hint: do you think that [4:9]
refers to rows four to nine?
If not, then what?
# put your working in here - make new cells if you like
# import the module
from datetime import datetime
# and show the first few of rows again
dft[0:3]
# use the pickup time from the first data point
put = datetime(2019,3,23,20,21,9)
print('the data and time are ', put)
# the elapsed time since 1 January 1970 is
print('elapsed time since 1 January 1970 = ', (put - datetime(1970,1,1)).total_seconds() )
# a quicker way to get this is
print('timestamp = ', put.timestamp())
Now, we don't want to be typing all of this ourselves so let's start to automate the process.
First, we use dft.iat[0,0]
to get the first element in the first row of the
data-frame. We print it out and see that it is just a character string giving
the date and time we saw above.
print(dft.iat[0,0])
This string is dft.iat[0,0])
which we interpret as %Y-%m-%d %H:%M:%S
and
then we can use strptime()
to get the datetime
variable. This
https://www.digitalocean.com/community/tutorials/python-string-to-datetime-strptime
was a useful source for writing these notes.
Once done, it's then easy to get the timestamp - just as above.
dt = datetime.strptime(str(dft.iat[0,0]), '%Y-%m-%d %H:%M:%S' )
print('The date and time are ', dt, ' with timestamp ', dt.timestamp())
Note that we had to write str(dft.iat[0,0])
rather that just dft.iat[0,0]
.
This is because (at least at the time of writing) if we try to run this
notebook in binder we will get an error.
The str()
function converts its argument into a string. The next two commands show the
type
of dft.iat[0,0]
. If you see str
(i.e. string) for both then you don't need
to replace dft.iat[0,0]
with str(dft.iat[0,0])
.
In binder these don't both give str
, and so we have to force it to be a string.
type( str(dft.iat[0,0]) )
type(dft.iat[0,0])
Let's remind ourselves of the first line of the data, and then get the timestamps for the first two entries:
dft[0:1]
print(datetime.strptime( str(dft.iat[0,0]), '%Y-%m-%d %H:%M:%S' ).timestamp())
print(datetime.strptime( str(dft.iat[0,1]), '%Y-%m-%d %H:%M:%S' ).timestamp())
The time taken between pickup and dropoff, 2019-03-23 20:21:09
and
2019-03-23 20:27:24
is $6$ minutes and $15$ seconds. This is
print(6*60+15,' seconds')
Now let's look at the difference between the two time stamps
ts1 = datetime.strptime( str(dft.iat[0,0]), '%Y-%m-%d %H:%M:%S' ).timestamp()
ts2 = datetime.strptime( str(dft.iat[0,1]), '%Y-%m-%d %H:%M:%S' ).timestamp()
print('ts 1 = ', ts1)
print('ts 2 = ', ts2)
print('ts2-ts1 = ', ts2-ts1)
and so the timestamp is just the conversion of a date to seconds.
You may wonder, then, when was zero seconds? Well, it was 1 January 1970 UTC (Coordinated Universal Time). To see this, we need to make sure we specify the correct timezone and then:
from datetime import timezone
print(datetime(1970,1,1,0,0,0,tzinfo=timezone.utc).timestamp())
So, here then is our vector for the zero-th taxi data. Note that it now includes the first two values as well as the six we had in the lecture.
Note that we use concatenate
to join two numpy
arrays.
r0 = np.array(dft.iloc[0,2:8])
print(r0)
r0 = np.concatenate(([ts1, ts2], np.array(dft.iloc[0,2:8])), axis=None)
print(r0)
In the following the question mark denotes values that you have to determine.
taxis
datasets.%
and //
do in python
? (Hint: print(10//7,10%7)
).datetime
to find the elapsed time in seconds for row 256. For the last, use timestamp()
to represent the date-time values. Also,
look at what these commands do. They might help you join two arrays
without explicitly using concatenate
.
A = np.array([1,2])
B = np.array([3,4,5,6,7,8])
print(A,B)
C = np.r_[A,B]
print(C)
# put your working in here - make new cells if you like
print(10//7,10%7)
print('dft[252:256] = ')
dft[252:256]
ts1 = datetime.strptime(dft.iat[255,0], '%Y-%m-%d %H:%M:%S' ).timestamp()
ts2 = datetime.strptime(dft.iat[255,1], '%Y-%m-%d %H:%M:%S' ).timestamp()
print('elaspsed time in seconds: ', ts2-ts1)
mins = (ts2-ts1) % 60
print('elaspsed time in min:secs: ', (ts2-ts1) // 60, ':', (ts2-ts1) % 60)
dft[13:14]
ts1 = datetime.strptime(dft.iat[13,0], '%Y-%m-%d %H:%M:%S' ).timestamp()
ts2 = datetime.strptime(dft.iat[13,1], '%Y-%m-%d %H:%M:%S' ).timestamp()
ts12 = np.array([ts1,ts2])
print(ts12)
row = np.array(dft.iloc[13,2:8])
print(row)
row = np.r_[ts12,r3]
print(row)
THINK ABOUT What information could be lost as a result of converting the date-time to a number?
For the taxis
data set:
dft = sns.load_dataset('taxis')
dft.head()
sns.scatterplot(data=dft, x="dropoff_borough", y="tip")
sns.scatterplot(data=dft, x="distance", y="fare")
# put your working in here - make new cells if you like
For the tips
data set:
dftp = sns.load_dataset('tips')
dftp.describe()
sns.scatterplot(data=dftp, x="total_bill", y="tip")
sns.scatterplot(data=dftp, x="day", y="total_bill")
sns.scatterplot(data=dftp, x="sex", y="tip")
# put your working in here - make new cells if you like
Work through the following material and have a go at the questions at the end. Make a note of anything you don't understand, and ask in the next session.
anscombe
data set¶As discussed in the lectures, this is pretty famous. There was a lot to take in during the walk-through of the lecture notebook so this is another opportunity to slow things down and read at your own pace. See https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Image Credit: https://upload.wikimedia.org/wikipedia/commons/7/7e/Julia-anscombe-plot-1.png
dfa = sns.load_dataset('anscombe')
# look at how we get an apostrophe...
print("The size of Anscombe's data set is:", dfa.shape)
Let's take a look at the data set - we can look at the head and tail of the table just as we did above.
dfa.head()
dfa.tail()
It looks like the four data sets are in the dataset
column. How can we extract them as separate items?
Well, one way is to print the whole dataset and see which rows correspond to each dataset. Like this...
print(dfa)
From this we can see that there are four data sets: I, II, III and IV. They each contain $11$ pairs $(x,y)$.
However, this kind of technique is not going to be useful if we have a data set with millions of data points (rows). We certainly wont want to print them all like we did above.
Is there another way to determine the number of distinct feature values in a given column of the data frame?
Fortunately, yes. We want to know how many different values the dataset
column
has. We can do it like this.
dfa.dataset.unique()
We can count the number of different ones automatically too, by asking
for the shape
of the returned value. Here we go:
dfa.dataset.unique().shape
This tell us that there are 4 items - as expected.
Don't worry too much about it saying (4,)
rather that just 4
.
We've seen what shape
refers to earlier.
Now, we want to extract each of the four datasets as separate data sets so we can work
with them. We can do that by using loc
to get the row-wise locations where each
value of the dataset
feature is the same.
(Ref: <# https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values>)
For example, to get the data for the sub-data-set I
we can do this:
dfa.loc[dfa['dataset'] == 'I']
Now we have this subset of data we can examine it - with a scatter plot for example.
sns.scatterplot(data=dfa.loc[dfa['dataset'] == 'I'], x="x", y="y")
To really work properly with each subset we should extract them and give each of them a name that is meaningful.
# On the other hand:
dfa1 = dfa.loc[dfa['dataset'] == 'I']
dfa2 = dfa.loc[dfa['dataset'] == 'II']
dfa3 = dfa.loc[dfa['dataset'] == 'III']
dfa4 = dfa.loc[dfa['dataset'] == 'IV']
sns.scatterplot(data=dfa1, x="x", y="y")
dfa1.describe()
sns.scatterplot(data=dfa2, x="x", y="y")
dfa2.describe()
sns.scatterplot(data=dfa3, x="x", y="y")
dfa3.describe()
sns.scatterplot(data=dfa4, x="x", y="y")
dfa4.describe()
For the Anscombe data set:
Look at the diamonds
data set
1: dfd = sns.load_dataset('diamonds')
dfd.shape
53940 and 10
2: sns.scatterplot(data=dfd, x="carat", y="price")