Worksheet A¶

variationalform https://variationalform.github.io/¶

Just Enough: progress at pace¶

https://variationalform.github.io/

https://github.com/variationalform

Simon Shaw https://www.brunel.ac.uk/people/simon-shaw.

This work is licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0 International)

Visit http://creativecommons.org/licenses/by-sa/4.0/ to see the terms.

This document uses python

and also makes use of LaTeX

in Markdown

What this is about:¶

This worksheet is based on the material in the notebooks

intro
vectors

Note that while the 'lecture' notebooks are prefixed with 1_, 2_ and so on, to indicate the order in which they should be studied, the worksheets are prefixed with A_, B_, ...

Refer back to the section where we introduced the use of numpy for vector calculations.

There we had set up the vectors $\boldsymbol{a}$ and $\boldsymbol{p}$ and illustrated arithmetic like this:

In [ ]:

import numpy as np
a = np.array([3, -2, 1])
p = np.array([5, 2, -10])
g = a-p
print(g)
a = g+p
print(a)

Exercises¶

Repeat the above calculation but with $a$ and $p$ as numpy column arrays.
Use np.sqrt(16) to print out $\sqrt{16}$ (e.g. https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html)
Use np.power(2,3) to print out $2^3$ (e.g. https://numpy.org/doc/stable/reference/generated/numpy.power.html).
Use np.pi to print out the area of a circle with radius $5$. (e.g. https://numpy.org/doc/stable/reference/constants.html)
Define these as numpy row arrays.

$$ \boldsymbol{x} = \left(\begin{array}{r} 3 \\ -5 \\ \pi \\ 7.2 \\ \sqrt{9} \end{array}\right) \qquad\text{ and }\qquad \boldsymbol{y} = \left(\begin{array}{r} -3 \\ 16 \\ 1 \\ 1089 \\ 15 \end{array}\right) $$

Print out $\boldsymbol{x}+2\boldsymbol{y}$.
Repeat 5 and 6 but with $x$ and $y$ as numpy column arrays.

In [ ]:

# put your working in here - make new cells if you like

Hints¶

a = np.array([[3], [-2], [1]])
p = np.array([[5], [2], [-10]])
g = a-p
print(g)
a = g+p
print(a)

print('sqrt(16) = ', np.sqrt(16))
print('2 cubed = ', np.power(2,3))
print('pi*r-squared =', np.pi*5**2)
x = np.array([3, -5, np.pi, 7.2, np.sqrt(9)])
y = np.array([-3, 16, 1, 1089, 15]) 
print ('x + 2y = ', x + 2*y)

x.shape = (5,1)
y.shape = (5,1)
print ('x + 2y = \n', x + 2*y)

1089¶

What is special about 1089?

Take three integers from $\{1,2,3,4,5,6,7,8,9\}$ and use them to make a three digit number. Reverse the digits and subtract the smaller from the larger to get a new three digit number (put a zero in front if it is only two digits). Add this number to its own reversal.

Example, $2$, $6$, $8$ gives $268$. Reversing and subtracting gives $862-268=594$. Reversing and adding then gives $594+495 = ?$

Try this for other numbers - is it always the same?

See e.g. https://en.wikipedia.org/wiki/1089_(number)

In [ ]:

print(862-268)
print(594+495)

Vector norms¶

We saw the vector $p$-norm for any $p\ge 1$ given by

$$ \Vert\boldsymbol{v}\Vert_p = \left\{\begin{array}{ll}\displaystyle \sqrt[p]{\vert v_1\vert^p + \vert v_2\vert^p +\cdots+\vert v_n\vert^p}, &\text{if } 1\le p < \infty; \\ \max\{\vert v_k\vert\colon k=1,2,\ldots,n\}, &\text{if } p = \infty. \end{array}\right. $$

Although $p<1$ is not allowed here, we do sometimes use phoney norms for cases when $p<1$

Exercises - see e.g. https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html ¶

Using numpy.linalg.norm() with,

$$ \boldsymbol{w} = \big( 3, 0, -5, 4, \pi^2, 0, -23, 7, -99 \big)^T $$

calculate these quantities and check your answers:

$\Vert\boldsymbol{w}\Vert_2$
$\Vert\boldsymbol{w}\Vert_1$
$\Vert\boldsymbol{w}\Vert_\infty$
$\Vert\boldsymbol{w}\Vert_3$
$\Vert\boldsymbol{w}\Vert_{\ln 7}$
$\Vert\boldsymbol{w}\Vert_{23}$

Hint: https://numpy.org/doc/stable/reference/generated/numpy.log.html

Furthermore, use numpy to calculate these phoney norms:

$\Vert\boldsymbol{w}\Vert_{\sqrt{2}-1}$
$\Vert\boldsymbol{w}\Vert_0$

Hint: https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html

In [ ]:

# put your working in here - make new cells if you like

Hints¶

import numpy as np
w = np.array([3, 0, -5, 4, np.pi**2, 0, -23, 7, -99])
print(w)
print('||w||_2   = ', np.linalg.norm(w,2))
print('||w||_1   = ', np.linalg.norm(w,1))
print('||w||_inf = ', np.linalg.norm(w,np.inf))  # note how we denote infinity
print('||w||_3   = ', np.linalg.norm(w,3))
print('||w||_ln7 = ', np.linalg.norm(w,np.log(7)))
print('||w||_23  = ', np.linalg.norm(w,23))
print('||w||_0.414...   = ', np.linalg.norm(w,np.sqrt(2)-1))
print('||w||_0   = ', np.linalg.norm(w,0))

Data as vectors¶

We look back at some of the data and learn a bit more about python and how to manipulate raw data.

We start with the taxis data sets and see how we can convert some num-numerica data into numeric.

In [ ]:

import seaborn as sns
dft = sns.load_dataset('taxis')
dft.head(9)

Exercises¶

In the following the question mark denotes values that you have to determine.

use dft.shape to determine and print the number of rows and columns in the data set.
use dft.shape[?] to determine and print the number of rows in the data set.
use dft.shape[?] to determine and print the number of columns in the data set.
Use dft.iat[?,?] to print out the contents of the fourth column of the third row.
Use dft.loc[?].iat[?] to print out the contents of the twelfth column of the sixth row.
Use dft.loc[?] to print out the contents of ninth row.

In [ ]:

# put your working in here - make new cells if you like

Hints¶

print('The number of rows and columns are: ', dft.shape)
print('The number of rows is: ', dft.shape[0])
print('The number of columns is: ', dft.shape[1])
print('dft.iat[2,3]      = ', dft.iat[2,3])
print('dft.loc[5].iat[11] = ', dft.loc[5].iat[11],'\n')
print('dft.loc[8] = ')

Working with `datetime`¶

We have already seen that the first column in the dft.head() output can be ignored - that is just a label for each observation and has nothing to do with the taxi ride data.

The pickup and dropoff columns are dates and times, and we ignored them earlier when we first worked with this data set because they aren't simple numerical values that can be put into a vector.

However, if we think of those as the number of seconds since 1 January 1970 then they are also just numbers (integers if we ignore fractions of a second). These two numbers can then go in a vector, just as those values in columns three to eight did.

Let's see how to convert the dates and times to simpler integers.

The following material on date-time conversion was adapted from that at https://docs.python.org/3/library/datetime.html and https://stackoverflow.com/questions/11743019/convert-python-datetime-to-epoch-with-strftime - it uses the datetime module.

We'll remind ourself of the first few data points, and then import datetime.

In [ ]:

dft.head(3)

In [ ]:

# here is a useful alternative
dft[0:3]

Exercise¶

Show only rows 5 to 9 (inclusive) of the data set.

Hint: do you think that [4:9] refers to rows four to nine? If not, then what?

In [ ]:

# put your working in here - make new cells if you like

In [ ]:

# import the module
from datetime import datetime
# and show the first few of rows again
dft[0:3]

In [ ]:

# use the pickup time from the first data point
put = datetime(2019,3,23,20,21,9)
print('the data and time are ', put)
# the elapsed time since 1 January 1970 is
print('elapsed time since 1 January 1970 = ', (put - datetime(1970,1,1)).total_seconds() )
# a quicker way to get this is
print('timestamp = ', put.timestamp())

Now, we don't want to be typing all of this ourselves so let's start to automate the process.

First, we use dft.iat[0,0] to get the first element in the first row of the data-frame. We print it out and see that it is just a character string giving the date and time we saw above.

In [ ]:

print(dft.iat[0,0])

This string is dft.iat[0,0]) which we interpret as %Y-%m-%d %H:%M:%S and then we can use strptime() to get the datetime variable. This https://www.digitalocean.com/community/tutorials/python-string-to-datetime-strptime was a useful source for writing these notes.

Once done, it's then easy to get the timestamp - just as above.

In [ ]:

dt = datetime.strptime(str(dft.iat[0,0]), '%Y-%m-%d %H:%M:%S' )
print('The date and time are ', dt, ' with timestamp ', dt.timestamp())

Note that we had to write str(dft.iat[0,0]) rather that just dft.iat[0,0]. This is because (at least at the time of writing) if we try to run this notebook in binder we will get an error.

The str() function converts its argument into a string. The next two commands show the type of dft.iat[0,0]. If you see str (i.e. string) for both then you don't need to replace dft.iat[0,0] with str(dft.iat[0,0]).

In binder these don't both give str, and so we have to force it to be a string.

In [ ]:

type( str(dft.iat[0,0]) )

In [ ]:

type(dft.iat[0,0])

Let's remind ourselves of the first line of the data, and then get the timestamps for the first two entries:

In [ ]:

dft[0:1]

In [ ]:

print(datetime.strptime( str(dft.iat[0,0]), '%Y-%m-%d %H:%M:%S' ).timestamp())
print(datetime.strptime( str(dft.iat[0,1]), '%Y-%m-%d %H:%M:%S' ).timestamp())

The time taken between pickup and dropoff, 2019-03-23 20:21:09 and 2019-03-23 20:27:24 is $6$ minutes and $15$ seconds. This is

In [ ]:

print(6*60+15,' seconds')

Now let's look at the difference between the two time stamps

In [ ]:

ts1 = datetime.strptime( str(dft.iat[0,0]), '%Y-%m-%d %H:%M:%S' ).timestamp()
ts2 = datetime.strptime( str(dft.iat[0,1]), '%Y-%m-%d %H:%M:%S' ).timestamp()
print('ts 1 = ', ts1)
print('ts 2 = ', ts2)
print('ts2-ts1 = ', ts2-ts1)

and so the timestamp is just the conversion of a date to seconds.

You may wonder, then, when was zero seconds? Well, it was 1 January 1970 UTC (Coordinated Universal Time). To see this, we need to make sure we specify the correct timezone and then:

In [ ]:

from datetime import timezone
print(datetime(1970,1,1,0,0,0,tzinfo=timezone.utc).timestamp())

So, here then is our vector for the zero-th taxi data. Note that it now includes the first two values as well as the six we had in the lecture.

Note that we use concatenate to join two numpy arrays.

In [ ]:

r0 = np.array(dft.iloc[0,2:8])
print(r0)
r0 = np.concatenate(([ts1, ts2], np.array(dft.iloc[0,2:8])), axis=None)
print(r0)

Exercises¶

In the following the question mark denotes values that you have to determine.

Use dft[?:?] to access rows 253 - 256 of the taxis datasets.
What do % and // do in python? (Hint: print(10//7,10%7)).
Use datetime to find the elapsed time in seconds for row 256.
Convert that time in seconds to minutes and seconds.
Print out row 14.
Create a vector of numbers that holds the first eight values from row 14.

For the last, use timestamp() to represent the date-time values. Also, look at what these commands do. They might help you join two arrays without explicitly using concatenate.

A = np.array([1,2])
B = np.array([3,4,5,6,7,8])
print(A,B)
C = np.r_[A,B]
print(C)

In [ ]:

# put your working in here - make new cells if you like

Hints¶

print(10//7,10%7)

print('dft[252:256] = ')
dft[252:256]


ts1 = datetime.strptime(dft.iat[255,0], '%Y-%m-%d %H:%M:%S' ).timestamp()
ts2 = datetime.strptime(dft.iat[255,1], '%Y-%m-%d %H:%M:%S' ).timestamp()
print('elaspsed time in seconds: ', ts2-ts1)
mins = (ts2-ts1) % 60
print('elaspsed time in min:secs: ', (ts2-ts1) // 60, ':', (ts2-ts1) % 60) 


dft[13:14]

ts1 = datetime.strptime(dft.iat[13,0], '%Y-%m-%d %H:%M:%S' ).timestamp()
ts2 = datetime.strptime(dft.iat[13,1], '%Y-%m-%d %H:%M:%S' ).timestamp()
ts12 = np.array([ts1,ts2])
print(ts12)
row = np.array(dft.iloc[13,2:8])
print(row)
row = np.r_[ts12,r3]
print(row)

THINK ABOUT What information could be lost as a result of converting the date-time to a number?

Exercises¶

For the taxis data set:

Produce a scatterplot of "dropoff_borough" vs. "tip"
Plot the dependence of fare on distance.

dft = sns.load_dataset('taxis')
dft.head()

sns.scatterplot(data=dft, x="dropoff_borough", y="tip")
sns.scatterplot(data=dft, x="distance", y="fare")

In [ ]:

# put your working in here - make new cells if you like

Exercises¶

For the tips data set:

What is the standard deviation of the tips?
Plot the scatter of tip against the total bill
Plot the scatter of total bill against day
Plot the scatter of tip against gender

dftp = sns.load_dataset('tips')
dftp.describe()
sns.scatterplot(data=dftp, x="total_bill", y="tip")
sns.scatterplot(data=dftp, x="day", y="total_bill")
sns.scatterplot(data=dftp, x="sex", y="tip")

In [ ]:

# put your working in here - make new cells if you like

Self-Study and Homework¶

Work through the following material and have a go at the questions at the end. Make a note of anything you don't understand, and ask in the next session.

The `anscombe` data set¶

As discussed in the lectures, this is pretty famous. There was a lot to take in during the walk-through of the lecture notebook so this is another opportunity to slow things down and read at your own pace. See https://en.wikipedia.org/wiki/Anscombe%27s_quartet

Image Credit: https://upload.wikimedia.org/wikipedia/commons/7/7e/Julia-anscombe-plot-1.png

In [ ]:

dfa = sns.load_dataset('anscombe')
# look at how we get an apostrophe...
print("The size of Anscombe's data set is:", dfa.shape)

Let's take a look at the data set - we can look at the head and tail of the table just as we did above.

In [ ]:

dfa.head()

In [ ]:

dfa.tail()

It looks like the four data sets are in the dataset column. How can we extract them as separate items?

Well, one way is to print the whole dataset and see which rows correspond to each dataset. Like this...

In [ ]:

print(dfa)

From this we can see that there are four data sets: I, II, III and IV. They each contain $11$ pairs $(x,y)$.

The first set occupies rows $0,1,2,\ldots,10$
The second set occupies rows $11,12,\ldots,21$
The third set occupies rows $22,23,\ldots,32$
The fourth set occupies rows $33,34,\ldots,43$

However, this kind of technique is not going to be useful if we have a data set with millions of data points (rows). We certainly wont want to print them all like we did above.

Is there another way to determine the number of distinct feature values in a given column of the data frame?

Fortunately, yes. We want to know how many different values the dataset column has. We can do it like this.

In [ ]:

dfa.dataset.unique()

We can count the number of different ones automatically too, by asking for the shape of the returned value. Here we go:

In [ ]:

dfa.dataset.unique().shape

This tell us that there are 4 items - as expected. Don't worry too much about it saying (4,) rather that just 4. We've seen what shape refers to earlier.

Now, we want to extract each of the four datasets as separate data sets so we can work with them. We can do that by using loc to get the row-wise locations where each value of the dataset feature is the same. (Ref: <# https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values>)

For example, to get the data for the sub-data-set I we can do this:

In [ ]:

dfa.loc[dfa['dataset'] == 'I']

Now we have this subset of data we can examine it - with a scatter plot for example.

In [ ]:

sns.scatterplot(data=dfa.loc[dfa['dataset'] == 'I'], x="x", y="y")

To really work properly with each subset we should extract them and give each of them a name that is meaningful.

In [ ]:

# On the other hand:
dfa1 = dfa.loc[dfa['dataset'] == 'I']
dfa2 = dfa.loc[dfa['dataset'] == 'II']
dfa3 = dfa.loc[dfa['dataset'] == 'III']
dfa4 = dfa.loc[dfa['dataset'] == 'IV']

In [ ]:

sns.scatterplot(data=dfa1, x="x", y="y")
dfa1.describe()

In [ ]:

sns.scatterplot(data=dfa2, x="x", y="y")
dfa2.describe()

In [ ]:

sns.scatterplot(data=dfa3, x="x", y="y")
dfa3.describe()

In [ ]:

sns.scatterplot(data=dfa4, x="x", y="y")
dfa4.describe()

Exercises¶

For the Anscombe data set:

Which of the summary statistics for $x$ are the same or similar for each subset?
Which of the summary statistics for $y$ are the same or similar for each subset?

Look at the diamonds data set

How many diamonds are listed there?
How many attributes, or features, does each have?
Create a scatter plot of price against carat.

1: dfd = sns.load_dataset('diamonds')
dfd.shape
53940 and 10
2: sns.scatterplot(data=dfd, x="carat", y="price")

In [ ]:

Worksheet A¶

variationalform https://variationalform.github.io/¶

Just Enough: progress at pace¶

What this is about:¶

Exercises¶

Hints¶

1089¶

Vector norms¶

Exercises - see e.g. https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html¶

Hints¶

Data as vectors¶

Exercises¶

Hints¶

Working with datetime¶

Exercise¶

Exercises¶

Hints¶

Exercises¶

Exercises¶

Self-Study and Homework¶

The anscombe data set¶

Exercises¶

Exercises - see e.g. https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html ¶

Working with `datetime`¶

The `anscombe` data set¶