13  Pandas

Jupyter Notebook

Pandas is a Python package built to work with spreadsheet-like data, and it is very good at its job. Pandas stores data in something called a “dataframe”. A dataframe is simply data stored in rows and columns. As an example, here is some sample data taken by an accelerometer sitting on an elevator floor:

time,gFx,gFy,gFz
0.007,-0.0056,-0.0046,1.012
0.008,0.0007,0.0024,1.0022
0.008,0,0.0059,1.0039
0.009,0.0054,-0.0022,1.0032
0.009,-0.0015,-0.0056,1.0042
0.009,0.0037,-0.002,0.9951
0.01,-0.002,-0.002,1.002
0.014,0.009,-0.0024,1.0159
0.015,0.0012,-0.0037,1.01
0.017,-0.0115,-0.002,1.0012
0.019,-0.0022,-0.0015,1.001
0.021,0.0024,-0.0022,1.0166

and here is the same data loaded into a dataframe.

   time     gFx     gFy     gFz  
0  0.007 -0.0056 -0.0046  1.0120  
1  0.008  0.0007  0.0024  1.0022  
2  0.008  0.0000  0.0059  1.0039  
3  0.009  0.0054 -0.0022  1.0032  
4  0.009 -0.0015 -0.0056  1.0042  
5  0.009  0.0037 -0.0020  0.9951  
6  0.010 -0.0020 -0.0020  1.0020  
7  0.014  0.0090 -0.0024  1.0159  
8  0.015  0.0012 -0.0037  1.0100  
9  0.017 -0.0115 -0.0020  1.0012 

Every dataframe has labels attached to its columns and rows. In this example, the row labels are just the first 10 integers and the column labels are “time”, “gFx”, “gFy”, and , “gFz”. Labeling the rows and columns is nice because you can access data using the row and column labels instead of indices.

13.1 Creating dataframes

A dataframe can be initiated in many ways but the most common are: from file, from a dictionary, and from an array or list. We’ll discuss each way separately.

13.1.1 Reading from a .csv file.

The most-used function from the pandas module is read_csv which is used to read a csv-formatted file into a data frame. To use it, simply call the pandas.read_csv function and pass in the path to the .csv file. (Here is a link to the file.)

from pandas import read_csv

elevator_data = read_csv("elevator.csv",index_col = 0)
print(elevator_data)
    time     gFx     gFy     gFz
0  0.007 -0.0056 -0.0046  1.0120
1  0.008  0.0070  0.0024  1.0022
2  0.008  0.0000  0.0059  1.0039
3  0.009  0.0054 -0.0022  1.0032
4  0.009 -0.0015 -0.0056  1.0042
5  0.009  0.0037 -0.0020  0.9951
6  0.010 -0.0020 -0.0020  1.0020
7  0.014  0.0090 -0.0025  1.0159
8  0.015  0.0012 -0.0037  1.0100
9  0.017 -0.0115 -0.0020  1.0012

The keyword argument index_col = 0 indicates that the row labels should be taken from the first column in the csv file. There are many, many possible keyword arguments that can be used to customize the way read_csv reads a file into a dataframe. I’ll highlight just a few and refer you to the documentation for the rest:

  1. delimiter- use this to specify the character that separates the data from each other. The default is “,” for .csv files.
  2. header - use this to specify which row contains the column names. Usually this occurs on the first row (header = 0) but not always.
  3. usecols - use this to specify which columns from the file should be included in the dataframe.
  4. skiprows - line numbers to skip when building the dataframe. Can be either a single integer (skip the first n lines) or a list of integers and it will skip all rows in the list.

The rename() function can be used if you don’t like the default index values (integers if you read from file) and want to reassign them.

from pandas import read_csv

elevator_data = read_csv("elevator.csv",index_col = 0)
elevator_data = elevator_data.rename(index = {0:"A",1:"B",2:"C",3:"D",4:"E",5:"F",6:"G",7:"H",8:"I",9:"J"})
print(elevator_data)
    time     gFx     gFy     gFz
A  0.007 -0.0056 -0.0046  1.0120
B  0.008  0.0070  0.0024  1.0022
C  0.008  0.0000  0.0059  1.0039
D  0.009  0.0054 -0.0022  1.0032
E  0.009 -0.0015 -0.0056  1.0042
F  0.009  0.0037 -0.0020  0.9951
G  0.010 -0.0020 -0.0020  1.0020
H  0.014  0.0090 -0.0025  1.0159
I  0.015  0.0012 -0.0037  1.0100
J  0.017 -0.0115 -0.0020  1.0012

13.1.2 Create from a dictionary

If your data is in a dictionary you can use the DataFrame function (case sensitive) to initialize the dataframe. Key values in the dictionary correspond to columns in the dataframe.

from pandas import DataFrame
elevator = {"time":[0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017],"gFx":[-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115],"gFy":[-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002],"gFz":[1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]}
elevator_data = DataFrame(elevator)
print(elevator_data)
    time     gFx     gFy     gFz
0  0.007 -0.0056 -0.0046  1.0120
1  0.008  0.0070  0.0024  1.0022
2  0.008  0.0000  0.0059  1.0039
3  0.009  0.0054 -0.0022  1.0032
4  0.009 -0.0015 -0.0056  1.0042
5  0.009  0.0037 -0.0020  0.9951
6  0.010 -0.0020 -0.0020  1.0020
7  0.014  0.0090 -0.0025  1.0159
8  0.015  0.0012 -0.0037  1.0100
9  0.017 -0.0115 -0.0020  1.0012

The row indices will default to the a set of integers starting at 0. If you want to index the rows with other labels, use the index keyword argument.

from pandas import DataFrame
elevator = {"time":[0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017],"gFx":[-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115],"gFy":[-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002],"gFz":[1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]}
elevator_data = DataFrame(elevator,index = ["A","B","C","D","E","F","G","H","J","K"])
print(elevator_data)
    time     gFx     gFy     gFz
A  0.007 -0.0056 -0.0046  1.0120
B  0.008  0.0070  0.0024  1.0022
C  0.008  0.0000  0.0059  1.0039
D  0.009  0.0054 -0.0022  1.0032
E  0.009 -0.0015 -0.0056  1.0042
F  0.009  0.0037 -0.0020  0.9951
G  0.010 -0.0020 -0.0020  1.0020
H  0.014  0.0090 -0.0025  1.0159
J  0.015  0.0012 -0.0037  1.0100
K  0.017 -0.0115 -0.0020  1.0012

13.1.3 Create from a list

Sometimes you have data in a list or multiple list and would like to combine all of that data and form a dataframe. This can also be done using the DataFrame function (case sensitive remember!). When initializing with lists, you have to also use the columns keyword argument to specify what you want the column labels to be.

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"])

The row labels will default to integers unless you specify otherwise with the index keyword.

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

13.2 Extracting data

Extracting data from the dataframe could mean several things. The most common possibilities include:

  1. accessing a single number using row and column labels or row and column indices.
  2. accessing one or several columns of the dataframe.
  3. accessing one or several rows of the dataframe.
  4. slicing from the “middle” of the dataframe. (i.e. Not entire columns or entire rows.)
  5. accessing only elements in the dataframe that meet a certain criteria.

We’ll cover each of these tasks one at a time.

13.2.1 Extracting General Information

Sometimes the dataframe is quite large and you’d like to inspect just a small portion of it. You can use dataframe.head(n) to look at the first n rows in the dataframe and dataframe.tail(n) to look at the last n rows.

elevator_data.head(3)
elevator_data.tail(4)
time gFx gFy gFz
G 0.010 -0.0020 -0.0020 1.0020
H 0.014 0.0090 -0.0025 1.0159
J 0.015 0.0012 -0.0037 1.0100
K 0.017 -0.0115 -0.0020 1.0012

To get a list of the column or row labels, use the variables dataframe.columns and dataframe.index.

elevator_data.columns
elevator_data.index
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K'], dtype='object')

13.2.2 Extracting a single number

To extract a single number from a dataframe, use the dataframe.at[] or dataframe.iat[] objects. The at object should be used if you want to locate the number using row and column labels and iat should be used when you want to access the number using row and column indices. Let’s see an example:

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.at["A","time"]
elevator_data.iat[0,0]
0.007

The loc and iloc dictionaries (see below) can also be used to extract a single entry in the dataframe.

13.2.3 Extracting entire columns

An entire column of the dataframe can be extract by indexing it like a dictionary, using the column name as the key value

elevator_data["gFz"]
elevator_data["time"]
elevator_data["time"][0:3]

You can bundle the column names into a list and extract multiple columns at once.

elevator_data[["gFx", "gFy", "gFz"]]

13.2.4 Accessing entire rows

Accessing rows in a dataframe is done with the help of the dataframe.loc[n] or dataframe.iloc[n] dictionary. loc should be used if you want to slice using row and column labels whereas iloc should be used if you want to slice out using indices.

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.loc["A"]
elevator_data.iloc[0]  
time    0.0070
gFx    -0.0056
gFy    -0.0046
gFz     1.0120
Name: A, dtype: float64

Multiple rows can be accessed using the A:B syntax to slice out the desired rows.

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.loc["A":"D"]
elevator_data.iloc[0:4]  
time gFx gFy gFz
A 0.007 -0.0056 -0.0046 1.0120
B 0.008 0.0070 0.0024 1.0022
C 0.008 0.0000 0.0059 1.0039
D 0.009 0.0054 -0.0022 1.0032

13.2.5 Slicing rows and columns

The loc and iloc dictionaries can be used to slice across rows and columns. Simple separate the two slices with a comma:

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.loc["A":"D","time":"gFy"]
elevator_data.iloc[0:4,0:2]  
time gFx
A 0.007 -0.0056
B 0.008 0.0070
C 0.008 0.0000
D 0.009 0.0054

13.2.6 Boolean Slicing

We have already seen boolean slicing in Numpy Arrays and we can use something similar on Pandas. Let’s see an example:

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data[elevator_data["time"] < 0.01]  # Slice only the rows where time < 0.01
time gFx gFy gFz
A 0.007 -0.0056 -0.0046 1.0120
B 0.008 0.0070 0.0024 1.0022
C 0.008 0.0000 0.0059 1.0039
D 0.009 0.0054 -0.0022 1.0032
E 0.009 -0.0015 -0.0056 1.0042
F 0.009 0.0037 -0.0020 0.9951

The statement elevator_data["time"] < 0.01 produces a boolean sequence which can be used as a set of indices to access only those entries where time is less than 0.01.

More complex boolean slicing can be done with the use of the query function which allows you to be more specific about your boolean conditions. The example below will extract all rows in the dataframe where time > 0.008 and gFz is greater than 1.

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = ["time", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.query("time > 0.008 and gFz > 1")
time gFx gFy gFz
D 0.009 0.0054 -0.0022 1.0032
E 0.009 -0.0015 -0.0056 1.0042
G 0.010 -0.0020 -0.0020 1.0020
H 0.014 0.0090 -0.0025 1.0159
J 0.015 0.0012 -0.0037 1.0100
K 0.017 -0.0115 -0.0020 1.0012

13.2.7 Special Characters in Column Name

Often you will find that your column labels will contain spaces and/or other special characters in them. To correctly specify the column label in these cases, you must enclose the name in graves accents (It’s located above the tab key on your keyboard.). For example, if your column label was ” time-s” instead of just “time” like in the example above, you could use the query function like this:

from pandas import DataFrame
from numpy import transpose

time = [0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017]
gFx = [-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115]
gFy = [-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002]
gFz = [1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012]
elevator_data = DataFrame(transpose([time,gFx,gFy,gFz]),columns = [" time-s", "gFx","gFy", "gFz"],index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.query("` time-s` > 0.008 and gFz > 1")
time-s gFx gFy gFz
D 0.009 0.0054 -0.0022 1.0032
E 0.009 -0.0015 -0.0056 1.0042
G 0.010 -0.0020 -0.0020 1.0020
H 0.014 0.0090 -0.0025 1.0159
J 0.015 0.0012 -0.0037 1.0100
K 0.017 -0.0115 -0.0020 1.0012

13.2.8 Using Query to match strings

Often your dataframe will contain strings as the data instead of numbers. If you are trying to use query to find dataframe entries that contain a specified string, you’ll have enclose the boolean expression in single quotes (’’) and enclose the string you are trying to match in double quotes (““). As an example, imagine a dataframe very similar to the one used in the previous example, but with an additional column that specifies whether the object

from pandas import DataFrame
from numpy import transpose

dataDict = {" time-s":[0.007,0.008,0.008,0.009,0.009,0.009,0.01,0.014,0.015,0.017], "gFx":[-0.0056,0.007,0,0.0054,-0.0015,0.0037,-0.002,0.009,0.0012,-0.0115],"gFy":[-0.0046,0.0024,0.0059,-0.0022,-0.0056,-0.002,-0.002,-0.0025,-0.0037,-0.002],"gFz":[1.012,1.0022,1.0039,1.0032,1.0042,0.9951,1.002,1.0159,1.01,1.0012], "Here":["Yes","Yes","No","No","Yes","No","Yes","No","Yes","No"]}

elevator_data = DataFrame(dataDict,index = ["A","B","C","D","E","F","G","H","J","K"])

elevator_data.query('` time-s` > 0.008 and Here == "No"')
time-s gFx gFy gFz Here
D 0.009 0.0054 -0.0022 1.0032 No
F 0.009 0.0037 -0.0020 0.9951 No
H 0.014 0.0090 -0.0025 1.0159 No
K 0.017 -0.0115 -0.0020 1.0012 No

To Do:

  1. Add print statements to the cell above until you understand what each line of code does.
  2. Add comments next to the line of code to help you remember.

13.3 Performing Calculations

Dataframes are similar to numpy arrays in that you can do math across an entire dataset. This makes mathematical calculations very easy. Look at the example below and try to guess what calculation is being performed

from numpy import sqrt

elevator_data[["gFx", "gFy", "gFz"]] *= 9.8
a_magnitude  = sqrt(elevator_data[["gFx", "gFy", "gFz"]]**2).sum(axis = 1)

max_accel = a_magnitude.max()

elevator_data
time-s gFx gFy gFz Here
A 0.007 -0.05488 -0.04508 9.91760 Yes
B 0.008 0.06860 0.02352 9.82156 Yes
C 0.008 0.00000 0.05782 9.83822 No
D 0.009 0.05292 -0.02156 9.83136 No
E 0.009 -0.01470 -0.05488 9.84116 Yes
F 0.009 0.03626 -0.01960 9.75198 No
G 0.010 -0.01960 -0.01960 9.81960 Yes
H 0.014 0.08820 -0.02450 9.95582 No
J 0.015 0.01176 -0.03626 9.89800 Yes
K 0.017 -0.11270 -0.01960 9.81176 No

You may have noticed that sum method that was used to sum up each column. The keyword argument axis = 1 indicates that you want to sum over rows, not over columns. (axis = 0 would result in summing over columns.) There are a few other handy methods for common mathematical operations.

  1. min() - Find the minimum value.
  2. max() - Find the maximum value.
  3. cumsum() - Cumulative sum, just like in numpy.
  4. std() - Standard deviation.
  5. mean() - Mean or average.
  6. quantile(q) - Find value of a give quantile q.

There are a multitude of other useful math functions available. See here for a more comprehensive list. Basically, any function that numpy has, pandas will also have.

13.4 Modifying the dataframe

13.4.1 Adding new columns

A new column can be added to a dataframe by typing dataframe[columnname] = followed by a list,tuple, or array containing the new entries. For example:

elevator_data["a_mag"]  = sqrt( (elevator_data[["gFx", "gFy", "gFz"]]**2).sum(axis = 1))

13.4.2 Adding a new row

To add a single (or multiple) rows to a dataframe you should use the concat() function (short for concatenate). This function will join multiple dataframes into one. Below is an example of the usage.

from pandas import DataFrame,concat

to_add = DataFrame({" time-s":0.02,"gFx":-0.028,"gFy":0.018,"gFz":1.028},index = ["L"]) # Build the dataframe to be added.
final = concat([elevator_data,to_add])  # Append using a dictionary.

elevator_data2 = DataFrame({"time":[0.02,0.025],"gFx":[-0.028,-0.022],"gFy":[0.018,-0.012],"gFz":[1.028,1.042]})

final2 = concat([elevator_data,to_add,elevator_data2]) #Combine all three data frames into one.

13.5 Other useful methods

13.5.1 Getting a summary of your dataframe.

The describe() function will calculate several useful statistical quantities and display them in a dataframe.

elevator_data.describe()
time-s gFx gFy gFz a_mag
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.010600 0.005586 -0.015974 9.848706 9.848937
std 0.003438 0.060314 0.033201 0.059088 0.059124
min 0.007000 -0.112700 -0.054880 9.751980 9.752067
25% 0.008250 -0.018375 -0.033320 9.820090 9.820186
50% 0.009000 0.005880 -0.020580 9.834790 9.834958
75% 0.013000 0.048755 -0.019600 9.883790 9.883886
max 0.017000 0.088200 0.057820 9.955820 9.956241

13.5.2 Plotting your dataframe.

A histogram of each column can be easily generate with the dataframe.hist() function.

elevator_data.hist()
array([[<Axes: title={'center': ' time-s'}>,
        <Axes: title={'center': 'gFx'}>],
       [<Axes: title={'center': 'gFy'}>, <Axes: title={'center': 'gFz'}>],
       [<Axes: title={'center': 'a_mag'}>, <Axes: >]], dtype=object)

To make a scatter plot of two columns in your dataframe, use dataframe.plot.scatter(). The x and y keyword arguments should be used to specify which columns to plot.

elevator_data.plot.scatter(x = " time-s", y = "gFz")
<Axes: xlabel=' time-s', ylabel='gFz'>

You can also plot a single column vs. the row labels using the plot function

elevator_data["gFz"].plot()
<Axes: >

You can also use this function to plot a single row vs the column labels.

elevator_data.iloc[2][[" time-s","gFx","gFy","gFz"]].plot()
<Axes: >

A wealth of other functions exist for pandas and I will not exemplify or explain them here because it is beyond the scope of the class. A nice summary sheet for pandas can be found here

13.5.3 Writing your dataframe to file

You have already used savetxt to save an entire array of numbers to a file in one step. If your data is in a pandas dataframe, saving that dataframe to file couldn’t be easier; just use dataframe.to_csv("filename").

elevator_data.to_csv("myelevatorData.csv")

Several helpful keyword arguments are available when writing to a file. I’ll list a few of them below.

  1. sep - Delimiter or character used to separate the data as a length-one string. Default is a comma (“,”).
  2. columns - specify which columns in your dataframe to write as a list of column labels.
  3. index - True if you want the row labels written and False if you don’t
  4. compression - specify the compression scheme as a string. Options are ‘zip’,‘gzip’, ‘bz2’,‘zstd’, and ‘tar’.

Depending on the type of data you are working with, there is a good argument for always using to_csv to write data to file, never needing savetxt.

13.6 Flashcards

  1. How does read_csv work? Give a short example.
  2. What is the keyword argument usecols used for when reading a dataframe from file?
  3. How does the DataFrame function work and what type of arguments can you pass it? Give a few short examples.
  4. How do you access/extract multiple columns in their entirety from a dataframe.
  5. How do you access/extract multiple rows in their entirety from a dataframe.
  6. How do you access/extract the middle part of a dataframe?
  7. How do you access/extract only the numbers in a dataframe that meet a given criteria?
  8. How do you add a new column to a dataframe?
  9. How do you add a new row to a dataframe?
  10. Recite Mosiah 4:27.

13.7 Exercises

  1. In homework problem 3 from the chapter on dictionaries and sets, you built a dictionary containing planetary information for your top three favorite planets.
    1. Create a single dataframe that contains this information in it. The columns should be the properties of the planets and the rows should be the planets.
    2. Make the name of the planet be the row label.
    3. Write this dataframe to a csv file.
 # Python code here
  1. In homework problem 2 from the chapter on I/O, you calculated the moment of inertia for 1000 different objects. I’ve copied the code from that homework problem below
    1. Create a dataframe out of the mass, radius, length, and inertia arrays. The dataframe should have four columns and 1000 rows.
    2. Use describe to get a summary of the data.
    3. Write the data to a .csv file and inspect the file to ensure that you did it correctly.
from numpy.random import normal

mass = normal(5,.5,1000)
length = normal(2,.2,1000)
radius = normal(1,.1,1000)
  1. In homework problem 3 from the chapter on I/O, you read planetary data into a .csv file using genfromtxt.
    1. Read the same file into a dataframe using read_csv. Let the name of the planets be the row label and the physical property be the column label.
    2. The sixth column contains the acceleration due to gravity for all of the planets in units of m/s\(^2\). Recalculate these values to be in terms of earth’s value (\(g = 9.78\) m/s\(^2\)) and modify the sixth column.
    3. The seventh column contains the orbital period for all of the planets in units of earth years. Recalculate these values to be in earth days and modify that column accordingly.
    4. Add a new column labeled “Surface Temperature” that contains the surface temperature for each planet in Kelvins as given here.
    5. Add a new row for a hypothetical planet (“planet X”) and set its physical properties equal to the average of the other 9 planets that are already in the dataframe. Set the index for this row to be “Planet X”.
    6. Determine which planets have an acceleration due to gravity that is less than earth’s value ($9.8 $ m/s\(^2\))
    7. Determine which planets have an eccentricity that is greater than \(0.04\).
    8. Determine which planet(s) have an inclination greater than \(1.5^\circ\) and have 10 or more satellites.
    9. Write a .csv file containing the following columns in the dataframe: “Name”, “SurfaceGravity”, “OrbitalPeriod”, and “Satellites”.
    10. Write the entire dataframe to a separate file.
    Hints:
    1. For part 5 you will want to use describe to generate a dataframe of summary statistics. Then use concat to add the row labeled “mean” to your original planetary database. Then use rename to rename that row from mean to Planet X.
    2. The column label for the acceleration due to gravity start with a space: ” SurfaceGravity” not “SurfaceGravity”.
    3. For part 8 use query. To include the space in the column names, enclose the name in backticks “`”. (i.e. ” ` Inclination` > 0.04”)
 # Python code here