10  File I/O (Input/Output)

Jupyter Notebook

Up to now, we have been working with computer-generated or manually typed data sets. Often in a scientific setting your data will be stored in a file and you will need to read the contents of the file into Python so you can perform an analysis. Most data files are text files, but there is a large variety of these that differ mostly in the way the information is formatted. The file extension (i.e., the 3 or 4 letters after the period at the end of a file name) specifies the formatting of the file. For example, a .csv file (short for comma-separated values) has commas to separate the information.

10.1 The os module

Before we learn how to read and write data files, we should first learn how to navigate our computer’s file system from within Python. For this, we will use a module called os (short for operating system). The os module allows you to perform standard operations on the files and folders on your computer. When opening files from a python script, the default search path is the directory where the current file is located. If you want to open files located in other directories, you may need to use the os module to navigate there first. The following functions are the most commonly used ones

Function Description
getcwd() Short for get current working directory. Returns a string of the current directory.
chdir(path) Change the current working direction to be at path.
listdir() List all of the files and folders in the current working directory.
mkdir(path) Make an new directory at location path.

In the cell below, we show the usage of these functions.

from os import getcwd,chdir,listdir

currentdir = getcwd()  
myfiles = listdir()
#chdir(/path/to/new/directory)

To Do:

  1. Add print statements to the cell above and inspect the results to understand what each function does.
  2. Modify the third statement above (the one that uses chdir) to change the current working directory to one that actually exists on your machine.
  3. Everyone has a “Downloads” folder on their computer. Use the chdir function to change the current working directory to the Downloads folder.

The os module has many, many more functions that do useful things but you will mostly use the functions mentioned above. As you get more experience using your computer at the command line, you will be better equipped to understand the usefulness of the rest of this library.

10.2 Reading Files

10.2.1 Reading Line by Line

The first way to read a file is using a for loop to iterate over the file line by line. Admittedly, this is not the most elegant or efficient way to read a file but we present it first because it always works. First, the file is opened using the open command. The file should be attached to a variable for later use. Next, the readlines() function is used to read the file in as a list of strings; one string for each line in the file. We should use a for loop for to iterate over this list, effectivley parsing the file line by line. Finally, the file should be closed when you’re finished reading it. Let’s see an example for reading in the following file, which will be named squares.csv. (You can download squares.csv here if you want to execute the cells below without an error.)

1, 1
2, 4
3, 9
4, 16
5, 25
6, 36
7, 49
file = open("squares.csv")
lines = file.readlines() # Read the file into a list of strings, one string for each line in the file.
for line in lines:  # Iterate over the list of strings.
    print(line)    # Do something 

file.close()
1,1

2,4

3,9

4,16

5,25

6,36

7,49

8,64

9,81

10,100

You can see how each line gets read separately and printed off. But this isn’t super useful yet because we’d probably like to have the numbers stored in lists for our forthcoming analysis. We can fix this by creating some empty lists and appending the appropriate values as they are read in.

numbers = []
squares = []

file = open("squares.csv")
for line in file.readlines():
    numbers.append( int(line.split(',')[0]) )
    squares.append( int(line.split(',')[1]) )

file.close()

print(numbers)
print(squares)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Now the data from the file is saved in Python lists and our analysis can proceed.

One final note: if all you ned to do is read the first few lines of a file, the readline() function can be used (not readlines()) and each call to this function will read the next line in the file.

file = open("squares.csv")
lineOne = file.readline()
lineTwo = file.readline()
lineThree = file.readline()
file.close()

10.2.2 Using NumPy’s genfromtxt function

If the data file is highly structured (every line looks the same, separator character is consistent across the file, etc) then NumPy’s genfromtxt function can read the data very efficiently into something called an array. The genfromtxt function requires only one argument (the file name) with another optional argument (delimiter) that is typically included to specify the character used to separate the data. Below is an example for using this function to read in the .csv data we have been working with.

import numpy as np

data = np.genfromtxt('squares.csv',delimiter = ",")


print(data)
[[  1.   1.]
 [  2.   4.]
 [  3.   9.]
 [  4.  16.]
 [  5.  25.]
 [  6.  36.]
 [  7.  49.]
 [  8.  64.]
 [  9.  81.]
 [ 10. 100.]]

The data from the file is stored in something called a 2D numpy array which is similar to a nested list. It’s best to think of a 2D numpy array as having rows and columns, and to index them the desired row and column must be provided. We can index these arrays similar to how we index nested lists but instead of using multiple sets of brackets ([]), simply place the desired row and column number in a single set of brackets, as shown in the cell below. Use a colon (:) to select an entire row or column.

import numpy as np

data = np.genfromtxt('squares.csv',delimiter = ",")


numbers = data[:,0]  # Extract the entire first column of the array
squares = data[:,1]  # Extract the entire second column of the array.
mySlice = data[1:3,1] # Extract the first and second row of the 2nd column.
print(numbers)
print(squares)
print(mySlice)
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
[  1.   4.   9.  16.  25.  36.  49.  64.  81. 100.]
[4. 9.]

Another useful feature of arrays is that you can do math with them and the mathematical operations are performed on every number in the array automatically. This is in contrast to lists which cannot be used to perform mathematical operations. See the example in the cell below:

import numpy as np

data = np.genfromtxt('squares.csv',delimiter = ",")

numbers = data[:,0]  # Extract the entire first column of the array
squares = data[:,1]  # Extract the entire second column of the array.

calc = np.sqrt(squares)  # Take the square root of all of the number in the "squares array"
print(calc)
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]

Other optional arguments that can be used with the genfromtxt function are given in the table below:

Argument Description
delimiter The string used to separate value. By default, whitespace acts as the delimiter.
skip_header The number of lines to skip at the beginning of a file.
skip_footer The number of lines to skip at the end of a file.
usecols Specify which columns to read with 0 being the first. For example, usecols = (0,2,5) will read the 1st, 3rd, and 6th columns.
comments The character used to indicate the start of a comment. Lines beginning with this character will be discarded.

10.3 Writing Files

Writing Python data to file is as simple as is reading a file. Just like when you are reading a file, determining which method to use will be determined by the type of data that you are writing. If your data is strictly numerical information stored in an array, Numpy has a function that will quickly save the data to a file. If your data is rife with inconsistencies, non-numerical data, etc, you’ll have to use Python’s native write function.

10.3.1 Writing Line by Line

Sometimes the data file that you want to write includes some text or other non-numerical data. For example, what if you wanted to write the following data to file:

Planet Acceleration due to gravity (m/s\(^2\)
Earth 9.8
Moon 1.6
Mars 3.7
Venus 8.83
Saturn 11.2
Uranus 10.5
Neptune 13.3
Pluto 0.61
Jupiter 24.5
Sun 275

In this case you must open the file you want to write through and write each line of the file one by one. This requires that you use a loop to iterate over the data. An example is given below.

planets = ["Earth","Moon","Mars","Venus","Saturn","Uranus","Neptune","Pluto","Jupiter","Sun"]
g=[9.8,1.6,3.7,8.83,11.2,10.5,13.3,0.61,24.5,275]

f= open("planets.txt","w")

f.write("Planet   g (m/s^2)\n")
f.write("------------------\n")

for idx,planet in enumerate(planets):
    f.write(f"{planet:10s}  {g[idx]:5.2f} \n" )

f.close()

10.3.2 The writelines function

If the data you are wanting to write to file is already in a list of strings, the writelines() function can be used once to write all of the lines in the file.

planets = ["Planet   g (m/s^2)","------------------", "Earth   9.8","Moon   1.6","Mars   3.7","Venus  8.83","Saturn   11.2","Uranus   10.5","Neptune   13.3","Pluto   0.61","Jupiter   24.5","Sun   275"]

f= open("planets.txt","w")

f.writelines(planets)

f.close()

10.4 The with statement

Forgetting to close a file that you are writing to can be problematic so you should always include the close() function when you are done. If you are worried about forgetting it, you can use a with block that will automatically close the file once the block terminates. An example is given below:

planets = ["Earth","Moon","Mars","Venus","Saturn","Uranus","Neptune","Pluto","Jupiter","Sun"]
g=[9.8,1.6,3.7,8.83,11.2,10.5,13.3,0.61,24.5,275]

with open("planets.txt","w") as f:
    f.write("Planet   g (m/s^2)\n")
    f.write("------------------\n")

    for idx,planet in enumerate(planets):
        f.write(f"{planet:10s}  {g[idx]:5.2f} \n" )

Even though no close() function is called, the file planets.txt will be automatically closed once the with block is terminated. A with block can also be used when reading data files.

10.4.1 Using NumPy’s savetxt function

If the data is strictly numerical and contains no text, the savetxt function is a fast and efficient way to write the data to file. For example, maybe you have a two dimensional array containing various columns of planetary data.

Radius ( \(\times 10^{6}\) meters ) Mass ( \(\times 10^{23}\) kg ) Acceleration due to gravity (m/s\(^2\))
6.37 59.8 9.8
1.74 0.736 1.6
3.38 6.42 3.7
6.07 48.8 8.83
58.2 5680 11.2
23.5 868 10.5
22.7 1030 13.3
1.15 0.131 0.61
69.8 19000 24.5
696 19890000 275
from numpy import savetxt

data =[[6.37,59.8,9.8],[1.74,0.736,1.6],[3.38,6.42,3.7],[6.07,48.8,8.83],[58.2,5680,11.2],[23.5,868,10.5],[22.7,1030,13.3],[1.15,.131,0.61],[69.8,19000,24.5],[696,19890000,275]]

savetxt("planetsData.txt",data,fmt = "%5.2e")

The fmt keyword argument can be be added to specify how to format the data. If the data is stored in separate one-dimensional arrays, you can pack them into a single list(tuple) when using the savetxt function and it will write each data set to its own line in the file.

import numpy as np

radius = [6.37,1.74,3.38,6.07,58.2,23.5,22.7,1.15,69.8,696]
mass = [59.8,0.736,6.42,48.8,5680,868,1030,0.131,19000,19890000]
g = [9.8,1.6,3.7,8.83,11.2,10.5,13.3,0.61,24.5,275]

np.savetxt("planetsDataTwo.txt",(radius,mass,g),fmt = "%5.2e")

In this example, radius, mass, and g will be written to file in seperate rows. If you’d like them to be writen in separate columns, you must use np.transpose to flip the array before writing it to file, as shown below:

import numpy as np
radius = [6.37,1.74,3.38,6.07,58.2,23.5,22.7,1.15,69.8,696]
mass = [59.8,0.736,6.42,48.8,5680,868,1030,0.131,19000,19890000]
g = [9.8,1.6,3.7,8.83,11.2,10.5,13.3,0.61,24.5,275]

np.savetxt("planetsDataTwo.txt",np.transpose((radius,mass,g)),fmt = "%5.2e")

Other optional arguments that can be used with the savetxt function are given in the table below:

Argument Description
delimiter String or character to separate columns.
newline String or character to separate lines.
header String to be written at the beginning of the file.
footer String to be written at the end of the file.
comments String that will be prepended to the header and footer string to mark them as comments.

10.5 Flashcards

  1. What does os.getcwd() do?
  2. What does os.chdir() do?
  3. What does os.mkdir() do?
  4. What does os.listdir() do?
  5. How do you open a file and read all of the lines in that file using readlines()?
  6. How do you open a file and read all of the lines in that file using numpy.genfromtxt()?
  7. What is the delimiter keyword argument used for when reading a file with numpy.genfromtxt()?
  8. How do you open and write to a file using the write function?
  9. How do you write to a file using numpy.savetxt?
  10. Where are the sacrament prayers located? (They are found in more than one location)

10.6 Exercises

  1. This file contains the coefficients of \(100\) second-order polynomials. Each line in the file represents a single second order polynomial: \(y = a x^2 + b x+ c\) with the first number equal to \(a\), the second equal to \(b\) and the third equal to \(c\).

    1. Use any method of your choosing to read this file into python.
    2. Use a loop (I’ll let you decide what kind of loop to use) to evaluate each polynomial at \(x = 2.5\) and add them all up. Hint: You should find that the sum of all of the polynomials is \(5285\).
# Python code here
  1. In the cell below, you will find arrays containing \(1000\) values for mass, length, and radius just as you did in the other homework problem.
    1. Using a single line of code, calculate the moments of inertia for the entire data set using the familiar formula: \[ I = {1\over 4} M R^2 + {1\over 12} M L^2\]
    2. Write this data to file first using the open and write functions. Mass values should be in column 1, lengths in column 2, radii in column 3, and moments of inertia in column 4. Add a header to the file to label the columns. Open the file and inspect it to verify that you did it right.
    3. Repeat exercise 2 but using the numpy.savetxt function. Open the file and inspect it to verify that you did it right.
from numpy.random import normal

mass = normal(5,.5,1000)
length = normal(2,.2,1000)
radius = normal(1,.1,1000)
  1. This file contains planetary data for all of the planets in our solar system formatted as a “.csv” file.
    1. Download the file and move it to a location on your computer that you are familiar with.
    2. Read the file using the open and readlines functions.
    3. Repeat exercise 2 using numpy.genfromtxt().
    4. Using a single line of code, calculate the average eccentricity of the planets.
    5. Using a single line of code each, determine the maximum and minium acceleration due to gravity of the planets.
    Tip: You’ll need to use the skip_header and usecols keyword arguments to tell genfromtxt to skip the first row and column.
# Python code here
  1. This file contains the output from a quantum mechanical calculation on a Cobalt-based alloy. The file is pretty large and the data is not well-structured at all even though there is valuable information inside. Reading this file is definitely a job for open and readlines rather than numpy.genfromtxt(). Periodically in this file there are lines that look like this:
    free energy TOTEN = -72.3092 eV
    and you need to extract the numbers from all of these lines. Use the following steps to build a list containing these numbers:
    1. Use the open and readlines functions to read this file into a list of strings.
    2. Iterate over this list and look for lines that contain the string “TOTEN”.
    3. When you find a line that has this string in it, split the string and extract the number. Make sure you convert the number into a float.
    4. Append this number to a list that you defined initially to be empty.
    When you have built the list of energies, determine:
    1. The number of energies in your list.
    2. The average of the last 5 entries.
    3. The minimum and maximum energies in your list.
# Python code here