Tutorial: Automatically download and process ALL the data

In this tutorial, we'll show how to process thousands of hours of LIGO data using a simple python code. First we make a list of all the files that we want to process, then the processing code fetches these files, one by one, recording in another file which have been processed. The resulting code can be stopped and started and interrupted many times (for example if the internet connection breaks), but eventually it will have processed each file exactly once.

Listing the files

Let us assume that a Run has been chosen from those available (see here), and let us suppose that is the S5 run. The following code uses the web-service interface to the tile catalog to find all of the relevant files between two given times for detector H2. The response from the server is a table: first a line of column labels, then all the rows of the table. The column that we want is the one with the label "HDF5 filename".

import json, urllib
dataset = 'S5'
GPSstart = 825155213   # start of S5
GPSend   = 825232014   # end of S5
detector = 'H2'

urlformat = 'https://gwosc.org/archive/links/{0}/{1}/{2}/{3}/json/'
url = urlformat.format(dataset, detector, GPSstart, GPSend)
print "Tile catalog URL is ", url

r = urllib.urlopen(url).read()    # get the list of files
tiles = json.loads(r)             # parse the json

print tiles['dataset']
print tiles['GPSstart']
print tiles['GPSend']

output_list = open('files', 'w')
for file in tiles['strain']:
    if file['format'] == 'hdf5':
        print "found file from ", file['UTCstart']
        print>>output_list, file['url']


Processing the files

As an example of processing the data, we have chosen to fold it with some period (32 seconds in the code below), meaning that time is evaluated modulo 32 and strain coadded, converting thousands of hours of strain to a single 32-second folded and averaged version.

So the code loops through the file names from above, and for each makes it into a URL and fetches the (~120 Mbyte) file, does the folding and accumulates it to a sum file called "foldsum.npy". There is another file called "foldnum.npy" which is the number of samples added up to make the sum file, so that the average is foldsum divided by foldnum.

However, such a long running process, fetching thousands of large files, is likely to be interrupted, which would require starting the whole process again. Therefore, we keep track of which files have been processed, so that if/when a restart is required, we do not start at the beginning, but rather with the next unprocessed file. The code looks like this:

import os, urllib
import numpy
import readligo

Sec = 4096      # samples per second
Hour = 4096     # seconds per file
FoldTime = 32   # seconds for folding
detector = 'H2' # which detector to use

# First we get a list of all the files that are already processed
    f = open('filesdone')
    donelist = f.readlines()
    donelist = []

# Now open the list of files that are to be processed
for line in open('files').readlines():
    if line in donelist: continue    # already processed

# Now build the URL for each file
# Example line 815792128/H-H1_LOSC_4_V1-815886336-4096.hdf5
    tok = line.strip().split('/')     
    url = 'https://gwosc.org/archive/data/S5/' + line.strip()
    filename = tok[1]

    tol = filename.split('-')   #  H   H2_LOSC_4_V1   843599872    4096.hdf5
    gpstime = int(tol[2])

        foldsum = numpy.load('foldsum.npy')   # load the sum output file 
        foldnum = numpy.load('foldnum.npy')   # load the file of how many 
        foldsum = numpy.zeros(FoldTime*Sec, numpy.float32)  # or make a new one with zeros
        foldnum = numpy.zeros(FoldTime, numpy.int32)  

    print "Fetching data file from ", url
    r = urllib.urlopen(url).read()
    f = open('cache/' + filename, 'w')   # write it to the right filename

    strain, time, chan_dict = readligo.loaddata('cache/' + filename, detector)
# Lets only count the CAT4 (best) data
    okmask = chan_dict['CBCHIGH_CAT4']
    secs = 0
    for t in range(0,Hour):
        if okmask[t]:
            m = (gpstime + t) % FoldTime            # second of the fold
            foldsum[m*Sec:(m+1)*Sec] += strain[t*Sec:(t+1)*Sec]  # vector coadd operation
            foldnum[m] += 1
            secs += 1
    print "     %d seconds processed, fold from %d to %d" % (secs, foldnum.min(), foldnum.max())

        numpy.save('foldsum.npy', foldsum)     # save the output file
        numpy.save('foldnum.npy', foldnum)     # save the output file
        f = open('filesdone', 'a')         # mark the file as done
        print>>f, line.strip()
        os.system("rm cache/*")                     # delete the processed file
    except Exception, e:
        print "Something went wrong", e

Making Sound from the Fold

This folding operation could show up how times of the day affect LIGO operation, by using a 24-hour folding time. It could also look for signals in data folded by sidereal day (86164.09 seconds), which is the time between risings of a fixed star.

The folded data product represented by "foldsum" and "foldnum" can be converted to a sound file. The scipy library has tools to convert a numpy file, such as the fold-H1.npy that we made abov, to a sound file. Here is a minimal program:

import numpy
import scipy.io.wavfile
Sec = 4096
foldsum = numpy.load('foldsum.npy')
foldnum = numpy.load('foldnum.npy')
print "Found %d seconds" % len(foldnum)
for i in range(len(foldnum)):   # Make the average
    foldsum[i*Sec:(i+1)*Sec] /= foldnum[i]
scaled = numpy.int16(numpy.sqrt(foldsum/numpy.max(numpy.abs(foldsum))) * 32767)
scipy.io.wavfile.write('fold.wav' , Sec, scaled)

The resulting wav file can be heard using standard tools such as quicktime or iTunes.