Tutorial: Download all Strain Files from a Run

In this tutorial, we'll show how to download all the available strain files in a run.

Tutorials in this series:

Fetch a list of files for a given run

Let us suppse that we want to download and eventually process all available strain files of a certain type for the O3b run. In the example script below we show how you can fetch a list of strain file download links for a given run, using GWOSC's web API.

import requests


def fetch_run_gps_times(run):
    "Fetch gwosc archive and return the (start, end) GPS time tuple of the run."
    response = requests.get("https://gwosc.org/archive/all/json/").json()
    response.raise_for_status()
    runs = response["runs"]
    run_info = runs.get(run)
    if run_info is None:
        raise ValueError(f"Could not find run {run}. Available runs: {runs.keys()}")
    return run_info["GPSstart"], run_info["GPSend"]


def fetch_strain_list(run, detector, gps_start=None, gps_end=None):
    "Return the list of strain file info for `run` and `detector`."
    if gps_start is None or gps_end is None:
        start, end = fetch_run_gps_times(run)
        gps_start = gps_start or start
        gps_end = gps_end or end

    # Get the strain list
    fetch_url = (
        f"https://gwosc.org/archive/links/"
        f"{run}/{detector}/{gps_start}/{gps_end}/json/"
    )
    response = requests.get(fetch_url)
    response.raise_for_status()
    return response.json()["strain"]


def main():
    strain_files = fetch_strain_list("O3b_4KHZ_R1", "H1")
    print(len(strain_files))


if __name__ == "__main__":
    main()  

Download strain files only once

After downloading the strain file list and checking its size, we see that there can be thousands of strain files associated with a run. With such a long running process, the script is likely to be interrupted, which would require starting the whole process again. To avoid that situation, we can keep track of downloaded files in an auxiliary text file.

In the following example we expand on our previous script to add a download function that uses requests to stream the download and a safeguard to avoid downloading already downloaded files.

The final analysis is usually done with specialized software like GWpy. We recommend reading the particular documentation of the analysis library chosen for the analysis. As an example, we show below how to open and read an hdf5 format file with GWpy.

import requests
from gwpy.timeseries import TimeSeries


def fetch_run_gps_times(run):
    "Fetch gwosc archive and return the (start, end) GPS time tuple of the run."
    response = requests.get("https://gwosc.org/archive/all/json/").json()
    runs = response["runs"]
    run_info = runs.get(run)
    if run_info is None:
        raise ValueError(f"Could not find run {run}. Available runs: {runs.keys()}")
    return run_info["GPSstart"], run_info["GPSend"]


def fetch_strain_list(run, detector, gps_start=None, gps_end=None):
    "Return the list of strain file info for `run` and `detector`."
    if gps_start is None or gps_end is None:
        start, end = fetch_run_gps_times(run)
        gps_start = gps_start or start
        gps_end = gps_end or end

    # Get the strain list
    fetch_url = (
        f"https://gwosc.org/archive/links/"
        f"{run}/{detector}/{gps_start}/{gps_end}/json/"
    )
    response = requests.get(fetch_url)
    response.raise_for_status()
    return response.json()["strain"]


def download_strain_file(download_url):
    "Download the strain file on the given url and save to disk."
    # In the next line I parse the file name from the download url.
    # Ideally, the file name should be grabbed from the
    # Content-Disposition response header.
    filename = download_url.split("/")[-1]
    with requests.get(download_url, stream=True) as r:
        r.raise_for_status()
        with open(filename, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return filename


def main():
    strain_files = fetch_strain_list("O3b_4KHZ_R1", "H1")
    try:
        with open("filesdone.txt", "r") as fp:
            donelist = [f.strip() for f in fp.readlines()]
    except FileNotFoundError:
        donelist = []
    for afile in strain_files:
        if afile["url"] in donelist:
            continue
        if afile["format"] == "hdf5":
            print(f"Downloading {afile['url']}")
            fname = download_strain_file(afile["url"])
            tseries = TimeSeries.read(fname, format="hdf5.gwosc")
            with open("filesdone.txt", "a") as fp:
                fp.write(f"{afile['url']}\n")
            # process tseries here


if __name__ == "__main__":
    main()  

To learn more about GWpy and how to manipulate a Timeseries object, check the official documenation.