Tutorial: Download all Strain Files from a Run
In this tutorial, we'll show how to download all the available strain files in a run.
Tutorials in this series:
- Discover and Download Bulk-release Strain Files
- Download Bulk-release Strain Files Programatically
- Download all Strain Files from a Run
Fetch a list of files for a given run
Let us suppse that we want to download and eventually process all available strain files of a certain type for the O3b run. In the example script below we show how you can fetch a list of strain file download links for a given run, using GWOSC's web API.
import requests
def fetch_run_gps_times(run):
"Fetch gwosc archive and return the (start, end) GPS time tuple of the run."
response = requests.get("https://gwosc.org/archive/all/json/").json()
response.raise_for_status()
runs = response["runs"]
run_info = runs.get(run)
if run_info is None:
raise ValueError(f"Could not find run {run}. Available runs: {runs.keys()}")
return run_info["GPSstart"], run_info["GPSend"]
def fetch_strain_list(run, detector, gps_start=None, gps_end=None):
"Return the list of strain file info for `run` and `detector`."
if gps_start is None or gps_end is None:
start, end = fetch_run_gps_times(run)
gps_start = gps_start or start
gps_end = gps_end or end
# Get the strain list
fetch_url = (
f"https://gwosc.org/archive/links/"
f"{run}/{detector}/{gps_start}/{gps_end}/json/"
)
response = requests.get(fetch_url)
response.raise_for_status()
return response.json()["strain"]
def main():
strain_files = fetch_strain_list("O3b_4KHZ_R1", "H1")
print(len(strain_files))
if __name__ == "__main__":
main()
Download strain files only once
After downloading the strain file list and checking its size, we see that there can be thousands of strain files associated with a run. With such a long running process, the script is likely to be interrupted, which would require starting the whole process again. To avoid that situation, we can keep track of downloaded files in an auxiliary text file.
In the following example we expand on our previous script to add a download function that uses requests to stream the download and a safeguard to avoid downloading already downloaded files.
The final analysis is usually done with specialized software like GWpy.
We recommend reading the particular documentation of the analysis library chosen for the analysis.
As an example, we show below how to open and read an hdf5
format file with GWpy.
import requests
from gwpy.timeseries import TimeSeries
def fetch_run_gps_times(run):
"Fetch gwosc archive and return the (start, end) GPS time tuple of the run."
response = requests.get("https://gwosc.org/archive/all/json/").json()
runs = response["runs"]
run_info = runs.get(run)
if run_info is None:
raise ValueError(f"Could not find run {run}. Available runs: {runs.keys()}")
return run_info["GPSstart"], run_info["GPSend"]
def fetch_strain_list(run, detector, gps_start=None, gps_end=None):
"Return the list of strain file info for `run` and `detector`."
if gps_start is None or gps_end is None:
start, end = fetch_run_gps_times(run)
gps_start = gps_start or start
gps_end = gps_end or end
# Get the strain list
fetch_url = (
f"https://gwosc.org/archive/links/"
f"{run}/{detector}/{gps_start}/{gps_end}/json/"
)
response = requests.get(fetch_url)
response.raise_for_status()
return response.json()["strain"]
def download_strain_file(download_url):
"Download the strain file on the given url and save to disk."
# In the next line I parse the file name from the download url.
# Ideally, the file name should be grabbed from the
# Content-Disposition response header.
filename = download_url.split("/")[-1]
with requests.get(download_url, stream=True) as r:
r.raise_for_status()
with open(filename, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
return filename
def main():
strain_files = fetch_strain_list("O3b_4KHZ_R1", "H1")
try:
with open("filesdone.txt", "r") as fp:
donelist = [f.strip() for f in fp.readlines()]
except FileNotFoundError:
donelist = []
for afile in strain_files:
if afile["url"] in donelist:
continue
if afile["format"] == "hdf5":
print(f"Downloading {afile['url']}")
fname = download_strain_file(afile["url"])
tseries = TimeSeries.read(fname, format="hdf5.gwosc")
with open("filesdone.txt", "a") as fp:
fp.write(f"{afile['url']}\n")
# process tseries here
if __name__ == "__main__":
main()
To learn more about GWpy and how to manipulate a Timeseries
object, check the official documenation.