Exoplanet atmospheric retrieval is an inverse modeling technique where atmospheric properties are inferred from exoplanet observations (primary transit, secondary eclipse, or direct observation). Using a radiative transfer (RT) code, thousands to millions of forward models are evaluted, compared to the observations, and probabilistically accepted/rejected by the Bayesian sampler (see review by Madhusudhan 2018). The RT calculations dominate the runtime, which is typically on the order of days of compute time.
Himes et al. (2021) demonstrated that the runtime of retrievals can be reduced by using a machine learning model that approximates RT, with minimal loss in accuracy. This webpage hosts the Reproducible Research Compendium (RRC) for the paper.
The RRC is supplied as .tar.gz files to reduce the data size and enable users to download only the parts they are interested in. The RRC's full size once extracted is ~330 GB. If running MARGE, it will use an additional ~230 GB for the data in TFRecords format + the predictions on the validation and test sets.
To reconstruct the RRC, place all .tar.gz files in the same directory, and extract each one. On Unix-based systems, users may enter:
for foo in RRC-HimesEtal2021*.tar.gz do tar -zxvf $foo done
into a terminal to accomplish this. This is available as an executable script below (build_rrc.sh
).
Windows users can use archiving programs like 7-Zip or WinZip to extract the .tar.gz files.
Himes et al. (2021) presents two software packages, released under the Reproducible Research Software License. The Machine learning Algorithm for Radiative transfer of Generated Exoplanets (MARGE) is a Python package that trains a user-specified neural network architecture to approximate a deterministic process, based on some data generated by a forward model. The Helper Of My Eternal Retrievals (HOMER) is a Python package that performs a Bayesian inverse inference using a MARGE-trained model. For more details, see the user manuals at their GitHub pages.
The training, validation, and test sets are stored in the Numpy binary (NPY) format. Each file contains a 2D array of 64 data vectors, where each vector is made up of the 12 inputs followed by the 6821 outputs. The training set has 2,446,784 cases, the validation set has 689,536 cases, and the test set has 322,112 cases.
The data inputs are comprised of:
The atmospheric models have 100 log-uniform layers spanning 10-8–100 bar. Molecular abundances are assumed to be uniform over the range of pressures. Output spectra are computed using the radiative transfer package of the Bayesian Atmospheric Radiative Transfer code (BART, Harrington et al. 2021,submitted to PSJ; Cubillos et al. 2021, submitted to PSJ; Blecic et al. 2021, submitted to PSJ). Each spectrum is in erg s-1 cm-1 and spans 280–7100 cm-1 at a resolution of 1.0 cm-1. For more details, refer to Section 2.1 and Table 1 of Himes et al. (2021).
The research team thanks the NASA Exoplanet Archive for hosting this RRC and gratefully acknowledges the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. This research was supported by the NASA Fellowship Activity under NASA Grant 80NSSC20K0682 and NASA Exoplanets Research Program grant NNX17AB62G.
If you find this useful for your own work, please cite Himes et al. (2021) with the NASA Exoplanet Archive's standard acknowledgement.
Last updated: 25 May 2021