The Multimodal Universe: Enabling Large-Scale Machine Learning with 100 TB of Astronomical Scientific Data

Angeloudi, Eirini; Audenaert, Jeroen; Bowles, Micah; Boyd, Benjamin M.; Chemaly, David; Cherinka, Brian; Ciucă, Ioana; Cranmer, Miles; Do, Aaron; Grayling, Matthew; Hayes, Erin E.; Hehir, Tom; Ho, Shirley; Huertas-Company, Marc; Iyer, Kartheik G.; Jablonska, Maja; Lanusse, Francois; Leung, Henry W.; Mandel, Kaisey; Martínez-Galarza, Juan R.; Melchior, Peter; Meyer, Lucas; Parker, Liam H.; Qu, Helen; Shen, Jeff; Smith, Michael J.; Stone, Connor; Walmsley, Mike; Wu, John F.

The Multimodal Universe: Enabling Large-Scale Machine Learning with 100 TB of Astronomical Scientific Data

Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Datasets and Benchmarks Track

Authors

Eirini Angeloudi, Jeroen Audenaert, Micah Bowles, Benjamin M. Boyd, David Chemaly, Brian Cherinka, Ioana Ciucă, Miles Cranmer, Aaron Do, Matthew Grayling, Erin E. Hayes, Tom Hehir, Shirley Ho, Marc Huertas-Company, Kartheik G. Iyer, Maja Jablonska, Francois Lanusse, Henry W. Leung, Kaisey Mandel, Juan Rafael Martínez-Galarza, Peter Melchior, Lucas Meyer, Liam H. Parker, Helen Qu, Jeff Shen, Michael J. Smith, Connor Stone, Mike Walmsley, John F. Wu

Abstract

We present the Multimodal Universe, a large-scale multimodal dataset of scientific astronomical data, compiled specifically to facilitate machine learning research. Overall, our dataset contains hundreds of millions of astronomical observations, constituting 100TB of multi-channel and hyper-spectral images, spectra, multivariate time series, as well as a wide variety of associated scientific measurements and metadata. In addition, we include a range of benchmark tasks representative of standard practices for machine learning methods in astrophysics. This massive dataset will enable the development of large multi-modal models specifically targeted towards scientific applications. All codes used to compile the dataset, and a description of how to access the data is available at https://github.com/MultimodalUniverse/MultimodalUniverse

The Multimodal Universe: Enabling Large-Scale Machine Learning with 100 TB of Astronomical Scientific Data

Authors

Abstract

Name Change Policy