Harvard Releases a Dataset that Contains a Book Collection of 394 Million Titles

June 20, 2025

IBL News | New York

Harvard University has released a dataset of library books, named Institutional Books 1.0, for researchers, which contains over 394 million records, according to the AP.

These materials, preserved and organized by generations of librarians, comprise nearly one million books in 254 languages, dating back to the 15th century.

The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law, and agriculture.

Supported financially by Microsoft and OpenAI, the maker of ChatGPT, the Harvard-based Institutional Data Initiative is collaborating with libraries and museums worldwide on how to prepare their AI collections for the public.

“Librarians have always been the stewards of data and the stewards of information,” said Aristana Scourtas, who manages research at Harvard Law School’s Library Innovation Lab.

These datasets were shared this month on the Hugging Face platform, which hosts open-source AI models that anyone can download.

Latest News