From 19th-century classics to early 20th-century publications, 983,000 books—totaling 386 million pages—have finally been released as structured data. The massive text dataset Institutional Books, published by the Harvard Law School Library, not only opens new doors for academic research but also symbolizes the convergence of cultural heritage and artificial intelligence.
What is “Institutional Books”?
Released in 2025 by the Harvard Law School Library, Institutional Books is a scholarly dataset created by extracting and refining text from Google Books scan data.
Key features include:
- Number of books: 983,000
- Total pages: Approx. 386 million
- Languages: 254 (English 43%, followed by German, French, etc.)
- Time period: Primarily books from the 1800s to early 1900s
- Fields covered: Literature (24%), Law (13%), Philosophy & Religion (12%), Science (11%), and more
The OCR-processed texts are optimized for use in machine learning and natural language processing and are available for non-commercial use.
Why Is This Project Important?
- Building Academic Infrastructure
This project directly addresses the challenge of digitizing primary sources in academic research. Structuring historical texts as data will accelerate studies across disciplines such as digital humanities, law, linguistics, and the history of ideas.
- A Bridge Between Public Knowledge and AI
This initiative—where Google’s vast book data is organized and made publicly available by a university—offers a potential new model for the “public use of AI.” It highlights how private tech and public academia can collaborate in service of collective knowledge.
Rediscovering the Past Through Digitization
The project also involves reevaluating language classification via OCR, leading to discoveries such as books previously labeled as Latin actually containing a mix of French.
Such examples suggest how machine processing and humanistic expertise can work together to reinterpret written works.
Challenges and Future Directions
- Copyright Barriers
Despite relying on Google Books, modern works and translated publications raise copyright concerns. Limiting the dataset to non-commercial use reflects careful navigation of these legal issues.
- Bias and Eurocentrism in the Data
The language distribution reveals a Western European focus—English, German, French, Italian, etc. Moving forward, incorporating texts from Asia, Africa, and colonial histories will be vital for a more inclusive body of knowledge.
How Do We Read the Future of Books?
Institutional Books marks the moment when printed volumes gain “textual life” in digital form.
In an era where AI can interpret narratives, learn legal reasoning, and read poetry, how should we preserve and expand the legacy of books as cultural heritage? This is not just a technical question—it challenges us to rethink the future of the humanities.
How we use this dataset depends on our imagination. Now that this vast inheritance of human knowledge has been released into the digital sea, it is not only researchers, but also developers, educators, and readers themselves, who must begin writing the next chapter.