The Newspaper Navigator project at the Library of Congress finds and extracts visual content from 16 million pages found in Chronicling America. Although LC is closed to the public, work continues behind the scenes on projects just like this.
Newspaper Navigator Project
Innovator in Residence Ben Lee trained a machine-learning model to identify visual content in historic newspapers at Chronicling America. Lee said,
The 16+ million historical newspapers within Chronicling America are fascinating to me on so many levels. They are a portal back in time and reveal the rich history of the United States in a way that is unique to historic newspapers, from local histories to fun advertisements.
But what excites me most about Chronicling America is how it reaches such a wide range of the American public, including school groups, genealogists, journalists, local historians, researchers, and even people looking to recreate old cooking recipes!
With Newspaper Navigator, I have been fortunate to be able to build on the work of Beyond Words volunteers who identified visual content in World War I-era newspapers; using the annotations volunteers created, I have trained a machine learning model to identify visual content in historic newspapers.
In the first phase of Newspaper Navigator, I have used this model to find and extract all of the visual content included in over 16 million pages in Chronicling America. The resulting dataset, which I am calling the Newspaper Navigator dataset, includes headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements.
A doctoral student in Computer Science and Engineering at the University of Washington, Lee was drawn to this project by his own family history.
My interest in using computer science to help identify and analyze images in digitized cultural heritage collections stems from my time as a Digital Humanities Associate Fellow at the United States Holocaust Memorial Museum (USHMM).
My grandmother is a survivor of Auschwitz-Birkenau concentration camp. When I first visited the USHMM with her in 2007 as a teenager, a reference librarian told us about the International Tracing Service, one of the largest repositories of documents pertaining to the Holocaust. A decade later, while a fellow at the USHMM, I learned about the idiosyncrasies and difficulties of manually searching the digital archive.
As a result, I developed a project to aid the search process by using machine learning to identify different types of documents in the archive. Read more about this work here. The idea of facilitating access to meaningful materials has stuck with me. It has been a large motivation in pursuing this line of research with cultural heritage collections with both the Library of Congress and the University of Washington.
Using the Newspaper Navigator Project Dataset
A compilation of extracted images from the Newspaper Navigator dataset (Courtesy, Library of Congress)
Nancy E. Loe, MA, MLS, is a genealogy researcher and educator. After a long career in libraries and archives, Nancy now writes and lectures on her specializations: organizing research and U.S. and European records. She appears frequently at regional, national, and international genealogy conferences. She recently completed two Salt Lake Institute of Genealogy classes on Nordic research and reading German handwriting and Fraktur.