Search solution for newspaper archive handles PDF text with embedded images
FultonHistory.com has scanned in more than 150 years of newspapers into searchable PDF. The archive allows users to search through 34 million online pages dating back to the 18th century. The site covers more than 1.5 million searchable newspapers of the upstate and central New York region, spanning a large number of different newspapers.
FultonHistory.com wanted a solution that would allow visitors to search the entire collection, with highlighted hits appearing right on the scanned newspaper images. The newspaper archive chose dtSearch, which uses its search algorithm to enable researchers, historians and students to search for one key word in the database and compile not only the archived newspapers that contain the word, but show exactly where the word appears on the scanned newspaper image. Contegra Systems integrated the PDF highlighting application and assisted with dtSearch optimization.
Tom Tryniski, architect of FultonHistory.com, says, “I searched high and low for a product that would handle PDF text with embedded images. I converted microfilm of the newspapers to 'hidden text' PDFs, and then used dtSearch to search the resulting PDFs. I used the dtSearch hit-highlighting feature to highlight words seemingly right on the scans. Because of the potential for OCR errors when scanning old newspapers, dtSearch’s fuzzy searching is really important.”
David Thede, president of dtSearch, says, “I did a search for the Titanic and instantly pulled up actual copies of newspapers from the original disaster in 1912. The results even spanned more modern coverage of the discovery of the underwater wreck. To instantly pull up such an historical reservoir is amazing.”
(Image courtesy of Shutterstock.com)