Text Correction for Historical Documents
University of Sheffield Collaborating Faculties: Faculty of Arts & Humanities, Faculty of Engineering, Digital Humanities Institute (DHI)
External Partner: British Library
Related Links: https://www.dhi.ac.uk/text-correction-for-mining-historical-documents/
Overview: This project addresses the critical issue of correcting noisily OCR’d historical documents, focusing on the British Library Newspapers (BLN) collection. BLN is a major corpus of over 200 years of scanned British newspapers from over 240 newspapers with textual data, visual data, and metadata available. Scanned newspaper images have undergone OCR (optical character recognition) processing, resulting in inaccurate transcriptions due to the degradation of the original documents. The project aims to employ advanced deep-learning techniques to improve the quality of these transcriptions. The final outputs will be high-quality corrected transcriptions of BLN and open-source code for OCR text correction, both of which would serve as valuable resources for humanities researchers.
Motivation: Since the early 2000s, significant digitisation efforts have been undertaken to preserve and make accessible historical primary sources such as newspapers, early printed books, and handwritten documents. While these efforts have been instrumental in advancing humanities research, the low quality of OCR transcriptions remains a significant barrier to discovering new historical insights. The successful completion of this project promises both short-term and long-term benefits. In the short term, it will significantly enhance the transcription quality of BLN, enabling accurate and efficient searching within the collection as well as unlocking the potential for text mining, which was previously impractical due to low transcription quality. In the long term, the project’s success could revolutionise research on other large collections of historical documents, allowing researchers to track content changes, language evolution, and shifts in thought across different time periods. By being language-independent, the impact could extend to historical documents worldwide, advancing research on a global scale.