Text Correction for Historical Documents

Alan Thomas, Robert Gaizauskas, Valeria Vitale, Michael Pidd, Robert Shoemaker, Haiping Lu

Jul 6, 2023

University of Sheffield Collaborating Faculties: Faculty of Arts & Humanities, Faculty of Engineering, Digital Humanities Institute (DHI)

External Partner: British Library

Overview: This project addresses the critical issue of correcting noisily OCR’d historical documents, focusing on the British Library Newspapers (BLN) collection. BLN is a major corpus of over 200 years of scanned British newspapers from over 240 newspapers with textual data, visual data, and metadata available. Scanned newspaper images have undergone OCR (optical character recognition) processing, resulting in inaccurate transcriptions due to the degradation of the original documents. The project aims to employ advanced deep-learning techniques to improve the quality of these transcriptions. The final outputs will be high-quality corrected transcriptions of BLN and open-source code for OCR text correction, both of which would serve as valuable resources for humanities researchers.

Motivation: Since the early 2000s, significant digitisation efforts have been undertaken to preserve and make accessible historical primary sources such as newspapers, early printed books, and handwritten documents. While these efforts have been instrumental in advancing humanities research, the low quality of OCR transcriptions remains a significant barrier to discovering new historical insights. The successful completion of this project promises both short-term and long-term benefits. In the short term, it will significantly enhance the transcription quality of BLN, enabling accurate and efficient searching within the collection as well as unlocking the potential for text mining, which was previously impractical due to low transcription quality. In the long term, the project’s success could revolutionise research on other large collections of historical documents, allowing researchers to track content changes, language evolution, and shifts in thought across different time periods. By being language-independent, the impact could extend to historical documents worldwide, advancing research on a global scale.

Digital Humanities Multimodal AI Foundation Model

Text Correction for Historical Documents

Alan Thomas

AI Research Engineer

Robert Gaizauskas

Professor of Computer Science, Co-Director of CDT in Speech and Language Technologies, and Member of the Natural Language Processing (NLP) Research Group

Valeria Vitale

Lecturer at Digital Humanities Institute

Michael Pidd

Director of The Digital Humanities Institute

Robert Shoemaker

Professor of Eighteenth-Century British History

Haiping Lu

Head of AI Research Engineering, Professor of Machine Learning, and Turing Academic Lead