Text Correction for Historical Documents

University of Sheffield Collaborating Faculties: Faculty of Arts & Humanities, Faculty of Engineering, Digital Humanities Institute (DHI)

External Partner: British Library

Related Links: https://www.dhi.ac.uk/text-correction-for-mining-historical-documents/

Overview: This project addresses the critical issue of correcting noisily OCR’d historical documents, focusing on the British Library Newspapers (BLN) collection. BLN is a major corpus of over 200 years of scanned British newspapers from over 240 newspapers with textual data, visual data, and metadata available. Scanned newspaper images have undergone OCR (optical character recognition) processing, resulting in inaccurate transcriptions due to the degradation of the original documents. The project aims to employ advanced deep-learning techniques to improve the quality of these transcriptions. The final outputs will be high-quality corrected transcriptions of BLN and open-source code for OCR text correction, both of which would serve as valuable resources for humanities researchers.

Motivation: Since the early 2000s, significant digitisation efforts have been undertaken to preserve and make accessible historical primary sources such as newspapers, early printed books, and handwritten documents. While these efforts have been instrumental in advancing humanities research, the low quality of OCR transcriptions remains a significant barrier to discovering new historical insights. The successful completion of this project promises both short-term and long-term benefits. In the short term, it will significantly enhance the transcription quality of BLN, enabling accurate and efficient searching within the collection as well as unlocking the potential for text mining, which was previously impractical due to low transcription quality. In the long term, the project’s success could revolutionise research on other large collections of historical documents, allowing researchers to track content changes, language evolution, and shifts in thought across different time periods. By being language-independent, the impact could extend to historical documents worldwide, advancing research on a global scale.

Alan Thomas
Alan Thomas
AI Research Engineer
Robert Gaizauskas
Robert Gaizauskas
Professor of Computer Science, Co-Director of CDT in Speech and Language Technologies, and Member of the Natural Language Processing (NLP) Research Group
Valeria Vitale
Valeria Vitale
Lecturer at Digital Humanities Institute
Michael Pidd
Michael Pidd
Director of The Digital Humanities Institute
Robert Shoemaker
Robert Shoemaker
Professor of Eighteenth-Century British History
Haiping Lu
Haiping Lu
Head of AI Research Engineering, Professor of Machine Learning, and Turing Academic Lead