Files in this item



application/pdfiConference21_poster580.pdf (667kB)


application/zipJiang-The Guten ... st Parallel (492kB)
(no description provided)ZIP


Title:The Gutenberg-HathiTrust Parallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts
Author(s):Jiang, Ming; Hu, Yuerong; Worthey, Glen; Dubnicek, Ryan C.; Capitanu, Boris; Kudeki, Deren; Downie, J. Stephen
Contributor(s):HathiTrust Research Center
Subject(s):Parallel Text Dataset
Optical Character Recognition
Digital Library
Digital Humanities
Data Curation
Abstract:This paper proposes large-scale parallel corpora of English-language publications for exploring the effects of optical character recognition (OCR) errors in the scanned text of digitized library collections on various corpus-based research. We collected data from: (1) Project Gutenberg (Gutenberg) for a human-proofread clean corpus; and, (2) HathiTrust Digital Library (HathiTrust) for an uncorrected OCR-impacted corpus. Our data is parallel regarding the content. So far as we know, this is the first large-scale benchmark dataset intended to evaluate the effects of text noise in digital libraries. In total, we collected and aligned 19,049 pairs of uncorrected OCR-impacted and human-proofread books in six domains published from 1780 to 1993.
Issue Date:2021-03-17
Genre:Conference Poster
Sponsor:HathiTrust and its member community
Rights Information:Copyright 2021 is held by Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C. Dubnicek, Boris Capitanu, Deren Kudeki, and J. Stephen Downie. Copyright permissions, when appropriate, must be obtained directly from the authors.
Date Available in IDEALS:2021-03-19

This item appears in the following Collection(s)

Item Statistics