Files in this item

FilesDescriptionFormat

application/pdf

application/pdf2pt6_Organisciak-Access.pdf (882kB)
(no description provided)PDF

Description

Title:Access to Billions of Pages for Large-Scale Text Analysis
Author(s):Organisciak, Peter; Capitanu, Boris; Underwood, Ted; Downie, J. Stephen
Subject(s):Non-consumptive research
Feature extraction
Large-scale text analysis
Datasets
Text mining
Abstract:Consortial collections have led to unprecedented scales of digitized corpora, but the insights that they enable are hampered by the complexities of access, particularly to in-copyright or orphan works. Pursuing a principle of non-consumptive access, we developed the Extracted Features (EF) dataset, a dataset of quantitative counts for every page of nearly 5 million scanned books. The EF includes unigram counts, part of speech tagging, header and footer extraction, counts of characters at both sides of the page, and more. Distributing book data with features already extracted saves resource costs associated with large-scale text use, improves the reproducibility of research done on the dataset, and opens the door to datasets on copyrighted books. We describe the coverage of the dataset and demonstrate its useful application through duplicate book alignment and identification of their cleanest scans, topic modeling, word list expansion, and multifaceted visualization.
Issue Date:2017
Publisher:iSchools
Citation Info:Organisciak, P., Capitanu, B., Underwood, T. & Downie, J. S. (2017). Access to Billions of Pages for Large-Scale Text Analysis. In iConference 2017 Proceedings, Vol. 2 (pp. 66-76). https://doi.org/10.9776/17014
Series/Report:iConference 2017 Proceedings Vol. 2
Genre:Conference Paper / Presentation
Type:Text
Language:English
URI:http://hdl.handle.net/2142/98873
DOI:https://doi.org/10.9776/17014
Rights Information:Copyright 2017 is held by the authors.
Date Available in IDEALS:2017-12-05


This item appears in the following Collection(s)

Item Statistics