Files in this item

FilesDescriptionFormat

application/octet-stream

application/octet-streamFictionWorkset1.tsv (7MB)
(no description provided)Unknown

Description

Title:HathiTrust English-language fiction, 1700-1899, workset 0.1.
Author(s):Underwood, Ted
Subject(s):literary history
Machine Learning
English literature
genre classification
Abstract:This workset is data in support of the article "Mapping Mutable Genres in Structurally Complex Volumes," http://arxiv.org/abs/1309.3323. It is a .tsv file containing 32,209 lines, each of which corresponds to a volume in HathiTrust Digital Library. The first column contains volume identifiers keyed to that library; the next three columns hold probabilities generated by naive Bayes classification (the probability that the volume is written in first person, the probability that the volume is fiction according to a classifier trained on 18c texts and one trained on 19c texts). The remaining columns hold metadata extracted from MARC records provided by HathiTrust. Many metadata fields are left blank, because information was not available. This workset is immutable documentation for the article described above: it will not change. But please understand that the underlying research is very much in progress, as the workset version number (0.1) is meant to imply. Some volumes included here will turn out not to be fiction. More importantly, many works of English fiction in HathiTrust will not be included here, because of errors in our automated classification, or in HT metadata, or because the works are contained in periodicals or miscellanies difficult to classify as wholes. Also note that 32,209 volume records ≠ 32,209 distinct titles. The workset has not been deduplicated; many volumes are reprints, or parts of a multivolume work.
Issue Date:2013-09-16
Citation Info:Underwood, Ted. HathiTrust English-language fiction, 1700-1899, workset 0.1. Tab-separated text file. September 2013.
Genre:Data
Type:Text
Language:English
URI:http://hdl.handle.net/2142/45713
Publication Status:unpublished
Peer Reviewed:not peer reviewed
Sponsor:Work on this project was supported by the Andrew W. Mellon Foundation and by the National Endowment for the Humanities.
Date Available in IDEALS:2013-09-16


This item appears in the following Collection(s)

Item Statistics