Files in this item



text/csvclassP-stats.csv.bz2 (39MB)
DatasetCSV file


Title:Term Frequencies for 235k Language and Literature Texts
Author(s):Organisciak, Peter
text analysis
digital library
Abstract:Corpus-level term statistics are valuable for numerous text analysis activities, such as term weighting or probability distribution smoothing. In instances where there is an insufficient corpus to calculate such statistics, falling back on a general corpus of similar texts is useful. This dataset provides statistics for a collection of 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, book frequency, page frequency, and term frequency are provided. Book frequency is the count of books that the term is seen in, page frequency is the number of pages that have the term, and term frequency is the overall count of the term. This data is derived from the holding of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.
Issue Date:2016-03
Type:Dataset / Spreadsheet
Date Available in IDEALS:2016-03-15

This item appears in the following Collection(s)

Item Statistics