Files in this item



text/csvclassP-idf.csv.bz2 (37MB)
(no description provided)CSV file


Title:Term Weights for 235k Language and Literature Texts
Author(s):Organisciak, Peter
Subject(s):data, text analysis, digital library
Abstract:A popular form of term weighting in texts is to use TF*IDF, which takes a text's term frequencies and weighs them by a measure derived from document frequency called Inverse Document Frequency (IDF). This dataset provides IDF weights for terms in 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, inverse book frequency and inverse page frequency are provided. Book frequency is the count of books that the term occurs in, page frequency is the number of pages that have the term. This data is derived from the holdings of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.
Issue Date:2016-03
Type:Dataset / Spreadsheet
Date Available in IDEALS:2016-03-17

This item appears in the following Collection(s)

Item Statistics