Files in this item

FilesDescriptionFormat

text/csv

text/csvclassP-idf.csv.bz2 (37MB)
(no description provided)CSV file

Description

Title:Term Weights for 235k Language and Literature Texts
Author(s):Organisciak, Peter
Subject(s):data, text analysis, digital library
Abstract:A popular form of term weighting in texts is to use TF*IDF, which takes a text's term frequencies and weighs them by a measure derived from document frequency called Inverse Document Frequency (IDF). This dataset provides IDF weights for terms in 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, inverse book frequency and inverse page frequency are provided. Book frequency is the count of books that the term occurs in, page frequency is the number of pages that have the term. This data is derived from the holdings of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.
Issue Date:2016-03
Genre:Data
Type:Dataset / Spreadsheet
Language:English
URI:http://hdl.handle.net/2142/89691
Date Available in IDEALS:2016-03-17


This item appears in the following Collection(s)

Item Statistics