Targeted Query Expansions as a Method for Searching Mixed Quality Digitized Cultural Heritage Documents
archives and records
Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized historical documents, words may occur in different surface forms due to three types of variation - morphological variation, historical variation, and errors in optical character recognition (OCR). Because individual documents may differ significantly from each other regarding the level of such variations, digitized collections may contain documents of mixed quality. Such different types of documents may require different types of retrieval methods. We suggest using targeted query expansions (QE) to access documents in mixed-quality text collections. In QE the user-given search term is replaced by a set of expansion keys (search words); in targeted QE the selection of expansion terms is based on the type of surface level variation occurring in the particular text searched. We illustrate our approach in a highly inflectional compounding language, Finnish while the variation occur across all natural languages. We report a minimal-scale experiment based on the QE method and discuss the need to support targeted QEs in the search interface.
Series/Report Name or Number
iConference 2015 Proceedings
Type of Resource
Copyright and License Information
Copyright 2015 is held by the authors. Copyright permissions, when appropriate, must be obtained directly from the authors.