Files in this item

FilesDescriptionFormat

application/pdf

application/pdfAn Empirical St ... Information Retrieval.pdf (183kB)
(no description provided)PDF

Description

Title:An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval
Author(s):Jiang, Jing; Zhai, ChengXiang
Subject(s):bioinformatics
tokenization
information retrieval
Abstract:Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical IR. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical IR test collections with two representative retrieval methods. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experimental results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 80\%. In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance in biomedical text.
Issue Date:2006-05
Genre:Technical Report
Type:Text
URI:http://hdl.handle.net/2142/11207
Other Identifier(s):UIUCDCS-R-2006-2733
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-21


This item appears in the following Collection(s)

Item Statistics