Files in this item



application/pdfLUCIC-DISSERTATION-2017.pdf (2MB)
(no description provided)PDF


Title:Automatically identifying facet roles from comparative structures to support biomedical text summarization
Author(s):Lucic, Ana
Director of Research:Blake, Catherine Lesley
Doctoral Committee Chair(s):Blake, Catherine Lesley
Doctoral Committee Member(s):Girju, Corina Roxana; Efron, Miles; Renear, Allen H.; Downie, J. Stephen
Department / Program:Information Sciences
Discipline:Library & Information Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Comparison sentences
Natural language processing
Text mining
Text summarization
Information extraction
Abstract:Within the context of biomedical scholarly articles, comparison sentences represent a rhetorical structure commonly used to communicate findings. More generally, comparison sentences are rich with information about how the properties of one or more entities relate one another. So far, in the biomedical domain, the emphasis has been on recognizing comparative sentences in the text. This dissertation goes beyond sentence-level recognition and aims to automate the identification of the integral parts of a comparison sentence which are called comparative facets and include: compared entities, the basis or the endpoint of comparison as well as the result or the relationship that binds the entities and the basis. Only the sentences that contain each of the four facets are of interest in this thesis. With respect to the first compared entity, the system achieves an average F1 on a random sample of short (between 11 and 21 words long) sentences of 0.65; medium (between 22 and <= 28 words) sentences 0.70; long (between 29 and <=36 words) sentences 0.60 and very long (more than 36 words), 0.60. With respect to the basis of comparison prediction (the endpoint), the average F1 measure ranged from 0.66 on short, 0.57 on medium, 0.56 on long, and 0.50 on very long sentences. The average F1 achieved with respect to the second entity compared ranged from 0.91 on short, 0.85 on medium, 0.81 on long and 0.72 on very long sentences. In the area of semantic relation identification, the performance achieved was also sensitive to sentence length: the average F1 measure on short sentences was 0.80; it was 0.71, 0.56, and 0.51 on medium, long, and very long sentences respectively. Thus, the methods developed in this dissertation work better on sentences that are shorter (<= 28 words) and on those that do not contain multiple claims or disjunctive conjunctions. When applied to a previously unseen collection of breast cancer articles, the performance achieved with respect to the identification of compared entities and the endpoint was comparable to the results achieved on the collection that was used for building and testing the models. This result is promising with respect to the potential of this model being applied on other collections of scholarly articles in the biomedical sciences.
Issue Date:2017-06-26
Rights Information:Copyright 2017 Ana Lucic
Date Available in IDEALS:2017-09-29
Date Deposited:2017-08

This item appears in the following Collection(s)

Item Statistics