Withdraw
Loading…
Evaluating pre-trained language modeling approaches for author name disambiguation
Kim, Jenna
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/125802
Description
- Title
- Evaluating pre-trained language modeling approaches for author name disambiguation
- Author(s)
- Kim, Jenna
- Issue Date
- 2024-07-10
- Director of Research (if dissertation) or Advisor (if thesis)
- Diesner, Jana
- Doctoral Committee Chair(s)
- Diesner, Jana
- Committee Member(s)
- Ludäscher, Bertram
- Torvik, Vetle
- Wang, Haohan
- Department of Study
- Information Sciences
- Discipline
- Information Sciences
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- author name disambiguation
- machine learning
- deep learning
- large language model
- bibliographic data
- Abstract
- Distinguishing between different authors who share the same names or identifying instances where different names refer to the same individual remains a persistent challenge in bibliometric research. This complexity impedes accurate cataloging and indexing in digital libraries, affecting the integrity of academic databases and the reliability of scholarship evaluation based on bibliographic data. Although various machine learning (ML) methods have been explored to tackle the issue of author name disambiguation (AND), traditional ML methods often fail to capture the subtle linguistic and contextual nuances essential for effective disambiguation. Moreover, while several studies have suggested that neural network models may surpass conventional ML models in AND tasks, the full potential of deep learning (DL) using advanced pre-trained language models like BERT has not been exhaustively examined. This dissertation delves into applying pre-trained language models for AND within scholarly databases and identifying its potential and limitations compared to traditional ML approaches. This is a novel and significant endeavor for improving the accuracy and functionality of digital library systems and bibliometric assessments. Specifically, this research aims to implement and evaluate three pre-trained models - BERT, MiniLM, and MPNet - against traditional ML algorithms and a neural network model across four established datasets and a newly introduced challenging dataset. The models are rigorously assessed using metrics including accuracy, precision, recall, and F1 score, emphasizing integrating abstract features from academic texts to enhance model comprehension and performance. The findings confirm that pre-trained language models significantly outperform traditional approaches, particularly regarding recall and F1 scores, demonstrating their ability to handle complex linguistic patterns and contextual cues vital for accurately differentiating between authors with similar names. Incorporating abstract text features boosts model performance, highlighting the critical role of semantic context in AND tasks. This dissertation contributes novel insights in several key areas. First, it pioneers the application of state-of-the-art pre-trained language models in AND tasks, providing a comparative analysis with conventional ML and neural network approaches. Second, it broadens the features employed in existing studies by including abstract texts alongside the typical metadata records used in current AND research. Third, it integrates the workflows of various high-performing ML and DL methods for classification and clustering into an open-source framework. This initiative makes implementing different AND methods transparent and facilitates easy comparisons across methods. The publicly available code and dataset serve as a benchmark framework, enabling AND researchers to validate this study and develop new models more efficiently and with fewer errors. The dissertation discusses the procedural benefits of using pre-trained language models, such as the reduced need for manual feature extraction and selection, while addressing implementation challenges, including substantial computational demands and greater transparency in decision-making processes in implementing AND methods. Future work, as detailed in the Conclusion section, will focus on optimizing these models for greater efficiency, enhancing their interpretability, and extending their application to multilingual datasets to increase their global applicability. This research advances the technological framework for AND and improves the reliability of bibliographic data, supporting more accurate scientific analysis and decision-making for research and development. This dissertation strongly advocates for adopting pre-trained language models to resolve the complexities of AND, marking a significant advancement over traditional methods and paving the way for more sophisticated, context-aware computational solutions in the academic field. By elucidating the similarities and differences between various ML- and DL-based AND approaches, this research enhances the robustness of research findings that resolve author name ambiguity, thus supporting more accurate scientific analysis and decision-making based on bibliographic data.
- Graduation Semester
- 2024-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/125802
- Copyright and License Information
- Copyright 2024 Jenna Kim
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…