Evaluating pre-trained language modeling approaches for author name disambiguation

Kim, Jenna

Evaluating pre-trained language modeling approaches for author name disambiguation

Kim, Jenna

This item's files can only be accessed by the System Administrators group.

Permalink

https://hdl.handle.net/2142/125802

Description

Title

Evaluating pre-trained language modeling approaches for author name disambiguation

Author(s)

Kim, Jenna

Issue Date

2024-07-10

Director of Research (if dissertation) or Advisor (if thesis)

Diesner, Jana

Doctoral Committee Chair(s)

Diesner, Jana

Committee Member(s)

Ludäscher, Bertram
Torvik, Vetle
Wang, Haohan

Department of Study

Information Sciences

Discipline

Information Sciences

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Author Name Disambiguation
Machine Learning
Deep Learning
Large Language Model
Bibliographic Data

Language

eng

Abstract

Distinguishing between different authors who share the same names or identifying instances where different names refer to the same individual remains a persistent challenge in bibliometric research. This complexity impedes accurate cataloging and indexing in digital libraries, affecting the integrity of academic databases and the reliability of scholarship evaluation based on bibliographic data. Although various machine learning (ML) methods have been explored to tackle the issue of author name disambiguation (AND), traditional ML methods often fail to capture the subtle linguistic and contextual nuances essential for effective disambiguation. Moreover, while several studies have suggested that neural network models may surpass conventional ML models in AND tasks, the full potential of deep learning (DL) using advanced pre-trained language models like BERT has not been exhaustively examined. This dissertation delves into applying pre-trained language models for AND within scholarly databases and identifying its potential and limitations compared to traditional ML approaches. This is a novel and significant endeavor for improving the accuracy and functionality of digital library systems and bibliometric assessments. Specifically, this research aims to implement and evaluate three pre-trained models - BERT, MiniLM, and MPNet - against traditional ML algorithms and a neural network model across four established datasets and a newly introduced challenging dataset. The models are rigorously assessed using metrics including accuracy, precision, recall, and F1 score, emphasizing integrating abstract features from academic texts to enhance model comprehension and performance. The findings confirm that pre-trained language models significantly outperform traditional approaches, particularly regarding recall and F1 scores, demonstrating their ability to handle complex linguistic patterns and contextual cues vital for accurately differentiating between authors with similar names. Incorporating abstract text features boosts model performance, highlighting the critical role of semantic context in AND tasks. This dissertation contributes novel insights in several key areas. First, it pioneers the application of state-of-the-art pre-trained language models in AND tasks, providing a comparative analysis with conventional ML and neural network approaches. Second, it broadens the features employed in existing studies by including abstract texts alongside the typical metadata records used in current AND research. Third, it integrates the workflows of various high-performing ML and DL methods for classification and clustering into an open-source framework. This initiative makes implementing different AND methods transparent and facilitates easy comparisons across methods. The publicly available code and dataset serve as a benchmark framework, enabling AND researchers to validate this study and develop new models more efficiently and with fewer errors. The dissertation discusses the procedural benefits of using pre-trained language models, such as the reduced need for manual feature extraction and selection, while addressing implementation challenges, including substantial computational demands and greater transparency in decision-making processes in implementing AND methods. Future work, as detailed in the Conclusion section, will focus on optimizing these models for greater efficiency, enhancing their interpretability, and extending their application to multilingual datasets to increase their global applicability. This research advances the technological framework for AND and improves the reliability of bibliographic data, supporting more accurate scientific analysis and decision-making for research and development. This dissertation strongly advocates for adopting pre-trained language models to resolve the complexities of AND, marking a significant advancement over traditional methods and paving the way for more sophisticated, context-aware computational solutions in the academic field. By elucidating the similarities and differences between various ML- and DL-based AND approaches, this research enhances the robustness of research findings that resolve author name ambiguity, thus supporting more accurate scientific analysis and decision-making based on bibliographic data.

Graduation Semester

2024-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/125802

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Information Sciences

Dissertations and theses from the School of Information Sciences

Evaluating pre-trained language modeling approaches for author name disambiguation

Kim, Jenna

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Information Sciences

Log In