Disambiguating academic institution names: a comprehensive study of authority files, linguistic variations, and computational evaluation in PubMed affiliations
Guan, Yingjun
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/130147
Description
Title
Disambiguating academic institution names: a comprehensive study of authority files, linguistic variations, and computational evaluation in PubMed affiliations
Author(s)
Guan, Yingjun
Issue Date
2025-07-11
Director of Research (if dissertation) or Advisor (if thesis)
Torvik, Vetle I
Doctoral Committee Chair(s)
Torvik, Vetle I
Committee Member(s)
Downie, Stephen
Ludäscher, Bertram
Renear, Allen
Department of Study
Information Sciences
Discipline
Information Sciences
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Natural Language Processing
Institution Name Disambiguation
Data Mining
Text Mining
Authority Control.
Abstract
Accurate representation of academic institution names is critical for bibliometric analysis, scholarly communication, and digital library systems. However, variations in naming conventions, institutional hierarchies, and multilingual expressions pose significant challenges for consistent affiliation metadata. This dissertation shows a comprehensive investigation into Institution Name Disambiguation (IND), with a focus on authority files, linguistic analysis, and empirical evaluation using PubMed affiliation data.
First, the study reviews the conceptual foundations of institutional name ambiguity and examines 21 prominent authority files, including VIAF, ROR, and Wikidata. A new integrated authority dataset is developed to enhance institutional name standardization. Second, we create a manually annotated dataset from real-world PubMed affiliation records to capture synonym patterns, structural inconsistencies, and user-written variations. Third, the dissertation evaluates both coverage of the authority files and the performance of computational tools for institutional name recognition and disambiguation across multiple metrics.
Key contributions include a structured evaluation of authority file quality, benchmark datasets for institutional name analysis, InsVar and AffiNorm, and experimental insights into best practices for combining authority control with computational methods. The findings have broad implications for improving data integrity in academic publishing, digital knowledge infrastructures, and citation indexing systems.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.