Identifying undiagnosed patients with rare genetic aortopathies using open-source large language models
Yang, Ze
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/129587
Description
Title
Identifying undiagnosed patients with rare genetic aortopathies using open-source large language models
Author(s)
Yang, Ze
Issue Date
2025-04-28
Director of Research (if dissertation) or Advisor (if thesis)
Kindratenko, Volodymyr
Department of Study
Electrical & Computer Eng
Discipline
Electrical & Computer Engr
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Large Language Model
Medical Diagnosis
Abstract
Rare genetic aortopathies are frequently missed in clinical practice due to their phenotypic heterogeneity. Although timely genetic testing can prevent catastrophic cardiovascular events, current diagnostic pathways are based on primary care physicians to recognize subtle clinical indicators and initiate referrals. This dependency often leads to missed or delayed diagnoses, particularly in patients with atypical presentations. Broader and more systematic approaches are needed to identify at-risk individuals who fall outside conventional diagnostic patterns.
Free-text clinical notes offer detailed, unstructured insights into a patient’s history that are often overlooked in automated systems. Given the ability of large language models (LLMs) to process unstructured text, we developed an open-source LLM-based pipeline that recommends genetic testing for rare aortopathies based on patient progress notes. The pipeline uses retrieval augmented generation (RAG) with a curated corpus of aortopathy-related knowledge to improve prediction accuracy, especially in ambiguous cases.
We validated the pipeline using 22,510 notes from 500 individuals (250 diagnosed cases and 250 controls) in the Penn Medicine BioBank (PMBB). The model correctly identified 425 out of 499 patients, achieving a recom mendation accuracy of 0.852, precision of 0.889, recall of 0.803, F1 score of 0.844, and F3 score of 0.811. Our results show that LLMs can effectively analyze clinical notes to recommend genetic testing, enabling earlier detection of rare genetic aortopathies. The pipeline is generalizable, requires no pre-processing of notes, and can be adapted to other disease domains for broader clinical impact.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.