Withdraw
Loading…
AI4Scientist: Accelerating and democratizing scientific research lifecycle
Wang, Qingyun
Loading…
Permalink
https://hdl.handle.net/2142/129195
Description
- Title
- AI4Scientist: Accelerating and democratizing scientific research lifecycle
- Author(s)
- Wang, Qingyun
- Issue Date
- 2025-04-15
- Director of Research (if dissertation) or Advisor (if thesis)
- Ji, Heng
- Doctoral Committee Chair(s)
- Ji, Heng
- Committee Member(s)
- Han, Jiawei
- Hakkani-Tur, Dilek
- Zhao, Han
- Neubig, Graham
- Hope, Tom
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Natrual language generation
- Scientific information extraction
- Scientific knowledge reasoning
- Scientific knowledge dissemination
- AI4Scientist
- Abstract
- Millions of scientific papers are published annually, resulting in an information overload. Beyond this, scientific papers, known as "Sleeping beauties", sometimes remain largely unnoticed for long periods before suddenly attracting great attention. Moreover, the process of discovering new scientific hypotheses has remained slow, expensive, and highly specialist-dependent, due to the increasingly complex experiments. There exists a pressing need to help scientists digest and evaluate relevant papers, facilitating scientific discovery. However, computer-human collaboration in scientific hypothesis discovery is still exploratory and lacks a unified framework for analyzing the relevant tasks. This thesis tackles the problem of automating scientific literature understanding and scientific discovery by proposing an AI4Scientist. Unlike AI4Science, which creates AI-driven solutions for scientific challenges, we focus on AI4Scientists to empower scientists with AI tools to enhance their research lifecycle. The recent advancements in large language models (LLMs) raise the prospect that they may be able to solve those problems. Despite their impressive progress, LLMs often fail to effectively incorporate domain-specific knowledge and support their generation with enough evidence. Additionally, the expert-curated databases represent only a small fraction of knowledge in the entire domain, due to high annotation cost. To address this issue and lower the entry barrier for interdisciplinary collaboration, we develop AI tools to accelerate the entire research lifecycle for scientists, from knowledge acquisition, hypothesis generation, multimedia procedure planning for experiment design, experiment execution, conduction to writing, and evaluating the paper draft. We divide the scientific research lifecycle into three main components: Few-shot Scientific Knowledge Acquisition Scientific knowledge acquisition is the foundation of many downstream tasks. During the COVID-19 pandemic, we propose the COVID-KG to extract fine-grained multimedia knowledge graphs (KGs) from scientific papers for drug repurposing reports. However, fine-grained information extraction systems usually require large amounts of expert-annotated examples to perform effectively. Therefore, creating methods that can learn effectively from only a few examples, known as few-shot approaches, becomes vital in scientific knowledge acquisition. To address this limitation, we propose to utilize the knowledge consistency between input text and output knowledge elements for few-shot fine-grained scientific entity extraction. Integrating Domain Knowledge with Scientific LLM Reasoning Based on the scientific KGs in the previous component, we investigate scientific hypothesis generation, as well as experiment planning and execution. Simulating the human research process, we propose to augment LLMs with external heterogeneous KGs from previous papers as "inspirations" to generate novel scientific hypotheses and iteratively boosting the novelty of the generated hypothesis. Furthermore, by extending the current hypothesis generation approach into the biochemical domain, we retrieve relevant contextual information from multiple databases and propose a new enzyme sequence generation benchmark based on the given chemical reactions. Finally, to \textit{incorporate external knowledge from both vision and text}, we introduce a new \textit{multimedia procedure learning framework} to produce \textit{visually trackable, inductive, and diverse} task scripts. Explainable Scientific Knowledge Dissemination To communicate a new idea to readers clearly and faithfully, evaluating the paper's quality is crucial to prevent distorted scientific dissemination. Therefore, we build an \textit{explainable paper review generation system} to generate explainable review scores and comments, along with detailed evidence, based on KGs and papers. Beyond the paper text, tables concisely and clearly present complex information, facilitating comparisons and enhancing readability. To improve scientific table reasoning, we propose to decompose complex table reasoning tasks into a high-granularity atomic skill set. We further propose an incremental training procedure to ensure accurate information alignment in the reasoning procedure rather than indiscriminate connection to all available contexts. This work on AI4Scientist aims to open doors to the autonomous scientific research lifecycle, by equipping machines with both structured and unstructured knowledge from previous literature and enabling reasoning across different knowledge modalities. By automating the scientific research lifecycle, the \textbf{AI4Scientist} can accelerate scientific discovery and reduce the time and resources required for research.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129195
- Copyright and License Information
- Copyright 2025 Qingyun Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…