Withdraw
Loading…
Scalable foundation models
Chen, Yangyi
Loading…
Permalink
https://hdl.handle.net/2142/132558
Description
- Title
- Scalable foundation models
- Author(s)
- Chen, Yangyi
- Issue Date
- 2025-12-03
- Director of Research (if dissertation) or Advisor (if thesis)
- Ji, Heng
- Doctoral Committee Chair(s)
- Ji, Heng
- Committee Member(s)
- Zhang, Tong
- Peng, Hao
- Yang, Zhengyuan
- Ping, Wei
- Department of Study
- Siebel Computing &DataScience
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- foundation models, scalable AI, multimodal
- Abstract
- The continual growth in computational resources and human annotators, driven by advances in hardware architecture and the increasing accessibility of crowdsourcing platforms, has created unprecedented opportunities for artificial intelligence (AI) models. However, without scalable AI solutions (e.g., training algorithms, model architectures), much of this additional compute and annotation may be underutilized or yield diminishing returns. Furthermore, as real-world applications demand increasingly sophisticated AI capabilities, scalable models offer a clear path to achieving higher levels of intelligence by taking full advantage of available computational and annotation resources. This makes scalability not just a technical consideration, but a fundamental requirement for advancing the field of AI in parallel with hardware developments. This dissertation investigates the fundamental trajectory toward scalable foundation models through three subsequent research milestones. (1) Predictable Scaling: We examine scaling laws that govern the development of foundation models, analyzing how models' capabilities correlate with computational resources. Our research establishes principle solutions to forecasting model behaviors and resource requirements across different scales, enabling scientific and reliable scaling of AI models. (2) Scalable Modeling: We explore model architectures and training recipes optimized for multimodal learning, demonstrating how these approaches can effectively utilize increasing computational resources and data to achieve continuous performance improvements. Our findings reveal architectural principles and training strategies that maintain efficiency at scale while avoiding common bottlenecks in previous modeling strategies. (3) Scalable Oversight: We study the scalable post-training approaches that enable continuous model improvement and alignment with human values even as model capabilities expand beyond human expertise. This research introduces novel techniques for scalable supervision that scale in parallel with model complexity and capability, ensuring the responsible advancement of AI models. In Chapter 1, we describe the key research problems and dive deeply into several key featured research in the following chapters. Predictable Scaling In Chapter 2, we study how to estimate the actual capabilities (i.e., downstream performance) in large language models (LLMs) via addressing the challenges of LLMs' emergent abilities. We focus on the pre-training loss as a more computation-efficient metric for performance estimation. We present FLP, a two-stage approach for performance prediction that consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of sampling models, followed by mapping the pre-training loss to downstream task Performance after the critical "emergent phase". Scalable Modeling In Chapter 3, we present a scalable code-guided visual representation learning method and a single transformer architecture for scalable vision-language modeling. A single unified Transformer architecture can effectively addresses the scalability concerns in previous large vision-language models (LVLMs); however, its limited adoption in modern context likely stems from the absence of reliable training recipes that balance both modalities and ensure stable training for billion-scale models. We introduce the first open-source training recipe for developing unified LVLMs, using moderate academic resources (8 x A100 80GB GPUs). In addition, we revisit the next token prediction loss on vision-language pre-training, and argue that this can be a false proxy of the actual capabilities in LVLMs. We propose a new algorithm, ViStruct, to scale up vision-langugage pre-training. The results show that ViStruct scales better with more data and compute. Scalable Oversight In Chapter 4, we investigate a novel approach to AI supervision through learning from AI feedback. We introduce a scalable alignment framework that harnesses the strong capabilities of large language models (LLMs) to guide the development of LVLMs. Our framework advances beyond conventional numerical reward signals by leveraging natural language feedback as a primary mechanism for model optimization and refinement. This methodology enables the systematic refinement of model responses, promoting attributes of helpfulness, truthfulness, and safety and also enhance their capacity for sustained multi-turn interactions. Our approach demonstrates how advanced LLMs can serve as effective supervisors in the training pipeline, offering a scalable solution to the challenge of model alignment. In addition, we utilize the AI feedback to supervise the reasoning consistency of LVLMs. In our curated benchmark that targets the chain-of-thought (CoT) reasoning performance and consistency of LVLMs, the results show that supervising the reasoning process brings better reasoning capabilities in LVLMs.
- Graduation Semester
- 2025-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/132558
- Copyright and License Information
- Copyright 2025 Yangyi Chen
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…