Withdraw
Loading…
CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code
Tao, Tianhua
Loading…
Permalink
https://hdl.handle.net/2142/124537
Description
- Title
- CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code
- Author(s)
- Tao, Tianhua
- Issue Date
- 2024-04-25
- Director of Research (if dissertation) or Advisor (if thesis)
- Peng, Hao
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Natural Language Processing
- Language Model
- Language
- eng
- Abstract
- Large Language Models (LLMs) specializing in code generation, which are also often referred to as code LLMs, e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language and code usage instruction generation. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. In open-sourced LLMs, we observe a prevalent issue: most models are tailored to specialize in either language or code, not both. For example, Llama is proficient in natural language tasks but poor in code tasks, while Code Llama is the opposite. Furthermore, there is a lack of thorough prior studies on LLM pretraining strategies that mix code and natural language. In this work, we propose a pretraining strategy designed to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes three pretraining phases with appropriately adjusted code/language ratios. The resulting model, CrystalCoder, achieves remarkable capability in both domains. Specifically, it attains natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. CrystalCoder exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We further fine-tuned the pretrained model with a collection of open-source datasets and delivered our instruction-following model, CrystalChat. We verify our pretraining strategy by analyzing the training process and observing consistent improvements in most benchmarks. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, and 136 checkpoints throughout the training.
- Graduation Semester
- 2024-05
- Type of Resource
- Text
- Handle URL
- https://hdl.handle.net/2142/124537
- Copyright and License Information
- Copyright 2024 Tianhua Tao
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Siebel School of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…