CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code

Tao, Tianhua

CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code

Tao, Tianhua

Permalink

https://hdl.handle.net/2142/124537

Description

Title

CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code

Author(s)

Tao, Tianhua

Issue Date

2024-04-25

Director of Research (if dissertation) or Advisor (if thesis)

Peng, Hao

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Natural Language Processing
Language Model

Language

eng

Abstract

Large Language Models (LLMs) specializing in code generation, which are also often referred to as code LLMs, e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language and code usage instruction generation. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. In open-sourced LLMs, we observe a prevalent issue: most models are tailored to specialize in either language or code, not both. For example, Llama is proficient in natural language tasks but poor in code tasks, while Code Llama is the opposite. Furthermore, there is a lack of thorough prior studies on LLM pretraining strategies that mix code and natural language. In this work, we propose a pretraining strategy designed to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes three pretraining phases with appropriately adjusted code/language ratios. The resulting model, CrystalCoder, achieves remarkable capability in both domains. Specifically, it attains natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. CrystalCoder exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We further fine-tuned the pretrained model with a collection of open-source datasets and delivered our instruction-following model, CrystalChat. We verify our pretraining strategy by analyzing the training process and observing consistent improvements in most benchmarks. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, and 136 checkpoints throughout the training.

Graduation Semester

2024-05

Type of Resource

Text

Handle URL

https://hdl.handle.net/2142/124537

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code

Tao, Tianhua

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In