Files in this item



application/pdfZHANG-THESIS-2019.pdf (4MB)
(no description provided)PDF


Title:Classifying GitHub repositories with minimal human efforts
Author(s):Zhang, Yu
Advisor(s):Han, Jiawei
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Weak Supervision
Abstract:GitHub is a great platform for sharing software code, data, and other resources. To improve search and analysis of a vast spectrum of resources on GitHub, it is necessary to conduct automatic, flexible and user-guided classification of GitHub repositories. In this paper, we study how to build a customized repository classifier with minimal human annotation. Previous document classification methods cannot be directly applied to our task due to three unique challenges: (1) multi-modal signals: besides text, signals in other formats need to be explored to uncover the topic of a repository; (2) low data quality: GitHub README files, usually containing code and commands, are noisier than typical text data such as scientific papers and news; and (3) limited ground-truth: users cannot afford to label many repositories for training a good classifier. To deal with the challenges above, we propose GitClass, a framework to classify GitHub repositories under weak supervision. Three key modules, heterogeneous network construction and embedding, keyword extraction and topic modeling, as well as pseudo document generation, are used to tackle the above three challenges, respectively. We conduct extensive experiments on three large-scale GitHub repository datasets and observe evident performance boost over state-of-the-art embedding and classification algorithms.
Issue Date:2019-04-26
Rights Information:Copyright 2019 Yu Zhang
Date Available in IDEALS:2019-08-23
Date Deposited:2019-05

This item appears in the following Collection(s)

Item Statistics