Files in this item

FilesDescriptionFormat

application/pdf

application/pdfZHONG-THESIS-2019.pdf (415kB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study
Author(s):Zhong, Zexuan
Advisor(s):Xie, Tao
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:M.S.
Genre:Thesis
Subject(s):Regular Expression
Natural Language Processing
Abstract:Translating natural language descriptions into executable programs is a fundamental problem for computational linguistics. Recent research proposes neural-network-based approaches to address the problem. These approaches typically train a sequence-to-sequence learning model using a syntax-based objective: maximum likelihood estimation (MLE). Such syntax-based approaches do not effectively address the goal of generating semantically correct programs, because these approaches fail to handle Program Aliasing, i.e., semantically equivalent programs may have many syntactically different forms. In this thesis, we focus on generating regular expressions from natural language, an important task of the program-synthesis problem. In particular, we study the task in two aspects. First, we address the issue of Program Aliasing, and propose a semantics-based approach named SemRegex. Different from the existing syntax-based approaches, SemRegex trains the model by maximizing the expected semantic correctness of the generated regular expressions. The semantic correctness is measured using the DFA-equivalence oracle, random test cases, and distinguishing test cases. The experiments on three public datasets demonstrate the superiority of SemRegex over the existing state-of-the-art approaches. Second, given that existing approaches use only synthetic data in both training datasets and validation/test datasets, we raise a question: are these approaches effective to address various real-world situations? To explore this question, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world dataset collected from the Internet, and conduct an experimental study on applying an existing state-of-the-art approach on the real-world dataset. Our study results suggest the existence of distinct characteristics between the synthetic datasets and the real-world dataset, and the existing state-of-the-art approach achieves extremely low effectiveness when evaluated on real-world data. We also provide initial analysis on some of those challenging cases and discuss future directions.
Issue Date:2019-04-26
Type:Text
URI:http://hdl.handle.net/2142/105101
Rights Information:Copyright 2019 Zexuan Zhong
Date Available in IDEALS:2019-08-23
Date Deposited:2019-05


This item appears in the following Collection(s)

Item Statistics