Files in this item

FilesDescriptionFormat

application/pdf

application/pdfARORA-THESIS-2021.pdf (379kB)
(no description provided)PDF

Description

Title:Domain-agnostic named entity recognition on unstructured text
Author(s):Arora, Jatin
Advisor(s):Han, Jiawei
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:M.S.
Genre:Thesis
Subject(s):named entity recognition
ner
information extraction
knowledge extraction
deep learning
bert
biomedical entity extraction
general domain ner
entity chunking
entity identification
word patterns
sequence labeling
question answering
span detection
phrase detection
entity typing
phrase classification
span classification
mention detection
natural language processing
nlp
text mining
data mining
machine learning
neural networks
Abstract:Named Entity Recognition (NER) is the task of extracting informing entities belonging to predefined semantic classes from raw text. These semantic classes could be general-purpose like a person, location or domain-specific like genes, protein names in biomedical texts. NER has widespread applications in natural language processing (NLP) and serves as the foundation for applications like question answering, information retrieval and machine translation. Recently, the NER task has got a lot of traction in the research community with the advent of deep learning models like BERT which are able to capture textual semantics very well. In this work, we present a detailed study approaching the NER task from three different perspectives, namely, sequence labeling, question answering (QA), and span-based classification. We propose a simple span detection and classification pipeline that first detects all mention spans irrespective of entity type and then feeds each mention span as input to a model and expects entity type as output. This setup is the reverse of a traditional QA-based NER system where we feed entity type as input and expect mention spans as output. We also introduce explicit pattern embeddings which compliment character embeddings to learn better word representations even with less training data. Experimental results demonstrate the effectiveness of our proposed domain-agnostic techniques on multiple datasets. We set the new state-of-the-art for BioNLP13CG and give a competitive performance on CoNLL 2003 and JNLPBA datasets. Additionally, we probe into the BERT model and show that mere concatenation of external feature vectors with BERT outputs may not train effectively at the recommended low learning rates for BERT. More sophisticated feature fusion is essential.
Issue Date:2021-04-26
Type:Thesis
URI:http://hdl.handle.net/2142/110555
Rights Information:Copyright 2021 Jatin Arora
Date Available in IDEALS:2021-09-17
Date Deposited:2021-05


This item appears in the following Collection(s)

Item Statistics