Files in this item

FilesDescriptionFormat

application/pdf

application/pdfHALL-THESIS-2017.pdf (3MB)
(no description provided)PDF

Description

Title:Examination of machine learning methods for multi-label classification of intellectual property documents
Author(s):Hall, John William
Advisor(s):Shih, Chilin
Department / Program:Linguistics
Discipline:Linguistics
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:M.A.
Genre:Thesis
Subject(s):Machine learning
Multi-label classification
Multi-class classification
Patent classification
Document classification
Abstract:This thesis explores the performance of a variety of machine learning techniques for the task of multi-label document classification applied to a corpus of United States patent grants. The rapidly rising number of patent applications in the past several decades has led to a rising need for enhanced automatic patent processing tools. The task of automated document classification in particular has been targeted as an important point of research. However, the development of adequate tools has been limited in part by the esoteric writing style particular to intellectual property and the overlapping categorizations of the branched hierarchical classification system employed by the CPC. A patent document corpus offers a large, publicly available training set consisting of both structured and unstructured data. The application of machine learning techniques to this corpus may help relieve the increasing need for highly trained human classifiers. The contributions of the present work are 2-fold. First, the present work constructed a patent document corpus by gathering 4500 patent documents from years 2015 and 2014 and compiling relevant structured and textual data relevant to an automated classification task. Second, it offers an examination of five different machine learning techniques as automated classifiers for patent documents by section. Test trials under different preprocessing conditions utilizing principal component analysis and word selection were applied in training supervised learning classifiers. It was found that principal component analysis of the patent documents without further feature selection yielded the greatest performance for all machine learning models. This approach also revealed an effect of dataset size where increasing the size of the training set increased the overall performance of Decision Tree, Support Vector Machine, Logistic Regression, and Neural Net models. It was further found that some classifiers trained on data not subject to principal component analysis showed decreasing performance metrics with increasing data sizes.
Issue Date:2017-04-24
Type:Text
URI:http://hdl.handle.net/2142/97430
Rights Information:Copyright 2017 John Hall
Date Available in IDEALS:2017-08-10
Date Deposited:2017-05


This item appears in the following Collection(s)

Item Statistics