Files in this item



application/pdfXU-THESIS-2019.pdf (2MB)
(no description provided)PDF


Title:NLP driven large scale financial data analysis
Author(s):Xu, Zhe
Advisor(s):Li, Bo
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Abstract:Stock market volatility is influenced by several factors including but not limited to information release, dissemination, and public acceptance. In terms of information, traders would read financial reports released by companies, 8-K reports, mass but informal social media information like tweets, as well as financial topic or market related reports from professional news agencies such as Reuters, Bloomberg, Yahoo Finance etc. Among them the financial news is relatively informative and abundant, therefore I take it as source of data. Natural language preprocessing (NLP), as a classical yet still prosperous research domain, has seen astonishing progress with the development of deep learning and widely used for texts related tasks such as text sentiment analysis and recommendation system. However, currently there is no widely acknowledged framework on utilizing NLP technique to mine news articles under finance or marketing topic for valuable information about stock market. stock price due to several practical issues. My contribution is as follows: I analyze influences of several factors in my whole pipeline for NLP based stock trend prediction. In particular, I crawl and preprocess data online and made a new dataset, and compared with a Reuters news dataset. I compared the underlying word embeddings constructed based on different corpora and preprocessing techniques, which are later used as input for my prediction models in terms of prediction accuracy as well as the corresponding models' robustness against adversarial attacks. I conduct adversarial analysis and provide my hypothesis which might be useful to further develop the NLP based framework to mine information for stock market related tasks such as stock trend prediction. My hypothesis on the adversarial analysis is that the current model, despite its success on texts rich in sentiment, is vulnerable to adversarial attacks and not stable enough with low frequency terms such as domain specific multi-words phrases, and numbers whose semantic meaning should not be represented by numbers. These two types of tokens would be paid attention to if a more robust model is desired. My experiments suggest that raw data quality and underlying embeddings used for NLP models are of vital importance in terms of prediction performance, and although current state of the art NLP models are capable of learning the sentiment-wise semantic meaning of words, it might not be properly structured to utilize more complex and heterogeneous information such as financial or marketing news.
Issue Date:2019-04-25
Rights Information:Copyright 2019 Zhe Xu
Date Available in IDEALS:2019-08-23
Date Deposited:2019-05

This item appears in the following Collection(s)

Item Statistics