Files in this item



application/pdf3.61_414_Ma-Doc ... ble Corpora Clustering.pdf (2MB)
(no description provided)PDF


Title:Documents representation for comparable corpora clustering: A preliminary study
Author(s):Ma, Shutian; Zhang, Chengzhi
Subject(s):Comparable corpora clustering
Document representation method
Abstract:With increasing globalization, digital libraries tend to provide multilingual documents access. There have been lots of available text information covering the same or similar topic written in multiple languages, namely comparable corpora. To better organize such information with clustering technique, we have explored three document representation methods, Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Doc2Vec (D2V) in task of comparable corpora clustering before. Previously used comparable corpora are in small size of hundred magnitude. In this poster, we use the comparable corpora of regular amount. Methods are found to perform differently when representing dimension sizes are different. Clustering results are investigated according to different representation methods. Choices of the best method for comparable corpora clustering are also discussed.
Issue Date:2017
Citation Info:Ma, S., & Zhang, C. (2017). Documents Representation for Comparable Corpora Clustering: A Preliminary Study. In iConference 2017 Proceedings (pp. 876-880).
Series/Report:iConference 2017 Proceedings
Genre:Conference Poster
Rights Information:Copyright 2017 Shutian Ma and Chengzhi Zhang
Date Available in IDEALS:2017-07-27

This item appears in the following Collection(s)

Item Statistics