Files in this item



application/pdfBHATTI-DISSERTATION-2018.pdf (1MB)
(no description provided)PDF


Title:Scalable centralized and distributed spectral clustering
Author(s):Bhatti, Shahzad Fazal
Director of Research:Beck, Carolyn L.
Doctoral Committee Chair(s):Beck, Carolyn L.
Doctoral Committee Member(s):Srikant, Rayadurgam; Salapaka, Srinivasa M.; Marla, Lavanya
Department / Program:Industrial&Enterprise Sys Eng
Discipline:Industrial Engineering
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Clustering algorithms
community detection
graph partitioning
random walk
distributed algorithms
Abstract:Spectral clustering approaches have led to well-accepted algorithms for finding accurate clusters in a given dataset. However, their application to large-scale datasets has been hindered by the computational complexity of eigenvalue decompositions. Several algorithms have been proposed in the recent past to accelerate spectral clustering, however, they compromise on the accuracy of the spectral clustering to achieve faster speed. In this paper, we propose a novel spectral clustering algorithm based on a mixing process on a graph. Unlike the existing spectral clustering algorithms, our algorithm does not require computing eigenvectors. Specifically, it finds the equivalent of a linear combination of eigenvectors of the normalized similarity matrix weighted with corresponding eigenvalues. This linear combination is then used to partition the dataset into meaningful clusters. Simulations on real datasets show that partitioning datasets based on such linear combinations of eigenvectors achieve better accuracy than standard spectral clustering methods as the number of clusters increase. Our algorithm can easily be implemented for parallel processing. In the past few years, the size of a typical dataset has grown exponentially making it impossible to the store the data in a single system. Thus distributed systems are employed to store the data. Most of the clustering algorithms are tailored towards data stored in a centralized system which makes them inappropriate for the distributed system. Moreover, the large scale of the data prohibits us from moving it to a central location to use a centralized algorithm. Our approach to distributed spectral clustering works in two phases. In phase 1, individual machines generate a set of representative points of the local data and communicate it to a central machine. In phase 2, the central machine performs spectral clustering on the data and communicates the cluster assignment of the representative points to the corresponding nodes. We have explored various algorithms to generate the representative points and compare their trade-offs and accuracy. Our algorithm can easily be cast in the MapReduce framework.
Issue Date:2018-06-26
Rights Information:Copyright 2018 Shahzad Bhatti
Date Available in IDEALS:2018-09-27
Date Deposited:2018-08

This item appears in the following Collection(s)

Item Statistics