Files in this item
Files  Description  Format 

application/pdf PANTHESIS2019.pdf (645kB)  (no description provided) 
Description
Title:  Query Kmeans clustering for crowdsourcing 
Author(s):  Pan, Chao 
Advisor(s):  Milenkovic, Olgica 
Department / Program:  Electrical & Computer Eng 
Discipline:  Electrical & Computer Engr 
Degree Granting Institution:  University of Illinois at UrbanaChampaign 
Degree:  M.S. 
Genre:  Thesis 
Subject(s):  Kmeans clustering
active learning semisupervised learning coupon collector's problem crowdsourcing 
Abstract:  This thesis focuses on solving the $K$means clustering problem approximately with side information provided by crowdsourcing. Both binary samecluster oracle and general crowdsourcing framework are considered. It can be shown that, under some mild assumptions on the smallest cluster size, one can obtain a $(1+\epsilon)$approximation for the optimal potential with probability at least $1\delta$, where $\epsilon>0$ and $\delta\in(0,1)$, using an expected number of $O(\frac{K^3}{\epsilon \delta})$ noiseless samecluster queries and comparisonbased clustering of complexity $O(ndK + \frac{K^3}{\epsilon \delta})$; here, $n$ denotes the number of points and $d$ the dimension of space. Compared to a handful of other known approaches that perform importance sampling to account for small cluster sizes, the proposed query technique reduces the number of queries by a factor of roughly $O(\frac{K^6}{\epsilon^3})$, at the cost of possibly missing very small clusters. This setting is extended to the case where some queries to the oracle produce erroneous information, and where certain points, termed outliers, do not belong to any clusters. Incorporating stateoftheart results in crowdsourcing can further improve the performance of the algorithm. Note that the proof techniques used in this thesis differ from previous methods used for $K$means clustering analysis, as they rely on estimating the sizes of the clusters and the number of points needed for accurate centroid estimation and subsequent nontrivial generalizations of the double Dixie cup problem. The performances of proposed algorithms are illustrated on both synthetic and real datasets, including MNIST and CIFAR $10$. 
Issue Date:  20191202 
Type:  Text 
URI:  http://hdl.handle.net/2142/106229 
Rights Information:  Copyright 2019 Chao Pan 
Date Available in IDEALS:  20200302 
Date Deposited:  201912 
This item appears in the following Collection(s)

Dissertations and Theses  Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer Engineering 
Graduate Dissertations and Theses at Illinois
Graduate Theses and Dissertations at Illinois