Files in this item

FilesDescriptionFormat

application/pdf

application/pdfPAUL-DISSERTATION-2017.pdf (14MB)
(no description provided)PDF

Description

Title:Consistent community detection in uni-layer and multi-layer networks
Author(s):Paul, Subhadeep
Director of Research:Chen, Yuguo
Doctoral Committee Chair(s):Chen, Yuguo
Doctoral Committee Member(s):Chen, Xiaohui; Hajek, Bruce; Portnoy, Stephen; Simpson, Douglas
Department / Program:Statistics
Discipline:Statistics
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Community detection
Consistency
Co-regularization
Invariant subspaces
Minimax rates
Multi-layer networks
Multi-layer null models
Multi-layer modularity
Multi-layer stochastic block model
Network analysis
Non-negative matrix factorization
Neuroimaging
Random effects stochastic block model
Stochastic block model
Spectral clustering
Variational EM
Abstract:Over the last two decades, we have witnessed a massive explosion of our data collection abilities and the birth of a "big data" age. This has led to an enormous interest in statistical inference of a new type of complex data structure, a graph or network. The surge in interdisciplinary interest on statistical analysis of network data has been driven by applications in Neuroscience, Genetics, Social sciences, Computer science, Economics and Marketing. A network consists of a set of nodes or vertices, representing a set of entities, and a set of edges, representing the relations or interactions among the entities. Networks are flexible frameworks that can model many complex systems. In the majority of the network examples dealt with in the literature, the relations between nodes are assumed to be of the same type such as web page linkage, friendship, co-authorship or protein-protein interaction. However, the complex networks in many modern applications are often multi-layered in the sense that they consist of multiple types of edges/relations among a group of entities. Each of those different types of relations can be viewed as creating its own network, called a layer of the multi-layer network. Multi-layer networks are a more accurate representation of many complex systems since many entities in those systems are involved simultaneously in multiple interactions. In this dissertation we view multi-layer networks in the broad sense that includes multiple types of relations as well as multiple information sources on the same set of nodes (e.g., multiple trials or multiple subjects). The problem of detecting communities or clusters of nodes in a network has received considerable attention in literature. As with uni-layer networks, community detection is an important task in multi-layer networks. This dissertation aims to develop new methods and theory for community detection in both uni-layer and multi-layer networks that can be used to answer scientific questions from experimental data. For community detection in uni and multi-layer graphs, we take three approaches - (1) based on statistical random graph models, (2) based on maximizing quality functions, e.g., the modularity score and (3) based on spectral and matrix factorization methods. In Chapter 2 we consider two random graph models for community detection in multi-layer networks, the multi-layer stochastic block model (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic block model (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compared MLEs in the two models among themselves and with other baseline approaches both theoretically and through simulations. We also derived minimax error rates and thresholds for achieving consistency of community detection in MLSBM, which were then used to show the advantage of the multi-layer model over a traditional alternative, the aggregate stochastic block model. In simulations RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low. A popular method of community detection in uni-layer networks is maximization of a partition quality function called modularity. In Chapter 3 we introduce several multi-layer network modularity measures based on different random graph null models, motivated by empirical observations from a diverse field of applications. In particular, we derived different modularities by defining the multi-layer configuration model, the multi-layer expected degree model and their various modifications as null models for multi-layer networks. These measures are then optimized to detect the optimal community assignment of nodes. We apply the methods to five real multi-layer networks - three social networks from the website Twitter, a complete neuronal network of a nematode, C-elegans and a classroom friendship network of 7th-grade students. In Chapter 4 we present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data. In Chapter 5 we once again consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network or multiple snapshots of a time-varying network. Numerous methods have been proposed in the literature for the more general problem of multi-view clustering in the past decade based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. Such methods can be adapted for community detection in multi-layer networks with minimal modifications. However, the theoretical properties of these methods remain largely unexplored and most authors have relied on performance in synthetic and real data to assess the goodness of the procedures. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objective will return a good community structure. We apply some of these methods for consensus community detection in multi-layer networks and investigate the consistency properties of the global optimizer of the objective functions under the multi-layer stochastic block model. We derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup where both the number of nodes and the number of layers of the multi-layer graph grow. We complement the asymptotic analysis with a thorough numerical study to compare the finite sample performance of the methods. Motivated by multi-subject and multi-trial experiments in neuroimaging studies, in Chapter 6 we develop a modeling framework for joint community detection in a group of related networks. The proposed model, which we call the random effects stochastic block model facilitates the study of group differences and subject specific variations in the community structure. In contrast to the previously proposed multi-layer stochastic block models, our model allows community memberships of nodes to vary in each component network or layer with a transition probability matrix, thus modeling the variation in community structure across a group of subjects or trials. We propose two methods to estimate the parameters of the model, a variational-EM algorithm and two non-parametric "two-step" methods based on spectral and matrix factorization respectively. We also develop several hypothesis tests with p-values obtained through resampling (permutation test) for differences in community structure in two groups of subjects both at the whole network level and node level. The methodology is applied to publicly available fMRI datasets from multi-subject experiments involving schizophrenia patients along with healthy controls. Our methods reveal an overall putative community structure representative of the groups as well as subject-specific variations within each group. Using our network level hypothesis tests we are able to ascertain statistically significant difference in community structure between the two groups, while our node level tests help determine the nodes that are driving the difference.
Issue Date:2017-06-30
Type:Thesis
URI:http://hdl.handle.net/2142/98136
Rights Information:Copyright 2017 Subhadeep Paul
Date Available in IDEALS:2017-09-29
Date Deposited:2017-08


This item appears in the following Collection(s)

Item Statistics