Subsampling based inference for network data

Chakrabarty, Sayan

Subsampling based inference for network data

Chakrabarty, Sayan

This item's files can only be accessed by the System Administrators group.

Permalink

https://hdl.handle.net/2142/125814

Description

Title

Subsampling based inference for network data

Author(s)

Chakrabarty, Sayan

Issue Date

2024-07-11

Director of Research (if dissertation) or Advisor (if thesis)

Chen, Yuguo
Sengupta, Srijan

Doctoral Committee Chair(s)

Chen, Yuguo

Committee Member(s)

Shao, Xiaofeng
Simpson, Douglas

Department of Study

Statistics

Discipline

Statistics

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Blockmodels
Community Detection
Large Networks
Model Selection
Network Cross-validation
Network Subsampling
Random Dot Product Graph

Language

eng

Abstract

Contemporary systems often comprise interactions between numerous agents, typically represented using networks. Network data is widespread across disciplines such as social sciences, biological sciences, information technology, and computer sciences. As technology rapidly advances, networks arising from such fields are becoming increasingly large and complex. Effectively analyzing these networks poses challenges concerning computational feasibility and the choice of suitable analytical models. This dissertation addresses two such problems in this area. Large networks are becoming widespread in scientific fields. Performing statistical analysis on such large networks is challenging due to high computation time and memory requirements. In the second chapter of this dissertation, we introduce a subsampling-based divide-and-conquer algorithm, SONNET, for detecting communities in large networks. The algorithm divides the original network into several subnetworks with an overlap part and applies a community detection algorithm to each subnetwork. The results from each subnetwork are combined using a label matching approach to determine the final community labels. This method significantly reduces both memory and computation costs since it only requires processing and storing the smaller subnetworks. It is also parallelizable, enhancing its speed. Theoretical and numerical performance of the algorithm is also presented in this chapter. Complex and extensive networks are increasingly common in scientific applications across various fields. Despite the availability of numerous network models and methodologies, cross-validation on networks is still difficult due to the unique structure of network data. In the third chapter, we propose a general cross-validation procedure, CROISSANT, based on subsampling for networks. The proposed algorithm splits the original network into multiple subnetworks with a shared overlap, creating a training set comprising the subnetworks and a test set with the node pairs between the subnetworks. This train-test split forms the basis for a network cross-validation procedure that can be used for a broad range of model selection and parameter tuning problems for network data. The method is computationally efficient for large networks, as it utilizes smaller subnetworks for the training process. It is also adaptable for specific network model selection and parameter tuning, with theoretical justifications provided as well. Numerical results show that the proposed algorithm accurately performs model selection and parameter tuning on various simulated and real networks from diverse models. They also indicate that the method is faster than existing network cross-validation methods.

Graduation Semester

2024-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/125814

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Subsampling based inference for network data

Chakrabarty, Sayan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Statistics

Log In