Dept. of Statistics
http://hdl.handle.net/2142/17361
Thu, 14 Dec 2017 14:50:16 GMT2017-12-14T14:50:16ZMethods and applications for space-time data
http://hdl.handle.net/2142/97696
Methods and applications for space-time data
Shand, Lyndsay Elizabeth
Spatial and spatio-temporal data are presented in a variety of forms and require a unique set of techniques to analyze. The goal of such analyses is often to estimate the spatial and/or temporal dependency structures of the underlying random field. This estimation in turn can then be used to make inference about the underlying random process. Recurring challenges with spatial data include a lack of multiple realizations of the process, e.g. a lack of replicates and the estimation of dependency structures given this difficulty. In this work, I contributed to solving this problem based on both geostatistical and areal data using likelihood and Bayesian methods respectively.
A nonstationary spatio-temporal model is proposed which applies the concept of the dimension expansion method in Bornn et al. (2012). The estimation of this model is investigated and simulations are conducted for both separable and nonseparable space-time covariance models. The model is also illustrated with wind speed and streamflow datasets. Both simulation and data analyses show that modeling nonstationarity in both space and time can improve the predictive performance over stationary covariance models or models that are nonstationary in space but stationary in time.
In demand of predicting new HIV diagnosis rates based on publicly available HIV data that is abundant in space but has few points in time, a class of spatially varying autoregressive (SVAR) models compounded with conditional autoregressive (CAR) spatial correlation structures is proposed. The copula approach coupled with a flexible CAR formulation are employed to model the dependency between adjacent counties. These models allow for spatial and temporal correlation as well as space-time interactions and are naturally suitable for predicting spatio-temporal disease data that feature such a data structure. The models also allow us to estimate the spatially varying evolution pattern of the disease. We apply the proposed models to HIV data over Florida, California and New England states and compare them to a range of linear mixed models that have been recently popular for modeling spatio-temporal disease data. The results show that for such data our proposed models outperform the others in terms of prediction.
Spatio-temporal data; Nonstationary models; Areal data; Conditional autoregressive (CAR) model; Copula models
Thu, 13 Apr 2017 00:00:00 GMThttp://hdl.handle.net/2142/976962017-04-13T00:00:00ZShand, Lyndsay ElizabethSequential mastery detection and Bayesian learning promotion under cognitive diagnosis models
http://hdl.handle.net/2142/97613
Sequential mastery detection and Bayesian learning promotion under cognitive diagnosis models
Ye, Sangbeak
E-learning assessments are becoming a common educational medium to instruct fine-grained skills in modern pedagogy. To accentuate the advantages of e-learning assessments, it is vital to automate the process of instruction and advancement in accordance with the learning progress of each individual. Though building computerized assessments can adopt some established developments from the computerized adaptive testing (CAT) literature, the guidelines to facilitate computerized adaptive learning assessments that aim for didactic outcomes are yet underdeveloped.
In order to empower the automated process of instruction in an e-learning setting, statistical tools that detect mastery and promote learning were developed. First, we consider the use of sequential change-detection methods under the cognitive diagnosis models. To that end, we introduce the change-detection methods that involve different set of information. We further introduce a model for the didactic value of items that readily leads to a sequential learning enhancement and learning detection procedure. Bayesian measurements of one-step-ahead posterior probability of mastery and expected sum of attributes are utilized together with a simple model for learning for use in item selection, and stopping rules are developed for detection of learning that control for the rate of false discovery. Simulation studies showed that the delays of mastery detection can be minimized and the mastery acquisition can be hastened with statistical methods introduced.
Cognitive diagnosis model; Sequential analysis; Change detection; Learning promotion
Fri, 21 Apr 2017 00:00:00 GMThttp://hdl.handle.net/2142/976132017-04-21T00:00:00ZYe, SangbeakStatistical methods for learning sparse features
http://hdl.handle.net/2142/97366
Statistical methods for learning sparse features
Hu, Jianjun
With the fast development of networking, data storage, and the data collection capacity, big data are now rapidly expanding in all science and engineering domains. When dealing with such data, it is appealing if we can extract the hidden sparse structure of the data since sparse structures allow us to understand and interpret the information better. The aim of this thesis is to develop algorithms that can extract such hidden sparse structures of the data in the context of both supervised learning and unsupervised learning.
In chapter 1, this thesis first examines the limitation of the classical Fisher Discriminant Analysis (FDA), a supervised dimension reduction algorithm for multi-class classification problems. This limitation has been discussed by Cui (2012), and she has proposed a new objective function in her thesis, which is named Complementary Dimension Analysis (CDA) since each sequentially added new dimension boosts the discriminative power of the reduced space. A couple of extensions of CDA are discussed in this thesis, including sparse CDA (sCDA) in which the reduced subspace involves only a small fraction of the features, and Local CDA (LCDA) that handles multimodal data more appropriately by taking the local structure of the data into consideration. A combination of sCDA and LCDA is shown to work well with real examples and can return sparse directions from data with subtle local structures.
In chapter 2, this thesis considers the problem of matrix decomposition that arises in many real applications such as gene repressive identification and context mining. The goal is to retrieve a multi- layer low-rank sparse decomposition from a high dimensional data matrix. Existing algorithms are all sequential algorithms, that is, the first layer is estimated, and then remaining layers are estimated one by one, by conditioning on the previous layers. As discussed in this thesis, such sequential approaches have some limitations. A new algorithm is proposed to address those limitations, where all the layers are solved simultaneously instead of sequentially.
The proposed algorithm in chapter 2 is based on a complete data matrix. In many real applications and cross-validation procedures, one needs to work with a data matrix with missing values. How to operate the proposed matrix decomposition algorithm when there exist missing values is the main focus of chapter 3. The proposed solution seems to be slightly different from some existing work such as penalized matrix decomposition (PMD).
In chapter 4, this thesis considers a Bayesian approach to sparse principal component analysis (PCA). An efficient algorithm, which is based on a hybrid of Expectation-Maximization (EM) and Variational-Bayes (VB), is proposed and it can be shown to achieve selection consistency when both p and n go to infinity. Empirical studies have demonstrated the competitive performance of the proposed algorithm.
Dimension reduction; Sparsity; Principal component analysis; Matrix decomposition; Regularization; Thresholding; Variational-Bayes; Selection consistency
Thu, 20 Apr 2017 00:00:00 GMThttp://hdl.handle.net/2142/973662017-04-20T00:00:00ZHu, JianjunDimension reduction and efficient recommender system for large-scale complex data
http://hdl.handle.net/2142/93003
Dimension reduction and efficient recommender system for large-scale complex data
Bi, Xuan
Large-scale complex data have drawn great attention in recent years, which play an important role in information technology and biomedical research. In this thesis, we address three challenging issues: sufficient dimension reduction for longitudinal data, nonignorable missing data with refreshment samples, and large-scale recommender systems.
In the first part of this thesis, we incorporate correlation structure in sufficient dimension reduction for longitudinal data. Existing sufficient dimension reduction approaches assuming independence may lead to substantial loss of efficiency. We apply the quadratic inference function to incorporate the correlation information and apply the transformation method to recover the central subspace. The proposed estimators are shown to be consistent and more efficient than the ones assuming independence. In addition, the estimated central subspace is also efficient when the correlation information is taken into account. We compare the proposed method with other dimension reduction approaches through simulation studies, and apply this new approach to an environmental health study.
In the second part of this thesis, we address nonignorable missing data which occur frequently in longitudinal studies and can cause biased estimations. Refreshment samples which recruit new subjects in subsequent waves from the original population could mitigate the bias. In this thesis, we introduce a mixed-effects estimating equation approach which enables one to incorporate refreshment samples and recover missing information. We show that the proposed method achieves consistency and asymptotic normality for fixed-effect estimation under shared-parameter models, and we extend it to a more general nonignorable-missing framework. Our finite sample simulation studies show the effectiveness and robustness of the proposed method under different missing mechanisms. In addition, we apply our method to election poll longitudinal survey data with refreshment samples from the 2007-2008 Associated Pressâ€“Yahoo! News.
In the third part of this thesis, we develop a novel recommender system which track users' preferences and recommend items of interest effectively. In this thesis, we propose a group-specific method to utilize dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the "cold-start" problem, where new users and new items' information is not available from the existing data collection. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on variables associated with missing patterns. In addition, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and therefore makes it feasible to achieve scalable computing. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches.
Machine learning; matrix factorization; longitudinal data; estimating equation
Tue, 14 Jun 2016 00:00:00 GMThttp://hdl.handle.net/2142/930032016-06-14T00:00:00ZBi, XuanSampling for conditional inference on contingency tables, multigraphs, and high dimensional tables
http://hdl.handle.net/2142/92928
Sampling for conditional inference on contingency tables, multigraphs, and high dimensional tables
Eisinger, Robert David
We propose new sequential importance sampling methods for sampling contingency tables with fixed margins, loopless, undirected multigraphs, and high-dimensional tables. In each case, the proposals for the method are constructed by leveraging approximations to the total number of structures (tables, multigraphs, or high-dimensional tables), based on results in the literature. The methods generate structures that are very close to the target uniform distribution. Along with their importance weights, the data structures are used to approximate the null distribution of test statistics. In the case of contingency tables, we apply the methods to a number of applications and demonstrate an improvement over competing methods. For loopless, undirected multigraphs, we apply the method to ecological and security problems, and demonstrate excellent performance. In the case of high-dimensional tables, we apply the sequential importance sampling method to the analysis of multimarker linkage disequilibrium data and also demonstrate excellent performance.
Monte Carlo method; Sequential importance sampling; Counting problem; Contingency Table
Fri, 08 Jul 2016 00:00:00 GMThttp://hdl.handle.net/2142/929282016-07-08T00:00:00ZEisinger, Robert DavidScalable algorithms for Bayesian variable selection
http://hdl.handle.net/2142/92827
Scalable algorithms for Bayesian variable selection
Wang, Jin
The innovation of modern technologies drives research and development on high-dimensional data analysis in diverse fields, where variable selection plays a pivotal role to ensure credible model estimation. We focus on scalable algorithms for variable selection that can handle large data sets.
Firstly, we propose an EM algorithm that returns the MAP estimate of the set of relevant variables. Due to its particular updating scheme, our algorithm can be implemented efficiently. We also show that the MAP estimate returned by our EM algorithm achieves variable selection consistency. In practice, EM algorithm tends to get stuck at local peaks. So we propose an ensemble version: repeatedly apply the EM algorithm on a subset of Bootstrap sample data and then aggregate the results. Empirical studies demonstrate the superior performance of this Bayesian Bootstrap EM algorithm.
Secondly, we propose a hybrid computation framework for Bayesian variable selection. This new algorithm SAB is a combination of the classical EM algorithm and the variational Bayes algorithm. It is very fast in handling high dimensional data with a large number of covariates. To address a critical biological problem, we apply SAB to a state-of-art cancer genomics data set with a goal to understand the complex regulatory relationship between miRNAs and mRNAs in cancer.
In the third part, we study the asymptotic behavior of the SAB algorithm in detail and prove that SAB achieves the selection consistency, Bayesian consistency and also an oracle property when the number of covariates grows with the sample size exponentially.
Lastly, we extend the hybrid framework of Bayesian variable selection to logistic models, where we adopt the Polya-Gamma specification and show that this specification is equivalent as the local approximation method in the variational Bayes framework.
Variable Selection; EM; Ensemble; Variational Bayes; Asymptotic Analysis; Logistic model
Thu, 14 Jul 2016 00:00:00 GMThttp://hdl.handle.net/2142/928272016-07-14T00:00:00ZWang, JinStatistical analysis of networks with community structure and bootstrap methods for big data
http://hdl.handle.net/2142/92763
Statistical analysis of networks with community structure and bootstrap methods for big data
Sengupta, Srijan
This dissertation is divided into two parts, concerning two areas of statistical methodology. The first part of this dissertation concerns statistical analysis of networks with community structure. The second part of this dissertation concerns bootstrap methods for big data.
Statistical analysis of networks with community structure:
Networks are ubiquitous in today's world --- network data appears from varied fields such as scientific studies, sociology, technology, social media and the Internet, to name a few. An interesting aspect of many real-world networks is the presence of community structure and the problem of detecting this community structure.
In the first chapter, we consider heterogeneous networks which seems to have not been considered in the statistical community detection literature. We propose a blockmodel for heterogeneous networks with community structure, and introduce a heterogeneous spectral clustering algorithm for community detection in heterogeneous networks. Theoretical properties of the clustering algorithm under the proposed model are studied, along with simulation study and data analysis.
A network feature that is closely associated with community structure is the popularity of nodes in different communities. Neither the classical stochastic blockmodel nor its degree-corrected extension can satisfactorily capture the dynamics of node popularity. In the second chapter, we propose a popularity-adjusted blockmodel for flexible modeling of node popularity. We establish consistency of likelihood modularity for community detection under the proposed model, and illustrate the improved empirical insights that can be gained through this methodology by analyzing the political blogs network and the British MP network, as well as in simulation studies.
Bootstrap methods for big data:
Resampling methods provide a powerful method of evaluating the precision of a wide variety of statistical inference methods. The complexity and massive size of big data makes it infeasible to apply traditional resampling methods for big data.
In the first chapter, we consider the problem of resampling for irregularly spaced dependent data. Traditional block-based resampling or subsampling schemes for stationary data are difficult to implement when the data are irregularly spaced, as it takes careful programming effort to partition the sampling region into complete and incomplete blocks. We develop a resampling method called Dependent Random Weighting (DRW) for irregularly spaced dependent data, where instead of using blocks we use random weights to resample the data. By allowing the random weights to be dependent, the dependency structure of the data can be preserved in the resamples. We study the theoretical properties of this resampling methods as well as its numerical performance in simulations.
In the second chapter, we consider the problem of resampling in massive data, where traditional methods like bootstrap (for independent data) or moving block bootstrap (for dependent data) can be computationally infeasible since each resample has effective size of the same order as the sample. We develop a new resampling method called subsampled double bootstrap (SDB) for both independent and stationary data. SDB works by choosing small random subsets of the massive data, and then constructing a single resample from that subset using bootstrap (for independent data) or moving block bootstrap (for stationary data). We study theoretical properties of SDB as well as its numerical performance in simulated data and real data.
Extending the underlying ideas of the second chapter, we introduce two new resampling strategies for big data in Chapter 3. The first strategy is called aggregation of little bootstraps or ALB, a generalized resampling technique that includes the SDB as a special case. The second strategy is called subsampled residual bootstrap or SRB, a fast version of residual bootstrap intended for massive regression models. We study both methods through simulations.
network data; resampling; community structure; big data
Fri, 08 Jul 2016 00:00:00 GMThttp://hdl.handle.net/2142/927632016-07-08T00:00:00ZSengupta, SrijanWeak signal identification and inference in penalized model selection
http://hdl.handle.net/2142/88025
Weak signal identification and inference in penalized model selection
Shi, Peibei
Weak signal identification and inference are very important in the area of penalized model selection, yet they are under-developed and not well-studied. Existing inference procedures for penalized estimators are mainly focused on strong signals. This thesis propose an identification procedure for weak signals in finite samples, and provide a transition phase in-between noise and strong signal strengths. A new two-step inferential method is introduced to construct better confidence intervals for the identified weak signals. Both theory and numerical studies indicate that the proposed method leads to better confidence coverage for weak signals, compared with those using asymptotic inference. In addition, the proposed method outperforms the perturbation and bootstrap resampling approaches. The method is illustrated for HIV antiretroviral drug susceptibility data to identify genetic mutations associated with HIV drug resistance.
We also provide signal's inference method based on the exact distribution of penalized estimator. The finite sample distribution is quite different from its asymptotic counterpart, which can be highly non-normal with a point mass at zero. Numerical studies indicate that the density-based approach works well when true parameter is moderately large. However, it cannot provide accurate inference when signal is weak.
model selection; weak signal; inference
Thu, 16 Jul 2015 00:00:00 GMThttp://hdl.handle.net/2142/880252015-07-16T00:00:00ZShi, PeibeiBuilding a Nonparametric Model After Dimension Reduction
http://hdl.handle.net/2142/87422
Building a Nonparametric Model After Dimension Reduction
Liu, Li
To effectively build a regression model with a large number of covariates is no easy task. We consider using dimension reduction before building a parametric or spline model. The dimension reduction procedure is based on a canonical correlation analysis on the predictor variables and a spline basis generated for the response variable. One important question in dimension reduction is to decide on the number of effective dimensions needed. We study four tests of dimensionality: a chi-square test, a Wald-type test on eigenvalues, a modified Wald-type test, and a matrix rank test. These tests are motivated from different aspects of the problem and have their own strength and weakness. We discuss and compare these tests both theoretically and through Monte Carlo simulations, based on which specific recommendations for determining dimensionality are made. Additive regression splines are first fitted to the data in the space of reduced dimensionality. A Tukey-type test of additivity is proposed and compared with Rao's score test. When the hypothesis of additivity is rejected, tensor product splines can be used for model building.
Statistics
Sat, 01 Jan 2000 00:00:00 GMThttp://hdl.handle.net/2142/874222000-01-01T00:00:00ZLiu, LiContributions to Estimation in Item Response Theory
http://hdl.handle.net/2142/87423
Contributions to Estimation in Item Response Theory
Trachtenberg, Felicia Lynn
In the logistic item response theory models, the number of parameters tends to infinity together with the sample size. Thus, there has been a longstanding question of whether the joint maximum likelihood estimates for these models are consistent. The main contribution of this paper is the study of the asymptotic properties and computation of the joint maximum likelihood estimates, as well as an alternative estimation procedure, one-step estimation. The one-step estimates are much easier to compute, yet are consistent and first-order equivalent to the joint maximum likelihood estimates under certain conditions on the sample sizes if the marginal distribution of the ability parameter is correctly specified. The one-step estimates are also highly robust against modest misspecifications of the ability distribution. We also study the accuracy of variance estimates for the one-step estimates. Finally, we study tests of the goodness of fit for the models. We show that Rao's score test is superior to the existing chi-square tests.
Statistics
Sat, 01 Jan 2000 00:00:00 GMThttp://hdl.handle.net/2142/874232000-01-01T00:00:00ZTrachtenberg, Felicia Lynn