Dept. of Statistics
http://hdl.handle.net/2142/17361
Statistical modeling of heterogeneous data
http://hdl.handle.net/2142/45589
Statistical modeling of heterogeneous data
Liu, Yufei
This dissertation is centered on the modeling of heterogeneous data which is ubiquitous in this digital information age. From the statistical point of view heterogeneous data is composed of dissimilar components, where objects in each component are homogeneous themselves. One such example from the real world is the stock return data, where stocks in the same industry segments tend to move closely together, while different segments tend to have distinct movement patterns.
Clustering is one of the most popular ways to characterize data heterogeneity. It is a classical problem of unsupervised learning. We will review major clustering approaches in Chapter 1. In recent years non-parametric Bayesian mixture models have attracted increasing attention in the clustering literature, which is closely related with our work. So we review the Mixture of Dirichlet Process Model in Chapter 2.
The main dissertation body consists of three generic statistical methods to model heterogeneity in different scenarios. As data are becoming more and more prevailing today, traditional clustering tasks are often accompanied by additional information about the objects to cluster, known as the side information. The opportunity is that the side information has the potential to complement clustering algorithms to achieve more accurate and meaningful results. In Chapter 3 we describe Two-view Clustering method, a novel non-parametric clustering model that is capable of robustly incorporating noisy side information. We demonstrate the effectiveness of this new model with three real world applications in Chapter 4.
Our second work is driven by market segmentation which is a key factor to a modern business's success by accurately recognizing customer groups with varying needs. Market segmentation involves dividing a larger market into sub-markets based upon a variety of factors such customersâ€™ demographic information and product preferences. In Chapter 5 we will propose a multi-task learning framework to solve this problem.
Our third work in Chapter 6 tries to solve a problem arising from citation analysis for research evaluation. In bibliometrics one central task is to characterize the statistical distribution of citations. This problem has been regarded as a challenging one for two reasons: (i) the citation distributions of almost all the subject areas are highly right-skewed; (ii) the citation behaviors across various subject areas can be drastically different. We propose a mixture model to formally characterize the statistical distribution of citation data. Based on this model we develop new criteria to evaluate impact of journals and performance of research institutes.
Statistical Learning
Clustering
Non-parametric Bayes
Dirichlet Process
Mixture Model
Heterogeneous Data
Thu, 22 Aug 2013 16:48:49 GMTStatistical models with diverging dimensionality
http://hdl.handle.net/2142/45563
Statistical models with diverging dimensionality
Li, Bin
Nowadays in many statistical applications, we face models whose complexity increases with the sample size. Such models pose a challenge to the traditional statistical analysis, and call for new methodologies and new asymptotic studies, which are exactly the focus of my thesis. In particular, my thesis consists of three parts: i) a novel non-parametric qualification procedure for lysate protein microarray; ii) theoretic analysis for one-way ANOVA with diverging dimensionality and iii) statistical analysis for multi-task learning.
lysate protein microarray
non-parametric qualification
regularization
one-way analysis of variance (ANOVA)
g-prior
multi-task learning
Generalized information criterion (GIC)
Group Lasso
Thu, 22 Aug 2013 16:47:56 GMTStatistical inference for dependent data
http://hdl.handle.net/2142/45543
Statistical inference for dependent data
Zhang, Xianyang
Functional data Analysis has emerged as an important area of statistics which provides convenient and informative tool for the analysis of data objects of high dimension/high resolution. In the literature, it seems that the emphasis has been placed on independent functional data or models where the covariates and errors are assumed to be independent. However, the independence assumption is
often too strong to be realistic in many application especially if the data are collected sequentially
over time such as climate data and high frequency financial data. Motivated by our ongoing research
on the development of high-resolution climate projections through statistical downscaling, we consider the change point problem and the two sample problem for temporally dependent functional data. Specifically, in Chapter 1, we develop a self-normalization based test to test the structural
stability of temporally dependent functional observations. We propose new tests to detect the differences of the covariance operators and their associated characteristics of two functional time series in Chapter 2. The self-normalization approach introduced in the first two chapters is closely linked to the fixed-b asymptotic scheme in the econometrics literature. Motivated by recent studies on heteroskedasticity and autocorrelation consistent based robust inference, we propose a class of estimators for estimating the asymptotic covariance matrix of the generalized method of moments estimator in the stationary time series models in Chapter 3. Under mild conditions, we establish the first order asymptotic distribution for the Wald statistics when the smoothing parameter is held fixed. Furthermore, we derive higher order Edgeworth expansions for the finite sample distribution of the Wald statistics in the Gaussian location model under the fixed-smoothing paradigm. The results are used to justify the second order correctness of a new bootstrap method, the Gaussian
dependent bootstrap, in the context of Gaussian location model. Finally, in Chapter 4, we describe an extension of the fixed-b approach to the empirical likelihood estimation framework.
Functional data
Change-point problem
Two sample problem
Self-normalization
High order expansion
Bootstrap
Empirical likelihood
Thu, 22 Aug 2013 16:47:13 GMTModel selection for correlated data and moment selection from high-dimensional moment conditions
http://hdl.handle.net/2142/45291
Model selection for correlated data and moment selection from high-dimensional moment conditions
Cho, Hyun Keun
High-dimensional correlated data arise frequently in many studies. My primary research interests lie broadly in statistical methodology for correlated data such as longitudinal data and panel data. In this thesis, we address two important but challenging issues: model selection for correlated data with diverging number of parameters and consistent moment selection from high-dimensional moment conditions.
Longitudinal data arise frequently in biomedical and genomic research where repeated measurements within subjects are correlated. It is important to select relevant covariates when the dimension of the parameters diverges as the sample size increases. We propose the penalized quadratic inference function to perform model selection and estimation simultaneously in the framework of a diverging number of regression parameters. The penalized quadratic inference function can easily take correlation information from clustered data into account, yet it does not require specifying the likelihood function. This is advantageous compared to existing model selection methods for discrete data with large cluster size. In addition, the proposed approach enjoys the oracle property; it is able to identify non-zero components consistently with probability tending to 1, and any finite linear combination of the estimated non-zero components has an asymptotic normal distribution. We propose an efficient algorithm by selecting an effective tuning parameter to solve the penalized quadratic inference function. Monte Carlo simulation studies have the proposed method selecting the correct model with a high frequency and estimating covariate effects accurately even when the dimension of parameters is high. We illustrate the proposed approach by analyzing periodontal disease data.
The generalized method of moments (GMM) approach combines moment conditions optimally to obtain efficient estimation without specifying the full likelihood function. However, the GMM estimator could be infeasible when the number of moment conditions exceeds the sample size. This research intends to address issues arising from the motivating problem where the dimension of estimating equations or moment conditions far exceeds the sample size, such as in selecting informative correlation structure or modeling for dynamic panel data. We propose a Bayesian information type of criterion to select the optimal number of linear combinations of moment conditions. In theory, we show that the proposed criterion leads to consistent selection of the number of principal components for the weighting matrix in the GMM. Monte Carlo studies indicate that the proposed method outperforms existing methods in the sense of reducing bias and improving the efficiency of estimation. We also illustrate a real data example for moment selection using dynamic panel data models.
diverging number of parameters
dynamic panel data models
generalized method of moments
high-dimensional moment conditions
moment selection
longitudinal data
model selection
oracle property
quadratic inference function
smoothly clipped absolute
deviation (SCAD)
singularity matrix
Thu, 22 Aug 2013 16:35:00 GMTNonparametric testing for random effects in mixed effects models based on the piecewise linear interpolate of the log characteristic function
http://hdl.handle.net/2142/42126
Nonparametric testing for random effects in mixed effects models based on the piecewise linear interpolate of the log characteristic function
Bawawana, Bavwidinsi
Traditional linear mixed effect models assume the distributions of the random effects and errors follow normal distribution with mean zero and homoscedastic variance sigma square. This thesis presents a new nonparametric testing approach for normality for the random effects distribution based on the piecewise linear interpolate of the log characteristic function along the grid when the number of replications is low. The ideas behind this approach were presented first by Meintanis and Portnoy (2011). The best initial grid and grid length were found, and empirical powers from the presented approach were compared with the empirical powers from Kolmogorov-Smirnov test under two venues, the first one assuming the variance of the random errors is less than the variance of the random effects distribution and the second one assuming the contrary. The method was proved to be square root n consistent. Real data set from the Modification of Diet in Renal Disease Trial (MDRD) Study A --a longitudinal study-- was used to test for normality of the random effects and errors distributions, and further, from the empirical characteristic function of the random effects and errors, the empirical density functions were estimated through the deconvolution of the empirical characteristic functions.
Linear Mixed Effect Model (LMEM)
Characteristic function
Normal distribution
Nonparametric testing
Power
Deconvolution
Density function
Sat, 01 Dec 2012 00:00:00 GMTEnsemble filtering for state space models
http://hdl.handle.net/2142/34574
Ensemble filtering for state space models
Yun, Jong Hyun
The state space model has been widely used in various fields including economics, finance, bioinformatics, oceanography, and tomography. The goal of the filtering problem is to find the posterior distribution of the hidden state given the current and past observations. The first part of my thesis focuses on designing efficient proposal distributions for particle filters. I propose a new approach named the augmented particle filter (APF), which combines two sets of particles from the observation and state equations. The APF can be applied to general state space models, and it does not require special structures of the model or any approximation to the target or proposal distribution. I find through simulation studies that the APF performs similarly to or better than other filtering algorithms in the literature. The convergence of the augmented particle filter has been established.
The second part of my thesis develops the localization methods for particle filters in high dimensional state space models. Under high dimensional state space models, the computational constraints prevent us from having a large number of particles to avoid the degeneracy problem of the importance weights. When the dimension of the state vector is high, it is common that only a few components of the state vector are dependent on any single component or a set of a few components of the observation vector. In filtering problems, the concept of localization is to use the information in the components of the observation vector to update only the corresponding a few components of the hidden state vector.
I propose the localized augmented particle filter. This new approach divides state vectors into small blocks, and it updates each block of the state vectors through state dynamics and observations. By considering blocks, the influence of observations in updating state vectors is restricted to a few blocks of the state vectors, so the localized augmented particle filter allows constructing the proposal distribution in a lower dimension than the original model. The localized augmented particle filter can outperform many other methods in the literature. The convergence of the localized augmented particle filter has been proved for some class of models.
The method to improve particle filters by dividing the particles into independent batches is presented. The development of the method is motivated by the particle Markov chain Monte Carlo method proposed by Andrieu et al. (2010). Often, the combination of particle filters in batches outperforms the standard particle filter. Parallel computing techniques can be easily adapted to make the implementation fast. The convergence property of the batched particle filter has been established. As the number of batches goes to infinity, the estimate based on the combination of batches converges to the target.
Nonlinear filtering
Sequential Monte Carlo
Particle filter
Ensemble Kalman filter
State space model
Target tracking
Augmented particle filter
Localized augmented particle filter
Particle Monte Carlo Markov chain
Lorenz model
Particle filtering with independent batches.
Tue, 18 Sep 2012 21:26:13 GMTContributions to modeling parasite dynamics and dimension reduction
http://hdl.handle.net/2142/32062
Contributions to modeling parasite dynamics and dimension reduction
Cui, Na
For my thesis, I have worked on two projects: modeling parasite dynamics (Chapter 2) and complementary dimensionality analysis (Chapter 3).
In the first project, we study a longitudinal data of infection with the parasite Giardia lamblia among children in Kenya. Understanding the infection and recovery rate from parasitic infections is valuable for public health planning. Two challenges in modeling these rates are (1) infection status is only observed at discrete times even though infection and recovery take place in continuous time and (2) detectability of infection is imperfect. We address these issues through a Bayesian hierarchical model based on a random effects Weibull distribution. The model incorporates heterogeneity of the infection and recovery rate among individuals and allows for imperfect detectability. We estimate the model by a Markov chain Monte Carlo algorithm with data augmentation. We present simulation studies and an application to an infection study about the parasite Giardia lamblia among children in Kenya.
The second project focuses on supervised dimension reduction. The goal of supervised dimension reduction (SDR) is to find a compact yet informative representation of the original data space via some transformation. Most SDR algorithms are formulated as an optimization problem with the objective being a linear function of the second order statistics of the data. However, such an objective function tends to overemphasize those directions already achieving large between-class distances yet making little improvement over the classification accuracy. To address this issue, we introduce two objective functions, which are directly linked to the classification accuracy, then present an algorithm that sequentially solves the nonlinear objective functions.
Bayesian hierarchical model
Infection rate
Markov chain Monte Carlo
Panel data
Dimension reduction
Fisher discriminant analysis
Wed, 27 Jun 2012 21:31:05 GMTImage classification and feature selection
http://hdl.handle.net/2142/31979
Image classification and feature selection
Chen, Gang
Tissue classification and feature selection have been increasing studied during the last two decades, however the available methods are still limited and need improvement. In this manuscript, we develop tissue classification and feature selection methods based on Dynamic Adaboost with logistic regression as its weak learner and a new Variational
Bayesian (VB) logistic regression with regularization. Furthermore we investigate the statistical properties of these methods and extend VB logistic regression to handle large scale data.
In chapter 1, we will introduce some key concepts like Ultrasound Tissue Classification, Level Set Segmentation method, Bayesian version of Lasso and Elastic Net and Variational Bayesian approximation. In chapter 2, we will introduce a framework of tumor
segmentation and feature extraction for ultrasound B-mode images, as well as a semi-parametric
model for the texture features. In chapter 3, we apply the Adaboost method with logistic regression as weak learner for tumor classification. Genetic Algorithm (GA) is used for stochastic search based feature selection and the algorithm is parallelized to
accelerate the computation. In chapter 4, we propose a new variational Bayesian logistic
regression incorporating the Lasso and Elastic Net type regularization for feature selection. In chapter 5, we extend the above VB logistic regression to large scale data by map/reduce cloud computing.
We will illustrate the experimental results in each chapter using simulation data and
ultrasound image data from our research.
Ultrasound tissue classification
feature selection, Level Set Segmentation
B-mode image
logistic regression
Variational
Bayesian
Lasso
Elastic Net
Adaboost
Genetic Algorithm
map/reduce
cloud computing
Wed, 27 Jun 2012 21:22:52 GMTVarying Coefficients in Logistic Regression with Applications to Marketing Research
http://hdl.handle.net/2142/30947
Varying Coefficients in Logistic Regression with Applications to Marketing Research
Condon, Erin
In the marketing research world today, companies have access to massive amounts of data regarding the purchase behavior of consumers. Researchers study this data to understand how outside factors, such as demographics and marketing tools, affect the probability that a given consumer will make a purchase. Through the use of panel data, we tackle these questions and propose a logistic regression model in which
coefficients can vary based on a consumer's purchase history. We also introduce a two-step procedure for model selection that uses a group LASSO penalty to decide which are informative and which variables need varying coefficients in the model.
varying coefficients
group LASSO
marketing research
logistic regression
least absolute shrinkage and selection operator (LASSO)
Tue, 22 May 2012 00:17:42 GMTBayesian empirical likelihood for quantile regression
http://hdl.handle.net/2142/29522
Bayesian empirical likelihood for quantile regression
Yang, Yunwen
Bayesian inference provides a flexible way of combiningg data with
prior information. However, quantile regression is not equipped with a parametric likelihood, and therefore, Bayesian inference for quantile regression demands careful investigations. This thesis considers the Bayesian empirical likelihood approach to quantile regression. Taking the empirical likelihood into a Bayesian framework, we show that the resultant posterior is asymptotically normal; its mean shrinks towards the true parameter values and its variance approaches that of the maximum empirical likelihood estimator. Through empirical likelihood, the proposed method enables us to explore various forms of commonality across quantiles for efficiency gains in the estimation of multiple quantiles. By using an MCMC algorithm in the computation, we avoid the daunting task of directly maximizing empirical likelihoods. The finite sample performance of the proposed method is investigated empirically, where substantial efficiency gains are demonstrated with informative priors on common features across quantile levels.
Efficiency
Empirical likelihood
High quantiles
Quantile regression
Prior
Posterior.
Wed, 01 Feb 2012 00:53:52 GMT