Dissertations and Theses - Statistics
http://hdl.handle.net/2142/17362
Sat, 24 Mar 2018 04:43:41 GMT2018-03-24T04:43:41ZDependence testing in high dimension
http://hdl.handle.net/2142/99102
Dependence testing in high dimension
Yao, Shun
The study of dependence for high dimensional data originates in many different areas of contemporary research. While a lot of existing work focuses on measuring the linear dependence and monotone dependence for fixed dimensional data, comparatively less is concerned for more complex dependence structure, especially when the dimension is allowed to grow. In this thesis, we propose different testing procedures for various independence/dependence related statistical testing problems in high dimension.
In the first part of the thesis, we introduce sum-of-square type tests for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed based on the pairwise distance covariance and it accounts for the non-linear and non-monotone dependencies among the data. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when sample size is small albeit dimension is high, and is shown to successfully identify nonlinear dependence in empirical data analysis. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the convergence rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance covariance based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distribution with equal correlation.
In the second part, we study distance covariance and related independence test in the high dimension, low sample size setting. We show that the sample distance covariance between two random vectors can be approximated by the sum of squared component-wise sample cross-covariance up to a constant factor. This demonstrates that the distance covariance can only capture the linear dependence in high dimension. As a result, it is shown that the distance correlation based "joint" test developed by Székely and Rizzo (2013a) for independence only has trivial power when the two random vectors are nonlinearly dependent but component-wisely uncorrelated. This phenomenon is further confirmed in our simulation study. As a remedy, we propose a distance covariance based "marginal" test and show its superior power behavior against its "joint" counterpart.
Non-linear dependence; High dimensionality; Distance covariance; U-statistics
Fri, 14 Jul 2017 00:00:00 GMThttp://hdl.handle.net/2142/991022017-07-14T00:00:00ZYao, ShunIndividualized learning and integration for multi-modality data
http://hdl.handle.net/2142/98381
Individualized learning and integration for multi-modality data
Tang, Xiwei
Individualized modeling and multi-modality data integration have experienced an explosive growth in recent years, which have many important applications in biomedical research, personalized education and marketing. Conventional statistical models usually fail to capture significant variation due to subject-specific effects and heterogeneity of data from multiple sources. Consequently, it has become very critical to incorporate individuals’ and modalities’ heterogeneous characteristics in order to efficiently explore the data structure and enhance the prediction power. In this thesis, we address three challenging issues: mixture modeling for longitudinal data, individualized variable selection and multi-modality tensor learning with an application in medical imaging analysis.
In the first part of the thesis, we develop a model-based subgrouping method for longitudinal data. Specifically, we propose an unbiased estimating equation approach for a two-component mixture model with correlated response data. In contrast to most existing longitudinal data clustering methods, the proposed model allows subgroup membership change for each individual over time. Furthermore, we incorporate correlation structure on unobservable latent indicator variables. Another advantage our approach is that we do not require any information about joint likelihood function for each subject. The proposed model is shown to have more efficient parameter estimators in both mixing proportions and component densities. In addition, by utilizing within-subject serial correlations, the proposed approach enhances classification power compared to existing methods, especially for those boundary observations.
In the second part of the thesis, we propose an individualized variable selection approach to select different relevant variables for different individuals. The conventional homogeneous model, which assumes all subjects share the same effects of certain predictors, may wash out important information due to heterogeneous variation. For example, in personalized medicine, some individuals could have positive responses to the treatment while some individuals could have negative ones. Hence the population average effect could be close to zero. In this thesis, we construct a separation penalty with multi-directional shrinkages including zero, which facilitates individualized modeling to distinguish strong signals from noisy ones. As a byproduct, the proposed model identifies subgroups among which individuals share similar effects, and thus improves estimation efficiency and personalized prediction accuracy. Finite sample simulation studies and an application to HIV longitudinal data demonstrate the model efficiency and the prediction power of the new approach compared to a variety of existing penalization models.
In the third part of the thesis, we are interested in employing medical imaging data for diagnosis. This work is motivated by breast cancer imaging data produced by a multimodality multiphoton optical imaging technique. We develop an innovative multilayer tensor learning method to predict disease status effectively through utilizing subject-wise imaging information. In particular, we propose an individualized multilayer model which leverages an additional layer of individual structure of imaging shared by multiple modalities in addition to employing a high-order tensor decomposition shared by populations. One major advantage of our approach is that we are able to capture the spatial information of microvesicles observed in certain modalities of optical imaging through integrating multimodality imaging data. Our simulation studies and real data analysis both indicate that the proposed multilayer learning method improves prediction accuracy significantly compared to existing competitive statistical and machine learning methods.
Mixture modeling; Individualized variable selection; Tensor; Imaging analysis
Thu, 13 Jul 2017 00:00:00 GMThttp://hdl.handle.net/2142/983812017-07-13T00:00:00ZTang, XiweiLongitudinal principal components analysis for binary and continuous data
http://hdl.handle.net/2142/98374
Longitudinal principal components analysis for binary and continuous data
Kinson, Christopher Leron
Large-scale data or big data is an enormously popular word in the data science and statistics communities. These datasets are often collected over periods of time - at hourly and weekly rates - with the help of technological advancements in physical and cloud-based storage. The information stored is useful, especially in biomedicine, insurance, and retail, where patients and customers are crucial to business survival. In this thesis, we develop new statistical methodologies for handling two types of datasets: continuous data and binary data.
Time-varying associations among store products provide important information to capture changes in consumer shopping behavior. In the first part of this thesis, we propose a longitudinal principal component analysis (LPCA) using a random-effects eigen-decomposition, where the eigen-decomposition utilizes longitudinal information over time to model time-varying eigenvalues and eigenvectors of the corresponding covariance matrices. Our method can effectively analyze large marketing data containing sales information for selected consumer products from hundreds of stores over an 11-year time period. The proposed method leads to more accurate estimation and interpretation compared to comparable approaches, which is illustrated through finite sample simulations. We show our method's capabilities and provide an interpretation of the eigenvector estimates in an application to IRI marketing data.
In the second part of this thesis, we formulate the LPCA problem for binary data. We propose capturing the associations among the products or variables through the odds ratios, where a two by two contingency table contains probabilities representing the joint distribution of two binary products. The eigen-decomposition utilizes longitudinal information over time to model time-varying eigenvalues and eigenvectors of the corresponding odds ratio matrices. These odds ratio matrices measure the pairwise associations among the binary products and is more appropriate to use than the Pearson correlation coefficient. Our method illustrates an improvement in visualization and interpretation through simulation studies and an application to IRI panel data of individual customer purchases.
Time-varying; Longitudinal; Eigen-decomposition; Non-parametric spline; Odds ratio
Wed, 12 Jul 2017 00:00:00 GMThttp://hdl.handle.net/2142/983742017-07-12T00:00:00ZKinson, Christopher LeronStatistical inference of multivariate time series and functional data using new dependence metrics
http://hdl.handle.net/2142/98188
Statistical inference of multivariate time series and functional data using new dependence metrics
Lee, Chung Eun
In this thesis, we focus on inference problems for time series and functional data and develop new methodologies by using new dependence metrics which can be viewed as an extension of Martingale Diﬀerence Divergence (MDD) [see Shao and Zhang (2014)] that quantiﬁes the conditional mean dependence of two random vectors. For one part, the new approaches to dimension reduction of multivariate time series for conditional mean and conditional variance are proposed by applying new metrics, the so-called Martingale Diﬀerence Divergence Matrix (MDDM), Volatility Martingale Diﬀerence Divergence (VMDDM), and vec Volatility Martingale Diﬀerence Divergence (vecVMDDM). For the other part, we propose a nonparametric conditional mean independence test for a response variable Y given a covariate variable X, both of which can be function-valued or vector-valued. The test is built upon Functional Martingale Diﬀerence Divergence (FMDD) which fully measures the conditional mean independence of Y on X.
Conditional mean; Dimension reduction; Nonlinear dependence
Fri, 30 Jun 2017 00:00:00 GMThttp://hdl.handle.net/2142/981882017-06-30T00:00:00ZLee, Chung EunFast algorithms for Bayesian variable selection
http://hdl.handle.net/2142/98201
Fast algorithms for Bayesian variable selection
Huang, Xichen
Variable selection of regression and classification models is an important but challenging problem. There are generally two approaches, one based on penalized likelihood, and the other based on Bayesian framework. We focus on the Bayesian framework in which a hierarchical prior is imposed on all unknown parameters including the unknown variable set. The Bayesian approach has many advantages, for example, we can access unknown obtain the posterior distribution of the sub-models. And more accurate prediction may be obtained by model averaging.
However, as the posterior distribution of the model parameters is usually not in closed form, posterior inference that relies on Markov Chain Monte Carlo (MCMC) has high computational cost especially in high-dimensional settings, which makes Bayesian approaches undesirable. In order to deal with datasets with large number of features, we aim to develop fast algorithms for Bayesian variable selection, which approximate the true posterior distribution, but yet still return the right inference (at least asymptotically).
In this thesis, we start with a variational algorithm for linear regression. Our algorithm is based on the work by Carbonetto and Stephens (2012), and with essential modifications including updating scheme and truncation of posterior inclusion probabilities. We have shown that our algorithm achieves both frequentist and Bayesian variable selection consistency.
Then we extend our variational algorithm to logistic regression by incorporating the Polya-Gamma data-augmentation trick (Polson et al., 2013), which links our algorithm for linear regression with logistic regression. However, as the variational algorithm needs to update the variational distribution of all the latent Polya-Gamma random variables of the same size of the observations at every iteration, this algorithm is slow when there are huge amount of observations, or even be infeasible when the data is too large to be loaded into computer memory. We propose an online algorithm for the logistic regression, under the framework of online convex optimization. Our algorithm is fast, and achieves similar accuracy (log-loss) as the state-of-art algorithm (Follow-the-Regularized-Proximal algorithm).
Bayesian variable selection; Variational Bayesian methods; Online learning
Mon, 10 Jul 2017 00:00:00 GMThttp://hdl.handle.net/2142/982012017-07-10T00:00:00ZHuang, XichenHeterogeneity modeling and longitudinal clustering
http://hdl.handle.net/2142/98160
Heterogeneity modeling and longitudinal clustering
Zhu, Xiaolu
Personalization has broad applications in many fields these days. Due to significant subject variations, it has become critical to incorporate subjects' heterogeneous characteristics in order to efficiently allocate personalized treatment or marketing strategies to tailor for subject specific needs.In this thesis, we develop several types of methods and theory to accommodate heterogeneity modeling in various personalization applications for longitudinal data.
In the first application, we propose a personalized drug dosage recommendation scheme. Specifically, we model patients' heterogeneity using subject-specific random effects, and propose an adaptive procedure to estimate new patients' random effects and provide dosage recommendations for new patients over time. An advantage of our approach is that we do not impose any distribution assumption on estimating random effects. Moreover, the new approach can accommodate general time-varying covariates corresponding to random effects. We show that the proposed method is more efficient compared to existing approaches, especially when covariates are time-varying.
In the second part of the thesis, we develop an efficient cluster analysis approach to subgroup longitudinal profiles using a penalized regression method. We utilize a pairwise-grouping penalization on the parameters corresponding to the individual nonparametric B-spline models, and thereby identify clusters based on different patterns of the predicted longitudinal curves. One advantage of the proposed method is that we approximate the longitudinal profiles and cluster trajectories into subgroups simultaneously. To implement the proposed method, we develop an alternating direction method of multipliers (ADMM) algorithm which has the desirable convergence property. In theory, we establish the consistency properties asymptotically. In addition, we show that our method outperforms the existing competitive approaches in our simulation studies and real data example.
In the third part of the thesis, we are interested in marketing segmentation, where customers are clustered into different subgroups due to their heterogeneous responses to the same marketing strategy. Specifically, we propose a pairwise subgrouping approach to identify and categorize similar marketing effects into subgroups. We model customers' purchase decisions as binary responses under the generalized linear model framework and incorporate their longitudinal correlation. We impose penalization on pairwise distances of individual effects to formulate subgroups, where different subgroups are associated with different marketing effects. In theory, we establish the consistency of subgroup identification in the sense that the true underlying segmentation structure can be recovered successfully, in addition to model estimation consistency. We apply the proposed approach to a real data application using IRI marketing data on in-store display marketing effects, where the proposed method performs favorably in terms of subgrouping identification and effects estimation.
Clustering; Heterogeneity modeling; Longitudinal data; Subgrouping
Wed, 31 May 2017 00:00:00 GMThttp://hdl.handle.net/2142/981602017-05-31T00:00:00ZZhu, XiaoluSampling for network motif detection and estimation of Q-matrix and learning trajectories in DINA model
http://hdl.handle.net/2142/98162
Sampling for network motif detection and estimation of Q-matrix and learning trajectories in DINA model
Chen, Yinghan
Monte Carlo methods provide tools to conduct statistical inference on models that are difficult or impossible to compute analytically and are widely used in many areas of statistical applications, such as bioinformatics and psychometrics. This thesis develops several sampling algorithms to address open issues in network analysis and educational assessments.
The first problem we investigate is network motif detection. Network motifs are substructures that appear significantly more often in the given network than in other random networks. Motif detection is crucial for discovering new characteristics in biological, developmental, and social networks. We propose a novel sequential importance sampling strategy to estimate subgraph frequencies and detect network motifs. The method is developed by sampling subgraphs sequentially node by node using a carefully chosen proposal distribution. The method generates subgraphs from a distribution close to uniform and performs better than competing methods. We apply the method to four real-world networks and demonstrate outstanding performance in practical examples.
The other two issues are related to educational measurement in psychometrics. Cognitive diagnosis models (CDMs) are partially ordered latent class models to classify students into skill mastery profiles. In educational assessment, these models help researchers analyze students' mastery of skills and learning process based on their responses to test items. The deterministic inputs, noisy "AND" gate model (DINA) is a popular psychometric model for cognitive diagnosis. We investigate the estimation of Q-matrix in DINA model. Q matrix is a binary matrix which maps the test item to its corresponding required attributes. We propose a Bayesian framework for estimating the DINA Q matrix. The proposed algorithms ensure that the estimated Q matrices always satisfy the identifiability constraints. We present Monte Carlo simulations to support the accuracy of parameter recovery and apply our algorithms to Tatsuoka's fraction-subtraction dataset.
The last project is related to the recovery of learning process. The increasing presence of electronic and online learning resources presents challenges and opportunities for psychometric techniques that can assist in the measurement of abilities and even hasten their mastery. CDMs can assist in carefully navigating through the training and assessment of these skills in e-learning applications. We propose a class of CDMs for modeling changes in attributes, which we refer to as learning trajectories. We focus on the development of Bayesian procedures for estimating parameters of a first-order hidden Markov model and apply the developed model to a spatial rotation experimental intervention.
Network motif; Cognitive diagnosis model; Monte Carlo methods; Sequential importance sampling; Gibbs sampling; Bayesian statistics
Tue, 06 Jun 2017 00:00:00 GMThttp://hdl.handle.net/2142/981622017-06-06T00:00:00ZChen, YinghanStatistical algorithms using multisets and statistical inference of heterogeneous networks
http://hdl.handle.net/2142/98245
Statistical algorithms using multisets and statistical inference of heterogeneous networks
Huang, Weihong
Computational statistics, including methods such as Markov chain Monte Carlo (MCMC), bootstrap, approximate Bayesian computation, is an important part in modern statistics and has been widely used in many areas, such as Bayesian statistics, computational biology, and computational physics. In this thesis, we study three problems: improvement of the efficiency for the EM algorithm and the MCMC method, and statistical analysis for heterogeneous networks.
The expectation-maximization (EM) algorithm is widely used in computing the maximum likelihood estimates when the observations can be viewed as incomplete data. However, the convergence rate of the EM algorithm can be slow especially when a large portion of the data is missing. In Chapter 2, we propose the multiset EM algorithm that can help the convergence of the EM algorithm. The key idea is to augment the system with a multiset of the missing component, and construct an appropriate joint distribution of the augmented complete data. We demonstrate that the multiset EM algorithm can outperform the EM algorithm, especially when EM has difficulties in convergence and the E-step involves Monte Carlo approximation.
The multiset sampler proposed by Leman et al. (2009) has been shown to be an effective algorithm to sample from complex multimodal distributions, but the multiset sampler requires that the parameters in the target distribution can be divided into two parts: the parameters of interest and the nuisance parameters. In Chapter 3, we propose a new self-multiset sampler (SMSS) which extends the multiset sampler to distributions without nuisance parameters. We also generalize our method to distributions with unbounded or infinite support. Numerical results show that the SMSS and its generalization have a substantial advantage in sampling multimodal distributions compared to the ordinary Markov chain Monte Carlo algorithm and some popular variants.
Heterogeneous networks are useful for modeling complex systems, which consist of different types of objects. However, there are limited statistical models to deal with heterogeneous networks. In Chapter 4, we propose a statistical model for community detection in heterogeneous networks. To allow heterogeneity in the data and the content dependent property of the pairwise relationship, we formulate the heterogeneous version of the mixed membership stochastic blockmodel. We also apply a variational algorithm for posterior inference. We demonstrate the advantage of the proposed method, in modeling overlapping communities and multiple memberships, through simulation studies and applications to the DBLP data.
Multisets; Expectation-maximization (EM) algorithm; Metropolis-Hastings algorithm; Heterogeneous network; Clustering; Mixed membership model; Variational algorithm
Tue, 27 Jun 2017 00:00:00 GMThttp://hdl.handle.net/2142/982452017-06-27T00:00:00ZHuang, WeihongConsistent community detection in uni-layer and multi-layer networks
http://hdl.handle.net/2142/98136
Consistent community detection in uni-layer and multi-layer networks
Paul, Subhadeep
Over the last two decades, we have witnessed a massive explosion of our data collection abilities and the birth of a "big data" age. This has led to an enormous interest in statistical inference of a new type of complex data structure, a graph or network. The surge in interdisciplinary interest on statistical analysis of network data has been driven by applications in Neuroscience, Genetics, Social sciences, Computer science, Economics and Marketing. A network consists of a set of nodes or vertices, representing a set of entities, and a set of edges, representing the relations or interactions among the entities. Networks are flexible frameworks that can model many complex systems.
In the majority of the network examples dealt with in the literature, the relations between nodes are assumed to be of the same type such as web page linkage, friendship, co-authorship or protein-protein interaction. However, the complex networks in many modern applications are often multi-layered in the sense that they consist of multiple types of edges/relations among a group of entities. Each of those different types of relations can be viewed as creating its own network, called a layer of the multi-layer network. Multi-layer networks are a more accurate representation of many complex systems since many entities in those systems are involved simultaneously in multiple interactions. In this dissertation we view multi-layer networks in the broad sense that includes multiple types of relations as well as multiple information sources on the same set of nodes (e.g., multiple trials or multiple subjects).
The problem of detecting communities or clusters of nodes in a network has received considerable attention in literature. As with uni-layer networks, community detection is an important task in multi-layer networks. This dissertation aims to develop new methods and theory for community detection in both uni-layer and multi-layer networks that can be used to answer scientific questions from experimental data.
For community detection in uni and multi-layer graphs, we take three approaches - (1) based on statistical random graph models, (2) based on maximizing quality functions, e.g., the modularity score and (3) based on spectral and matrix factorization methods.
In Chapter 2 we consider two random graph models for community detection in multi-layer networks, the multi-layer stochastic block model (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic block model (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compared MLEs in the two models among themselves and with other baseline approaches both theoretically and through simulations. We also derived minimax error rates and thresholds for achieving consistency of community detection in MLSBM, which were then used to show the advantage of the multi-layer model over a traditional alternative, the aggregate stochastic block model. In simulations RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low.
A popular method of community detection in uni-layer networks is maximization of a partition quality function called modularity. In Chapter 3 we introduce several multi-layer network modularity measures based on different random graph null models, motivated by empirical observations from a diverse field of applications. In particular, we derived different modularities by defining the multi-layer configuration model, the multi-layer expected degree model and their various modifications as null models for multi-layer networks. These measures are then optimized to detect the optimal community assignment of nodes. We apply the methods to five real multi-layer networks - three social networks from the website Twitter, a complete neuronal network of a nematode, C-elegans and a classroom friendship network of 7th-grade students.
In Chapter 4 we present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data.
In Chapter 5 we once again consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network or multiple snapshots of a time-varying network. Numerous methods have been proposed in the literature for the more general problem of multi-view clustering in the past decade based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. Such methods can be adapted for community detection in multi-layer networks with minimal modifications. However, the theoretical properties of these methods remain largely unexplored and most authors have relied on performance in synthetic and real data to assess the goodness of the procedures. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objective will return a good community structure. We apply some of these methods for consensus community detection in multi-layer networks and investigate the consistency properties of the global optimizer of the objective functions under the multi-layer stochastic block model. We derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup where both the number of nodes and the number of layers of the multi-layer graph grow. We complement the asymptotic analysis with a thorough numerical study to compare the finite sample performance of the methods.
Motivated by multi-subject and multi-trial experiments in neuroimaging studies, in Chapter 6 we develop a modeling framework for joint community detection in a group of related networks. The proposed model, which we call the random effects stochastic block model facilitates the study of group differences and subject specific variations in the community structure. In contrast to the previously proposed multi-layer stochastic block models, our model allows community memberships of nodes to vary in each component network or layer with a transition probability matrix, thus modeling the variation in community structure across a group of subjects or trials. We propose two methods to estimate the parameters of the model, a variational-EM algorithm and two non-parametric "two-step" methods based on spectral and matrix factorization respectively. We also develop several hypothesis tests with p-values obtained through resampling (permutation test) for differences in community structure in two groups of subjects both at the whole network level and node level. The methodology is applied to publicly available fMRI datasets from multi-subject experiments involving schizophrenia patients along with healthy controls. Our methods reveal an overall putative community structure representative of the groups as well as subject-specific variations within each group. Using our network level hypothesis tests we are able to ascertain statistically significant difference in community structure between the two groups, while our node level tests help determine the nodes that are driving the difference.
Community detection; Consistency; Co-regularization; Invariant subspaces; Minimax rates; Multi-layer networks; Multi-layer null models; Multi-layer modularity; Multi-layer stochastic block model; Network analysis; Non-negative matrix factorization; Neuroimaging; Random effects stochastic block model; Stochastic block model; Spectral clustering; Variational expectation-maximization (EM)
Fri, 30 Jun 2017 00:00:00 GMThttp://hdl.handle.net/2142/981362017-06-30T00:00:00ZPaul, SubhadeepEffect size estimation and robust classification for irregularly sampled functional data
http://hdl.handle.net/2142/98126
Effect size estimation and robust classification for irregularly sampled functional data
Park, Yeon Joo
Functional data arise frequently in numerous scientific fields with the development of modern technology. Accordingly, functional data analysis to extract information on curves or functions is an important area for investigation. In this thesis, we address two key issues: measuring an effect size of variable of the interest in functional analysis of variance (fANOVA) model and the development of robust probabilistic classifier in functional response model. We especially consider irregular functional data in our study, where curves are collected over varying or non-overlapping intervals.
First, we develop an approach to quantify the effect size on functional data, perform functional ANOVA hypothesis test, and conduct power analysis. We develop an approach to quantify the effect size on functional data, perform functional ANOVA hypothesis test, and conduct power analysis. We introduce the functional signal-to-noise ratio ($fSNR$), visualize the magnitude of effects over the interval of interest, and perform bootstrapped inferences. It can be applicable when the individual curves are sampled at irregularly spaced points or collected over varying intervals. The proposed methods are applied in the analysis of functional data from inter-laboratory quantitative ultrasound measurements, and in a reanalysis of Canadian weather data. Moreover, we represent the asymptotic power of functional ANOVA test as a function of proposed measure. The agreement between the asymptotic and empirical results is examined and found to be quite good even for small sample sizes. The asymptotic lower bound of power can be reasonably used to determine sample size in planning experimental design.
Second, we build a robust probabilistic classifier for functional data, which predicts the membership for given input as well as provides informative posterior probability distribution over a set of classes. This method combines Bayes formula and semiparametric mixed effects model with robust tuning parameter. We aim to make the method robust to outlying curves especially in providing robust degree of certainty in prediction, which is crucial in medical diagnosis. It can be applicable to various practical structures, such as unequally and sparsely collected samples or repeatedly measured curves retaining between-curve correlation, with very flexible spatial covariance function. As an illustration we conduct simulation studies to investigate the sensitivity behaviors of probability estimates to outlying curves under Gaussian assumption and compare our proposed classifier with other functional classification approaches. The performance is evaluated by imposing more penalty for being confident but false prediction. The value of the proposed approach hinges on its simple, flexible, and computational efficiency. We illustrate the issues and methodology in ultrasound quantitative ultrasound, backscatter coefficient vs. frequency functional data, commonly obtained as irregular form and public dataset with artificial contamination. We also show how to implement proposed classifier in R.
Effect size; Functional analysis of variance (ANOVA); Functional central limit theorem; Functional random effect model; Irregular functional data; Power analysis; Probabilistic classification; Quantitative image analysis; Robustness; Signal-to-noise ratio
Mon, 10 Jul 2017 00:00:00 GMThttp://hdl.handle.net/2142/981262017-07-10T00:00:00ZPark, Yeon Joo