|Abstract:||Modern applications of statistical approaches involve high-dimensional complex data, where variable selection plays an important role for model construction. In this thesis, we address the following challenging issues for the variable selection problem: variable selection consistency when irrepresentable conditions fail, block-wise missing data from multiple sources, and heterogeneous mediator selection for high-dimensional data.
In the first project, we propose a new Semi-standard PArtial Covariance (SPAC) approach which is able to reduce correlation effects from other predictors while incorporating the magnitude of coefficients. The proposed SPAC variable selection is effective in choosing covariates which have direct association with the response variable, while removing the predictors which are not directly associated with the response but are highly correlated with the relevant predictors. We show that the proposed method with the Lasso penalty and the SCAD penalty enjoys strong sign consistency in both finite-dimensional and high-dimensional settings under regularity conditions. Numerical studies and the ‘HapMap’ gene data application also confirm that the proposed method outperforms the traditional Lasso, adaptive Lasso, SCAD, and Peter–Clark-simple methods for highly correlated predictors.
In the second project, we propose a Multiple Block-wise Imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and optimally integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms.
In the third project, we propose a new mediator selection method, which can identify sub-populations and select mediators in each sub-population from high-dimensional data simultaneously. Specifically, we utilize the sum of squared residuals of a subject in mediator models and outcome model of a sub-population as a distance between the subject and the sub-population. For each subject, we find the smallest distance between this subject and all the sub-populations to determine which sub-population the subject should belong to. We then estimate parameters for each sub-population based on subjects which are identified in the sub-population. To select mediators instead of just variables, we propose a new joint penalty which penalizes effect from the independent variable to a mediator (independent-mediator effect) and effect from the mediator to the outcome (mediator-outcome effect) together. In addition, we propose a difference of convex-smooth gradient descent (DC-SmGD) algorithm to implement the proposed method. Numerical studies show that the proposed method performs better than existing methods for heterogeneous data.