A Comparison of Inference Methods in High-Dimensional Linear Regression

Spring 2022

Document Type

Open Access Thesis

Statistics

Karl B. Gregory

Abstract

Building confidence/credible intervals for the high-dimensional (p >> n) linear models have been the subject of exploration for many years. In this paper, we explore three specific setups. First, we look at the Bayesian paradigm for the LASSO model. A double-exponential prior has been applied to the regression coefficient and from that, a posterior distribution is derived to get the necessary quantiles to calculate the credible intervals for the regression coefficients. Second, we explore the de-sparsified LASSO estimates, and using its asymptotic normality, we calculate the confidence intervals for the model coefficients. Finally, we incorporate an adaptive LASSO model. To calculate the confidence intervals, we have used the residual and perturbation bootstrap methods and obtained the necessary quantiles. All three methods have been put through a simulation study to compare the interval coverage of the true coefficient values. The width of the intervals is also compared. We make n, the sample size fixed, and explore the cases where we put a set of correlated covariates as true values. The considered number of covariates, p includes 200, 500, and 1000. We also compare the time it takes to complete 10 runs of each setup on a personal computer. We assume two kinds of correlation structures for data. We call the first one AR-1 and the second one is known as compound symmetry. For AR-1 cases, when p >> n, the Bayesian LASSO provides better coverage for true non-zero coefficient, especially if the correlation is close to 0.9 and 0.5. For compound symmetry cases, the desparsified LASSO seems to provide closer to the nominal coverage regardless of the value of correlation coefficient, ρ. However, the coverage is around 0.90 for p = 1000. This better coverage comes with the cost of getting wider intervals for highly corre-lated cases. For the moderate correlation, the intervals by de-sparsified LASSO are even smaller for the true non-zero coefficients. The bootstrap generated intervals for adaptive LASSO tend to provide coverage around 0.90, in the uncorrelated cases, regardless of the number of covariates in the model. But they appear to suffer when the predictors are highly correlated. If the correlation is low and the number of predictors is not too much greater than the sample size, perturbation bootstrap provides close to nominal coverage. In addition, the adaptive LASSO with perturbation bootstrap typically achieves faster calculation time in most cases of our simulation setup. In conjunction with the simulation study, we illustrate the aforementioned methods for building confidence/credible intervals on two real datasets.

COinS