Date of Award
Summer 2025
Document Type
Open Access Dissertation
Department
Statistics
First Advisor
Karl Gregory
Second Advisor
Xianzheng Huang
Abstract
This dissertation develops new methodology for valid statistical inference following model selection in various regression settings. We first consider the logistic regression model for a binary response that can only be observed at the group level, as in group testing, and is subject to testing error. With the true responses only partially observed, we employ the expectation-maximization (EM) algorithm to account for missing information in the response data and simultaneously conduct variable selection using the LASSO-penalized log-likelihood function. After variable selection, we extend an existing post-selection inference method based on the polyhedral lemma \parencite{lee2016exact} to make inferences on selected covariates, adjusting for selection bias. Despite the complications caused by missing response data information, the proposed method produces more reliable post-selection inferences than those from the naive method that ignores selection bias and uncertainty. However, one drawback of this method is its overly conservative nature that leads to wide confidence intervals for regression coefficients. To address this limitation, we explore an alternative post-selection inference method based on parametric programming \parencite[PP,][]{le2022more}. This framework, originally developed for linear models, is extended here to generalized linear models (GLMs), including logistic, Poisson, and beta regression. This method improves upon the polyhedral-based inference method by avoiding conditioning on the signs of selected coefficients. In addition to preserving higher statistical efficiency, the PP-based method accommodates more complex regression settings, where the selection event is not limited to a polyhedron in the sample space. We propose a two-step procedure called parametric programming afterlinearization (PPL), for selection-adjusted inference in GLMs. In the first step, the GLM is linearized by rewriting the maximum likelihood estimation problem as a weighted least squares formulation, thereby generating pseudo-data. In the second step, variable selection and inference are performed on the pseudo-data using the PP framework as in linear regression. Simulation results confirm that the PPL method effectively corrects the inflated Type I error characteristic of naive approaches and achieves narrower confidence intervals than those from the polyhedral method. Importantly, the PPL method supports valid inference even when the tuning parameter is chosen in a data-driven manner (e.g., via train-test split), thus overcoming a key limitation of the polyhedral-based approach that require the tuning parameter to be fixed or selected independently of the data. Extending this two-step PPL framework, we address post-selection inference for error-prone group testing data with partially observed binary outcomes. Simulation studies show that the method maintains nominal Type I error control across different strengths of penalty and remains robust in the presence of correlated covariates. Indeed, the strategy of parametric programming opens the door to developing valid statistical inference following model selection in a wide range of regression settings where it is infeasible to analytically derive the post-selection sampling distribution of a statistic, such as an estimator for a regression coefficient or a test statistic.
Rights
© 2025, Qinyan Shen
Recommended Citation
Shen, Q.(2025). Post-Selection Inference in Regression Models. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/8355