Date of Award

Summer 2025

Document Type

Open Access Dissertation

Department

Epidemiology and Biostatistics

First Advisor

Jiajia Zhang

Abstract

Electronic Health Record (EHR) data from millions of patients are now routinely collected across diverse healthcare institutions, offering unprecedented opportunities to enhance personalized medicine and improve healthcare quality. EHR data contain rich information, including patient demographics, diagnoses, laboratory test results, medication prescriptions, medical images, and clinical notes. However, analyzing EHR data accurately is challenging due to its large scale, high dimensionality, and inherent complexity. Traditional statistical models often rely on strong assumptions and lack the flexibility needed to accommodate complex, nonlinear relationships among variables. In this dissertation, we propose incorporating deep learning methods to improve risk prediction functions within different types of survival models, designed specifically to handle the high-dimensional, nonlinear associations present in EHR data. The three projects presented in this dissertation are motivated by statewide COVID-19 infection-related and vaccine data from South Carolina.

For patients vaccinated against COVID-19, some individuals may never experience a breakthrough infection, either due to effective vaccine-induced immunity or natural immunity. In the first project, we develop a predictive framework using a mixture cure model (MCM) to handle diseases with a potentially "cured" subpopulation. The MCM has two components: an incidence component, modeled using logistic regression, which estimates the likelihood of experiencing a breakthrough infection, and a latency component, modeled with Cox proportional hazards (Cox PH) regression, which estimates the timing of breakthrough events if they occur. Deep learning methods are applied to capture the nonlinear risk functions within both incidence and latency components, and the Expectation-Maximization (EM) algorithm is employed to estimate model parameters efficiently.

The second project focuses on using deep learning to predict the nonlinear risk function within the Generalized Odds Rate (GOR) model, an alternative to the PH model that flexibly accommodates non-proportional hazards. This flexibility makes the GOR model particularly suitable for dynamic health data where hazard ratios may vary over time. To further enhance model simplicity and account for unobserved heterogeneity, a gamma-distributed frailty term is incorporated into the GOR model. The EM algorithm is used here as well, aiding in the estimation of latent variables within the frailty-adjusted framework. In the third project, I extend the methodologies developed in the first two projects by combining the MCM and GOR approaches into a GOR mixture cure model. This combined model utilizes deep learning to predict risk functions for both the incidence and latency components, allowing for complex, nonlinear associations across different subpopulations. The EM algorithm is proposed to estimate the parameters of both components.

All three proposed methods are evaluated through extensive simulation studies, demonstrating their robustness and flexibility. To prevent overfitting, an early-stopping criterion is implemented to identify an optimal stopping point during model training. Finally, the methods are applied to statewide COVID-19 infection and vaccination data, highlighting their practical utility and potential impact on real-world healthcare applications.

Rights

© 2025, Xiaowen Sun

Available for download on Tuesday, August 31, 2027

Included in

Biostatistics Commons

Share

COinS