Siyuan Guo

Date of Award

Summer 2022

Document Type

Open Access Dissertation



First Advisor

Jiajia Zhang


Longitudinal measurements are important components in electronic health record (EHR) data. In practice, using longitudinal EHR history is expected to improve the estimation or prediction performance when studying some outcomes of interest, such as binary outcome or time to event outcome. However, the longitudinal observations in EHR data is complex due to irregular and sparse EHR visits. Therefore, modelling longitudinal data and further incorporating them with different types of outcomes is challenge. In this dissertation, we aim to develop methodology to first, describe the pattern of the longitudinal predictors with continuous or binary observations, and second, model the longitudinal predictors with scalar outcomes or time to event outcomes. The three projects in this dissertation are motivated by the South Carolina statewide HIV data set which is collected by the South Carolina Department of Health and Environmental Control (SC DHEC).

Since 2004, the longitudinal laboratory measurements include CD4 and Viral load (VL) counts of People living with HIV (PLWH) in SC are available in the data set. For all the three projects, latent processes are assumed for the longitudinal observations and Functional Principal Component Analysis is used to model the latent processes. For the first project, we propose a functional multi-variable logistic regression model and use 10-year historical longitudinal laboratory results to predict future scalar viral suppression status of PLWH in SC. The components of the functional principal component model are estimated by penalized spline method. The patterns of the longitudinal predictors are summarized by their functional principal component scores. A logistic regression framework is used to link the functional principal component scores and the binary outcome.

In the second project, we aim to use the longitudinal covariates to predict the mean residual life of PLWH at different age. Mean residual life is defined as the average lifetime left after a specific time point. 5-year longitudinal predictors are also summarized into functional principal component scores and a varying-coefficient model is built to predict the mean residual life where the varying-coefficient is depending on the current age of PLWH. The coefficients are estimated by inverse probability of censoring weighting approach and the local polynomial smoother is used to extend the estimates of the coefficient to any time point of interest.

In the third project, we consider to study the time to first cancer diagnosis since HIV diagnosis where only a proportion of PLWH are assumed that will develop cancer eventually. The PLWH who are in insusceptible group are all censored. A joint modelling of longitudinal predictors and time to cancer data is proposed using FPCA for the longitudinal variables and a mixture cure model for the survival outcome. Proportional Hazard (PH) model is used for the time to cancer of PLWH in susceptible group, which is associated with both longitudinal predictors and baseline variables. Note, only baseline variables are associated with the incidence rate. Expectation-Maximization (EM) algorithm is proposed to estimate the parameters of both components.

Available for download on Saturday, October 05, 2024

Included in

Biostatistics Commons