Author

Date of Award

Fall 2025

Document Type

Open Access Dissertation

Department

Statistics

First Advisor

Jiajia Zhang

Abstract

Electronic health records (EHRs) provide valuable information for disease progression monitoring, which can significantly enhance disease management and reduce healthcare burdens. Each encounter records patient’s demographics, treatment progress, medications, vital signs, past medical history, immunizations, and laboratory data. This data leads to complex longitudinal EHRs data features, including irregularity, sparsity, and non-linearity. Additionally, clinical notes recorded by clinicians are unstructured. Consequently, implementing EHRs in public health research presents numerous challenges, particularly in data management and analysis. Dynamic prediction is designed to update predictions as new data becomes available. It is particularly advantageous in healthcare settings where patient data are continuously collected. The objective of this dissertation is to develop methodology for constructing dynamic prediction algorithms across three projects. These projects model longitudinal numeric and unstructured predictors with scalar outcomes or time-to-event outcomes.

The first project is motivated by COVID-19 EHRs from India. We propose a two-step landmark competing risk model that summarizes historical laboratory measurements using a functional principal component analysis (PCA) and then employs the landmark competing risk model for prediction. Different approaches for handling longitudinal observations, including baseline measurement, mean, last value carry forward, and linear regression, are adopted in the two-step estimation and compared with the proposed method via the weighted Harrell’s C-Index, multi-class area under curve, and Brier score.

The second project is motivated by the EHRs of Prisma Hospital in South Carolina. Our objective is to dynamically predict the risk of all-cause mortality among patients using a landmark large language model that deciphers the chronological comorbidity history. Longitudinal features are initially extracted from concatenated comorbidity descriptions history via Bidirectional Encoder Representations from Transformers (BERT) and its variants. Subsequently, a binary classification model is employed to predict all-cause mortality among patients.

The third project is motivated by the Medical Information Mart for Intensive Care III (MIMIC III) datasets. Our focus is on patients who have been discharged from the intensive care unit (ICU). Dynamic predictions are made regarding the time of discharge, and discharge clinical notes are utilized as predictors for the 360-day mortality of these patients. BERT and its variants were employed to decipher the clinical notes, and a Cox proportional hazards model is integrated to model the time-to-360-day mortality.

All three proposed methods are applied to the motivated data to demonstrate their practical application in real-world scenarios.

Rights

© 2025 Cai

Available for download on Thursday, December 31, 2026

Share

COinS