Date of Award
Summer 2022
Document Type
Open Access Dissertation
Department
Statistics
First Advisor
Yen-Yi Ho
Abstract
The advancements in high-throughput technologies have made it possible to generate a huge number of "omics'' data, including genomics, proteomics, transcriptomics, epigenomics, metabolomics, and microbiomics. Combining multiple data sources and performing joint analyses with all available information and the phenotypic outcome can reflect various aspects in complex biological systems, such as revealing regulation processes, discovering novel associations between biological entities, and identifying relevant biomarkers for certain diseases or phenotypic outcomes. This dissertation focuses on developing statistical models for analyzing multi-omics data. It is comprised of three topics: (1) integrative analysis for multi-omics data with missing observations in intermediate variables; (2) modeling the dynamic gene co-expression (DC) in a genome-wide search space with the implementation of variable selection techniques in a Bayesian framework; and (3) mixed-effect variable selection model for identifying DC using scRNA-seq count data and DC-based strategy in subject subgroup classification.
In Chapter 2, we propose a novel integrative multi-omics analytical framework based on $p$-value weight adjustment in order to incorporate observations with a large proportion of missing values in the analysis. The occurrence of missing values is an inevitable issue in multi-omics data because some measurements, such as mRNA gene expression levels, often require invasive tissue sampling from patients. To incorporate the incomplete information from measurements with missing records, we split the data into a complete set with full information and an incomplete set with missing measurements, and introduce mechanisms to derive weights and weight-adjusted $p$-values from the two sets. Through simulation analyses, we demonstrate that the proposed framework achieves considerable statistical power gains compared to a complete case analysis or multiple imputation approaches. In experimental data analysis, the implementation of our proposed framework is illustrated by a joint analysis of DNA methylation, mRNA, and the phenotypic outcome in a study of preterm infant birth weight.
Chapter 3 proposes two models to apply the genome-wide search for identifying dynamic gene co-expression from genomic datasets. In a biological system, genetic interactions are tightly regulated and are often highly dynamic. The interactions can change flexibly under various internal cellular signals or external stimuli. Previous studies have developed statistical methods to examine these dynamic changes in genetic interactions. However, a common challenge encountered in the existing approaches is the computational intensiveness due to the massive number of gene combinations needed to be considered in a typical genomic dataset, yet often a much smaller proportion of gene interactions exhibit dynamic co-expression changes.
To solve this problem, we propose variable selection methods in Bayesian frameworks with spike-and-slab priors. The proposed algorithms focus on subsets of promising gene combinations in the search space. We also adopt a Bayesian false discovery control procedure for testing the significance of dynamic gene co-expression changes. A series of simulation studies are then conducted to present a comparison between our proposed approaches to the existing exhaustive search heuristics. We also demonstrate the implementation of our proposed approaches in experimental data analysis to study gene co-expression changes associated with colorectal cancer recurrence-free survival.
In Chapter 4, we developed subject-specific methods which combines mixed-effect model and spike-and-slab variable selection method in identifying DC for scRNA-seq count data. In a typical scRNA-seq dataset, gene expression profiles are usually collected from multiple subjects such as different patients or tissues. Considering subject-specific characterization can increase the accuracy of identifying DC and the sparsity of DC signals in scRNA-seq datasets, we propose a mixed-effect variable selection model for identifying subject-specific DC gene pairs while incorporating both subject-specific random effects, across-specific fixed effects, zero-inflation and over-dispersion in scRNA-seq datasets. We also propose a DC-based strategy to classify subjects into subgroups by using subject-specific DC as inputs. Through simulation study, we show that our proposed ME-SPSL model outperforms the mixed-effect model without using variable selection technique and the existing method without considering subject-specific random effects. We also demonstrate the implementation of our proposed method in a melanoma scRNA-seq dataset to estimate subject-specific DC and use the DC gene pairs as the biomarkers to classify the melanoma samples into immunotheropy-resistant and non-resistant groups.
Rights
© 2022, Wenda Zhang
Recommended Citation
Zhang, W.(2022). Statistical Methods for Analyzing Multi-omics Data: Dependence Structure and Missing Values. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/6970