Date of Award

Spring 2025

Document Type

Open Access Dissertation

Department

Epidemiology and Biostatistics

First Advisor

Feifei Xiao

Abstract

Copy number variants (CNVs) and single-nucleotide polymorphisms (SNPs) are major sources of genetic variation in genome of both haploid and diploid organisms. CNVs play important roles in human diseases and dynamic evolution of species. Therefore, accurate CNV detection and association testing methods, including causal effect estimation, are warranted. Modern genotyping technologies have experienced rapid development which enable the characterization of CNVs in various resolutions, such as SNP array in the sample level and the whole genome DNA sequencing in a single- or low-cell resolution. In addition, careful calibration of the CNV detection methods is needed to adjust the difference between haploid and diploid organisms, for example, the low breadth and coverage of single-cell sequencing may induce a high degree of DNA amplification bias. However, current statistical methodologies for CNV detection, association testing, and causal effect estimation still face challenges due to various genetic and technological factors that may lead to spurious results. First, most methods are developed for diploid genomes without accounting for the difference from haploid genome, making it unsuitable for application on haploid data. Second, a traditional two-stage strategy, where CNVs are first detected and their association with diseases are then tested, may generate false CNV calls, leading to biased association testing in the second stage. Third, in the Mendelian randomization (MR), correlated horizontal pleiotropy (CHP) and uncorrelated horizontal pleiotropy (UHP) exist when SNPs also affect the outcome through unknown confounders. Conventional methods do not consider CHP or UHP effects or only assume homogenous CHP effects, ignoring the fact that complex diseases are usually caused by multiple exposures, which may lead to heterogenous CHP effects if those exposures are not accounted for. In this study, we developed (1) HapCNV, a comprehensive statistical framework for data normalization and CNV detection in haploid single- or low-cell DNA sequencing data. HapCNV constructs a novel genomic location specific pseudo-reference that selects unbiased references and effectively preserves both rare and common CNV signals after normalization. (2) a one-stage CNV-disease association analysis method, OSCAA, which utilizes a two-dimensional GMM model to assess the association between CNV and disease risk in a known CNVR with a one-stage framework. OSCAA accounts for the probabilities of false CNV calls into the association testing and thus improves statistical power and controls for type I error. (3) a Mendelian randomization method, MR-GFL, that adaptively identifies groups of IVs with distinct CHP effects by using generalized fused lasso and estimate the causal effect using such group information. MR-GFL enables the identification of heterogenous CHP effects, considering which will improve the accuracy and precision of estimated causal effect. The proposed framework provides a broad application platform for consecutive studies on the role of CNVs in human diseases and dynamic evolution of both haploid and diploid organisms, which will inform new approaches for precision medicine.

Rights

© 2025, Xuanxuan Yu

Available for download on Monday, May 31, 2027

Included in

Biostatistics Commons

Share

COinS