Author

Xizhi Luo

Date of Award

Summer 2021

Document Type

Open Access Dissertation

Department

Statistics

First Advisor

Feifei Xiao

Abstract

Copy number variation, as a major source of genetic variation in the human genome, are gains or losses of the DNA segments. Copy number variation has gained considerable interest as it plays important roles in human complex diseases. Therefore, accurate detection of CNVs with data generated by modern genotyping technologies, such as SNP array and whole-exome sequencing (WES), comprises a critical step toward a better understanding of disease etiology. However, current statistical methodologies for CNV detection still face analytical challenges due to numerous genetic and technological factors that may lead to spurious findings. First, existing methods assume the independent observations along the whole genome in genetic intensities for CNV detection, which is often violated in the genetics perspective. Second, neither SNP array nor WES offers full coverage of the genome in their genotyping resolution, which leads to a significant amount of missed variant calls by analyzing each data separately. Third, conventional methods adopt a single sample-based strategy that suffers from high false discovery rates due to prominent data noise.

In this study, we developed (a) a SNP array CNV calling algorithm, LDcnv, that integrated the genomic correlation structure with a local search strategy into statistical modelling of the genetic intensities, which improves both detection accuracy and robustness; (2) a WES CNV detection method, CORRseq, that extended the methodological work of LDcnv coupled with a median normalization procedure, which gives significant power gain for CNV identification; (3) a Bayesian Multi-sample and Integrative CNV (BMI-CNV) profiling method with matched samples sequenced by both WES and microarray, which used a Bayesian probit stick-breaking process model coupled with a Gaussian Mixture Model estimation for multiple sample integration. BMI-CNV enables accurate CNV identification integrating multiple genotyping platforms with a genome-wide scale. The performance of these proposed methods has been evaluated by extensive simulation studies and real data analyses. Our novel methods have been further applied to the 1000 Genomes project and an international lung cancer study to identify lung cancer susceptibility genes.

The proposed framework has a broad application scope for multiple study designs in studying the role of CNVs in various complex diseases, which will reveal the vital roles of CNVs in disease development and inspire new approaches for precision medicine.

Rights

© 2021, Xizhi Luo

Included in

Biostatistics Commons

Share

COinS