Date of Award


Document Type

Open Access Dissertation


Epidemiology and Biostatistics



First Advisor

Hongmei Zhang


Epigenetics is the study of chemical reactions, which are orchestrated for the development and maintenance of an organism. Genetic or epigenetic variants (GEVs) encompass different types of genetic measures, such as Deoxyribonucleic acid (DNA) methylation at different CpG sites, expression level of genes, or single nucleotide polymorphisms (SNPs). With the development of technology, huge amount of genetic and epigenetic information is produced. However, the rich information potentially brings in challenges in data analyses. Thus it is necessary to reduce the dimension of data to improve efficiency. In my dissertation, I will focus on two directions of dimension reduction: variable selection and clustering.

The first project on dimension reduction was motivated by an epigenetic project aiming to identifying GEVs that are associated with a health outcome. Due to the potential non-linear interaction between GEVs, we designed a backward variable selection procedure to select informative GEVs. It is built upon a reproducing kernel-based method for evaluating the joint effect of a set of GEVs, e.g, a set of CpG sites. These GEVs may interact with each other in an unknown and complex way. Simulation studies indicate that the selection method is robust to different types of interaction effects, linear or non-linear. We demonstrate the method using two data sets with the first data selecting important SNPs that are associated with lung function and the second identifying important CpG sites such that their methylation is jointly associated with active smoking measured by cotinine levels.

The second project was motivated by the potential heterogeneity in clusters identified by existing methods. Traditional approaches focus on the clustering of either subjects or (response) variables. However, clusters formed through these approaches possibly lack homogeneity. To improve the quality of clusters, we propose a clustering method through joint clustering. Specifically, the variables are first clustered based on the agreement of relationships (unknown) between variable measures and covariates of interest, and then subjects within each variable cluster are further clustered to form refined joint clusters. A Bayesian method is proposed for this purpose, in which a semi-parametric model is used to evaluate any unknown relationship between variables and covariates of interest, and a Dirichlet process is utilized in the process of second-step subjects clustering. The proposed method has the ability to produce homogeneous clusters composed of a certain number of subjects sharing common features on the relationship between some (response) variables and covariates. Simulation studies are used to examine the performance and efficiency of the proposed method. The method is then applied to DNA methylation measures of multiple CpG sites.