Date of Award


Document Type

Open Access Thesis


Epidemiology and Biostatistics



First Advisor

Alexander C. McLain


Clustering, as a fundamental process in data science, is frequently used in preliminary data analysis. Batch effects are a powerful source of variation that can come from many sources in data collection, and influence data. We propose a method to simultaneously remove batch effects and perform cluster analysis. We see a batch effect as a fixed value added on to each batch, and do not make assumptions about the distribution of batch effects. We represent the data using a Gaussian mixture model, and use the EM algorithm to estimate the cluster means, the cluster covariance matrices, and the batch effects, and give predictions on which cluster each observation belongs to via their posterior probability. We also give two tests to identify the presence of batch effects in the data. Gap statistics are used to determine the number of clusters that should be used.

We compare our method via simulation studies with a standard K-means method and K-means with the batch effects removed prior to analysis. Out simulations studies our method has better prediction results than both of these approaches. Our method does not assume the batch effects following any particular distribution, and works on data that have larger batch effects, as well as an interaction between clusters and batches.


© 2015, Yifan Tang

Included in

Epidemiology Commons