Date of Award


Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Jijun Tang


Studies of bioinformatics develop methods and software tools to analyze the biological data and provide insight of the mechanisms of biological process. Machine learning techniques have been widely used by researchers for disease prediction, disease diagnosis, and bio-marker identification. Using machine-learning algorithms to diagnose diseases has a couple of advantages. Besides solely relying on the doctors’ experiences and stereotyped formulas, researchers could use learning algorithms to analyze sophisticated, high-dimensional and multimodal biomedical data, and construct prediction/classification models to make decisions even when some information was incomplete, unknown, or contradictory. In this study, first of all, we built an automated computational pipeline to reconstruct phylogenies and ancestral genomes for two high-resolution real yeast whole genome datasets. Furthermore, we compared the results with recent studies and publications to show that we reconstruct very accurate and robust phylogenies, as well as ancestors. We also identified and analyzed conserved syntenic blocks among reconstructed ancestral genomes and present yeast species. Next, we analyzed the metabolic level dataset obtained from positive mass spectrometry of human blood samples. We applied machine learning algorithms and feature selection algorithms to construct diagnosis models of Chronic kidney diseases (CKD). We also identified the most critical metabolite features and studied the correlations v among the metabolite features and the developments of CKD stages. The selected metabolite features provided insights into CKD early stage diagnosis, pathophysiological mechanisms, CKD treatments, and medicine development. Finally, we used deep learning techniques to build accurate Down Syndrome (DS) prediction/screening models based on the analysis of newly introduced Illumina human genome genotyping array. We proposed a bi-stream convolutional neural network (CNN) architecture with ten layers and two merged CNN models, which took two input chromosome SNP maps in combination. We evaluated and compared the performances of our CNN DS predictions models with conventional machine learning algorithms. We visualized the feature maps and trained filter weights from intermediate layers of our trained CNN model. We further discussed the advantages of our method and the underlying reasons for the differences of their performances.