Efficient Machine Learning on Scientific Data Using Bayesian Optimization

Rui Xin, University of South Carolina

Abstract

Deep Learning is pivotal in advancing data analysis across various scientific fields, from genomics to materials discovery. Despite its widespread use, efficiently learning from limited data and operating under resource constraints remains a significant challenge, often limiting its full potential in environments where data is scarce or resources are restricted. This dissertation explores Active Learning and Automated Machine Learning (AutoML) powered by Bayesian Optimization to enhance the efficiency of machine learning across multiple disciplines. It focuses on algorithm optimization and data management through three interconnected studies. In the first study, we investigate how data management technique - active learning helps discover new materials with target properties in limited dataset considering the vast chemical design space. We propose an active generative inverse design method that combines active learning with a deep autoencoder neural network and a generative adversarial deep neural network model to discover new materials with a target property in the whole chemical design space. Our experiments demonstrate that although active learning may select chemically infeasible candidates, these samples are beneficial for training robust screening models. These models effectively filter and identify materials with desired properties from those generated hypothetically by the generative model. The results confirm the success of our active generative inverse design approach. In the second study, we explore cancer heterogeneity and specificity through the analysis of mutational signatures, using collinearity analysis and machine learning techniques. These techniques include either a decision tree-based ensemble model or a flexible neural network-based method with automated hyperparameter optimization, each customizing a neural network for individual sub-tasks. Through thorough training and independent validation, our results reveal that although the majority of mutational signatures are distinct, similarities between certain mutational signature pairs are observed through both mutation patterns and mutational signature abundance. These observations can potentially assist in determining the etiology of still elusive mutational signatures. Further analysis using machine learning approaches indicates specific mutational signature relevance to cancer types, with skin cancer showing the strongest specificity among all cancer types. In the third study, we analyze cancer heterogeneity by examining immune cell compositions in tumor microenvironments, using AutoML to develop tailored models for classification subtasks. By analyzing transcriptome profiles from 11,274 patients across 33 cancer types to identify 22 immune cell types, we employ neural architecture search to model outcomes for cancer type and tumor-normal distinctions, utilizing the Shannon index for immune cell diversity and Cox regression for prognostic evaluations. Our findings reveal significant immune cell differences between tumors and normal tissues, with some discrepancies in directional differences across cancers. Immune cell composition patterns modestly differentiate cancer types, with sixteen significant prognostic associations identified, such as in kidney renal clear cell carcinoma. Additionally, immune cell diversity shows marked differences in seven cancer types and correlates positively with survival in some cases, underscoring the lack of a universal standard across all cancers.