Theses and Dissertations

Machine Learning Based Disease Gene Identification and MHC Immune Protein-peptide Binding Prediction

Zhonghao Liu, University of South Carolina - Columbia

Date of Award

2018

Document Type

Open Access Dissertation

Department

Computer Science and Engineering

First Advisor

Jianjun Hu

Abstract

Machine learning and deep learning methods have been increasingly applied to solve challenging and important bioinformatics problems such as protein structure prediction, disease gene identification, and drug discovery. However, the performances of existing machine learning based predictive models are still not satisfactory. The question of how to exploit the specific properties of bioinformatics data and couple them with the unique capabilities of the learning algorithms remains elusive. In this dissertation, we propose advanced machine learning and deep learning algorithms to address two important problems: mislocation-related cancer gene identification and major histocompatibility complex-peptide binding affinity prediction. Our first contribution proposes a kernel-based logistic regression algorithm for identifying potential mislocation-related genes among known cancer genes. Our algorithm takes protein-protein interaction networks, gene expression data, and subcellular location gene ontology data as input, which is particularly lightweight comparing with existing methods. The experiment results demonstrate that our proposed pipeline has a good capability to identify mislocation-related cancer genes. Our second contribution addresses the modeling and prediction of human leukocyte antigen (HLA) peptide binding of human immune system. We present an allelespecific convolutional neural network model with one-hot encoding. With extensive evaluation over the standard IEDB datasets, it is shown that the performance of our model is better than all existing prediction models. To achieve further improvement, we propose a novel pan-specific model on peptide-HLA class I binding affinities prediction, which allows us to exploit all the training samples of different HLA alleles. iv Our sequence based pan model is currently the only algorithm not using pseudo sequence encoding — a dominant structure-based encoding method in this area. The benchmark studies show that our method could achieve state-of-the-art performance. Our proposed model could be integrated into existing ensemble methods to improve their overall prediction capabilities on highly diverse MHC alleles. Finally, we present a LSTM-CNN deep learning model with attention mechanism for peptide-HLA class II binding affinities and binding cores prediction. Our model achieved very good performance and outperformed existing methods on half of tested alleles. With the help of attention mechanism, our model could directly output the peptide binding core based on attention weight without any additional post- or preprocessing.

Rights

Recommended Citation

Liu, Z.(2018). Machine Learning Based Disease Gene Identification and MHC Immune Protein-peptide Binding Prediction. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/5084

Download

Included in

Computer Sciences Commons

COinS

Theses and Dissertations

Machine Learning Based Disease Gene Identification and MHC Immune Protein-peptide Binding Prediction

Date of Award

Document Type

Department

First Advisor

Abstract

Rights

Recommended Citation

Included in

Search

Browse

Submissions

Links

Theses and Dissertations

Machine Learning Based Disease Gene Identification and MHC Immune Protein-peptide Binding Prediction

Author

Date of Award

Document Type

Department

First Advisor

Abstract

Rights

Recommended Citation

Included in

Share

Search

Browse

Submissions

Links