Date of Award


Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Jianjun Hu


Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. In the first part of the dissertation, we propose a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset.

In the second part of the dissertation, we propose a computational method, SeqNLS, to predict nuclear localization signal (NLS). The major difficulty of NLS prediction is that NLSs are known to have diverse patterns, but the knowledge to NLS patterns is limited and only a portion of NLSs can be covered by the known NLS motifs. In SeqNLS, on the one hand we propose a sequential-pattern approach to effectively detect potential NLS segments without constrained by the limited knowledge of NLS patterns. On the other hand, we introduce a model for NLS prediction which utilizes the fact that NLS is one type of linear motifs. Our experiment results show that our sequential-pattern approach is effectively in extensively searching potential NLSs. Our method can consistently find over 50% of NLSs with prediction precision at least 0.7 in the two independent datasets. The performance of our method can outperform the-state-of-art NLS prediction methods in terms of F1-score.

The binding affinity between a nuclear localization signal (NLS) and its import receptor is closely related to corresponding nuclear import activity. PTM based modulation of the NLS binding affinity to the import receptor is one of the most understood mechanisms to regulate nuclear import of proteins. However, identification of such regulation mechanisms is challenging due to the difficulty of assessing the impact of the PTM on corresponding nuclear import activities. In the third part of the dissertation we proposed NIpredict, an effective algorithm to predict nuclear import activity given its NLS, in which molecular interaction energy components (MIECs) were used to characterize the NLS-import receptor interaction, and the support vector regression machine (SVR) was used to learn the relationship between the characterized NLS-import receptor interaction and the corresponding nuclear import activity. Our experiments showed that nuclear import activity change due to NLS change could be accurately predicted by the NIpredict algorithm. Based on NIpredict, we developed a systematic framework to identify potential PTM-based nuclear import regulations for human and yeast nuclear proteins. Application of this approach has uncovered the potential nuclear import regulation mechanisms by phosphorylation and/or acetylation of three nuclear proteins including SF1, histone H1, and ORC6.