Date of Award


Document Type

Campus Access Dissertation


Computer Science and Engineering

First Advisor

Jianjun Hu


With the availability of an overwhelming amount of high-throughput biological data, biologists and medical researchers increasingly depend on computational algorithms for hypothesis generation and prediction. One area of bioinformatics research is the development of algorithms for predicting subcellular localization of both monoplex and multiplex proteins. Most of current localization prediction algorithms employ features derived from protein sequence data and external functional annotations such as gene ontology or physicochemical properties. However, there is no method that can exploit rich localization information in a protein-protein correlation network since correlated proteins tend to be co-localized within the cell. Here we propose a novel diffusion kernel and logistic regression based algorithm, NetLoc, for protein localization prediction by exploiting protein correlation networks. NetLoc is applied to yeast protein localization prediction using four types of protein networks including physical protein-protein interaction (PPI) networks, genetic PPI networks, mixed PPI networks, and co-expressed PPI networks. Experiments showed that protein networks can provide rich information for localization prediction, achieving an AUC score up to 0.93. We also showed that networks with high connectivity and high percentage of co-localized PPI lead to better prediction performance. Compared to a previous network feature based prediction algorithm with an AUC score of 0.52 on the yeast PPI network, NetLoc achieved significantly better overall performance with an AUC of 0.74 on the same dataset. We also investigated how the prediction performance of NetLoc was affected by the network characteristics such as ratio of the number of co-localized PPI (coPPI) to the number of non-co-localized PPI (ncPPI) and the density of annotated coPPI in the network. For a given network with a specific number of proteins, NetLoc performance increases with increasing coPPI/ncPPI ratio and increasing density of annotated coPPI.

Another limitation of current protein localization algorithms is that they are not capable of predicting multi-location proteins. NetLoc algorithm addressed this limitation by calculating probabilistic scores for all locations for each query protein. Evaluation on the Yeast multi-localization protein dataset showed that the overall success rate of NetLoc is 88%, which is much higher than the existing method (73%) tested on the same dataset. Finally, we proposed and evaluated two methods for network based localization prediction based on multiple protein correlation networks. One is by constructing a unified protein correlation network. The other is to use multiple network kernels. Experiment showed that both methods can improve the NetLoc performance compared to original individual network.