Date of Award


Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

John Rose


High-throughput proteomics analysis involves the rapid identification and characterization of large sets of proteins in complex biological samples. Tandem mass spectrometry (MS/MS) has become the leading approach for the experimental identification of proteins. Accurate analysis of the data produced is a computationally challenging process that relies on a complex understanding of molecular dynamics, signal processing, and pattern classification. In this work we address these modeling and classification problems, and introduce an additional data-driven evolutionary information source into the analysis pipeline.

The particular problem being solved is peptide sequencing via MS/MS. The objective in solving this problem is to decipher the amino acid sequence of digested proteins (peptides) from the MS/MS spectra produced in a typical experimental protocol. Our approach sequences peptides using only the information contained in the experimental spectrum (de novo) and distributions of amino acid usage learned from large sets of protein sequence data. In this dissertation we pursue three main objectives: an ion classifier based on a neural network which selects informative ions from the spectrum, a peptide sequencer which uses dynamic programming and a scoring function to generate candidate peptide sequences, and a candidate peptide scoring function. Candidate peptide sequences are generated via a dynamic programming graph algorithm, and then scored using a combination of the neural network score, the amino acid usage score, and an edge frequency score. In addition to a complete de novo peptide sequencer, we also examine the use of amino acid usage models independently for reranking candidate peptides.