Date of Award

2013

Document Type

Open Access Dissertation

Department

Statistics

First Advisor

Ian L Dryden

Abstract

Current methods for protein identification in tandem mass spectrometry (MS/MS) involve database searches or de novo peptide sequencing, with database searches being the standard method. With database searches, issues arise when the species is not in the database. Shortcomings of de novo peptide sequencing and database searches include chemical noise, overly complex fragments, and incomplete b and y ion sequences. Here we present a Bayesian approach to identifying peptides. Our model uses prior information about the average relative abundances of bond cleavages and the prior probability of any particular amino acid sequence. The proposed likelihood function is composed of two overall distance measures, which measure how close an observed spectrum is to a theoretical scan for a peptide. A Markov chain Monte Carlo (MCMC) algorithm is employed to simulate candidate choices from the posterior distribution of the peptide sequence. The true peptide is estimated as the peptide with the largest posterior density. In addition, our method is designed to rank top candidate peptides according to their approximate posterior densities, which allows one to see the relative uncertainty in the "best'' choice. A simulation study was carried out to ensure our algorithm is performing accurately. Two different noise structures were explored: a Laplace noise structure and a Poisson noise structure. Simulation studies showed our methods are promising. Our motivating data come from the Pacific Northwest National Laboratory (PNNL) and the dataset is from the salmonella typhimurium species. The dataset is a set of doubly charged tryptic peptides. When our method was applied to peptides from this dataset, the true peptide was captured among the list of the top estimated peptides.

Share

COinS