Date of Award


Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Jijun Tang


Over the long history of genome evolution, genes get rearranged under events such as rearrangements, losses, insertions and duplications, which in all change the ordering and content along the genome. Recent progress in genome-scale sequencing renews the challenges in the reconstructions of phylogeny and ancestral genomes with gene-order data. Such problems have been proved so interesting that a large number of algorithms have been developed rigorously over the past few years in attempts to tackle these problems following various principles. However, difficulties and limitations in performance and scalability largely prevent us from analyzing emerging modern whole-genome data, our study presented in this dissertation focuses on developing appropriate evolutionary models and robust algorithms for solving the phylogenetic and ancestral inference problems using gene-order data under the whole-genome evolution, along with their applications.

To reconstruct phylogenies from gene-order data, we developed a collection of closely-related methods following the principle of likelihood maximization. To the best of our knowledge, it was the first successful attempt to apply maximum likelihood optimization technique into the analysis of gene-order phylogenetic problem. Later we proposed MLWD (in collaboration with Lin and Moret) in which we described an effective transition model to account for the transitions between presence and absence states of an gene adjacency. Besides genome rearrangements, other evolutionary events modify gene contents such as gene duplications and gene insertion/deletion (indels) can be naturally processed as well. We present our results from extensive testing on simulated data showing that our approach returns very accurate results very quickly.

With a known phylogeny, a subsequent problem is to reconstruct the gene-order of ancestral genomes from their living descendants. To solve this problem, we adopted an adjacency-based probabilistic framework, and developed a method called PMAG. PMAG decomposes gene orderings into a set of gene adjacencies and then infers the probability of observing each adjacency in the ancestral genome. We conducted extensive simulation experiments and compared PMAG with InferCarsPro, GASTS, GapAdj and SCJ. According to the results, PMAG demonstrated great performance in terms of the true positive rate of gene adjacency. PMAG also achieved comparable running time to the other methods, even when the traveling sales man problem (TSP) were exactly solved.

Although PMAG can give good performance, it is strongly restricted from analyzing datasets underwent only rearrangements. To infer ancestral genomes under a more general model of evolution with an arbitrary rate of indels , we proposed an enhanced method PMAG+ based on PMAG. PMAG+ includes a novel approach to infer ancestral gene contents and a detail description to reduce the adjacency assembly problem to an instance of TSP. We designed a series of experiments to validate PMAG+ and compared the results with the most recent and comparable method GapAdj. According to the results, ancestral gene contents predicted by PMAG+ coincided highly with the actual contents with error rates less than 1%. Under various degrees of indels, PMAG+ consistently achieved more accurate prediction of ancestral gene orders and at the same time, produced contigs very close to the actual chromosomes.