Date of Award

8-9-2014

Document Type

Open Access Dissertation

Department

Computer Science and Engineering

First Advisor

Jijun Tang

Abstract

Genome rearrangement analysis has attracted a lot of attentions in phylogenetic com- putation and comparative genomics. Solving the median problems based on various distance definitions has been a focus as it provides the building blocks for maximum parsimony analysis of phylogeny and ancestral genomes. The Median Problem (MP) has been proved to be NP-hard and although there are several exact or heuristic al- gorithms available, these methods all are difficulty to compute distant three genomes containing high evolution events. Such as current approaches, MGR[1] and GRAPPA [2], are restricted on small collections of genomes and low-resolution gene order data of a few hundred rearrangement events. In my work, we focus on heuristic algorithms which will combine genomic sorting algorithm with genetic algorithm (GA) to pro- duce new methods and directions for whole-genome median solver, ancestor inference and phylogeny reconstruction.

In equal median problem, we propose a DCJ sorting operation based genetic algorithms measurements, called GA-DCJ. Following classic genetic algorithm frame, we develop our algorithms for every procedure and substitute for each traditional genetic algorithm procedure. The final results of our GA-based algorithm are optimal median genome(s) and its median score. In limited time and space, especially in large scale and distant datasets, our algorithm get better results compared with GRAPPA and AsMedian.

Extending the ideas of equal genome median solver, we develop another genetic algorithm based solver, GaDCJ-Indel, which can solve unequal genomes median prob- lem (without duplication). In DCJ-Indel model, one of the key steps is still sorting operation[3]. The difference with equal genomes median is there are two sorting di- rections: minimal DCJ operation path or minimal indel operation path. Following different sorting path, in each step scenario, we can get various genome structures to fulfill our population pool. Besides that, we adopt adaptive surcharge-triangle inequality instead of classic triangle inequality in our fitness function in order to fit unequal genome restrictions and get more efficient results. Our experiments results show that GaDCJ-Indel method not only can converge to accurate median score, but also can infer ancestors that are very close to the true ancestors.

An important application of genome rearrangement analysis is to infer ancestral genomes, which is valuable for identifying patterns of evolution and for modeling the evolutionary processes. However, computing ancestral genomes is very difficult and we have to rely on heuristic methods that have various limitations. We propose a GA-Tree algorithm which adapts meta-population [4], co-evolution and repopulation pool methods In this paper, we describe and illuminate the first genetic algorithm for ancestor inference step by step, which uses fitness scores designed to consider co- evolution and uses sorting-based methods to initialize and evolve populations. Our extensive experiments show that compared with other existing tools, our method is accurate and can infer ancestors that are much closer to true ancestors.

Share

COinS