Date of Award


Document Type

Campus Access Dissertation


Computer Science and Engineering

First Advisor

Jijun Tang


Phylogenetic reconstruction is the attempt to determine the evolutionary relationships which connect species by comparing their genetic information. In recent decades research has opened a new frontier in this field, referred to as gene order and content data. This information format uses the ordering and appearance of genes on chromosomes to measure large scale evolutionary events and distances.

The first problem addressed in this work is the inversion distance median problem, which involves finding a genome that minimizes the sum pairwise distance between itself and three other genomes. This median problem is known to be NP-hard and all existing solvers are extremely slow when genomes are distant. We present a new inversion median heuristic based on commuting reversals. Testing using simulated data sets shows that this method is a better trade-off between speed and accuracy than existing methods.

Next addressed in this work is the current state of the art in gene order evolutionary measures is called the Double-Cut-and-Join distance. With the DCJ distance metric and a new concept calledProsthetic Chromosomesan elegant solution will be demonstrated for the situation where gene order data sets have unequal content. Egchel (Extended Gene Content HEueristic Layer) is our implementation which creates an equal content gene order data set to emulate the behavior of data sets which include insertion and deletion events. Testing of simulated data indicates that data sets which previously contained too few common genes can now be analyzed using Egchel with practical speed and improved accuracy.

Finally, gene order phylogenetic analysis currently has the weakness of not having a convincing means to statistically validate trees. Existing literature has attempted to do so by resampling data sets with a jackknifing procedure adapted from sequence data analysis, but this approach has significant theoretical weakness. We have attempted to correct this weakness by incorporating a DCJ error model into a resampling method for verifying gene order based trees.

These three methods work together to extend the speed, accuracy, and utility of phylogenetic analysis using gene order and content based data.