Date of Award

12-15-2014

Document Type

Open Access Dissertation

Department

Computer Science and Engineering

First Advisor

John R. Rose

Abstract

Computational Genomics, or Computational Genetics, refers to the use of computational and statistical analysis for understanding the structure and the function of genetic material in organisms. The primary focus of research in computational genomics in the past three decades has been the understanding of genomes and their functional elements by analyzing biological sequence data. The high demand for low-cost sequencing has driven the development of highthroughput sequencing technologies, next-generation sequencing (NGS), that parallelize the sequencing process, producing thousands or millions of sequences concurrently. Moore’s Law is the observation that the number of transistors on integrated circuits doubles approximately every two years; correspondingly, the cost per transistor halves. The cost of DNA sequencing declines much faster, which implies more new DNA data will be obtained. This large-scale sequence data, produced with high throughput sequencing technologies, needs to be processed in a time-effective and cost-effective manner. In this dissertation, we present a high-performance meta-genome gene identification framework. This framework includes four modules: filter, alignment, error correction, and gene identification. The following chapters describe the proposed design and evaluation of this pipeline. The most computationally expensive kernel in the framework is the alignment procedure. Thus, the filter module is developed to determine unnecessary alignment operations. Without the filter module, the alignment module requires 1.9 hours to complete all-to-all alignment on a test file of size 512,000 sequences with each sequence average length 750 base pairs by using ten Kepler K20 NVIDIA GPU. On the other hand, when combined with the filter kernel, the total time is 11.3 minutes. Note that the ideal speedup is nearly 91.4 times faster when new alignment kernel is run on ten GPUs ( 10*9.14). We conclude that accuracy can be achieved at the expense of more resources while operating frequency can still be maintained.

Share

COinS