Date of Award


Document Type

Campus Access Dissertation


Environmental Health Sciences

First Advisor

Robert S Norman


Microorganism's makeup half of earth's biomass and are known to be the largest reservoir of biodiversity and genetic potential. They play an important role in maintaining equilibrium in the environment by taking part in several biogeochemical cycles. Microbes are found in every habitat, irrespective of the environmental factors. In human body microbes outnumber human cells by a factor of ten to one. Though their influence is not entirely known, it is probable that microbes play a major role in several metabolic processes. The gastrointestinal (GI) microbiota is the most abundant and diverse microbial community of human body. Several studies have shown that alteration in GI microbiota is associated with obesity, Inflammatory Bowel Disease (IBD), Crohn's disease and even colorectal cancer. Colorectal cancers are the second leading cause of death in the US and it's mainly due to the lack of early detection. If diagnosed at the polyp stage, they can be treated to avert progression of the cancer. But invasive and expensive detection procedures such as CT-colonoscopy and endoscopy make early screening inconvenient and difficult. Therefore, alternate diagnostic strategies such as finding possible biomarkers can help discover colon cancer at an early stage.

So far, only 1-15% of microbes have been studied, due to the lack of ability to readily culture them. Past few decades have witnessed remarkable advancement in the field of microbiology since researchers started venturing in the `omics' methods of investigation. These new-era of approaches such as metagenomics have created a paradigm shift in our understanding of microbial community structure and function. Metagenomics bypasses the need to culture the microbes and hence is a powerful tool to investigate microbial community in their natural environment. The advancements in sequencing technology such as high-throughput sequencing have made it possible to take full advantage of metagenomics. This has aided researcher to tap into greater depths of microbial diversity and changed our view of their relationship with other organisms. Although, the use of the next generation sequencing, results in large amounts of data that can get challenging while processing and analyzing. Hence, flexible and efficient data management pipelines that can be tailored to the needs of an individual research group are essential. The current data management pipeline was designed for processing, analyzing and storing metagenomic sequences obtained from 454-pyrosequencing technology.

The data management pipeline was optimized using microbial metagenomes isolated from colon tissue and stool samples of APCmin/+ mice (a mouse model of colon cancer). A total of 171,405 unprocessed sequences were generated from proximal and distal colon sections as well as stool samples using 454-pyrosequencing. The preprocessing step resulted in 72% of high-quality reads that were used for further analyses. These high-quality sequences were then classified by aligning them against the Ribosomal Database Project (RDP) database. The relative abundance of each classified sequence was compared across samples to identify differences between the microbial communities contained in tumor versus non-tumor mice. It was observed that the proximal region showed higher variation compared to other samples. Also Prevotella was found in higher abundance in both tissue and stool samples of mice that developed tumors. Finally the clean and formatted reads were stored in a relational database management system that was created using `MySQL'. This database not only contains sequence information such as read length, average quality score and GC content, but also other related

information such as, metadata, habitat and sequence annotation. A user interface was created using `PHP' for data retrieval and sharing.