Date of Award

2017

Document Type

Open Access Dissertation

Department

Computer Science and Engineering

Sub-Department

College of Engineering and Computing

First Advisor

Jijun Tang

Abstract

The 3D structures of the chromosomes play fundamental roles in essential cellular functions, e.g., gene regulation, gene expression, evolution and Hi-C technique provides the interaction density between loci on chromosomes. In this dissertation, we developed multiple algorithms, focusing the deep learning approach, to study the Hi-C datasets and the genomic 3D structures.

Building 3D structure of the genome one of the most critical purpose of the Hi-C technique. Recently, several approaches have been developed to reconstruct the 3D model of the chromosomes from HiC data. However, all of the methods are based on a particular mathematical model and lack of flexibility for new development.We introduce a novel approach using the genetic algorithm. Our approach is flexible to accept any mathematical models to build a 3D chromosomal structure. Also, our approach outperforms current techniques in accuracy.

Although an increasing number of Hi-C datasets have been generated in a variety of tissue/cell types, Due to high sequencing cost, the resolution of most Hi-C datasets are coarse and cannot be used to infer important biological functions (e.g., enhancerpromoter interactions, and link disease-related non-coding variants to their target genes). To address this challenge, we develop HiCPlus, a computational approach based on deep convolutional neural network, to infer high-resolution Hi-C interaction matrices from low-resolution Hi-C data. Through extensive testing, we demonstrate that HiCPlus can impute interaction matrices highly similar to original ones while using only as few as 1/16 of the total sequencing reads. We observe that Hi-C interaction matrix contains unique local features that are consistent across di!erent cell types, and such features can be e!ectively captured by the deep learning framework. We further apply HiCPlus to enhance and expand the usability of Hi-C datasets in a variety of tissue and cell types. In summary, our work not only provides a framework to generate high-resolution Hi-C matrix with a fraction of the sequencing cost but also reveals features underlying the formation of 3D chromatin interactions.

The noise level in the Hi-C is high, and the structure of the noise is complicated. Also, even under most strict experimental conditions, the absolute noise-free Hi-C data still cannot be obtained. We proposed a novel approach to learn a denoising network without clean data. Our approach employs Siamese structure, utilizing two replicates of the same experimental settings to train the model; the resulting model can then be applied to datasets where only one replicate is available. We applied our new approach to enhance Hi-C data, an important type of data in exploring threedimensional genomic structures. The results prove that the model trained by our method significantly reduce the noise level in Hi-C data.

In the past few years, we have seen an explosion of Hi-C data in a variety of cell/tissue types. While these publicly available data presents an unprecedented opportunity to interrogate chromosomal architecture, how to quantitatively compare Hi-C data from di!erent tissues and identify tissue-specific chromatin interactions remains challenging. We developed HiCComp, a comprehensive framework for comparing Hi-C data. HiCComp utilizes convolutional neural networks to extract key features in Hi-C interaction matrices in a fully automatic way. The core component of HiCComp is a triplet network, which contains three identical convolutional neural networks with shared parameters. The inputs to our network are three Hi-C matrices: two of them are biological replicates from the same cell type, and the third one is from another cell type. The HiCComp network takes advantages of the two biological replicates to estimate the natural variation in the experiments and further use it to identify significant variations between Hi-C matrices from di!erent cell types. Furthermore, we incorporate systematic occluding method into our framework so that we can identify the dynamic interaction regions from Hi-C maps. Finally, we show that the dynamic regions between two cell types are enriched for transcription factor binding sites and histone modifications that are associated with cis-regulatory functions, suggesting these variations in 3D genome structure are potentially gene regulatory events.

Rights

© 2017, Yan Zhang

Share

COinS