Date of Award
Open Access Dissertation
Computer Science and Engineering
Gene expression is the fundamental differentiation and development process of life. Although all cells in an organism have essentially the same DNA, cell types and activities vary due to changes in gene expression. Gene expression can be influenced by many gene regulations. RNA editing contributes to the variety of RNA and proteins by allowing single nucleotide substitution. Reverse transcription can alter the expression status of genes by inducing genetic diversity and polymorphism via novel insertions, deletions, and recombination events. Gene regulation is critical to normal development because it enables cells to respond rapidly to environmental changes. However, identifying gene regulations from genome data remains challenging due to the repetitive nature of eukaryotic genomes and their high structural diversity.
Deep learning techniques emerged in the 2000s and quickly gained traction in a variety of disciplines due to their unparalleled prediction performance on large datasets. Since then, numerous applications in computational biology have been proposed, including image resolution enhancement and analysis, the detection of DNA function, and protein structure prediction. As a result, deep learning is widely regarded as a promising technique for advancing bioinformatics perspectives. In this dissertation, we explore deep learning-based methods to solve the following gene regulation problems: 1) RNA editing identification, 2) novel LINE-1 retrotransposon gene synthesis, and 3) RNA editing identification and classification across tissues.
First, we took the RNA editing identification task as an example to fully explore deep learning-based methods for solving gene regulation problems. While millions of RNA editing sites have been reported in the human genome, far more sites are believed to be editable and still need to be identified. We constructed convolutional neural network (CNN) models to predict human RNA editing events in both Alu regions and non-Alu regions. Experiment results showed that our method achieved outstanding performance in two validation datasets. We ported our CNN models to a web service named EditPredict. In addition to the human genome, EditPredict tackles the genomes of other model organisms, including the bumblebee, fruit fly, mouse, and squid genomes.
Second, we explored the advantages of deep learning methods in synthesizing novel genes. Long interspersed nuclear elements (LINE-1) retrotransposons are the only autonomously active transposable elements. While numerous bioinformatics methods have been developed to assist in detecting and classifying LINE-1 retrotransposons, there are still limitations in terms of reliability, precision, and efficiency. We proposed an interpretable generative adversarial network to learn the operation pattern of the LINE-1 retrotransposon and then generate synthetic sequences up to 201 nucleotides. Experimental results showed that the synthetic sequences generated by our model are highly similar to those of natural LINE-1 retrotransposons. We also optimized the generated sequences for desired properties, such as sequence structure for a particular biological function and protein secondary structure.
Third, we extended our dissertation by using deep learning methods to explore RNA editing across human tissues. It is known that RNA editing varies across different tissues. Our study can be divided into two major parts: RNA editing similarity and RNA editing specificity across human tissues. We analyzed the distribution of RNA editing and presented the atlas, comprising millions of A-to-I events identified in six tissues. Then, we used a transfer learning technique and hybrid models to identify and classify the RNA editing across tissues, respectively. Our models achieved relatively good identification and classification performances. At last, we calculated the RNA editing events associated with human disorders and categorized them into different groups. We found that specific RNA editing events are consistently associated with specific human tissue diseases.
Wang, J.(2023). Utilizing Deep Learning Methods in the Identification and Synthesis of Gene Regulations. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/7339