Multi-Scale Deep Representation Learning in Synthetic Biology
Abstract
Synthetic biology advances and combines the expertise of engineers and biologists, bridging the gap between engineering and natural life. Synthetic biology has been generally categorized into two broad branches by developing new biological components, networks, and systems to reprogram organisms. The first branch involves using synthetic molecules to mimic natural biological functions. The second branch focuses on assembling natural biological components in novel ways, aiming to produce systems with unique, practical functions. Thus, the de novo engineering of biological modules and synthetic pathways is used in related practical bioengineering applications, such as drug-targeting strategies and microbial product manufacturing. Therefore, synthetic biology represents a new paradigm in scientific exploration and innovation, with widely used implications for our understanding and optimization of biological systems. Over the past decades, there has been a significant increase in the amount of available whole-genome sequencing data and experimental data due to the emergence of new automation technologies, such as high-content imaging, high-throughput screening, and sequencing. Given the growth of these data sets, researchers are unable to summarize these data simply from experience and memory. Thus, stable and efficient computational methods are required to integrate them to predict or reveal new phenomena or insights that have never been discovered. However, incomplete knowledge of metabolic processes impairs the accuracy of biological systems, hindering advancements in systems biology and metabolic engineering. Additionally, some fundamental challenges still remain. Firstly, problems in systems biology are often cross-scale and multi-modal, yet existing computational methods for problem definition and model design are often single-scale and single-modal. Secondly, biological systems are multi-scale, unbalanced, and noisy, making structuring and benchmarking this complicated data very difficult. Thirdly, most natural or valuable products' complete biosynthetic pathways are unknown. Thus, computer-aided biosynthesis planning holds significant value. To address the above challenges, we introduce multi-scale deep learning-based representation learning methodologies to understand and optimize the downstream tasks in systems biology, such as metabolic pathway inference, missing reaction prediction in GEMs, and retrosynthesis prediction. Specifically, our first study introduces a novel Multi-View Multi-Label learning framework for Metabolic Pathway Inference (MVML-MPI), which outperforms State-Of-The-Art (SOTA) methods by accurately representing the complex relationships between compounds and pathways. In the second study, to address the limitation of incomplete metabolic knowledge in GEMs, we proposed a novel framework named hypergraph Convolution network and attention mechanism integrated Explorer for GAPS prediction of metabolism termed CLOSEgaps. It is a comprehensive deep learning-driven tool that represents the hyper-topological information of GEMs and effectively fills gaps through hyperlink prediction, thereby enhancing the accuracy of phenotypic predictions. In the third study, we proposed a novel end-to-end framework for one-step retrosynthesis that combines the power of a graph encoder, which integrates learnable structural information, with the capability to sequentially translate drugs, thereby efficiently capturing chemically plausible information (RetroCaptioner). This research presents an advancement in systems biology by introducing a suite of multi-scale deep learning methodologies. These methodologies tackle key challenges such as MVML-MPI enhancing our understanding of complex metabolic pathways, CLOSEgaps innovatively filling gaps in metabolic models, and RetroCaptioner facilitating the process of retrosynthesis. Taken together, they form a comprehensive and integrated approach, and our proposed methods significantly advance the capabilities of synthetic biology.