BS8 - Generating New Protein Samples to Improve Accuracy of Protein Structure Prediction
SCURS Disciplines
Computer Sciences
Document Type
General Presentation (Oral)
Invited Presentation Choice
Not Applicable
Abstract
Since protein structure prediction is very important for biochemical studies, researchers have developed many deep learning-based AI models to predict protein structure using sequence information only. In order to further enhance the performance of deep learning-based AI system for protein structure prediction, diverse and realistic new protein samples need be generated. These protein samples can provide additional patterns to enhance training of the computational system. Traditionally, researchers use the single generative model to explore the complex data distribution from the whole dataset, which is very challenging task. In order to overcome this problem, we propose Clustering Diffusion Model (CDM). In the CDM model, the whole dataset is divided into the multi-level tree and each ‘branch’ represents one Diffusion Model (DM). The Diffusion Model (DM) is the widely used single generative model to produce realistic and complex computer images and videos. As compared to the single generative model, each ‘branch’ of the multiple generative model-based system focuses on learning unique and distinct data distribution for only one protein family. Several ‘branches’ of the generative models in my system work together to figure out complex data distribution for the entire dataset. This strategy can greatly reduce the learning difficulties of generative models and has the potential to produce more realistic and diverse protein samples than the traditional approach. Experimental results demonstrate that CDM can further enhance the performance of protein structure prediction as compared to the single generative model. Using parallel strategies, the training time of CDM decreases significantly. Reduction of the training time can help researchers speed up the process of new drug exploration and other biochemical studies.
Keywords
Protein Structure Prediction, Generative Model, Diffusion Model, Generative Adversarial Network, Deep Learning
Start Date
10-4-2026 4:25 PM
Location
CASB 102
End Date
10-4-2026 4:40 PM
BS8 - Generating New Protein Samples to Improve Accuracy of Protein Structure Prediction
CASB 102
Since protein structure prediction is very important for biochemical studies, researchers have developed many deep learning-based AI models to predict protein structure using sequence information only. In order to further enhance the performance of deep learning-based AI system for protein structure prediction, diverse and realistic new protein samples need be generated. These protein samples can provide additional patterns to enhance training of the computational system. Traditionally, researchers use the single generative model to explore the complex data distribution from the whole dataset, which is very challenging task. In order to overcome this problem, we propose Clustering Diffusion Model (CDM). In the CDM model, the whole dataset is divided into the multi-level tree and each ‘branch’ represents one Diffusion Model (DM). The Diffusion Model (DM) is the widely used single generative model to produce realistic and complex computer images and videos. As compared to the single generative model, each ‘branch’ of the multiple generative model-based system focuses on learning unique and distinct data distribution for only one protein family. Several ‘branches’ of the generative models in my system work together to figure out complex data distribution for the entire dataset. This strategy can greatly reduce the learning difficulties of generative models and has the potential to produce more realistic and diverse protein samples than the traditional approach. Experimental results demonstrate that CDM can further enhance the performance of protein structure prediction as compared to the single generative model. Using parallel strategies, the training time of CDM decreases significantly. Reduction of the training time can help researchers speed up the process of new drug exploration and other biochemical studies.