Research Thesis
Title
Computational Intelligence in the Discovery of Natural Products from agriculturally important Metagenomics Data
Objectives
1. To develop an approach for binning of Metagenomics Data using advanced learning techniques 2. To develop a framework based on AI/ML techniques for identifying the natural products in the bins of Metagenomics Data 3. To identify the natural products in the agriculturally important metagenomics data using developed approaches
Abstract
Microorganisms are diverse, invisible, and ecologically important organisms that encourage biosphere activities, additionally providing constraints in the form of plant diseases with agricultural implications. This work navigates the novel landscape of metagenomics, a world that rejects traditional limitations through the application of high-throughput sequencing methods. Agriculturally significant metagenomics, characterized by direct DNA sequencing from soil, plants, and cattle, reveals previously hidden microbial diversity and genetic components. Despite this, the reconstruction of individual genomes from the complex mixture of DNA sequences remains a challenging task. The process of binning, which groups sequences from diverse microorganisms, lays the crucial foundation for the identification of Natural Products (NPs) by clustering genomes or taxonomically related groups. This preliminary step is pivotal for extracting valuable insights from the wealth of metagenomic data. Natural Products (NPs), organic compounds synthesized by living organisms, encompass bacteria, fungi, plants, and marine life. These NPs wield a vast range of applications, from medicine to agriculture. NPs often emerge from biosynthetic gene clusters (BGCs) within microbial genomes. Identifying these NPs and their associated BGCs stands as a paramount task in metagenomics, offering the prospect of discovering novel compounds with potential agricultural applications. Computational intelligence techniques facilitate the efficient analysis of metagenomics data and the prediction of NPs and have emerged as indispensable tools in this endeavor. This comprehensive study embarks on a transformative journey, introducing innovative approaches for binning metagenomics data, identifying NPs, and applying these methodologies to agriculturally significant metagenomics datasets. Introduced two novel binning strategies, Deep Embedded Clustering (DEC) and Variational Autoencoders (VAE), outperformed the existing unsupervised methods and were on par with semi-supervised techniques, with DEC excelling in cluster quality and VAE demonstrating a high silhouette index. Then, the NP identification from bins of metagenomics data, this research presents a comprehensive approach to effective BGC identification. The study focuses on five classes of Natural Products (NPs) classes: Polyketide synthase (PKS), Non-Ribosomal Polyketide Synthase (NRPS), Ribosomally synthesized and post-translationally modified Peptides (RiPP), Terpenes, and Hybrid PKS-NRPS. Data was gathered from the MiBIG database in GBK format. Protein sequences from each file were extracted, and sequences under the same BGC ID were combined. Physicochemical properties were calculated, and sequence embeddings were generated using NLP techniques like CountVec, TFIDF, and Word2Vec specific to each NP class. An integrated feature matrix was created by merging physicochemical properties and generated embeddings. Then this matrix was used for training and testing nine ML models including Logistic Regression (LR), Naïve Bayes (NB), Decision Tree (DT), Random Forests (RF), K-Nearest Neighbors (KNN), Extreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), Artificial Neural Networks (ANN), and Categorical Boosting. The study explored data balancing techniques, with SMOTE and without SMOTE, and employed Grid Search for parameter optimization. This led to six datasets and 54 models. The LR model, using TFIDF with SMOTE, emerged as the most effective, achieving an accuracy of 0.96, AUC of 0.9912, and other strong metrics. With the proposed approach, we developed an AI-based tool called NaturePred (http://login1.cabgrid.res.in:5101/), for NP class prediction and protein physicochemical property calculation. Applied to a genuine Agriculturally Important Metagenomics dataset which is collected from Mustard soil Rhizosphere in the Mau district of UP, the study reveals a rich presence of more than 40% Ribosomally synthesized and post-translationally modified Peptides (RiPPs), signalling robust plant-microbiome interactions and soil health. By combining innovative binning strategies, advanced NLP techniques, and machine learning, this study lays a robust foundation for future advancements in agriculture and microbial research. The integration of AI tools, exemplified by NaturePred, promises to unlock untapped agricultural potential. This work propels microbial research into uncharted territories, unlocking hidden treasures within microbial genomes. The journey into the microbial universe continues with heightened excitement, driven by the insights and innovations arising from this transformative study.
Publications (4)
A Deep Clustering- based Novel Approach for Binning of Metagenomics Data.
Sharanbasappa, Dwijesh Chandra Mishra, Anu Sharma, Sanjeev Kumar, Arpan Maji K, Neeraj Budhlakoti, Dipro Sinha, Anil Rai
A New Insight into Binning of Metagenomics Data Using Unsupervised Deep Learning Approaches.
Sharanbasappa., Dwijesh Chandra Mishra., Anu Sharma., Neeraj Budhlakoti., Sudhir Srivastava, Mohammed Samir Farooqi, Ulavappa Angadi, and Krishna Kumar Chaturvedi
NaturePred: A Tool for Revolutionizing Natural Product Classification with Artificial Intelligence.
Madival, S. D., Mishra, D. C., Chaturvedi, K. K., Sharma, A., Budhlakoti, N., Angadi, U. B., ... & Jha, G. K. (2024).
RFBGCpred: A Random forest based tool for prediction of biosynthetic gene clusters.
Sharanbasappa D. Madival, Dwijesh Chandra Mishra, Krishna Kumar Chaturvedi, Neeraj Budhlakoti, Mohammad Samir Farooqi, Sudhir Srivastava, Anu Sharma, Shivadarshan S. Jirli, Alka Arora, Girish K. Jha, Shesh N. Rai