Publications

Thesis Student Research

A Study on Development of Artificial Intelligence-Based Methodology for Identification of Copy Number Variation in Crops

Author: Parinita Das

2020-21

Copy number variations (CNVs), encompassing deletions and duplications of DNA segments, are critical genomic features that influence gene expression, adaptation, and phenotypic variation. These structural variations play a pivotal role in genome evolution, trait expression, and environmental adaptability across plants. This research introduces MLDeCNV, a novel machine learning-based framework for the accurate detection and interpretation of copy number variations (CNVs) in genomic data, specifically targeting next-generation sequencing (NGS) data. CNVs, which involve alterations in the number of DNA copies, can significantly influence gene expression and contribute to phenotypic diversity. Traditional CNV detection methods often struggle with small CNVs or those in regions with low read-depth signals, leading to incomplete detection. To overcome these challenges, MLDeCNV integrates 32 features derived from NGS data and combines outputs from multiple CNV detection tools with experimental validation using PCR and aCGH. A key aspect of the framework is the application of the Smote-TomekLinks data-balancing technique, which enhances the model’s accuracy by addressing class imbalances commonly found in CNV prediction. MLDeCNV outperforms existing CNV detection tools like Delly, CNVnator, and Manta, demonstrating robust performance across different species, including rice, Arabidopsis, and pomegranate, with an impressive AUC of 0.96. The study also highlights the practical utility of MLDeCNV by developing a web-based tool that simplifies CNV detection for researchers by accepting standard genomic inputs and offering an intuitive interface for classifying CNVs into deletions, duplications, or no CNV. Additionally, the research presents a genome-wide analysis of CNVs in black pepper and bitter gourd, uncovering thousands of CNVs and mapping them to critical agronomic traits such as stress resilience and pathogen defense. This work contributes significantly to the field of agricultural genomics, showing how CNVs can be leveraged for crop improvement, marker-assisted breeding, and understanding species adaptation. The study’s findings underscore the value of integrating CNV data with genome-wide association studies (GWAS) to identify important loci linked to key traits, positioning MLDeCNV as a valuable resource for advancing genomic research in agriculture and evolutionary biology.

View Student Profile

Thesis Student Research

Development of Hapmap Database and Visualization Tool for Tea

Author: Dipankar Mandal

2022-23

View Student Profile

Thesis Student Research

Integrating GWAS Module with HtP-DAP for SNP-trait Associations Mining

Author: Surapuram Aswini

2022-23

Genome-wide association studies (GWAS) provide a crucial methodology for identifying genetic variants associated with traits in organisms. These studies are important for understanding the genetic basis of complex traits, which can aid in improving crop performance, human health, and livestock breeding. This thesis seamlessly integrates a GWAS analysis tool with the existing phenomics data analysis platform, HtP-DAP, aimed at enhancing and streamlining GWAS analysis workflows. The tool addresses key challenges in GWAS by offering robust preprocessing capabilities, including data filtering based on allelic frequency thresholds, imputation of missing genotypic data, and file conversion to ensure compatibility with various analysis pipelines. A major feature of the tool is its comprehensive set of relatedness analysis functions, which include kinship estimation, principal component analysis (PCA), and multi-dimensional scaling (MDS). These analyses provide critical insights into the underlying genetic architecture of populations, facilitating more accurate GWAS results. The GWAS analysis itself is highly flexible, supporting both single-locus models, which test individual markers for trait associations, and multi-locus models, which examine interactions between multiple markers. Result visualization is a key component of the tool, offering users the ability to generate clear and informative graphical outputs, such as Manhattan plots to highlight significant associations, circular Manhattan plots for a more compact genome-wide view, and Q-Q plots to assess the quality of the GWAS results and also provide a platform for presenting results in a meaningful way for publication or further research. The tool’s backend leverages the power of the GAPIT R package, known for its efficiency and scalability in handling large genomic datasets. GAPIT enables the seamless execution of GWAS analyses by managing the computational load, thus ensuring that the tool performs optimally even with large-scale datasets. By incorporating this GWAS tool within the HtP-DAP platform, this study bridges the gap between phenotypic data from high-throughput phenotyping and genotypic data from modern genomic studies. The integration facilitates a holistic approach to genetic research, allowing users to move from data collection to meaningful biological insights within a single platform.

View Student Profile

Thesis Student Research

Development of Computational Tool for Mining Intron Length Polymorphism Markers and Designing Primers

Author: Soumya Shivamurti

2022-23

View Student Profile

Thesis Student Research

Standardizing Workflow for Identifying Stress-Tolerance Contributing Non-Coding Rnas in Vigna and Developing a Comprehensive NCRNA Database For Legumes

Author: Ashok S

2022-23

View Student Profile

Thesis Student Research

A Study on Machine Learning Based Approach for long non-coding RNA Subcellular Localization Prediction

Author: Baibhav Kumar

2019-20

View Student Profile

Thesis Student Research

Web Tool for Crispr/Cas9 off target prediction in Plants

Author: Abhishek Anand

2021-22

View Student Profile

Thesis Student Research

Discovery of Molecular Markers and Development of Database for Rice Bean

Author: Ravi

2021-22

View Student Profile

Thesis Student Research

Novel and efficient Pipeline for Metagenomics Binning

Author: Subham Ghosh

2021-22

Metagenomics delves into the examination of microorganisms, and a pivotal aspect of this field involves piecing together the genetic makeup of distinct organisms. This task proves challenging due to the complexities of isolating and cloning certain organisms under in-vitro conditions. Metagenomics is alternatively termed environmental genomics, eco-genomics, or community genomics. To reconstruct the fragmented sequences obtained from shotgun sequencing, the process heavily relies on genome assembly. However, a significant hurdle arises when attempting to segregate and reassemble genomes from various organisms. The abundance of these genomes and the intermingling of genomics reads present a formidable challenge. Shotgun sequencing produces genomic reads that contain fragments originating from diverse microorganisms' genomes. To facilitate reconstruction, it becomes imperative to classify these reads into separate bins corresponding to distinct microorganisms. For this purpose, various clustering techniques have emerged for the categorization of these intertwined genomes. These techniques encompass binning, boosting, bagging, and stacking. Among these, binning has gained prominence as the most extensively utilized algorithm in contemporary times. To put it differently, genomes are categorized into operational taxonomic units (OTUs) to facilitate subsequent taxonomic profiling and subsequent functional analysis. This process of OTU clustering is commonly referred to as binning. In this clustering process, binning employs a variety of clustering methods such as k-means, k-medoids, Hidden Markov Model (HMM), and hierarchical clustering. However, each of these clustering approaches comes with its own limitations and drawbacks. There is a no research on motif-based binning in the existing ones. Here an approach is given for metagenomic binning by constructing frequency table of motif or segments by using local alignment using gap and during local alignment, the segments should not be overlapped. K means clustering, PAM clustering and DBSCAN clustering are applied to cluster the contigs based on the segments and the motifs. But K-means clustering has performed the best. The rand indexes for this approach are tend to 1. So, this approach is good for metagenomics binning. And it is also performing better than the existing binning tools, i.e., MaxBin and MetaBat. And this approach has a lot of scope. In the place of simple kmeans clustering, many advanced clustering can be used for better performance. GC content, tetra-nucleotide frequency can be added for getting better performance. This approach also highlights the mutation concepts and conserved regions, which are too much necessary to get the idea of evolutionary biology.

View Student Profile

Thesis Student Research

A Semi-Supervised Approach For Binning Of Metagenomics Data

Author: Deeksha P M

2021-22

View Student Profile

Thesis Student Research

Computational Intelligence in the Discovery of Natural Products from agriculturally important Metagenomics Data

Author: Sharanbasappa

2020-21

Microorganisms are diverse, invisible, and ecologically important organisms that encourage biosphere activities, additionally providing constraints in the form of plant diseases with agricultural implications. This work navigates the novel landscape of metagenomics, a world that rejects traditional limitations through the application of high-throughput sequencing methods. Agriculturally significant metagenomics, characterized by direct DNA sequencing from soil, plants, and cattle, reveals previously hidden microbial diversity and genetic components. Despite this, the reconstruction of individual genomes from the complex mixture of DNA sequences remains a challenging task. The process of binning, which groups sequences from diverse microorganisms, lays the crucial foundation for the identification of Natural Products (NPs) by clustering genomes or taxonomically related groups. This preliminary step is pivotal for extracting valuable insights from the wealth of metagenomic data. Natural Products (NPs), organic compounds synthesized by living organisms, encompass bacteria, fungi, plants, and marine life. These NPs wield a vast range of applications, from medicine to agriculture. NPs often emerge from biosynthetic gene clusters (BGCs) within microbial genomes. Identifying these NPs and their associated BGCs stands as a paramount task in metagenomics, offering the prospect of discovering novel compounds with potential agricultural applications. Computational intelligence techniques facilitate the efficient analysis of metagenomics data and the prediction of NPs and have emerged as indispensable tools in this endeavor. This comprehensive study embarks on a transformative journey, introducing innovative approaches for binning metagenomics data, identifying NPs, and applying these methodologies to agriculturally significant metagenomics datasets. Introduced two novel binning strategies, Deep Embedded Clustering (DEC) and Variational Autoencoders (VAE), outperformed the existing unsupervised methods and were on par with semi-supervised techniques, with DEC excelling in cluster quality and VAE demonstrating a high silhouette index. Then, the NP identification from bins of metagenomics data, this research presents a comprehensive approach to effective BGC identification. The study focuses on five classes of Natural Products (NPs) classes: Polyketide synthase (PKS), Non-Ribosomal Polyketide Synthase (NRPS), Ribosomally synthesized and post-translationally modified Peptides (RiPP), Terpenes, and Hybrid PKS-NRPS. Data was gathered from the MiBIG database in GBK format. Protein sequences from each file were extracted, and sequences under the same BGC ID were combined. Physicochemical properties were calculated, and sequence embeddings were generated using NLP techniques like CountVec, TFIDF, and Word2Vec specific to each NP class. An integrated feature matrix was created by merging physicochemical properties and generated embeddings. Then this matrix was used for training and testing nine ML models including Logistic Regression (LR), Naïve Bayes (NB), Decision Tree (DT), Random Forests (RF), K-Nearest Neighbors (KNN), Extreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), Artificial Neural Networks (ANN), and Categorical Boosting. The study explored data balancing techniques, with SMOTE and without SMOTE, and employed Grid Search for parameter optimization. This led to six datasets and 54 models. The LR model, using TFIDF with SMOTE, emerged as the most effective, achieving an accuracy of 0.96, AUC of 0.9912, and other strong metrics. With the proposed approach, we developed an AI-based tool called NaturePred (http://login1.cabgrid.res.in:5101/), for NP class prediction and protein physicochemical property calculation. Applied to a genuine Agriculturally Important Metagenomics dataset which is collected from Mustard soil Rhizosphere in the Mau district of UP, the study reveals a rich presence of more than 40% Ribosomally synthesized and post-translationally modified Peptides (RiPPs), signalling robust plant-microbiome interactions and soil health. By combining innovative binning strategies, advanced NLP techniques, and machine learning, this study lays a robust foundation for future advancements in agriculture and microbial research. The integration of AI tools, exemplified by NaturePred, promises to unlock untapped agricultural potential. This work propels microbial research into uncharted territories, unlocking hidden treasures within microbial genomes. The journey into the microbial universe continues with heightened excitement, driven by the insights and innovations arising from this transformative study.

View Student Profile

Thesis Student Research

Development of an approach for identification of core microbiome

Author: Sorna A M

2021-22

View Student Profile

Publications

Publications by Year

Publication Types

A Study on Development of Artificial Intelligence-Based Methodology for Identification of Copy Number Variation in Crops

Development of Hapmap Database and Visualization Tool for Tea

Integrating GWAS Module with HtP-DAP for SNP-trait Associations Mining

Development of Computational Tool for Mining Intron Length Polymorphism Markers and Designing Primers

Standardizing Workflow for Identifying Stress-Tolerance Contributing Non-Coding Rnas in Vigna and Developing a Comprehensive NCRNA Database For Legumes

A Study on Machine Learning Based Approach for long non-coding RNA Subcellular Localization Prediction

Web Tool for Crispr/Cas9 off target prediction in Plants

Discovery of Molecular Markers and Development of Database for Rice Bean

Novel and efficient Pipeline for Metagenomics Binning

A Semi-Supervised Approach For Binning Of Metagenomics Data

Computational Intelligence in the Discovery of Natural Products from agriculturally important Metagenomics Data

Development of an approach for identification of core microbiome