Publications
Browse our research publications and academic works
Publications by Year
Publication Types
Developmental of Computational Approach for Natural Products in Plants
Author: Nimai C. Mahanandia
2020-21
Scientific research on plant-derived natural products has exponentially increased in recent years, with various new natural compounds of important therapeutic uses, which are being reported regularly in the scientific literature. These natural products have been used to treat a wide range of diseases, from infection to cancer, as well as to combat pests in various crops. In this doctoral thesis, a database of plant-derived natural products for crop protection “NatProCP,” has been developed, which will serve as a comprehensive resource for researchers by providing detailed information on natural products. The database contains information on 262 plant species of 5,281 unique natural compounds. It includes data on medicinal and drug likeness properties of the natural compounds, along with their 2D and 3D structures. The natural compounds were collected using text mining approaches and their 3D structures were optimized through the MMFF94 force field method via an in-house Python script. The database also provides information on the antifungal and antiviral potency of these compounds, which can be used for virtual screening studies. The primary purpose of the development of this database is to provide a library of natural compounds for virtual screening studies against various therapeutic proteins. Additionally, a machine learning based model for virtual screening (ML-VSPred) web server was developed for predicting the binding affinity scores of protein-ligand complexes. In this study, eight machine learning (ML) regression models, such as linear regression (LR), random forest (RF), decision tree (DT), support vector regression (SVR), polynomial SVR (PSVR), XGBoost (XGB), gradient boosting regression (GBR) and deep neural network (DNN) were trained on various protein-ligand structural features derived from the PDBbind dataset to build ML-based predictive models for protein-ligand binding affinity scores prediction. The result shows that the XGBoost (R2=0.84±0.012) model has the best performance compared to the other models, followed by GBR, RF, DNN, DT, LR, PSVR and SVM. This ML-VSPred prediction server accurately screened natural compounds against various target proteins. This prediction web server offers a valuable resource for advancing research in natural product based compounds discovery and crop protection. Thereafter, the SDH1 protein of M. oryzae, a major fungal disease of rice blast pathogen, was selected for screening against the NatProCP database using the ML-VSPred web server, followed by docking and MD simulation to find the unique natural compounds that inhibit the 92 SDH1 protein. The MM-PBSA method was used to perform the binding free energy analysis. Two compounds, Quercetin and Cinchonine were identified, showing strong binding affinity with binding free energy (Δ𝐺𝑏𝑖𝑛𝑑) values of -89.27 kJ/mol and -82.03 kJ/mol, respectively, as compared to that reference compound Azoxystrobin (-76.82 kJ/mol). These in-silico findings can be further validated through biochemical and structural investigation to explore the potential of these natural compounds for treating the M. oryzae receptor protein. Keywords: NatProCP, MMFF94, ML-VSPred, linear regression (LR), random forest (RF), decision tree (DT), support vector regression (SVR), polynomial SVR (PSVR), XGBoost (XGB), gradient boosting regression (GBR), deep neural network (DNN), PDBbind, SDH1, docking, MD simulation, MM-PBSA.
CRISPR-GATE: a one-stop repository and guide to computational resources for genome editing experimentation
Author: Asif Ali Vadakkethil, Sonali Panda, Aranya Mitra, Manaswini Dash, Mirza J. Baig, Ulavappa B. Angadi, Dinesh Kumar, Sarika Jaiswal, Mir Asif Iquebal, Kutubuddin A. Molla
2025
Deciphering microbial diversity and predicting metabolic functionalities in fermented pigmented rice water using culture-independent characterization
Author: Shruti Mishra, Asif Ali Vadakkethil, Mir Asif Iquebal, Sarika Jaiswal, Dinesh Kumar, Bhim Pratap Singh, Said Ajlouni, C. Senaka Ranadheera, S. Chakkaravarthi
2025
Development of Multiclass Model to Classify the Major Group of Microbes from Metagenomic Data
Author: Rajarshi Mondal
2023-24
Metagenomics involves the analysis of genetic material directly extracted from environmental samples, enabling the study of microbial communities without the need for isolation or cultivation. However, the complexity and diversity of microbial genomes, along with varying abundance levels, make accurate classification and assembly of sequences a challenging task. Shotgun sequencing generates short reads from mixed microbial genomes, necessitating taxonomic binning for effective profiling and functional analysis. This study proposes a taxonomy-dependent classification framework based on marker genes to improve binning accuracy. Feature extraction and selection were performed on marker sequences, followed by training machine learning models using 10-fold cross-validation. A two-stage hierarchical classification scheme was designed to address misclassification biases caused by genome size differences among viruses, bacteria, and fungi. The first stage distinguishes viral sequences, while the second stage classifies bacterial and fungal sequences. Results show that the proposed framework effectively reduces misclassification and improves taxonomic assignment balance. The Logistic Regression was the best-performing model with an overall accuracy of 85.06%. The Class-wise analysis showed strong predictive power for bacteria and fungi, but lower performance for viruses. A web server (named, MultiMetaCC) was successfully implemented to make the model accessible (https://rajmondal.shinyapps.io/MultiMetaCC/). The approach demonstrates robust performance and offers flexibility for enhancement through the integration of ensemble methods (e.g., Random Forest, XGBoost) and deep learning models (e.g., CNNs, RNNs) to capture complex patterns and relationships. This method lays a foundation for more refined, scalable, and accurate classification in metagenomic studies.
Advanced Statistical Approach for Metagenomics Analysis Addressing Data Heterogeneity and Covariates
Author: Shylin Joe S
2023-24
Metagenomics is the direct genetic analysis of genomes contained within an environmental sample. Data heterogeneity and covariates are two main challenges in the statistical analysis of metagenomics data. The core microbiome is certain microbial taxa that are consistently present in a particular environment; it maintains plant health, ecosystem stability, and various biological functions. Differential abundance analysis aims to identify taxa whose abundances vary significantly across conditions. There are several tools/packages available for core microbiome identification and differential abundance analysis, each has its own limitation. This study addresses these gaps by introducing an innovative approach for core microbiome identification and differential abundance analysis by developing a user-friendly web tool. In this study, Arabidopsis thaliana core root microbiome data have been used as a demo dataset. The developed approach entails multiple phases involving filtering, normalization, exploratory analysis, diversity analysis, core microbiome identification, testing the significance of the identified core, differential abundance analysis, adjusting effects of covariates, and visualization of results. To mitigate data heterogeneity, five filtering methods (abundance, occurrence, abundance and occurrence, membership, and hard cut-off filter) and eleven normalization methods (TMM, TMMwsp, RLE, GMPR, TSS, CSS, CLR, SRS, upperquartile, rrarefy, and invlogit) are provided. By revealing condition-specific microbial patterns, the identification of the core microbiome by group improves biological understanding, functional significance, and targeted applications. The significance of the identified core can be tested using four statistical methods (F-test, Kruskal-Wallis test, Levene’s test, and Fligner-Killeen test) were implemented. Further, this tool supports exploratory analysis (boxplot, density plot, and MDS plot) and diversity analysis such as alpha diversity (richness, evenness, Shannon and Simpson indices) and beta diversity (Bray-Curtis, Jaccard, and Euclidean). For differential abundance analysis, various statistical tests such as exact test, quasi-likelihood ratio test and quasi-likelihood F test have been provided, along with the options for covariate adjustment and multiple testing correction. Finally, a web tool for Core Microbiome Identification and Differential abundance analysis (CoreMDA) has been developed which is freely accessible at https://dabin-iasri.shinyapps.io/CoreMDA/. This is an interactive tool which allows researchers to perform core microbiome identification and differential abundance analysis using customized workflows based on user-defined objectives by uploading the datasets.
Multi-Omics Approaches for Drug Design and its Application in Animal Science
Author: Mamatha Y S
2020-21
Multi-omics approaches have transformed the understanding of complex molecular mechanisms by integrating diverse datasets. Key genes, as central regulators within molecular networks, offer critical insights into disease mechanisms, biomarkers, and therapeutic targets. However, single-omics studies fail to capture cross-omics interactions, necessitating integrative approaches. Despite this need, few studies have identified key genes using multi-omics data. To address this gap, we introduce MultiKey, an R Shiny application (https://iasri.shinyapps.io/multikey) that integrates genetic, DNA methylation, and proteomic datasets to identify biologically significant key genes. MultiKey employs correlation matrices and precision matrix estimation to construct correlation-based networks while preserving biologically meaningful interactions. Centrality measures identify disease- and control-specific key genes, with validation via bootstrapping. We demonstrate MultiKey’s effectiveness using simulated datasets and a case study on Johne's disease, a chronic intestinal condition in ruminants, revealing key genes linked to disease progression and potential therapeutic targets. Expanding on this multi-omics framework, we applied the MultiKey methodology to bovine mastitis, a major challenge in the dairy industry caused by Staphylococcus aureus. By integrating DNA methylation, transcriptomics, and proteomics, we identified key genes and pathways involved in host-pathogen interactions. Clumping Factor A (ClfA), a key S. aureus virulence factor, emerged as a promising drug target. Molecular docking and dynamics simulations revealed stable binding interactions between ClfA and bovine host proteins, validated through MM/PBSA free energy calculations. To identify potential ClfA inhibitors, a library of 52 natural compounds was screened using structure-based virtual screening and molecular docking. Among them, Oridonin and Salvianolic acid A exhibited the strongest binding affinities (≤ -8.0 kcal/mol) and favourable ADMET properties. MD simulations confirmed the stability of these interactions. These findings suggest their potential for preclinical evaluation as novel therapeutics for bovine mastitis.This study underscores the power of multi-omics integration in advancing systems biology and precision medicine. By combining computational methodologies with natural product screening, we provide a pathway for targeted drug discovery, reducing antibiotic resistance and improving dairy productivity.
Development of artificial intelligence-based fish specific long non-coding RNA biomarkers discovery tool and web genomic resource
Author: Jutan Das
2019-20
Long noncoding RNAs (lncRNAs) are a subclass of RNA molecules longer than 200 nucleotides that do not encode proteins. However, they play crucial roles in regulating gene expression and most cellular processes. Although computationally challenging, especially with less-studied organisms such as fish, predictions and functional characterization of lncRNAs are urgently needed. This work bridges the gap by creating state-of-the-art machine learning/ deep learning models for identifying and analyzing lncRNAs in fish, contributing to improving aquaculture using genetic insight. A carefully curated dataset from the Ensembl database contained equal amounts of lncRNA and coding RNA sequences from 14 different fish species totaling up to 48,006 sequences. This dataset was enriched with a comprehensive feature extraction process, which combined traditional sequence-based techniques and advanced embedding-based techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to ensure a strong representation of the biological information inherent in the RNA sequences. Six features extracted from the sequences were used for training and testing our ML models. In DL applications, TF-IDF was used. We mostly relied on two feature selection techniques, namely Random Forest (RF) and Univariate Selection (Mutual Information), along with their combination technique, RFMI (Random Forest intersect Mutual Information), in machine learning. A total of twelve different machine learning methods, seven deep learning methods, and three hybrid methods were employed to classify the lncRNAs. Through rigorous evaluation, the model of Light Gradient Boosting Machine (LGBM) with feature selection, combining Random Forest intersect Mutual Information (RFMI) on 45 features outperformed, achieving an accuracy of 98.36%. The effectiveness of the LGBM model was further validated by comparative analysis against six popular lncRNA prediction tools using an independent dataset derived from the fish species (Salmo trutta) not included in the training set. This independent validation underscores the robustness and accuracy of the model in real-world scenarios. This work also introduces FishLncPred, a user-friendly web server available at 250 | A b s t r a c t http://46.202.167.198:5000/ that was developed to facilitate the real-time prediction of lncRNA biomarkers in fish. This tool uses the trained LGBM model to give predictions and downloadable results for user-submitted sequences, making the process of lncRNA identification much easier for researchers in aquaculture. To validate the practicality of this classifier, a case study was conducted on the economically important fish species, common carp (Cyprinus carpio). A total of 33,990 lncRNAs and 22,854 circular RNAs (circRNAs) were identified. The classifier was further applied to identify lncRNAs in common carp from RNA-seq data. This application not only validated the utility of the classifier but also provided insights into the RNA regulatory mechanisms in common carp. In parallel with the prediction tool, a comprehensive genomic resource called CCncRNAdb available at (http://backlin.cabgrid.res.in/ccncrnadb/), was developed for the common carp. CCncRNAdb harbors the identified lncRNAs and circRNAs, which is a very useful resource for the scientific community to fuel further research in fish genomics. In conclusion, this research significantly advances the computational identification and functional analysis of lncRNAs in fish, providing tools and resources that improve the understanding of their roles in aquaculture. The output of this research will lead to more resilient and productive aquaculture practices that could be beneficial for developing more sustainable techniques of fish farming.
Comprehensive Analysis of Copy Number Variation in diverse Bitter Gourd accessions.
Author: Das, Parinita, Jaiswal, Sarika, Iquebal, MA, Angadi, UB, Kumar, Dinesh.
2025
Power of Discrimination in Gene Expression based on trait Heritability in Bovine: A Meta-Analysis Approach
Author: Naina Kumari
2020-21
Livestock are important drivers towards sustainable development goals through promoting resilience, productivity in small farmers, and involvement in markets. Cattle (Bos taurus) and buffalo (Bubalus bubalis) have important roles to play in Asian and Indian economy through other significant products in addition to milk and meat. Due to low availability of genomics and transcriptomics resources on buffaloes, genetic improvement efforts are hindered. To fill this gap, we compared 2,429 transcriptomes from 438 BioSamples in 23 BioProjects, spanning 76 river and swamp buffalo tissues and cell types. This prompted the creation of BuffExDB (http://46.202.167.198/buffex/), an easy-to-use, filterable database with tissue-specific gene expression, provides Tau scores for tissue-specific genes including functional annotations. This is the first of its kind to provide an easily browsable and filterable database that allows users to view and visualize the expression level of each tissue in multiple samples at once. In addition, we have performed meta-analysis of bovine transcriptome datasets to determine crucial genes involved in bovine tuberculosis (BTB) in cattle by combining multiple independent studies using a unified bioinformatics workflow. In the present research, we determined major genes, pathways, and ontologies in relation to BTB disease process. RNA-Seq technology has revolutionized transcriptomic research with insights into gene expression in varying biological conditions However, optimizing RNA-Seq experimental design remains a challenge, particularly in determining appropriate sample sizes based on heritability and statistical power. We created a statistical tool based on a linear model to calculate RNA-Seq sample sizes from heritability, the first of its kind. This method considers false discovery rates, heritability, tissue type, fold change, power and sample-to-sample variation, making differential expression studies more reliable. The findings of the study reveal that sample size is inversely proportional to trait heritability i.e. when the heritability is low, a higher number of replicates should be used in order to achieve the required statistical power as compared to medium and high heritable traits. To further assist researchers, we introduced HEssRNA (https://cran.r-project.org/web/packages/HEssRNA/index.html) and HEssRNA-Shiny, an R package created in CRAN repository and a web-based shiny tool respectively for sample size estimation in RNA-Seq studies based on bovine gene expression data using the model developed. The web tools are an easy-to-use resource for non-programmers to estimate sample size based on heritability. Both the package and tool offer option for power calculation starting from RNA-Seq count matrix. Although designed for bovine data, the tools can be customized for other species based on input data and heritability values. Collectively, these resources form a strong platform for transcriptomic studies, enabling data-informed experimental design and enhancing the reproducibility of gene expression research in cattle and buffalo. Our research helps to advance bovine functional genomics and enables precision livestock research.
Genome-wide identification of copy number variation in black pepper and development of its atlas.
Author: Das, Parinita, Sheeja TE, Saha, Bibek, Fayad A, Chandra, Tilak, Angadi, UB, Shivakumar MS, Muhammed Azharudheen TP, Jaiswal, Sarika, Iquebal, Mir Asif, Kumar, Dinesh.
2025
Integrating a Module with Htp-Dap for Qtl Mining using High Throughput Phenomics And Genomics
Author: Satendra Shivam
2022-23
A Study on Development of Artificial Intelligence-Based Methodology for Identification of Copy Number Variation in Crops
Author: Parinita Das
2020-21
Copy number variations (CNVs), encompassing deletions and duplications of DNA segments, are critical genomic features that influence gene expression, adaptation, and phenotypic variation. These structural variations play a pivotal role in genome evolution, trait expression, and environmental adaptability across plants. This research introduces MLDeCNV, a novel machine learning-based framework for the accurate detection and interpretation of copy number variations (CNVs) in genomic data, specifically targeting next-generation sequencing (NGS) data. CNVs, which involve alterations in the number of DNA copies, can significantly influence gene expression and contribute to phenotypic diversity. Traditional CNV detection methods often struggle with small CNVs or those in regions with low read-depth signals, leading to incomplete detection. To overcome these challenges, MLDeCNV integrates 32 features derived from NGS data and combines outputs from multiple CNV detection tools with experimental validation using PCR and aCGH. A key aspect of the framework is the application of the Smote-TomekLinks data-balancing technique, which enhances the model’s accuracy by addressing class imbalances commonly found in CNV prediction. MLDeCNV outperforms existing CNV detection tools like Delly, CNVnator, and Manta, demonstrating robust performance across different species, including rice, Arabidopsis, and pomegranate, with an impressive AUC of 0.96. The study also highlights the practical utility of MLDeCNV by developing a web-based tool that simplifies CNV detection for researchers by accepting standard genomic inputs and offering an intuitive interface for classifying CNVs into deletions, duplications, or no CNV. Additionally, the research presents a genome-wide analysis of CNVs in black pepper and bitter gourd, uncovering thousands of CNVs and mapping them to critical agronomic traits such as stress resilience and pathogen defense. This work contributes significantly to the field of agricultural genomics, showing how CNVs can be leveraged for crop improvement, marker-assisted breeding, and understanding species adaptation. The study’s findings underscore the value of integrating CNV data with genome-wide association studies (GWAS) to identify important loci linked to key traits, positioning MLDeCNV as a valuable resource for advancing genomic research in agriculture and evolutionary biology.