Publications

Browse our research publications and academic works

Publications by Year

Publication Types

Thesis (97)
Research Papers (87)
Publication Student Research

A comprehensive review on genomic resources in medicinally and industrially important major spices for future breeding programs: status, utility and challenges.

Author: Das, Parinita, Chandra, Tilak, Negi, Ankita, Jaiswal, Sarika, Iquebal, Mir Asif, Rai, Anil and Kumar, Dinesh

2023

Thesis Student Research

Development of Advanced Learning Model for Prediction of Epigenetic Modifications in Crop(s)

Author: Dipro Sinha

2018-19

Thesis Student Research

Deep learning based algorithm for identification of copy number variation

Author: Nitesh Kumar Sharma

2019-20

Copy number variations (CNVs) are a significant class of variants having role in the etiology of numerous disease manifestations. It is still challenging to find CNVs in genomic data, and the approaches used currently have an unacceptably high false positive rate. Before moving to downstream analysis or experimental validation, interventions of human specialists to carefully check the original CNV calls for weeding out false positives is required. Here, we present a deep learning-based tool designed uptake this human intervention while validating CNV calls, with emphasis on the calls made by PennCNV tool, which is one of the most reliable CNV callers reported in literature. An ensemble model was developed that outperformed traditional machine learning techniques, with improved accuracy of 0.9807 in CNV calls and an ideal area under the receiver operating characteristic curve of 0.9985. The model's improvement resulted in reducing the false positives and instances when the CNV association results couldn’t be replicated.This study also presents a CNV prediction server (http://backlin.cabgrid.res.in/eqcnvdb/index.php) based on ensemble deep learning technique with minimum false discovery rate (FDR) which can be used in other related species. This work is the first genome-wide, chromosome-wise, breed- wise CNV Atlas of Indian Equine breeds, EqCNVDb, available at http://backlin.cabgrid.res.in/eqcnvdb/index.php. The output of this study can befurther evaluated for horse breed signature, evolutionary studies including adaptive response of equine germplasm against biotic and abiotic stresses.

Thesis Student Research

Development of Advanced Learning Based Classification Approach for Fungal Metagenomic Data

Author: Ritwika Das

2017-18

Microorganisms are an inevitable part of the ecosystem playing beneficial roles like nutrient mineralization, bioremediation, organic matter decomposition as well as posing harmful effects as pathogens. Rapid advancement in NGS technologies has given rise to a new field of study, “Metagenomics” for understanding the microbial community composition and functions directly from any environmental sample such as human gut, skin, soil, ocean, crop rhizosphere etc. Accurate binning and taxonomic annotation of raw metagenomic reads is an essential step before the subsequent functional analysis. Computational approaches, especially machine learning and deep learning algorithms, have been found to efficiently classify prokaryotic microorganisms, viz. bacteria and archaea from metagenomic datasets as compared to the reference-based method using BLAST. However, identification of fungi species from metagenomic data is a highly challenging task due to the complexity of eukaryotic genomes. Internal Transcribed Spacer (ITS) region is the most widely used DNA marker for the taxonomic annotation of a majority of fungal species. In this present study, a convolutional neural network based approach, CNN_Funbar has been developed using UNITE+INSDC reference ITS datasets for classifying fungi ITS sequences at all the six taxonomic levels, viz., species, genus, family, order, class and phylum while varying convolution kernel size, filter numbers, k-mer size, unique category numbers and category-wise ITS sequence frequencies. The proposed CNN_FunBar models have produced > 93% average accuracy for classifying ITS sequences from balanced datasets with 500 sequences per category and 6-mer frequency features at all the taxonomic levels. Species and genus level CNN_FunBar models, viz., Species_Model.h5 and Genus_Model.h5 could identify 62 species and 41 genera from the simulated fungal metagenomic dataset with a classification accuracy of 91.93% and 95.16% respectively. The comparative study has suggested that CNN_FunBar could outperform existing fungal taxonomy prediction tools (funbarRF, Mothur, RDP Classifier, and SINTAX) as well as competitive machine learning-based algorithms (SVM, KNN, Naive-Bayes, and Random Forest). A web application, CNN_FunBar has been developed for extracting oligonucleotide frequency features from the input ITS sequences followed by their classification using proposed CNN_FunBar models at various taxonomic levels. The developed tool is freely available at https://github.com/ritwika1993/CNN_FunBar_ITS.

Thesis Student Research

Development of computational approaches to understand plant-pathogen interactions

Author: Sneha Murmu

2017-18

Identifying protein-protein interactions (PPIs) in plant-pathogen system is an intriguing and demanding field of research that is necessary to comprehend the complex molecular mechanism of plant defense mechanism and pathogen virulence. Because identifying plant-pathogen PPIs experimentally requires so much time and effort, computational techniques are beginning to emerge as a helpful way to augment experimental methods. In the present study, the accuracy of well-established computational techniques for predicting plant-pathogen PPIs were investigated, such as interolog-based approach which is based on similarity searches. Due to the low sensitivity of the interolog technique, a machine learning (ML)-based ensemble model was employed to construct a multi-species plant-pathogen PPI predictor using diverse sequence encodings and multiple learning algorithms. Several amino acid sequences encoding schemes were evaluated. Auto-covariance (AC), conjoint triad (CT), and local descriptor (LD) schemes were selected based on their performance in terms of various evaluation metrics such as accuracy, sensitivity, precision, recall, Matthew’s correlation coefficient and F1-score. The selected features were combined with multiple learning algorithms such as random forest (RF), support vector machine (SVM), and artificial neural network (ANN). It was observed that AC and CT attained high accuracy with SVM (~96% and ~94% respectively) whereas LD performed better with RF (~95%). The predictions of these three individual models were further combined to yield an ensemble model with improved accuracy (~97%). The developed ensemble model was compared with the existing similar tools used for PPIs prediction between plant and pathogen, using an independent test dataset. The result of the comparative assessment exhibited the promising potential of the classifier in this domain. Hence, the developed model is proposed as an efficient tool for the prediction of multispecies plant-pathogen PPIs. Furthermore, to demonstrate the utility of the proposed classifier, it was employed to predict PPIs involved in the wheat blast, caused due to Magnaporthe oryzae pathotype Triticum (MoT). Wheat blast is a comparatively recent fungal disease but has become a serious threat to global wheat production. Most of the wheat proteins involved in the cross-talk between wheat and MoT were involved in the energy production mechanisms in response to the fungal attack. The fungal effector proteins were involved in biological processes that support the growth of the pathogen. Finally, a web-based prediction server, named PlantPathoPPI, was developed using the proposed model to extend the support for diverse levels of end-users. The prediction server is freely accessible and is available at http://login1.cabgrid.res.in:5080/. Taken together, PlantPathoPPI can serve as a valuable tool accelerate the investigation of plant-pathogen interactions.

Thesis Student Research

Development of database of genes and gene families responsible for nutritional traits in field crops

Author: Soumya Sharma

2016-17

ABSTRACT Nutritional insecurity is a major challenge in developing countries which are largely dependent on cereal based diets. Soil and plant scientists have accumulated much information on the concentration of minerals in the leaves of food crops. Major problems with food plants have been attributed to their lower than desired concentration of protein, inadequate essential amino acid ratios in plant proteins, and low digestibility of the proteins and carbohydrates in plants. Nutritionally dense crops offer an inexpensive and sustainable solution to the problem of malnutrition. A comprehensive search strategy was followed to obtain the genes responsible for nutritional traits in plants. The genes for mineral transportation, vitamin biosynthesis and essential amino acid biosynthesis were retrieved using advanced searches with gene ontology keyword for specific nutrients, plants, crops and their nutrient-related role in conjunction with the BOOLEANS like OR/AND), from 4 databases viz. GenBank, EnsemblPlants, Gramene, and UniProt. A total of 7695 sequences for mineral transportation, 1480 sequences for vitamin biosynthesis and 2583 sequences for essential amino acids were obtained. This study was oriented towards the application and comparison of different machine learning techniques (namely, support vector machine, random forest, Naïve Bayes and K nearest neighbour) for development of classification models for nutritional trait (mineral transportation, vitamin biosynthesis and essential amino acid biosynthesis) related gene sequences in flowering plants. Firstly the machine learning techniques were applied for developing three binary classification models: binary classification for mineral transportation, vitamin biosynthesis and essential amino acid biosynthesis genes. Afterwards, three multiclass classification models mineral transportation, vitamin biosynthesis and essential amino acid biosynthesis genes were developed using each of the four classifiers. 5-fold cross validation was performed to compare the performances of four classifiers independently and the results suggested that Random forest, SVM and KNN performed best for both binary as well as multiclass classification. The performance of naïve Bayes was comparatively lower. Finally, a database nutritional trait (mineral transportation, vitamin biosynthesis and essential amino acid biosynthesis) related gene sequences in flowering plants has been developed.

Thesis Student Research

Comparative Genomic studies for Domestication Related traits in Vigna species

Author: Shweta Kumari

2017-18

Thesis Student Research

Identification of Deep Learning Models to study microbial diversity in North Indian River Systems

Author: Nalini K Choudhury

2016-17

The Ganga and the Yamuna rivers constitute major north Indian river system. These rivers play an important role in irrigation, fishing, transportation, health, etc. Besides, they function as sinks for major microbial density and diversity. The type and abundance of microbial populations help enable to carry out several bio-remedial and biogeochemical studies including metagenomics. With the advent of high-throughput technologies a large amount of metagenomic data is available in the public domain. Thus, it became a challenge to process such large amount of metagenome data to classify the unknown/unclassified microbes to known groups of microbes such as bacteria, archaea, fungi, virus, and others. On the other hand, machine learning, and deep learning techniques came in a big way to handle myriads of data. In the field of river system metagenomics, that too in north Indian River metagenomics, application of such learning techniques is yet to be fully explored for binning/classification of unknown microbial populations. Further, there is a great demand for the development of online servers with tools/pipelines for analyzing metagenomic data from users view point. Thus, the present investigation has been carried out with objectives to: i) study deep learning model based procedures to analyze microbial communities of major north Indian river system, ii) compare the performance of deep learning model based procedures with the existing procedures meant for metagenome data analysis, and iii) develop an user-friendly interface with the developed and existing procedures of meta-genome analysis. In order to achieve the objectives, the river sediment samples collected from three sites, each at Kanpur & Farakka and Delhi for the Ganga and the Yamuna rivers respectively, by ICAR-CIFRI, Barrackpore were used. The raw metagenome data generated from the collected samples was pre-processed for quality checks and subsequently metagenome assembly was carried out to obtain contigs and scaffolds. The BLAST was initially applied on the scaffolds to identify the number of known microbial classes. It was found that there were broadly five classes present in the metagenome data. The other scaffolds that were unclassified were kept as separate group. The entire metagenomic data with the extracted features were subjected to iterative K-means clustering to classify the microbes into five categories. The identified group/class labels along with the extracted feature data were used to train and test the machine learning (SVM, RF, GBDT, XGBoost, AdaBoost) and deep learning (BiLSTM) models. A 10-fold cross-validation technique was also employed to assess the performance of learning classifiers in terms of metrics such as sensitivity, specificity, accuracy, etc. It was found from the comparison of performances of classifiers that the Random Forest performed with high accuracy (89%) over other classifiers. A software package based on RF in available at (https://github.com/Nalinikanta7/metagenomics). Also, the results revealed that Acetobacter, Achromobacter, Bacteroidetes, Fadolivirus, Indivirus, Gaeumannomyces, Phoenix, Strongyloides, Halobacterium, Haloferax, Halogeometricum, and Halosimplex microbes are most abundantly present in the metagenome data. Further, 66 percentage of unknown microbes have also been classified into the identified known five categories. The deep learning models have shown an accuracy range of 87 to 89 percentage for the analysis of metagenomic data. Thus, a web server “The Deep Machine in river metagenome” has also been developed based on deep learning models for the users to analyze river metagenome data at cabgrid.iasri.res.in/deepmachine.

Thesis Student Research

Computational approach for prediction of AMP and molecular markers in black pepper germplasm

Author: Ankita Negi

2018-19

Black pepper (Piper nigrum, 2n = 52), an important spice, has been traditionally used for the treatment of various diseases owing to its therapeutic properties since time immemorial. Antimicrobial peptides (AMPs) (nature's antibiotics/host defense peptides) are produced by living organisms as innate immune response against microbes. To overcome the problem of antimicrobial resistance (AMR), toxicity residue of antibiotics and more, the computational approach tuning to AMPs prediction and whole genome analysis will be advantageous. In this work, in silico prediction of genome-wide AMPs in black pepper and development of its specific prediction server using machine learning techniques were done along with identification of molecular markers and miRNAs in black pepper genome. Web genomic resources, including species-specific AMP candidate prediction server BPepAMPred (http://login1.cabgrid.res.in:5040/) based on bidirectional- Gated Recurrent Unit- based deep neural network architecture, BPepAMPdb (http://backlin.cabgrid.res.in/blackpepper_amp_db/) cataloguing predicted 43759 AMP candidates across black pepper proteome along with 10935 functionally associated unique genes with detailed features were developed. BlackP2MSATdb (http://webtom.cabgrid.res.in/blackp2msatdb/), which is the first largest reported web resource for black pepper genomic SSRs and polymorphic SSRs was also developed using ddRAD GBS data of 29 genotypes of black pepper across India were used. A total of 276230 genomic SSRs with average distance of 2.76 Kb between each SSR and a relative density of 362.88 SSRs per Mb, 3176 polymorphic SSRs from 29 black pepper genotypes, out of which 2015 were found hypervariable. The study also reports 2029 putative conserved miRNAs and 4207 miRNA targets coding sequences of black pepper which were functionally characterised to determine the possible post-transcriptional regulatory processes. This information can be used by researchers for the study of genetic diversity among the different black pepper verities, studies of post-transcriptional regulation and its role in various black pepper diseases. The markers provided can be used in QTL discovery, variety signature, traceability of produce and product, including GI certification if needed and in improvement program.

Thesis Student Research

Study on computational based approach for genome-wide prediction of AMP and molecular markers in buffalo

Author: Aamir Khan

2018-19

Water buffalo (Bubalus bubalis), belonging to Bovidae family is economically important animal. It is one of the most important farm animals, lives in hot and humid regions where climatic conditions are favorable for prevalence of infectious diseases. But interestingly, there is a less occurrence of diseases along with lesser deleterious effect of diseases on buffalo which indicates their stronger innate immunity to fight against infection. Buffaloes are found to express many AMPs such as defensins, cathelicidins, and hepcidin, which play an important role in neutralizing the invading pathogens. This study provides a faster, improved and species-specific AMP/non-AMP candidate prediction server, BufAMPpred (http://login1.cabgrid.res.in:5030/) based on an ensemble of CNN+LSTM deep learning neural network architecture with improved prediction accuracy (98.72% accuracy, 99.79% sensitively and 98.68% specificity) in comparison to existing tools/servers till date. This server also facilitates analyses up to 500 sequences at a time and also linked with BufAMPdb (http://backlin.cabgrid.res.in/buffampdb/) which is the collection of all candidate AMPs (61711) and non-AMPs (2529971) predicted for buffalo along with their corresponding gene information. We also report the first comprehensive, holistic and user-friendly web genomic resource of buffalo (BuffGR) accessible at http://backlin.cabgrid.res.in/buffgr/, that catalogues 6028881 SNPs and 613403 InDels extracted from the set of 31 buffalo tissues. We found a total of 3727122 SNPs and 634124 InDels distributed in the four breeds of buffalo (Murrah, Bangladesh, Jaffarabadi and Egyptian) with reference to Mediterranean breed. It also houses 4504691 SSR markers from all the breeds along with 1458 unique circRNAs, 37712 lncRNAs and 938 miRNAs. This comprehensive web-resource can be widely used by the buffalo researchers across the globe for use of markers in marker trait association, genetic diversity among the different breeds of buffalo, use of ncRNAs as regulatory molecules, post-transcriptional regulations, role in various diseases /stresses etc. These can also be used as biomarkers to address the adulteration and traceability. This resource can also be useful in buffalo improvement programs and disease/breed management.

Thesis Student Research

Identification of bacteriophage from the metagenomic data of Ganga and Yamuna Rivers

Author: Soutrik Mukherjee

2020-21

Microbes are important in each and every aspect of not only human but also all the lifeforms in earth. Every system in the biosphere is induced by the almost infinite ability of microbes to transform the world around them. Identification of bacteriophage from various regions of Ganga and Yamuna rivers was indeed a very important task to know the abundance of different species of bacteriophages. As bacteriophages play a very important role in riverine system by checking the growth of bacteria, it was very important to understand abundance of bacteriophages. Further, very few works have been done on the annotation of bacteriophages identified from the Ganga and Yamuna rivers. Sediment samples from various regions of Ganga and Yamuna River like Balkeshwar-ShivpuriAgra, Koteswar-Ganga, Rasulabad-Ganga, Sahi-Dabad-Ganga, Taj-Gung-Yamuna, Triveni-Sangam-Ganga, Yamuna- Expressway-Agra, Bagwan-Ganga area by ICARCentral Inland Fisheries Research Institute under CABIN project. Two approaches were followed for the identification of bacteriophages, one is identification of bacteriophages by binning of the metagenomic contigs data with Metabat2 tool and then distinguishing bacteriophage sequences by a machine learning based tool MARVEL. The other approach was alignment-based approach by BLASTN with the query as the contigs of the metagenomics samples and database made from the bacteriophage sequences downloaded from NCBI. With MARVEL tool from the 9 datasets, two bins of Balkeshwar-Ganga contigs data shows the result of having bacteriophage sequence. Using the bioinformatics software program Blast2GO, unique sequence data was automatically and quickly functionally annotated (genes, proteins). Blast Table describes the quantification of the bacteriophage species from the samples was generated. Aeribacillus phage AP45, complete genome phage was the most abundant phage in all 9 sites of Ganga and Yamuna rivers and gene ontology pie chart describes the biological process, cellular component and molecular function.

Thesis Student Research

Quality control of label - free Proteomics expression data considering missing values and Heterogeneity

Author: Kabilan S

2020-21

Proteins are important biomolecules that perform various physiological tasks such as metabolic catalysis, energy conservation, host defence and signalling. Proteomics is the large-scale study of the expressed proteins in a cell. The LC-MS is an indispensable tool for protein identification and quantification analysis because of its improved coverage, sensitivity, and high throughput. The most popular approach for LC-MS based protein analysis is the label-free bottom-up approach, where the unlabelled proteins are proteolytically digested into peptides by specific proteases such as trypsin and then analysed. The label-free LC-MS proteomics dataset often suffers from the problems of data heterogeneity and missing values. Various normalization and imputation methods are widely used for removing these biases, but there is no standard condition available for selecting the suitable combination of normalization and imputation methods for the proteomics expression dataset. This study aims to develop an approach for finding the suitable combination of normalization and imputation methods for the label-free proteomics expression data based on various quality control measures such as PCV, PEV, and PMAD. The standard benchmark dataset based on a highly complex yeast lysate sample spiked with different levels of a UPS1 standard protein was taken for this study. The three popular normalization methods namely, VSN, LOESS, and RLR and three efficient imputation methods named k-NN, LLS, and SVD methods were chosen for this study. They were paired with each other and a total of nine combinations of these methods were considered. The combination of LLS imputation and LOESS normalization was given as the suitable combination in the developed approach. This combination identified a greater number of significant proteins in differential expression analysis, than other combinations in most cases. The performance of the developed approach was consistent even after generating the missing values artificially by three different ways in the dataset and based on the NRMSE scores. The R-package named ‘lfproQC’ was developed for the proposed approach and will be deposited in the CRAN repository. This package can be used to find the best combination of normalization and imputation methods for any label-free proteomics expression dataset and helps in efficient downstream analysis.