Publications

Thesis Student Research

Quality control of label - free Proteomics expression data considering missing values and Heterogeneity

Author: Kabilan S

2020-21

Proteins are important biomolecules that perform various physiological tasks such as metabolic catalysis, energy conservation, host defence and signalling. Proteomics is the large-scale study of the expressed proteins in a cell. The LC-MS is an indispensable tool for protein identification and quantification analysis because of its improved coverage, sensitivity, and high throughput. The most popular approach for LC-MS based protein analysis is the label-free bottom-up approach, where the unlabelled proteins are proteolytically digested into peptides by specific proteases such as trypsin and then analysed. The label-free LC-MS proteomics dataset often suffers from the problems of data heterogeneity and missing values. Various normalization and imputation methods are widely used for removing these biases, but there is no standard condition available for selecting the suitable combination of normalization and imputation methods for the proteomics expression dataset. This study aims to develop an approach for finding the suitable combination of normalization and imputation methods for the label-free proteomics expression data based on various quality control measures such as PCV, PEV, and PMAD. The standard benchmark dataset based on a highly complex yeast lysate sample spiked with different levels of a UPS1 standard protein was taken for this study. The three popular normalization methods namely, VSN, LOESS, and RLR and three efficient imputation methods named k-NN, LLS, and SVD methods were chosen for this study. They were paired with each other and a total of nine combinations of these methods were considered. The combination of LLS imputation and LOESS normalization was given as the suitable combination in the developed approach. This combination identified a greater number of significant proteins in differential expression analysis, than other combinations in most cases. The performance of the developed approach was consistent even after generating the missing values artificially by three different ways in the dataset and based on the NRMSE scores. The R-package named ‘lfproQC’ was developed for the proposed approach and will be deposited in the CRAN repository. This package can be used to find the best combination of normalization and imputation methods for any label-free proteomics expression dataset and helps in efficient downstream analysis.

View Student Profile

Thesis Student Research

Development of web based tool to identify polymorphic microsattelite markers for RAD seq data

Author: Madhusudhan CM

2020-21

View Student Profile

Thesis Student Research

CRISPR/Cas9 off targets prediction in Plants using Deep Learning

Author: Chandini B C

2020-21

View Student Profile

Thesis Student Research

Identification and Characterization of Lnc RNA in Ricebean (Vigna umbellata)

Author: Bibek Saha

2019-20

Ricebean, Vigna umbellata is a Kharif-season annual legume. Its seeds are consumed as pulse. It is considered as a minor legume as it is grown in limited areas as an intercrop with maize and sorghum. It is mostly grown in Northern part of India (Mainly Uttarakhand) and North-eastern part of India (Mainly Assam). Its seed contains a good amount of protein and other nutrients. These protein-coding RNA of developing stages of seed largely regulated by non-coding RNA specifically long non-coding RNA. Long non-coding RNAs (lncRNAs) are a large and diverse class of transcribed RNA molecules with a nucleotide length of more than 200 bp and ORF<100 bp that do not encode proteins. It is one of the types of Regulatory non-coding RNA. LncRNAs are important regulators of gene expression by DNA methylation and chromatin remodeling, and in some cases, they act as miRNA (Micro RNA) sponges to enhance the expression of mRNA targeted by miRNA (Tay et al., 2014). LncRNAs are thought to have a wide range of functions in cellular and developmental processes. LncRNA may be positioned beside protein- coding genes or in between genes even it overlaps with coding genes. There has been hardly any work reported for the identification of lncRNA with respect to the Ricebean crop. This study aims to identify lncRNA and annotate its targets for the developing stages of Ricebean seed. A total of 906 novel lncRNAs have been identified. Out of these 906 novel lncRNAs, 82 lncRNAs have targets of 15 miRNAs. It was observed that different lncRNA could have similar miRNA targets. These 15 microRNA had targets of 15 mRNA. Lastly, annotation of 15 mRNA was carried out and it was found that these mRNA regulated different biological, cellular, metabolic processes of the developmental stages of Rice bean seed. ‘RbLncDB’, a web resource has also been developed under the present study to help future researchers in regard to Ricebean seed transcriptome. Keywords: Vigna umbellata; long non-coding RNAs; micro RNA; Reference assembly;lncRNA targeted miRNA.

View Student Profile

Thesis Student Research

Phylogenetic Marker Genes Based Approach for Binning of Metagenomics Data

Author: Asif Ali V K

2019-20

The study of microbes was traditionally focused on single species in pure culture, which made the interpretation of these complex communities very difficult. The science of ‘Metagenomics’ enables us to investigate microbes in their natural environments, the complex communities in which they normally live. Metagenomic sequence binning is one of the important steps of metagenomic data analysis so as to produce meaningful 'bins' or groups. There are several techniques for grouping, among which binning is most widely used. Binning indicates to the process of classification of DNA sequences into clusters that might be the true representative of an individual genome or genomes from taxonomically related microorganisms. Binning uses any of the several clustering techniques available such as K-Means, DBSCAN, spectral clustering, hierarchical clustering, etc. But each of these clustering techniques has its own drawbacks. In the past, only few efforts have been seen on the use of single-copy phylogenetic marker genes for the clustering of metagenomic data. The phylogenetic marker genes are protein encoding genes that are universal, single-copy marker genes and are rarely subjected to horizontal gene transfer (HGT). They had been used to accurately and consistently delineate prokaryotic species. Here in this research a semi-supervised clustering approach is adopted to cluster the metagenomic data using marker genes. Initially, contigs harbouring marker genes are identified by running the Prodigal, FetchMG and USEARCH applications sequentially. Then the K-Means clustering technique is applied on the metagenomic data which has been already reduced to two dimensions using BH-TSNE algorithm. In the end, correction of the generated clusters was carried out based on the sequences harbouring marker genes with the help of spectral clustering. K-Means clustering itself generated 8 clusters with a rand index of 0.973, a F1 score of 0.71 and an overall accuracy of 0.9 for a 10s genome dataset using tetranucleotide frequency as initial input feature matrix. While cluster correction resulted in the generation of 10 clusters with a rand index of 0.981, a F1 score of 0.91 and an overall accuracy of 0.95 for the same dataset. In a nutshell, the cluster correction using sequences harbouring marker genes produced better clustering results.

View Student Profile

Thesis Student Research

Deep Learning for Predicting Breeding Value using High Throughput Genotyping and Phenotyping

Author: Lal Dhari Patel

2019-20

Accurate estimation of the breeding value in a crop breeding program is of key importance. Traditionally, statistical methods have been widely utilized for predicting breeding values using genotypic effects. These statistical methods usually assume that genotypic effects are independently distributed and follows a prior distribution such as Gaussian etc. These statistical assumptions may play limiting role in predicting the breeding values using high throughput genotyping data, which has very precise information of genotypes. At the same time, harnessing the potential of this precise information of genotyping equally precise phenotyping is also warranted. Precise phenotyping is laborious, expensive and sometime impossible in case of conventional phenotyping. Therefore to overcome these limitations, the present work proposes the use of deep learning in prediction of breeding value by exploiting the full potential of high-throughput genotyping in conjecture with high throughput phenotyping. Hence, deep learning-based CNN Model has been trained for the prediction of breeding Value using High Throughput Genotyping and Phenotyping data of wheat dataset, which consist of 184 RILs and each RILs contains 3121 filtered SNPs. Altogether, data of six traits were taken, under two environments (controlled and drought condition), for the prediction of breeding value. First, the whole dataset was randomly divided into two parts, one is training dataset and other is testing dataset. The CNN models were trained on training dataset, which contains 80% of total dataset and remaining 20% of the total data was used for testing. Two parameters were used for testing and evaluation of the deep learning model training. The trained and tested deep learning model was compared with the existing statistical models i.e., GBLUP (Genomic best linear unbiased prediction), rrBLUP (ridge regression best linear unbiased prediction) and Bayesian LASSO (Bayesian Least Absolute Selection and Shrinkage Operator). The result shows that deep learning model performs better as compare to statistical methods undertaken.

View Student Profile

Thesis Student Research

Identification and Characterization of bZIP and Dof gene families from developing seeds of Vignaumbellata

Author: Shivdarshan S Jirli

2019-20

View Student Profile

Thesis Student Research

Prediction of enzymes involved in bioremediation using aquatic Metagenomes

Author: Chandana V

2019-20

View Student Profile

Thesis Student Research

Development of a deep learning based methodology for functional protein classification

Author: Bulbul Ahmed

2016-17

Cereals are staple crops widely cultivated across the world. These are highly nutritious, rich in vitamins, minerals, carbohydrates, fats, oils, proteins and fibers but are low in essential amino acids such as lysine. Cereal crops belong to poaceae family, having wider applications in production of flour, bread, rice, cakes, corn etc. The other by-products of these crops are beverages and wine. Moreover, consumption of these crops reduces the coronary heart disease, diabetes, colon cancer, diverticular disease etc. India is the third largest cereal producer after China and USA but it has been producing to a great extent which could be achieved to 4.9% increase in production from base year 2020 to 2027. The production of these crops is highly affected by biotic and abiotic stresses which adversely affected crop growth and development, further resulting in crop loss that leads to economic loss. Hence, it is required to understand and study the genes involved in order to minimize the biotic and abiotic stresses. The genes start adapting under stress factors and produce proteins that can tolerate such changes by changing signalling pathways in protein-protein interaction. Finding these proteins are highly expensive, time consuming and required a highly experienced person. In order to reduce cost and time, rapid classification and prediction of such proteins using computation approaches is required. Further, these proteins are complex in nature with high dimensions which are very difficult to study using conventional approaches. This study was oriented towards the application of different machine learning techniques (namely, support vector machine and random forest) and deep learning (long short-term memory) for development of classification models for abiotic stresses (heat, cold, salinity and drought) protein sequences from poaceae family. Also, an activation function, Gaussian Error Linear Unit with Sigmoid function (SiELU) has been developed for deploying in a deep learning model which shows an increased efficiency of the model. Lastly, a web-based tool for prediction of stress associated proteins from poaceae family has been developed implementing the proposed long short-term memory deep learning methodology with developed activation function i.e., SiELU and tuning of other hyper-parameters.

View Student Profile

Thesis Student Research

Development of Big Data Analytics Based Methods for Genome Assembly and Annotation.

Author: Amit Kairi

2014-15

The study on “Development of Big Data Analytics Based Methods for Genome Assembly and Annotation” was carried out in the Centre for Agricultural Bioinformatics (CABin), ICAR-Indian Agricultural Statistics Research Institute (IASRI), New Delhi during the year 2014-2020.In the present study, genome assembly and annotation procedures have been critically reviewed to develop new approaches that may reduce the time complexity along with increase in quality output. Big Data analytics-based techniques have been used in this study to devise new approaches and compare them with the existing algorithms so as to judge the quality of the outcome in terms of genome assembly and annotation. In this chapter, a brief introduction to sequencing techniques, genome assembly procedures with their merits and demerits, annotation, and Big Data has been made along with the motivation and objectives of the study.

View Student Profile

Thesis Student Research

Development of Robust Methods for Genomic Selection

Author: Neeraj Budhlakoti

2014-15

View Student Profile

Thesis Student Research

Development of Integrated Index for Genomic Selection

Author: Md Asif Khan

2015-16

View Student Profile

Publications

Publications by Year

Publication Types

Quality control of label - free Proteomics expression data considering missing values and Heterogeneity

Development of web based tool to identify polymorphic microsattelite markers for RAD seq data

CRISPR/Cas9 off targets prediction in Plants using Deep Learning

Identification and Characterization of Lnc RNA in Ricebean (Vigna umbellata)

Phylogenetic Marker Genes Based Approach for Binning of Metagenomics Data

Deep Learning for Predicting Breeding Value using High Throughput Genotyping and Phenotyping

Identification and Characterization of bZIP and Dof gene families from developing seeds of Vignaumbellata

Prediction of enzymes involved in bioremediation using aquatic Metagenomes

Development of a deep learning based methodology for functional protein classification

Development of Big Data Analytics Based Methods for Genome Assembly and Annotation.

Development of Robust Methods for Genomic Selection

Development of Integrated Index for Genomic Selection