Research Thesis
Title
Development of Advanced Learning Based Classification Approach for Fungal Metagenomic Data
Objectives
i) To develop an efficient advanced learning based approach for molecular marker based classification of fungi data from metagenomics data sets ii) To empirically evaluate and compare the performance of the developed approach with existing software iii) To develop software for the proposed approach
Abstract
Microorganisms are an inevitable part of the ecosystem playing beneficial roles like nutrient mineralization, bioremediation, organic matter decomposition as well as posing harmful effects as pathogens. Rapid advancement in NGS technologies has given rise to a new field of study, “Metagenomics” for understanding the microbial community composition and functions directly from any environmental sample such as human gut, skin, soil, ocean, crop rhizosphere etc. Accurate binning and taxonomic annotation of raw metagenomic reads is an essential step before the subsequent functional analysis. Computational approaches, especially machine learning and deep learning algorithms, have been found to efficiently classify prokaryotic microorganisms, viz. bacteria and archaea from metagenomic datasets as compared to the reference-based method using BLAST. However, identification of fungi species from metagenomic data is a highly challenging task due to the complexity of eukaryotic genomes. Internal Transcribed Spacer (ITS) region is the most widely used DNA marker for the taxonomic annotation of a majority of fungal species. In this present study, a convolutional neural network based approach, CNN_Funbar has been developed using UNITE+INSDC reference ITS datasets for classifying fungi ITS sequences at all the six taxonomic levels, viz., species, genus, family, order, class and phylum while varying convolution kernel size, filter numbers, k-mer size, unique category numbers and category-wise ITS sequence frequencies. The proposed CNN_FunBar models have produced > 93% average accuracy for classifying ITS sequences from balanced datasets with 500 sequences per category and 6-mer frequency features at all the taxonomic levels. Species and genus level CNN_FunBar models, viz., Species_Model.h5 and Genus_Model.h5 could identify 62 species and 41 genera from the simulated fungal metagenomic dataset with a classification accuracy of 91.93% and 95.16% respectively. The comparative study has suggested that CNN_FunBar could outperform existing fungal taxonomy prediction tools (funbarRF, Mothur, RDP Classifier, and SINTAX) as well as competitive machine learning-based algorithms (SVM, KNN, Naive-Bayes, and Random Forest). A web application, CNN_FunBar has been developed for extracting oligonucleotide frequency features from the input ITS sequences followed by their classification using proposed CNN_FunBar models at various taxonomic levels. The developed tool is freely available at https://github.com/ritwika1993/CNN_FunBar_ITS.
Publications (1)
CNN_FunBar: Advanced Learning Technique for Fungi ITS Region Classification
Ritwika Das, Anil Rai, Dwijesh Chandra Mishra
Academic Details
Experience
M.Sc. in Bioinformatics from PG School, ICAR - Indian Agricultural Research Institute, New Delhi - 110012
Jul 2015 — Jul 2017