Rajarshi Mondal
M.Sc. Bioinformatics
1 yr 9 mo
Duration
Research Thesis
Title
Development of Multiclass Model to Classify the Major Group of Microbes from Metagenomic Data
Objectives
Development of a machine learning multiclass model for the classification of major microbial groups using metagenomic datasets. Development of a user-friendly web tool or R package using the proposed model.
Abstract
Metagenomics involves the analysis of genetic material directly extracted from environmental samples, enabling the study of microbial communities without the need for isolation or cultivation. However, the complexity and diversity of microbial genomes, along with varying abundance levels, make accurate classification and assembly of sequences a challenging task. Shotgun sequencing generates short reads from mixed microbial genomes, necessitating taxonomic binning for effective profiling and functional analysis. This study proposes a taxonomy-dependent classification framework based on marker genes to improve binning accuracy. Feature extraction and selection were performed on marker sequences, followed by training machine learning models using 10-fold cross-validation. A two-stage hierarchical classification scheme was designed to address misclassification biases caused by genome size differences among viruses, bacteria, and fungi. The first stage distinguishes viral sequences, while the second stage classifies bacterial and fungal sequences. Results show that the proposed framework effectively reduces misclassification and improves taxonomic assignment balance. The Logistic Regression was the best-performing model with an overall accuracy of 85.06%. The Class-wise analysis showed strong predictive power for bacteria and fungi, but lower performance for viruses. A web server (named, MultiMetaCC) was successfully implemented to make the model accessible (https://rajmondal.shinyapps.io/MultiMetaCC/). The approach demonstrates robust performance and offers flexibility for enhancement through the integration of ensemble methods (e.g., Random Forest, XGBoost) and deep learning models (e.g., CNNs, RNNs) to capture complex patterns and relationships. This method lays a foundation for more refined, scalable, and accurate classification in metagenomic studies.