Research Thesis
Title
Novel and efficient Pipeline for Metagenomics Binning
Objectives
i) To develop novel method for generating features from metagenomics dataset. ii) To evaluate the performance of proposed approach of features along with composition features with various clustering algorithms and existing approaches.
Abstract
Metagenomics delves into the examination of microorganisms, and a pivotal aspect of this field involves piecing together the genetic makeup of distinct organisms. This task proves challenging due to the complexities of isolating and cloning certain organisms under in-vitro conditions. Metagenomics is alternatively termed environmental genomics, eco-genomics, or community genomics. To reconstruct the fragmented sequences obtained from shotgun sequencing, the process heavily relies on genome assembly. However, a significant hurdle arises when attempting to segregate and reassemble genomes from various organisms. The abundance of these genomes and the intermingling of genomics reads present a formidable challenge. Shotgun sequencing produces genomic reads that contain fragments originating from diverse microorganisms' genomes. To facilitate reconstruction, it becomes imperative to classify these reads into separate bins corresponding to distinct microorganisms. For this purpose, various clustering techniques have emerged for the categorization of these intertwined genomes. These techniques encompass binning, boosting, bagging, and stacking. Among these, binning has gained prominence as the most extensively utilized algorithm in contemporary times. To put it differently, genomes are categorized into operational taxonomic units (OTUs) to facilitate subsequent taxonomic profiling and subsequent functional analysis. This process of OTU clustering is commonly referred to as binning. In this clustering process, binning employs a variety of clustering methods such as k-means, k-medoids, Hidden Markov Model (HMM), and hierarchical clustering. However, each of these clustering approaches comes with its own limitations and drawbacks. There is a no research on motif-based binning in the existing ones. Here an approach is given for metagenomic binning by constructing frequency table of motif or segments by using local alignment using gap and during local alignment, the segments should not be overlapped. K means clustering, PAM clustering and DBSCAN clustering are applied to cluster the contigs based on the segments and the motifs. But K-means clustering has performed the best. The rand indexes for this approach are tend to 1. So, this approach is good for metagenomics binning. And it is also performing better than the existing binning tools, i.e., MaxBin and MetaBat. And this approach has a lot of scope. In the place of simple kmeans clustering, many advanced clustering can be used for better performance. GC content, tetra-nucleotide frequency can be added for getting better performance. This approach also highlights the mutation concepts and conserved regions, which are too much necessary to get the idea of evolutionary biology.