Research Thesis
Title
Phylogenetic Marker Genes Based Approach for Binning of Metagenomics Data
Objectives
1. To develop semi-supervised binning strategy for metagenomics data using phylogenetic marker genes 2. To evaluate the performance of developed approach
Abstract
The study of microbes was traditionally focused on single species in pure culture, which made the interpretation of these complex communities very difficult. The science of ‘Metagenomics’ enables us to investigate microbes in their natural environments, the complex communities in which they normally live. Metagenomic sequence binning is one of the important steps of metagenomic data analysis so as to produce meaningful 'bins' or groups. There are several techniques for grouping, among which binning is most widely used. Binning indicates to the process of classification of DNA sequences into clusters that might be the true representative of an individual genome or genomes from taxonomically related microorganisms. Binning uses any of the several clustering techniques available such as K-Means, DBSCAN, spectral clustering, hierarchical clustering, etc. But each of these clustering techniques has its own drawbacks. In the past, only few efforts have been seen on the use of single-copy phylogenetic marker genes for the clustering of metagenomic data. The phylogenetic marker genes are protein encoding genes that are universal, single-copy marker genes and are rarely subjected to horizontal gene transfer (HGT). They had been used to accurately and consistently delineate prokaryotic species. Here in this research a semi-supervised clustering approach is adopted to cluster the metagenomic data using marker genes. Initially, contigs harbouring marker genes are identified by running the Prodigal, FetchMG and USEARCH applications sequentially. Then the K-Means clustering technique is applied on the metagenomic data which has been already reduced to two dimensions using BH-TSNE algorithm. In the end, correction of the generated clusters was carried out based on the sequences harbouring marker genes with the help of spectral clustering. K-Means clustering itself generated 8 clusters with a rand index of 0.973, a F1 score of 0.71 and an overall accuracy of 0.9 for a 10s genome dataset using tetranucleotide frequency as initial input feature matrix. While cluster correction resulted in the generation of 10 clusters with a rand index of 0.981, a F1 score of 0.91 and an overall accuracy of 0.95 for the same dataset. In a nutshell, the cluster correction using sequences harbouring marker genes produced better clustering results.