MAG binning - MOSHPIT documentation

Read mapping¶

Before we continue to assemble MAGs, we need to index the contigs obtained in the assembly step and map the original reads to those contigs using that index. This read mapping can then be used by the contig binner to figure out which contigs originated from the same genome and put those together. Run the actions below to index the contigs and map the reads to the generated index:

Contig indexing¶

With parsl parallelization

Without parallelization

You can speed up this action by taking advantage of parsl parallelization support. We will use the same config as for genome assembly.

mosh assembly index-contigs \
    --i-contigs contigs.qza \
    --p-threads 2 \
    --p-seed 100 \
    --o-index contigs-index.qza \
    --parallel-config parallel.config.toml \
    --verbose

Read mapping¶

With parsl parallelization

Without parallelization

You can speed up this action by taking advantage of parsl parallelization support. We will use the same config as for genome assembly.

You can then run the action in the following way:

mosh assembly map-reads \
    --i-index contigs-index.qza \
    --i-reads reads.qza \
    --p-threads 2 \
    --p-seed 100 \
    --o-alignment-maps reads-to-contigs-aln.qza \
    --parallel-config parallel.config.toml \
    --verbose

Contig binning¶

Finally, we are ready to perform contig binning. This process involves categorizing contigs into distinct bins or groups based on their likely origin from different microbial species or strains within a mixed community. Here, we will use the MetaBAT 2 tool, which uses tetranucleotide frequency together with abundance (coverage) information to assign contigs to individual bins.

mosh annotate bin-contigs-metabat \
    --i-contigs contigs.qza \
    --i-alignment-maps reads-to-contigs-aln.qza \
    --p-num-threads 4 \
    --p-seed 100 \
    --o-mags mags.qza \
    --o-contig-map contig-map.qza \
    --o-unbinned-contigs unbinned-contigs.qza \
    --parallel-config parallel.config.toml \
    --verbose

MAG quality control¶

Once we have our contigs binned into Metagenome-Assembled Genomes (MAGs), we need to check what the quality of those bins is. There are a couple of different tools which can be used for this purpose, many of which use the single-copy marker genes to estimate the completeness and purity (or contamination) of the recovered genomes. Here, we will use BUSCO which uses a set of curated ortholog genes to estimate those metrics.

We begin by fetching the required BUSCO database: we know that all species in our samples are strictly bacteria so we can fetch only this lineage to save some space and resources:

mosh annotate fetch-busco-db \
    --p-lineages bacteria_odb12 \
    --o-db busco-db-bacteria.qza \
    --verbose

Next, we use the database we fetched to run BUSCO with our recovered MAGs as input:

With parsl parallelization

Without parallelization

You can speed up this action by taking advantage of parsl parallelization support. We will use the same config as for genome assembly.

mosh annotate evaluate-busco \
    --i-mags mags.qza \
    --i-db busco-db-bacteria.qza \
    --p-lineage-dataset bacteria_odb12 \
    --p-cpu 2 \
    --o-results busco-results.qza \
    --o-visualization mags.qzv \
    --parallel-config parallel.config.toml \
    --verbose

mosh annotate evaluate-busco \
    --i-mags mags.qza \
    --i-db busco-db-bacteria.qza \
    --p-lineage-dataset bacteria_odb12 \
    --p-cpu 2 \
    --o-results busco-results.qza \
    --o-visualization mags.qzv \
    --verbose

The action above generated a result table and a neat visualization which will allow us to investigate the quality of all our genomes.

Your visualization should look similar to this one.

Now that we evaluated the quality of our MAGs, we can use this information to filter out only the best ones. We do not want to continue with MAGs of low quality (not very complete or highly contaminated) as it may lead to incorrect results in the downstream analyses. We want to keep the MAGs which are at least 50% complete and have less than 10% contamination (these are considered to be of medium quality according to the MIMAG standard). We can easily achieve this with the following action:

mosh annotate filter-mags \
    --i-mags mags.qza \
    --m-metadata-file busco-results.qza \
    --p-where "completeness>50 AND contamination<10" \
    --p-on "mag" \
    --o-filtered-mags mags-filtered.qza \
    --verbose

References¶

Kang, D. D., Li, F., Kirton, E., Thomas, A., Egan, R., An, H., & Wang, Z. (2019). MetaBAT 2: An Adaptive Binning Algorithm for Robust and Efficient Genome Reconstruction from Metagenome Assemblies. PeerJ, 2019(7). 10.7717/peerj.7359
Tegenfeldt, F., Kuznetsov, D., Manni, M., Berkeley, M., Zdobnov, E. M., & Kriventseva, E. V. (2024). OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Research, 53(D1), D516–D522. 10.1093/nar/gkae987
Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., Schulz, F., Jarett, J., Rivers, A. R., Eloe-Fadrosh, E. A., Tringe, S. G., Ivanova, N. N., Copeland, A., Clum, A., Becraft, E. D., Malmstrom, R. R., Birren, B., Podar, M., Bork, P., … Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology, 35(8), 725–731. 10.1038/nbt.3893