Skip to article frontmatterSkip to article content

Recovery of MAGs

Authors
Affiliations
Bokulich Lab
Bokulich Lab

In this part of the tutorial we will go through the steps required to recover metagenome-assembled genomes (MAGs) from metagenomic data. The workflow is divided into several steps, from contig assembly to binning and quality control.

Assemble contigs with MEGAHIT

The first step in recovering MAGs is genome assembly itself. There are many genome assemblers available, two of which you can use through the q2-assembly plugin - here, we will use MEGAHIT. MEGAHIT takes short DNA sequencing reads, constructs a simplified De Bruijn graph, and generates longer contiguous sequences called contigs, providing valuable genetic information for the next steps of our analysis.

  • The --p-presets specifies the preset mode for MEGAHIT. In this case, it’s set to “meta-sensitive” for metagenomic data.
  • The --p-num-cpu-threads specifies the number of CPU threads to use during assembly.
  • The --p-min-contig-len specifies the minimum length of contigs to keep.
  • Alternatively, you can also use mosh assembly assemble-spades to assemble contigs with SPAdes.

Contig QC with QUAST

Once the reads are assembled into contigs, we can use QUAST to evaluate the quality of our assembly. There are many metrics that can be used for that purpose but here we will focus on the two most popular metrics: N50 and L50.
In addition to calculating generic statistics like N50 and L50, QUAST will try to identify potential genomes from which the analyzed contigs originated. Alternatively, we can provide it with a set of reference genomes we would like it to run the analysis against using --i-references.

mosh assembly evaluate-quast \
    --i-contigs cache:contigs  \
    --p-threads 7 \
    --p-memory-efficient \
    --o-visualization results/contigs.qzv \ 
    --verbose

Your visualization should look similar to this one.

Index contigs

In this step, we generate an index for the assembled contigs. This index is required for mapping reads to the contigs later. Various parameters control the size and structure of the index, as well as resource usage.

Map reads to contigs

Here we map the input paired-end reads to the indexed contigs created in the previous step. We use various alignment settings to ensure optimal mapping, including local alignment mode and sensitivity settings.

Bin contigs with MetaBAT

Binning contigs involves grouping assembled contigs into MAGs. This step uses MetaBAT to assign contigs based on co-abundance and other features, producing MAG files that represent putative genomes.

mosh annotate bin-contigs-metabat \
    --i-contigs cache:contigs \                       
    --i-alignment-maps cache:reads_to_contigs \         
    --p-num-threads 64 \                              
    --p-seed 100 \                                   
    --p-verbose \                                    
    --o-mags cache:mags \                             
    --o-contig-map cache:contig_map \                   
    --o-unbinned-contigs cache:unbinned_contigs \
    --verbose

This step generated several artifacts:

  • mags: these are our actual MAGS, per sample.
  • contig_map: this is a mapping between MAG IDs and IDs of contigs that belong to a given MAG.
  • unbinned_contigs: these are all the contigs that could not be assign to any particular MAG. From now on, we will focus on the mags.

Evaluate bins with BUSCO

This step evaluates the completeness and quality of MAGs using the BUSCO tool, which checks for the presence of single-copy orthologs. The evaluation helps ensure the quality of the recovered MAGs.

First we will use mosh annotate fetch-busco-db to download a specific lineage’s BUSCO database. BUSCO databases are precompiled collections of orthologous genes, tailored to specific lineages such as viruses, prokaryotes (bacteria and archaea), eukaryotes or more specific ones like bacteria_odb12.

  • The --p-lineages parameter set to bacteria_odb12 specifies that we want to download the bacterial dataset.
mosh annotate fetch-busco-db \
    --p-lineages bacteria_odb12 \
    --o-db cache:busco_db
    --verbose

Once the appropriate BUSCO database is fetched, the next step is to evaluate the completeness and quality of the MAGs.

The --p-lineage-dataset bacteria_odb12 parameter specifies the particular lineage dataset to use, in this case, the bacteria_odb12 dataset. This is a standard database for bacterial genomes.

Your visualization should look similar to this one.

Filter MAGs

This step filters MAGs based on completeness. In this example, we filter out any MAGs with completeness below 50%. The filtering process ensures only high-quality genomes are kept for downstream analysis.

mosh annotate filter-mags \
    --i-mags cache:mags \                             
    --m-metadata-file cache:busco_results \           
    --p-where 'complete>50' \                        
    --p-no-exclude-ids \                              
    --p-on mag \                                     
    --o-filtered-mags cache:mags_filtered_50 \
    --verbose