Recovery of MAGs - MOSHPIT documentation

In this part of the tutorial we will go through the steps required to recover metagenome-assembled genomes (MAGs) from metagenomic data. The workflow is divided into several steps, from contig assembly to binning and quality control.

Assemble contigs with MEGAHIT¶

The first step in recovering MAGs is genome assembly itself. There are many genome assemblers available, two of which you can use through the q2-assembly plugin - here, we will use MEGAHIT. MEGAHIT takes short DNA sequencing reads, constructs a simplified De Bruijn graph, and generates longer contiguous sequences called contigs, providing valuable genetic information for the next steps of our analysis.

The --p-presets specifies the preset mode for MEGAHIT. In this case, it’s set to “meta-sensitive” for metagenomic data.
The --p-num-cpu-threads specifies the number of CPU threads to use during assembly.
The --p-min-contig-len specifies the minimum length of contigs to keep.

With parsl parallelization

You can speed up the assembly by taking advantage of the parsl parallelization support (see here to learn more). The config could look like this (to run the action on an HPC):

assembly.config.toml

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "SlurmProvider"
scheduler_options = "#SBATCH --mem-per-cpu=16G"
worker_init = "source ~/.bashrc && conda activate qiime2-moshpit-2025.10"
walltime = "12:00:00"
nodes_per_block = 1
cores_per_node = 8
max_blocks = 14

You can then run the action in the following way:

mosh assembly assemble-megahit \
    --i-reads cache:reads_filtered \
    --p-presets meta-sensitive \
    --p-num-cpu-threads 8 \
    --p-min-contig-len 200 \
    --o-contigs cache:contigs \
    --parallel-config assembly.config.toml \
    --verbose

Without parallelization

mosh assembly assemble-megahit \
    --i-reads cache:reads_filtered \
    --p-presets meta-sensitive \
    --p-num-cpu-threads 8 \
    --p-min-contig-len 200 \
    --o-contigs cache:contigs \
    --verbose

Alternatively, you can also use mosh assembly assemble-spades to assemble contigs with SPAdes.

Contig QC with QUAST¶

Once the reads are assembled into contigs, we can use QUAST to evaluate the quality of our assembly. There are many metrics that can be used for that purpose but here we will focus on the two most popular metrics: N50 and L50.
In addition to calculating generic statistics like N50 and L50, QUAST will try to identify potential genomes from which the analyzed contigs originated. Alternatively, we can provide it with a set of reference genomes we would like it to run the analysis against using --i-references.

mosh assembly evaluate-quast \
    --i-contigs cache:contigs  \
    --p-threads 7 \
    --p-memory-efficient \
    --o-visualization results/contigs.qzv \ 
    --verbose

Your visualization should look similar to this one.

Index contigs¶

In this step, we generate an index for the assembled contigs. This index is required for mapping reads to the contigs later. Various parameters control the size and structure of the index, as well as resource usage.

With parsl parallelization

You can speed up the indexing by taking advantage of the parsl parallelization support (see here to learn more). The config could look like this (to run the action on an HPC):

indexing.config.toml

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "SlurmProvider"
scheduler_options = "#SBATCH --mem-per-cpu=4G"
worker_init = "source ~/.bashrc && conda activate qiime2-moshpit-2025.10"
walltime = "4:00:00"
nodes_per_block = 1
cores_per_node = 8
max_blocks = 14

You can then run the action in the following way:

mosh assembly index-contigs \
    --i-contigs cache:contigs \                       
    --p-threads 8 \                                  
    --o-index cache:contigs_index \
    --parallel-config indexing.config.toml
    --verbose

Without parallelization

mosh assembly index-contigs \
    --i-contigs cache:contigs \                       
    --p-threads 8 \                                  
    --o-index cache:contigs_index \ 
    --o-contigs cache:contigs \
    --verbose

Map reads to contigs¶

Here we map the input paired-end reads to the indexed contigs created in the previous step. We use various alignment settings to ensure optimal mapping, including local alignment mode and sensitivity settings.

With parsl parallelization

You can speed up the mapping by taking advantage of the parsl parallelization support (see here to learn more). The config could look like this (to run the action on an HPC):

mapping.config.toml

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "SlurmProvider"
scheduler_options = "#SBATCH --mem-per-cpu=16G"
worker_init = "source ~/.bashrc && conda activate qiime2-moshpit-2025.10"
walltime = "12:00:00"
nodes_per_block = 1
cores_per_node = 8
max_blocks = 14

You can then run the action in the following way:

mosh assembly map-reads \
    --i-index cache:contigs_index \                         
    --i-reads cache:reads_filtered \                                                  
    --o-alignment-maps cache:reads_to_contigs \
    --parallel-config mapping.config.toml
    --verbose

Without parallelization

mosh assembly map-reads \
    --i-index cache:contigs_index \                         
    --i-reads cache:reads_filtered \                                                  
    --o-alignment-maps cache:reads_to_contigs \
    --verbose

Bin contigs with MetaBAT¶

Binning contigs involves grouping assembled contigs into MAGs. This step uses MetaBAT to assign contigs based on co-abundance and other features, producing MAG files that represent putative genomes.

mosh annotate bin-contigs-metabat \
    --i-contigs cache:contigs \                       
    --i-alignment-maps cache:reads_to_contigs \         
    --p-num-threads 64 \                              
    --p-seed 100 \                                   
    --p-verbose \                                    
    --o-mags cache:mags \                             
    --o-contig-map cache:contig_map \                   
    --o-unbinned-contigs cache:unbinned_contigs \
    --verbose

This step generated several artifacts:

mags: these are our actual MAGS, per sample.
contig_map: this is a mapping between MAG IDs and IDs of contigs that belong to a given MAG.
unbinned_contigs: these are all the contigs that could not be assign to any particular MAG. From now on, we will focus on the mags.

Evaluate bins with BUSCO¶

This step evaluates the completeness and quality of MAGs using the BUSCO tool, which checks for the presence of single-copy orthologs. The evaluation helps ensure the quality of the recovered MAGs.

First we will use mosh annotate fetch-busco-db to download a specific lineage’s BUSCO database. BUSCO databases are precompiled collections of orthologous genes, tailored to specific lineages such as viruses, prokaryotes (bacteria and archaea), eukaryotes or more specific ones like bacteria_odb12.

The --p-lineages parameter set to bacteria_odb12 specifies that we want to download the bacterial dataset.

mosh annotate fetch-busco-db \
    --p-lineages bacteria_odb12 \
    --o-db cache:busco_db
    --verbose

Once the appropriate BUSCO database is fetched, the next step is to evaluate the completeness and quality of the MAGs.

With parsl parallelization

You can speed up this QC step by taking advantage of the parsl parallelization support (see here to learn more). The config could look like this (to run the action on an HPC):

busco.config.toml

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "SlurmProvider"
scheduler_options = "#SBATCH --mem-per-cpu=4G"
worker_init = "source ~/.bashrc && conda activate qiime2-moshpit-2025.10"
walltime = "2:00:00"
nodes_per_block = 1
cores_per_node = 8
max_blocks = 14

You can then run the action in the following way:

mosh annotate evaluate-busco \
    --i-mags cache:mags \                             
    --i-db cache:busco_db \                     
    --p-lineage-dataset bacteria_odb12 \             
    --p-cpu 16 \                                     
    --o-visualization results/mags.qzv \
    --o-results cache:busco_results \
    --parallel-config busco.config.toml
    --verbose

Without parallelization

mosh annotate evaluate-busco \
    --i-mags cache:mags \                             
    --i-db cache:busco_db \                     
    --p-lineage-dataset bacteria_odb12 \             
    --p-cpu 16 \                                     
    --o-visualization results/mags.qzv \
    --o-results cache:busco_results \
    --verbose

The --p-lineage-dataset bacteria_odb12 parameter specifies the particular lineage dataset to use, in this case, the bacteria_odb12 dataset. This is a standard database for bacterial genomes.

Your visualization should look similar to this one.

Filter MAGs¶

This step filters MAGs based on completeness. In this example, we filter out any MAGs with completeness below 50%. The filtering process ensures only high-quality genomes are kept for downstream analysis.

mosh annotate filter-mags \
    --i-mags cache:mags \                             
    --m-metadata-file cache:busco_results \           
    --p-where 'complete>50' \                        
    --p-no-exclude-ids \                              
    --p-on mag \                                     
    --o-filtered-mags cache:mags_filtered_50 \
    --verbose

Tutorials

Recovery of Metagenome-assembled Genomes

Tutorials

MAG set dereplication