Genome assembly - MOSHPIT documentation

The first step in recovering metagenome-assembled genomes (MAGs) is genome assembly itself. There are many genome assemblers available, two of which you can use through our MOSHPIT plugin - here, we will use MEGAHIT. MEGAHIT takes short DNA sequencing reads, constructs a simplified De Bruijn graph, and generates longer contiguous sequences called contigs, providing valuable genetic information for the next steps of our analysis.

The reads generated for this tutorial can be downloaded using the following command:

wget -O reads.qza \
    https://polybox.ethz.ch/index.php/s/rgXpDtCMgRgyeKB/download

You can run the assembly using the following command:

With parsl parallelization

Without parallelization

You can speed up the assembly by taking advantage of parsl parallelization support. The config for local execution could look like this:

parallel.config.toml

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "LocalProvider"
max_blocks = 2

You can then run the action in the following way:

mosh assembly assemble-megahit \
    --i-reads reads.qza \
    --p-presets meta-sensitive \
    --p-num-cpu-threads 2 \
    --p-min-contig 500 \
    --o-contigs contigs.qza \
    --parallel-config parallel.config.toml \
    --verbose

Once the reads are assembled into contigs, we can use QUAST to evaluate the quality of our assembly. There are many metrics that can be used for that purpose, but here we will focus on the two most popular metrics: N50 and L50. In addition to calculating generic statistics like N50 and L50, QUAST will try to identify potential genomes from which the analyzed contigs originated. Alternatively, we can provide it with a set of reference genomes we would like it to run the analysis against. Since we generated the reads from an “artificial” mock community, we will provide the reference sequences for those genomes—this will save us a bit of work and time.

First, fetch the reference genomes that QUAST will use to compare our contigs against:

wget -O reference-genomes.qza \
    https://polybox.ethz.ch/index.php/s/dRdDSZJcxH4LRgk/download

Then, run the following command to assess the quality of contigs assembled in the previous step:

mosh assembly evaluate-quast \
    --i-contigs contigs.qza \
    --i-references reference-genomes.qza \
    --p-threads 4 \
    --p-min-contig 500 \
    --o-ref-genomes ref-genomes.qza \
    --o-results-table quast-results.qza \
    --o-visualization contigs-qc.qzv \
    --verbose

Your visualization should look similar to this one.

There are many things to look at here! Don’t worry, it is not our goal to try to understand all of that information, though. Let’s focus on the metrics we mentioned above - you will find them on the “QC report” tab, in the colorful table right above the plot. Click on the “Extended report” and use the values you find there to answer the checkpoint questions below.

Solution to Exercise 1 #

sample4: N50 = ~84797

Solution to Exercise 2 #

sample1: L50 = ~2758

Solution to Exercise 3 #

sample1: ~308

Solution to Exercise 4 #

~77570 bp

Solution to Exercise 5 #

sample4: it has the best N50 value, it comprises a (relatively) small number of large contigs with the smallest count of mismatches and misassemlbies.

References¶

Li, D., Liu, C. M., Luo, R., Sadakane, K., & Lam, T. W. (2015). MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10), 1674–1676. 10.1093/bioinformatics/btv033
Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075. 10.1093/bioinformatics/btt086

Tutorials

🛠️ End-to-end MAG reconstruction

Tutorials

MAG binning