MAG dereplication - MOSHPIT documentation

Since our samples were generated from the same mock community (i.e., we know they most likely contain the same set of genomes), we can simplify our MAG collection by performing dereplication, similarly to how you do it for 16S amplicon sequences.

We begin by computing hash sketches of every genome using sourmash - you can think of those sketches as tiny representations of our genomes (sourmash compresses a lot of information into a much smaller space):

qiime sourmash compute \
    --i-sequence-file mags-filtered.qza \
    --p-ksizes 105 \
    --p-scaled 100 \
    --o-min-hash-signature min-hash.qza \
    --verbose

Then, we compare all of those sketches (genomes) to one another to generate a matrix of pairwise distances between our MAGs:

qiime sourmash compare \
    --i-min-hash-signature min-hash.qza \
    --p-ksize 105 \
    --o-compare-output min-hash-compare.qza \
    --verbose

Finally, we dereplicate the genomes using the distance matrix and a fixed similarity threshold—the last action will simply choose the most complete genome from all the genomes belonging to the same cluster, given a similarity threshold:

qiime annotate dereplicate-mags \
    --i-mags mags-filtered.qza \
    --i-distance-matrix min-hash-compare.qza \
    --m-metadata-file busco-results.qza \
    --p-metadata-column completeness \
    --p-threshold 0.9 \
    --o-dereplicated-mags mags-derep.qza \
    --o-table table.qza \
    --verbose

References¶

Pierce, N. T., Irber, L., Reiter, T., Brooks, P., & Brown, C. T. (2019). Large-scale sequence comparisons with sourmash. F1000Research, 8, 1006. 10.12688/f1000research.19675.1

Tutorials

MAG binning

Tutorials

Taxonomic classification