Data retrieval - MOSHPIT documentation

The dataset used in this tutorial is available through the NCBI Sequence Read Archive (SRA). To retrieve it, we will use the q2-fondue plugin for programmatic access to sequences and metadata from SRA; we only need to provide a list of accession IDs to download - q2-fondue will take care of the rest.

download the files containing all the accession IDs and corresponding metadata:

wget -O ids.tsv \
    https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/docs/data/ids.tsv

wget -O metadata.tsv \ 
    https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/docs/data/metadata.tsv

create QIIME 2 cache in the current working directory:
```
mosh tools cache-create --cache cache
```

import the file into a QIIME 2 artifact:

mosh tools cache-import \
    --type 'NCBIAccessionIDs' \
    --input-path ids.tsv \
    --cache cache \
    --key ids

run the get-all action from the fondue plugin:

With parsl parallelization

To make use of the parsl support built into the fondue plugin, you need to prepare a parsl config first (see here to learn more about parsl parallelization and here to learn about fetching large datasets). The config could look like this (to run the action on an HPC):

fondue.config.toml

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "SlurmProvider"
scheduler_options = "#SBATCH --mem-per-cpu=4G --tmp=5GB"
worker_init = "source ~/.bashrc && conda activate qiime2-moshpit-2025.10"
walltime = "6:00:00"
nodes_per_block = 1
cores_per_node = 1
max_blocks = 14

You can then run the action in the following way:

mosh fondue get-all \
    --i-accession-ids cache:ids \
    --p-email YOUR.EMAIL@domain.com \
    --p-threads 5 \
    --p-retries 5 \
    --o-paired-reads cache:reads_paired \
    --o-metadata cache:metadata \
    --o-single-reads cache:reads_single \
    --o-failed-runs cache:failed_runs \
    --parallel-config fondue.config.toml
    --verbose

Without parallelization

mosh fondue get-all \
    --i-accession-ids cache:ids \
    --p-email YOUR.EMAIL@domain.com \
    --p-threads 5 \
    --p-retries 5 \
    --o-paired-reads cache:reads_paired \
    --o-metadata cache:metadata \
    --o-single-reads cache:reads_single \
    --o-failed-runs cache:failed_runs \
    --verbose

References¶

Ziemski, M., Adamov, A., Kim, L., Flörl, L., & Bokulich, N. A. (2022). Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data using q2-fondue. Bioinformatics. 10.1093/bioinformatics/btac639