Skip to article frontmatterSkip to article content

How to fetch large datasets

The datasets originating from shotgun metagenomics experiments can demand substantial storage capacity. Fetching those from online archives like Sequence Read Archive (SRA) or European Nucleotide Archive (ENA) can take, hence, take a lot of time if the download process is not parallelized in any way. The q2-fondue plugin allows QIIME 2 users to easily fetch datasets from the SRA archive by simply providing a list of accession IDs. Since the 2025.7 release of the plugin it is also possible to accelerate the download process by using the built-in parsl parallelization support.

We are assuming here that you already imported your accession IDs into the NCBIAccessionIDs artifact (see here for more info). You will first need to construct a parsl config matching your resource requirements:

Scenario 1: you are fetching on a local machine

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "LocalProvider"
max_blocks = 5

Scenario 2: you are fetching on an HPC

[parsl]

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"

[parsl.executors.provider]
class = "SlurmProvider"
worker_init = "source ~/.bashrc && conda activate q2-moshpit-2025.10"
walltime = "6:00:00"
nodes_per_block = 1
cores_per_node = 1
mem_per_node = 4
max_blocks = 5

Both of the above configurations will result in a pool of five workers, one CPU each. You can then run the get-sequences action in the following way:

mosh fondue get-sequences \
    --i-accession-ids <path to the accession ids> \
    --p-email <your e-mail address> \
    --p-threads <count of threads> \
    --parallel-config <path to the parsl config> \
    --output-dir <path to the results directory>