fondue - MOSHPIT documentation

Plugin Overview¶

This is a QIIME 2 plugin for fetching raw sequencing data andits associated metadata from data archives like SRA.

version: 2025.7.0
website: https://github.com/bokulich-lab/q2-fondue
user support:: Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org
citations:: Ziemski et al., 2022

Actions¶

Name	Type	Short Description
get-metadata	method	Fetch sequence-related metadata based on run, study, BioProject, experiment or sample ID.
-get-sequences	method	Fetch sequences based on run ID.
merge-metadata	method	Merge several metadata files into a single metadata object.
combine-seqs	method	Combine sequences from multiple artifacts.
scrape-collection	method	Scrape Zotero collection for run, study, BioProject, experiment and sample IDs, and associated DOI names.
get-ids-from-query	method	Find SRA run accession IDs based on a search query.
get-sequences	pipeline	Fetch sequences based on run ID.
get-all	pipeline	Fetch sequence-related metadata and sequences of all run, study, BioProject, experiment or sample IDs.

Artifact Classes¶

SRAMetadata

SRAFailedIDs

NCBIAccessionIDs

Formats¶

NCBIAccessionIDsFormat

NCBIAccessionIDsDirFmt

fondue get-metadata¶

Fetch sequence-related metadata based on run, study, BioProject, experiment or sample ID using Entrez. All metadata will be collapsed into one table.

Citations¶

Ziemski et al., 2022; Buchmann & Holmes, 2019

Inputs¶

accession_ids: NCBIAccessionIDs | SRAMetadata | SRAFailedIDs: Artifact containing run, study, BioProject, experiment or sample IDs for which the metadata and/or sequences should be fetched. Associated DOI names can be providedin an optional column and are preserved in get-alland get-metadata actions.[required]
linked_doi: NCBIAccessionIDs: Optional table containing linked DOI names that is only used if accession_ids does not contain any DOI names.[optional]

Parameters¶

email: Str: Your e-mail address (required by NCBI).[required]
threads: Int % Range(1, None): Number of threads to be used for parallelization of the data download from NCBI. Not to be confused with the number of parsl workers which can be configured through the parsl configuration file.[default: 1]
log_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR'): Logging level.[default: 'INFO']

Outputs¶

metadata: SRAMetadata: Table containing metadata for all the requested IDs.[required]
failed_runs: SRAFailedIDs: List of all run IDs for which fetching metadata failed, with their corresponding error messages.[required]

fondue -get-sequences¶

Fetch sequence data of all run IDs.

Citations¶

Ziemski et al., 2022; Team, n.d.

Parameters¶

accession_id: Str: Run ID to fetch sequences for.[required]
retries: Int % Range(0, None): Number of retries to fetch sequences.[default: 2]
threads: Int % Range(1, None): Number of threads to be used for parallelization of the data download from NCBI. Not to be confused with the number of parsl workers which can be configured through the parsl configuration file.[default: 1]
log_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR'): Logging level.[default: 'INFO']
restricted_access: Bool: If sequence fetch requires dbGaP repository key.[default: False]

Outputs¶

single_reads: SampleData[SequencesWithQuality]: Artifact containing single-read fastq.gz files for all the requested IDs.[required]
paired_reads: SampleData[PairedEndSequencesWithQuality]: Artifact containing paired-end fastq.gz files for all the requested IDs.[required]
failed_runs: SRAFailedIDs: List of all run IDs for which fetching sequences failed, with their corresponding error messages.[required]

fondue merge-metadata¶

Merge multiple sequence-related metadata from different q2-fondue runs and/or projects into a single metadata file.

Citations¶

Ziemski et al., 2022

Inputs¶

metadata: List[SRAMetadata]: Metadata files to be merged together.[required]

Outputs¶

merged_metadata: SRAMetadata: Merged metadata containing all rows and columns (without duplicates).[required]

fondue combine-seqs¶

Combine paired- or single-end sequences from multiple artifacts, for example obtained by re-fetching failed downloads.

Citations¶

Ziemski et al., 2022

Inputs¶

seqs: List[SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]]: Sequence artifacts to be combined together.[required]

Parameters¶

on_duplicates: Str % Choices('error', 'warn'): Preferred behaviour when duplicated sequence IDs are encountered: "warn" displays a warning and continues to combining deduplicated samples while "error" raises an error and aborts further execution.[default: 'error']

Outputs¶

combined_seqs: SampleData[SequencesWithQuality¹ | PairedEndSequencesWithQuality²]: Sequences combined from all input artifacts.[required]

fondue scrape-collection¶

Scrape attachment files of a Zotero collection for run, study, BioProject, experiment and sample IDs, and associated DOI names.

Citations¶

Ziemski et al., 2022; Hügel et al., 2019

Parameters¶

collection_name: Str: Name of the collection to be scraped.[required]
on_no_dois: Str % Choices('ignore', 'error'): Behavior if no DOIs were found.[default: 'ignore']
log_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR'): Logging level.[default: 'INFO']

Outputs¶

run_ids: NCBIAccessionIDs: Artifact containing all run IDs scraped from a Zotero collection and associated DOI names.[required]
study_ids: NCBIAccessionIDs: Artifact containing all study IDs scraped from a Zotero collection and associated DOI names.[required]
bioproject_ids: NCBIAccessionIDs: Artifact containing all BioProject IDs scraped from a Zotero collection and associated DOI names.[required]
experiment_ids: NCBIAccessionIDs: Artifact containing all experiment IDs scraped from a Zotero collection and associated DOI names.[required]
sample_ids: NCBIAccessionIDs: Artifact containing all sample IDs scraped from a Zotero collection and associated DOI names.[required]

fondue get-ids-from-query¶

Find SRA run accession IDs in the BioSample database using a text search query.

Citations¶

Ziemski et al., 2022

Parameters¶

query: Str: Search query to retrieve SRA run IDs from the BioSample database.[required]
email: Str: Your e-mail address (required by NCBI).[required]
threads: Int % Range(1, None): Number of threads to be used for parallelization of the data download from NCBI. Not to be confused with the number of parsl workers which can be configured through the parsl configuration file.[default: 1]
log_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR'): Logging level.[default: 'INFO']

Outputs¶

ids: NCBIAccessionIDs: Table containing metadata for all the requested IDs.[required]

fondue get-sequences¶

Fetch sequence data of all run IDs.

Citations¶

Ziemski et al., 2022; Team, n.d.

Inputs¶

accession_ids: NCBIAccessionIDs | SRAMetadata | SRAFailedIDs: Artifact containing run, study, BioProject, experiment or sample IDs for which the metadata and/or sequences should be fetched. Associated DOI names can be providedin an optional column and are preserved in get-alland get-metadata actions.[required]

Parameters¶

email: Str: Your e-mail address (required by NCBI).[required]
retries: Int % Range(0, None): Number of retries to fetch sequences.[default: 2]
threads: Int % Range(1, None): Number of threads to be used for parallelization of the data download from NCBI. Not to be confused with the number of parsl workers which can be configured through the parsl configuration file.[default: 1]
log_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR'): Logging level.[default: 'INFO']
restricted_access: Bool: If sequence fetch requires dbGaP repository key.[default: False]

Outputs¶

single_reads: SampleData[SequencesWithQuality]: Artifact containing single-read fastq.gz files for all the requested IDs.[required]
paired_reads: SampleData[PairedEndSequencesWithQuality]: Artifact containing paired-end fastq.gz files for all the requested IDs.[required]
failed_runs: SRAFailedIDs: List of all run IDs for which fetching sequences failed, with their corresponding error messages.[required]

fondue get-all¶

Pipeline fetching all sequence-related metadata and raw sequences of provided run, study, BioProject, experiment or sample IDs.

Citations¶

Ziemski et al., 2022; Buchmann & Holmes, 2019; Team, n.d.

Inputs¶

accession_ids: NCBIAccessionIDs | SRAMetadata | SRAFailedIDs: Artifact containing run, study, BioProject, experiment or sample IDs for which the metadata and/or sequences should be fetched. Associated DOI names can be providedin an optional column and are preserved in get-alland get-metadata actions.[required]
linked_doi: NCBIAccessionIDs: Optional table containing linked DOI names that is only used if accession_ids does not contain any DOI names.[optional]

Parameters¶

email: Str: Your e-mail address (required by NCBI).[required]
threads: Int % Range(1, None): Number of threads to be used for parallelization of the data download from NCBI. Not to be confused with the number of parsl workers which can be configured through the parsl configuration file.[default: 1]
retries: Int % Range(0, None): Number of retries to fetch sequences.[default: 2]
log_level: Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR'): Logging level.[default: 'INFO']

Outputs¶

metadata: SRAMetadata: Table containing metadata for all the requested IDs.[required]
single_reads: SampleData[SequencesWithQuality]: Artifact containing single-read fastq.gz files for all the requested IDs.[required]
paired_reads: SampleData[PairedEndSequencesWithQuality]: Artifact containing paired-end fastq.gz files for all the requested IDs.[required]
failed_runs: SRAFailedIDs: List of all run IDs for which fetching sequences and/or metadata failed, with their corresponding error messages.[required]

References¶

Ziemski, M., Adamov, A., Kim, L., Flörl, L., & Bokulich, N. A. (2022). Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data using q2-fondue. Bioinformatics. 10.1093/bioinformatics/btac639
Buchmann, J. P., & Holmes, E. C. (2019). Entrezpy: a Python library to dynamically interact with the NCBI Entrez databases. Bioinformatics, 35(21), 4511–4514. 10.1093/bioinformatics/btz385
Team, S. T. D. (n.d.). (2.9.6). https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Hügel, S., Gerdes, P., Fournier, P., emuzie, Golden, P., jghauser, Frühwirth, S., Takats, S., Orduña, P., Merlin, Hetzner, E., Brodbeck, C., Lyon, A., & Lee, A. (2019). urschrei/pyzotero: Zenodo Release (v1.3.15). Zenodo. 10.5281/zenodo.2917290
Team, S. T. D. (n.d.). (2.9.6). https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Buchmann, J. P., & Holmes, E. C. (2019). Entrezpy: a Python library to dynamically interact with the NCBI Entrez databases. Bioinformatics, 35(21), 4511–4514. 10.1093/bioinformatics/btz385
Team, S. T. D. (n.d.). (2.9.6). https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software