annotate

Inputs¶

seqs: SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality] | SampleData[Contigs] | FeatureData[MAG] | SampleData[MAGs]: Sequences to be classified. Single-/paired-end reads, contigs, or assembled MAGs can be provided.[required]
db: Kraken2DB: Kraken 2 database.[required]

Parameters¶

threads: Int % Range(1, None): Number of threads.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True): Confidence score threshold.[default: 0.0]
minimum_base_quality: Int % Range(0, None): Minimum base quality used in classification. Only applies when reads are used as input.[default: 0]
memory_mapping: Bool: Avoids loading the database into RAM.[default: False]
minimum_hit_groups: Int % Range(1, None): Minimum number of hit groups (overlapping k-mers sharing the same minimizer).[default: 2]
quick: Bool: Quick operation (use first hit or hits).[default: False]
report_minimizer_data: Bool: Include number of read-minimizers per-taxon and unique read-minimizers per-taxon in the report. If this parameter is enabled then merging kraken2 reports with the same sample ID from two or more input artifacts will not be possible.[default: False]

Outputs¶

reports: SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | FeatureData[Kraken2Report % Properties('mags')] | SampleData[Kraken2Report % Properties('mags')]: Reports produced by Kraken2.[required]
outputs: SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | FeatureData[Kraken2Output % Properties('mags')] | SampleData[Kraken2Output % Properties('mags')]: Output files produced by Kraken2.[required]

annotate collate-kraken2-reports¶

Collates kraken2 reports.

Inputs¶

reports: List[SampleData[Kraken2Report % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]]: <no description>[required]

Outputs¶

collated_reports: SampleData[Kraken2Report % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]: <no description>[required]

annotate collate-kraken2-outputs¶

Collates kraken2 outputs.

Inputs¶

outputs: List[SampleData[Kraken2Output % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]]: <no description>[required]

Outputs¶

collated_outputs: SampleData[Kraken2Output % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]: <no description>[required]

annotate estimate-bracken¶

This method uses Bracken to re-estimate read abundances. Only available on Linux platforms.

Citations¶

Wood et al., 2019; Lu et al., 2017

Inputs¶

kraken2_reports: SampleData[Kraken2Report % Properties('reads')]: Reports produced by Kraken2.[required]
db: BrackenDB: Bracken database.[required]

Parameters¶

threshold: Int % Range(0, None): Bracken: number of reads required PRIOR to abundance estimation to perform re-estimation.[default: 0]
read_len: Int % Range(0, None): Bracken: the ideal length of reads in your sample. For paired end data (e.g., 2x150) this should be set to the length of the single-end reads (e.g., 150).[default: 100]
level: Str % Choices('D', 'P', 'C', 'O', 'F', 'G', 'S'): Bracken: specifies the taxonomic rank to analyze. Each classification at this specified rank will receive an estimated number of reads belonging to that rank after abundance estimation.[default: 'S']
include_unclassified: Bool: Bracken does not include the unclassified read counts in the feature table. Set this to True to include those regardless.[default: True]

Outputs¶

reports: SampleData[Kraken2Report % Properties('bracken')]: Reports modified by Bracken.[required]
taxonomy: FeatureData[Taxonomy]: <no description>[required]
table: FeatureTable[Frequency]: <no description>[required]

annotate build-kraken-db¶

This method builds Kraken 2 and Bracken databases either (1) from provided DNA sequences to build a custom database, or (2) simply fetches pre-built versions from an online resource.

Citations¶

Inputs¶

seqs: List[FeatureData[Sequence]]: Sequences to be added to the Kraken 2 database.[optional]

Parameters¶

collection: Str % Choices('viral', 'minusb', 'standard', 'standard8', 'standard16', 'pluspf', 'pluspf8', 'pluspf16', 'pluspfp', 'pluspfp8', 'pluspfp16', 'eupathdb', 'nt', 'corent', 'gtdb', 'greengenes', 'rdp', 'silva132', 'silva138'): Name of the database collection to be fetched. Please check https://benlangmead.github.io/aws-indexes/k2 for the description of the available options.[optional]
threads: Int % Range(1, None): Number of threads. Only applicable when building a custom database.[default: 1]
kmer_len: Int % Range(1, None): K-mer length in bp/aa.[default: 35]
minimizer_len: Int % Range(1, None): Minimizer length in bp/aa.[default: 31]
minimizer_spaces: Int % Range(1, None): Number of characters in minimizer that are ignored in comparisons.[default: 7]
no_masking: Bool: Avoid masking low-complexity sequences prior to building; masking requires dustmasker or segmasker to be installed in PATH.[default: False]
max_db_size: Int % Range(0, None): Maximum number of bytes for Kraken 2 hash table; if the estimator determines more would normally be needed, the reference library will be downsampled to fit.[default: 0]
use_ftp: Bool: Use FTP for downloading instead of RSYNC.[default: False]
load_factor: Float % Range(0, 1): Proportion of the hash table to be populated.[default: 0.7]
fast_build: Bool: Do not require database to be deterministically built when using multiple threads. This is faster, but does introduce variability in minimizer/LCA pairs.[default: False]
read_len: List[Int % Range(1, None)]: Ideal read lengths to be used while building the Bracken database.[optional]

Outputs¶

kraken2_db: Kraken2DB: Kraken2 database.[required]
bracken_db: BrackenDB: Bracken database.[required]

annotate inspect-kraken2-db¶

This method generates a report of identical format to those generated by classify_kraken2, with a slightly different interpretation. Instead of reporting the number of inputs classified to a taxon/clade, the report displays the number of minimizers mapped to each taxon/clade.

Citations¶

Inputs¶

db: Kraken2DB: The Kraken 2 database for which to generate the report.[required]

Parameters¶

threads: Int % Range(1, None): The number of threads to use.[default: 1]

Outputs¶

report: Kraken2DBReport: The report of the supplied database.[required]

annotate dereplicate-mags¶

This method dereplicates MAGs from multiple samples using distances between them found in the provided distance matrix. For each cluster of similar MAGs, the longest one will be selected as the representative. If metadata is given as input, the MAG with the highest or lowest value in the specified metadata column is chosen, depending on the parameter "find-max". If there are MAGs with identical values, the longer one is chosen. For example an artifact of type BUSCOResults can be passed as metadata, and the dereplication can be done by highest "completeness".

Inputs¶

mags: SampleData[MAGs]: MAGs to be dereplicated.[required]
distance_matrix: DistanceMatrix: Matrix of distances between MAGs.[required]

Parameters¶

threshold: Float % Range(0, 1, inclusive_end=True): Similarity threshold required to consider two bins identical.[default: 0.99]
metadata: Metadata: Metadata table.[optional]
metadata_column: Str: Name of the metadata column used to choose the most representative bins.[default: 'complete']
find_max: Bool: Set to True to choose the bin with the highest value in the metadata column. Set to False to choose the bin with the lowest value.[default: True]

Outputs¶

dereplicated_mags: FeatureData[MAG]: Dereplicated MAGs.[required]
table: FeatureTable[PresenceAbsence]: Mapping between MAGs and samples.[required]

annotate kraken2-to-features¶

Convert a Kraken 2 report, which is an annotated NCBI taxonomy tree, into generic artifacts for downstream analyses.

Inputs¶

reports: SampleData[Kraken2Report]: Per-sample Kraken 2 reports.[required]

Parameters¶

coverage_threshold: Float % Range(0, 100, inclusive_end=True): The minimum percent coverage required to produce a feature.[default: 0.1]

Outputs¶

table: FeatureTable[PresenceAbsence]: A presence/absence table of selected features. The features are not of even ranks, but will be the most specific rank available.[required]
taxonomy: FeatureData[Taxonomy]: Output taxonomy. Infra-clade ranks are ignored unless if they are strain-level. Missing internal ranks are annotated by their next most specific rank, with the exception of k__Bacteria and k__Archaea, which match their domain name.[required]

annotate kraken2-to-mag-features¶

Convert a Kraken 2 report, which is an annotated NCBI taxonomy tree, into generic artifacts for downstream analyses.

Inputs¶

reports: FeatureData[Kraken2Report % Properties('mags')]: Per-sample Kraken 2 reports.[required]
outputs: FeatureData[Kraken2Output % Properties('mags')]: Per-sample Kraken 2 output files.[required]

Parameters¶

coverage_threshold: Float % Range(0, 100, inclusive_end=True): The minimum percent coverage required to produce a feature.[default: 0.1]

Outputs¶

taxonomy: FeatureData[Taxonomy]: Output taxonomy. Infra-clade ranks are ignored unless if they are strain-level. Missing internal ranks are annotated by their next most specific rank, with the exception of k__Bacteria and k__Archaea, which match their domain name.[required]

annotate build-custom-diamond-db¶

Creates an artifact containing a binary DIAMOND database file (ref_db.dmnd) from a protein reference database file in FASTA format.

Citations¶

Buchfink et al., 2021

Inputs¶

seqs: FeatureData[ProteinSequence]: Protein reference database.[required]
taxonomy: ReferenceDB[NCBITaxonomy]: Reference taxonomy, needed to provide taxonomy features.[optional]

Parameters¶

threads: Int % Range(1, None): Number of CPU threads.[default: 1]
file_buffer_size: Int % Range(1, None): File buffer size in bytes.[default: 67108864]
ignore_warnings: Bool: Ignore warnings.[default: False]
no_parse_seqids: Bool: Print raw seqids without parsing.[default: False]

Outputs¶

db: ReferenceDB[Diamond]: DIAMOND database.[required]

Examples¶

Minimum working example¶

[Command Line]

[Python API]

[R API]

[View Source]

wget -O 'sequences.qza' \
  'https://moshpit--20.org.readthedocs.build/en/20/data/examples/annotate/build-custom-diamond-db/1/sequences.qza'

qiime annotate build-custom-diamond-db \
  --i-seqs sequences.qza \
  --o-db diamond-db.qza

from qiime2 import Artifact
from urllib import request
import qiime2.plugins.annotate.actions as annotate_actions

url = 'https://moshpit--20.org.readthedocs.build/en/20/data/examples/annotate/build-custom-diamond-db/1/sequences.qza'
fn = 'sequences.qza'
request.urlretrieve(url, fn)
sequences = Artifact.load(fn)

diamond_db, = annotate_actions.build_custom_diamond_db(
    seqs=sequences,
)

library(reticulate)

Artifact <- import("qiime2")$Artifact
annotate_actions <- import("qiime2.plugins.annotate.actions")
request <- import("urllib")$request

url <- 'https://moshpit--20.org.readthedocs.build/en/20/data/examples/annotate/build-custom-diamond-db/1/sequences.qza'
fn <- 'sequences.qza'
request$urlretrieve(url, fn)
sequences <- Artifact$load(fn)

action_results <- annotate_actions$build_custom_diamond_db(
    seqs=sequences,
)
diamond_db <- action_results$db

annotate fetch-eggnog-db¶

Downloads EggNOG reference database using the download_eggnog_data.py script from eggNOG. Here, this script downloads 3 files and stores them in the output artifact. At least 80 GB of storage space is required to run this action.

Citations¶

Buchfink et al., 2021; Huerta-Cepas et al., 2019

Outputs¶

db: ReferenceDB[Eggnog]: eggNOG annotation database.[required]

annotate fetch-diamond-db¶

Downloads Diamond reference database. This action downloads 1 file (ref_db.dmnd). At least 18 GB of storage space is required to run this action.

Citations¶

Outputs¶

db: ReferenceDB[Diamond]: Complete Diamond reference database.[required]

annotate fetch-eggnog-proteins¶

Downloads eggNOG proteome database. This script downloads 2 files (e5.proteomes.faa and e5.taxid_info.tsv) and creates and artifact with them. At least 18 GB of storage space is required to run this action.

Citations¶

National Center for Biotechnology Information (NCBI), n.d.

Outputs¶

eggnog_proteins: ReferenceDB[EggnogProteinSequences]: eggNOG database of protein sequences and their corresponding taxonomy information.[required]

annotate fetch-ncbi-taxonomy¶

Downloads NCBI reference taxonomy from the NCBI FTP server. The resulting artifact is required by the build-custom-diamond-db action if one wishes to create a Diamond data base with taxonomy features. At least 30 GB of storage space is required to run this action.

Citations¶

Outputs¶

taxonomy: ReferenceDB[NCBITaxonomy]: NCBI reference taxonomy.[required]

annotate build-eggnog-diamond-db¶

Creates a DIAMOND database which contains the protein sequences that belong to the specified taxon.

Citations¶

Buchfink et al., 2021; Huerta-Cepas et al., 2019

Inputs¶

eggnog_proteins: ReferenceDB[EggnogProteinSequences]: eggNOG database of protein sequences and their corresponding taxonomy information (generated through the fetch-eggnog-proteins action).[required]

Parameters¶

taxon: Int % Range(2, 1579337): NCBI taxon ID number.[required]

Outputs¶

db: ReferenceDB[Diamond]: Complete Diamond reference database for the specified taxon.[required]

annotate -eggnog-diamond-search¶

This method performs the steps by which we find our possible target sequences to annotate using the Diamond search functionality from the eggnog emapper.py script.

Citations¶

Buchfink et al., 2021; Huerta-Cepas et al., 2019

Inputs¶

seqs: SampleData[Contigs] | SampleData[MAGs] | FeatureData[MAG]: Sequences to be searched for ortholog hits.[required]
db: ReferenceDB[Diamond]: Diamond database.[required]

Parameters¶

num_cpus: Int: Number of CPUs to utilize. '0' will use all available.[default: 1]
db_in_memory: Bool: Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: False]

Outputs¶

eggnog_hits: SampleData[Orthologs]: BLAST6-like table(s) describing the identified orthologs. One table per sample or MAG in the input.[required]
table: FeatureTable[Frequency]: Feature table with counts of orthologs per sample/MAG.[required]
loci: GenomeData[Loci]: Loci of the identified orthologs.[required]

annotate -eggnog-hmmer-search¶

This method performs the steps by which we find our possible target sequences to annotate using the HMMER search functionality from the eggnog emapper.py script.

Citations¶

Buchfink et al., 2021; Huerta-Cepas et al., 2019

Inputs¶

seqs: SampleData[Contigs] | SampleData[MAGs] | FeatureData[MAG]: Sequences to be searched for hits.[required]
idmap: EggnogHmmerIdmap: List of protein families in pressed_hmm_db.[required]
pressed_hmm_db: ProfileHMM[PressedProtein]: Collection of Profile HMMs in binary format and indexed.[required]
seed_alignments: GenomeData[Proteins]: Seed alignments for the protein families in pressed_hmm_db.[required]

Parameters¶

num_cpus: Int: Number of CPUs to utilize per partition. '0' will use all available.[default: 1]
db_in_memory: Bool: Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: False]

Outputs¶

eggnog_hits: SampleData[Orthologs]: BLAST6-like table(s) describing the identified orthologs. One table per sample or MAG in the input.[required]
table: FeatureTable[Frequency]: Feature table with counts of orthologs per sample/MAG.[required]
loci: GenomeData[Loci]: Loci of the identified orthologs.[required]

annotate -eggnog-feature-table¶

Create an eggnog table.

Inputs¶

seed_orthologs: SampleData[Orthologs]: Sequence data to be turned into an eggnog feature table.[required]

Outputs¶

table: FeatureTable[Frequency]: <no description>[required]

annotate -eggnog-annotate¶

Apply eggnog mapper to annotate seed orthologs.

Citations¶

Inputs¶

eggnog_hits: SampleData[Orthologs]: <no description>[required]
db: ReferenceDB[Eggnog]: <no description>[required]

Parameters¶

db_in_memory: Bool: Read eggnog database into memory. The eggNOG database is very large (>44GB), so this option should only be used on clusters or other machines with enough memory.[default: False]
num_cpus: Int % Range(0, None): Number of CPUs to utilize. '0' will use all available.[default: 1]

Outputs¶

ortholog_annotations: GenomeData[NOG]: <no description>[required]

annotate collate-busco-results¶

Collates BUSCO results.

Inputs¶

results: List[BUSCOResults]: <no description>[required]

Outputs¶

collated_results: BUSCOResults: <no description>[required]

annotate -evaluate-busco¶

This method uses BUSCO to assess the quality of assembled MAGs and generates a table summarizing the results.

Citations¶

Inputs¶

mags: SampleData[MAGs] | FeatureData[MAG]: MAGs to be analyzed.[required]
db: ReferenceDB[BUSCO]: BUSCO database.[required]

Parameters¶

mode: Str % Choices('genome'): Specify which BUSCO analysis mode to run.Currently only the 'genome' option is supported, for genome assemblies. In the future modes for transcriptome assemblies and for annotated gene sets (proteins) will be made available.[default: 'genome']
lineage_dataset: Str: Specify the name of the BUSCO lineage to be used. To see all possible options run busco --list-datasets.[optional]
augustus: Bool: Use augustus gene predictor for eukaryote runs.[default: False]
augustus_parameters: Str: Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: '--PARAM1=VALUE1,--PARAM2=VALUE2'.[optional]
augustus_species: Str: Specify a species for Augustus training.[optional]
auto_lineage: Bool: Run auto-lineage to find optimum lineage path.[default: False]
auto_lineage_euk: Bool: Run auto-placement just on eukaryote tree to find optimum lineage path.[default: False]
auto_lineage_prok: Bool: Run auto-lineage just on non-eukaryote trees to find optimum lineage path.[default: False]
cpu: Int % Range(1, None): Specify the number (N=integer) of threads/cores to use.[default: 1]
contig_break: Int % Range(0, None): Number of contiguous Ns to signify a break between contigs. See https://gitlab.com/ezlab/busco/-/issues/691 for a more detailed explanation.[default: 10]
evalue: Float % Range(0, None, inclusive_start=False): E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03.[default: 0.001]
limit: Int % Range(1, 20): How many candidate regions (contig or transcript) to consider per BUSCO.[default: 3]
long: Bool: Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms.[default: False]
metaeuk_parameters: Str: Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: --PARAM1=VALUE1,--PARAM2=VALUE2.[optional]
metaeuk_rerun_parameters: Str: Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: --PARAM1=VALUE1,--PARAM2=VALUE2.[optional]
miniprot: Bool: Use miniprot gene predictor for eukaryote runs.[default: False]
additional_metrics: Bool: Adds completeness and contamination values to the BUSCO report. Check here for documentation: https://github.com/metashot/busco?tab=readme-ov-file#documetation[default: False]

Outputs¶

results: BUSCOResults: BUSCO result table.[required]

annotate predict-genes-prodigal¶

Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm), a gene prediction algorithm designed for improved gene structure prediction, translation initiation site recognition, and reduced false positives in bacterial and archaeal genomes.

Citations¶

Hyatt et al., 2010

Inputs¶

seqs: FeatureData[MAG] | SampleData[MAGs] | SampleData[Contigs]: MAGs or contigs for which one wishes to predict genes.[required]

Parameters¶

translation_table_number: Str % Choices('1', '2', '3', '4', '5', '6', '9', '10', '11', '12', '13', '14', '15', '16', '21', '22', '23', '24', '25'): Translation table to be used to translate genes into sequences of amino acids. See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for reference.[default: '11']

Outputs¶

loci: GenomeData[Loci]: Gene coordinates files (one per MAG or sample) listing the location of each predicted gene as well as some additional scoring information.[required]
genes: GenomeData[Genes]: Fasta files (one per MAG or sample) with the nucleotide sequences of the predicted genes.[required]
proteins: GenomeData[Proteins]: Fasta files (one per MAG or sample) with the protein translation of the predicted genes.[required]

annotate fetch-kaiju-db¶

This method fetches the latest Kaiju database from Kaiju's web server.

Citations¶

Menzel et al., 2016

Parameters¶

database_type: Str % Choices('nr', 'nr_euk', 'refseq', 'refseq_ref', 'refseq_nr', 'fungi', 'viruses', 'plasmids', 'progenomes', 'rvdb'): Type of database to be downloaded. For more information on available types please see the list on Kaiju's web server: https://bioinformatics-centre.github.io/kaiju/downloads.html[required]

Outputs¶

db: KaijuDB: Kaiju database.[required]

annotate -classify-kaiju¶

This method uses Kaiju to perform taxonomic classification of DNA sequence reads or contigs.

Citations¶

Menzel et al., 2016

Inputs¶

seqs: SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality] | SampleData[Contigs]: Sequences to be classified.[required]
db: KaijuDB: Kaiju database.[required]

Parameters¶

z: Int % Range(1, None): Number of threads.[default: 1]
a: Str % Choices('greedy', 'mem'): Run mode.[default: 'greedy']
e: Int % Range(1, None): Number of mismatches allowed in Greedy mode.[default: 3]
m: Int % Range(1, None): Minimum match length.[default: 11]
s: Int % Range(1, None): Minimum match score in Greedy mode.[default: 65]
evalue: Float % Range(0, 1): Minimum E-value in Greedy mode.[default: 0.01]
x: Bool: Enable SEG low complexity filter.[default: True]
r: Str % Choices('phylum', 'class', 'order', 'family', 'genus', 'species'): Taxonomic rank.[default: 'species']
c: Float % Range(0, 100): Minimum required number or fraction of reads for the taxon (except viruses) to be reported.[default: 0.0]
exp: Bool: Expand viruses, which are always shown as full taxon path and read counts are not summarized in higher taxonomic levels.[default: False]
u: Bool: Do not count unclassified reads for the total reads when calculating percentages for classified reads.[default: False]

Outputs¶

abundances: FeatureTable[Frequency]: Sequence abundances.[required]
taxonomy: FeatureData[Taxonomy]: Linked taxonomy.[required]

annotate fetch-busco-db¶

Downloads BUSCO database for the specified lineage. Output can be used to run BUSCO with the 'evaluate-busco' action.

Citations¶

Huerta-Cepas et al., 2019; HMMER, 2024

Parameters¶

lineages: List[Str]: Lineages to be included in the database. Can be any valid BUSCO lineage or any of the following: 'all' (for all lineages), 'prokaryota', 'eukaryota', 'virus'.[optional]

Outputs¶

db: ReferenceDB[BUSCO]: BUSCO database for the specified lineages.[required]

annotate get-feature-lengths¶

This method extract lengths for the provided feature set.

Inputs¶

features: FeatureData[MAG | Sequence] | SampleData[MAGs | Contigs]: Features to get lengths for.[required]

Outputs¶

lengths: FeatureData[SequenceCharacteristics % Properties('length')]: Feature lengths.[required]

annotate filter-derep-mags¶

Filter dereplicated MAGs based on metadata.

Inputs¶

mags: FeatureData[MAG]: Dereplicated MAGs to filter.[required]

Parameters¶

metadata: Metadata: Sample metadata indicating which MAG ids to filter. The optional where parameter may be used to filter ids based on specified conditions in the metadata. The optional exclude_ids parameter may be used to exclude the ids specified in the metadata from the filter.[required]
where: Str: Optional SQLite WHERE clause specifying MAG metadata criteria that must be met to be included in the filtered data. If not provided, all MAGs in metadata that are also in the MAG data will be retained.[optional]
exclude_ids: Bool: Defaults to False. If True, the MAGs selected by the metadata and optional where parameter will be excluded from the filtered data.[default: False]

Outputs¶

filtered_mags: FeatureData[MAG]: <no description>[required]

annotate filter-mags¶

Filter MAGs based on metadata.

Inputs¶

mags: SampleData[MAGs]: MAGs to filter.[required]

Parameters¶

metadata: Metadata: Sample metadata indicating which MAG ids to filter. The optional where parameter may be used to filter ids based on specified conditions in the metadata. The optional exclude_ids parameter may be used to exclude the ids specified in the metadata from the filter.[required]
where: Str: Optional SQLite WHERE clause specifying MAG metadata criteria that must be met to be included in the filtered data. If not provided, all MAGs in metadata that are also in the MAG data will be retained.[optional]
exclude_ids: Bool: Defaults to False. If True, the MAGs selected by the metadata and optional where parameter will be excluded from the filtered data.[default: False]
on: Str % Choices('sample', 'mag'): Whether to filter based on sample or MAG metadata.[default: 'mag']

Outputs¶

filtered_mags: SampleData[MAGs]: <no description>[required]

annotate fetch-eggnog-hmmer-db¶

Downloads Profile HMM database for the specified taxon.

Citations¶

Parameters¶

taxon_id: Int % Range(2, None): Taxon ID number.[required]

Outputs¶

idmap: EggnogHmmerIdmap % Properties('eggnog'): List of protein families in hmm_db.[required]
hmm_db: ProfileHMM[MultipleProtein] % Properties('eggnog'): Collection of Profile HMMs.[required]
pressed_hmm_db: ProfileHMM[PressedProtein] % Properties('eggnog'): Collection of Profile HMMs in binary format and indexed.[required]
seed_alignments: GenomeData[Proteins] % Properties('eggnog'): Seed alignments for the protein families in hmm_db.[required]

annotate estimate-abundance¶

This method estimates MAG/contig abundances by mapping the reads to them and calculating respective metric valueswhich are then used as a proxy for the frequency.

Inputs¶

alignment_maps: FeatureData[AlignmentMap] | SampleData[AlignmentMap]: Bowtie2 alignment maps between reads and features for which the abundance should be estimated.[required]
feature_lengths: FeatureData[SequenceCharacteristics % Properties('length')]: Table containing length of every feature (MAG/contig).[required]

Parameters¶

metric: Str % Choices('rpkm') | Str % Choices('tpm'): Metric to be used as a proxy of feature abundance.[default: 'rpkm']
min_mapq: Int % Range(0, 255): Minimum mapping quality.[default: 0]
min_query_len: Int % Range(0, None): Minimum query length.[default: 0]
min_base_quality: Int % Range(0, None): Minimum base quality.[default: 0]
min_read_len: Int % Range(0, None): Minimum read length.[default: 0]
threads: Int % Range(1, None): Number of threads to pass to samtools.[default: 1]

Outputs¶

abundances: FeatureTable[Frequency % (Properties('rpkm')¹ | Properties('tpm')²)]: Feature abundances.[required]

annotate extract-annotations¶

This method extract a specific annotation from the table generated by EggNOG and calculates its frequencies across all MAGs.

Inputs¶

ortholog_annotations: GenomeData[NOG]: Ortholog annotations.[required]

Parameters¶

annotation: Str % Choices('cog', 'caz', 'kegg_ko', 'kegg_pathway', 'kegg_reaction', 'kegg_module', 'brite', 'ec'): Annotation to extract.[required]
max_evalue: Float % Range(0, None): <no description>[default: 1.0]
min_score: Float % Range(0, None): <no description>[default: 0.0]

Outputs¶

annotation_frequency: FeatureTable[Frequency]: Feature table with frequency of each annotation.[required]

annotate -multiply-tables¶

Calculates the dot product of two feature tables with matching dimensions. If table 1 has shape (M x N) and table 2 has shape (N x P), the resulting table will have shape (M x P). Note that the tables must be identical in the N dimension.

Inputs¶

table1: FeatureTable[Frequency]: First feature table.[required]
table2: FeatureTable[Frequency]: Second feature table with matching dimension.[required]

Outputs¶

result_table: FeatureTable[Frequency]: Feature table with the dot product of the two original tables. The table will have a shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required]

annotate -multiply-tables-pa¶

Inputs¶

table1: FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence] | FeatureTable[Frequency] | FeatureTable[RelativeFrequency]: First feature table.[required]
table2: FeatureTable[Frequency] | FeatureTable[RelativeFrequency] | FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence]: Second feature table with matching dimension.[required]

Outputs¶

result_table: FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence] | FeatureTable[PresenceAbsence]: Feature table with the dot product of the two original tables. The table will have a shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required]

annotate -multiply-tables-relative¶

Inputs¶

table1: FeatureTable[RelativeFrequency] | FeatureTable[Frequency] | FeatureTable[RelativeFrequency]: First feature table.[required]
table2: FeatureTable[Frequency] | FeatureTable[RelativeFrequency] | FeatureTable[RelativeFrequency]: Second feature table with matching dimension.[required]

Outputs¶

result_table: FeatureTable[PresenceAbsence] | FeatureTable[RelativeFrequency] | FeatureTable[RelativeFrequency]: Feature table with the dot product of the two original tables. The table will have a shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required]

annotate -filter-kraken2-reports-by-abundance¶

Filters kraken2 reports on a per-taxon basis by relative abundance (relative frequency). Useful for removing suspected spurious classifications.

Inputs¶

reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]: The kraken2 reports to filter by relative abundance.[required]

Parameters¶

abundance_threshold: Float % Range(0, 1, inclusive_end=True): A proportion between 0 and 1 representing the minimum relative abundance (by classified read count) that a taxon must have to be retained in the filtered report.[required]
remove_empty: Bool: If True, reports with only unclassified reads remaining will be removed from the filtered data.[default: False]

Outputs¶

filtered_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]: The relative abundance-filtered kraken2 reports[required]

annotate -filter-kraken2-results-by-metadata¶

Filter Kraken2 reports and outputs based on metadata or remove reports with 100% unclassified reads.

Inputs¶

reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]: The Kraken reports to filter.[required]
outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Output % Properties('contigs', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'contigs')] | SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | SampleData[Kraken2Output % Properties('mags')] | FeatureData[Kraken2Output % Properties('mags')]: The Kraken outputs to filter.[required]

Parameters¶

metadata: Metadata: Metadata indicating which IDs to filter. The optional where parameter may be used to filter IDs based on specified conditions in the metadata. The optional exclude_ids parameter may be used to exclude the IDs specified in the metadata from the filter.[optional]
where: Str: Optional SQLite WHERE clause specifying metadata criteria that must be met to be included in the filtered data. If not provided, all IDs in metadata that are also in the data will be retained.[optional]
exclude_ids: Bool: If True, the samples selected by the metadata and optional where parameter will be excluded from the filtered data.[default: False]
remove_empty: Bool: If True, reports with only unclassified reads will be removed from the filtered data. Reports containing sequences classified only as root will also be removed.[default: False]

Outputs¶

filtered_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]: <no description>[required]
filtered_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Output % Properties('contigs', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'contigs')] | SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | SampleData[Kraken2Output % Properties('mags')] | FeatureData[Kraken2Output % Properties('mags')]: <no description>[required]

annotate -merge-kraken2-results¶

Merge multiple kraken2 reports and outputs such that the results contain a union of the samples represented in the inputs. If sample IDs overlap across the inputs, these reports and outputs will be processed into a single report or output per sample ID.

Inputs¶

reports: List[SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]]: The kraken2 reports to merge. Only reports with the same sample ID are merged into one report.[required]
outputs: List[SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Output % Properties('contigs', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'contigs')] | SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | SampleData[Kraken2Output % Properties('mags')] | FeatureData[Kraken2Output % Properties('mags')]]: The kraken2 outputs to merge. Only outputs with the same sample ID are merged into one output.[required]

Outputs¶

merged_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]: The merged kraken2 reports.[required]
merged_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Output % Properties('contigs', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'contigs')] | SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | SampleData[Kraken2Output % Properties('mags')] | FeatureData[Kraken2Output % Properties('mags')]: The merged kraken2 outputs.[required]

annotate -align-outputs-with-reports¶

Inputs¶

outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Output % Properties('contigs', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'contigs')] | SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | SampleData[Kraken2Output % Properties('mags')] | FeatureData[Kraken2Output % Properties('mags')]: The kraken2 outputs to align with the filtered reports.[required]
reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Report % Properties('contigs', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'mags')] | SampleData[Kraken2Report % Properties('reads', 'contigs')] | SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | SampleData[Kraken2Report % Properties('mags')] | FeatureData[Kraken2Report % Properties('mags')]: The filtered kraken2 reports.[required]

Outputs¶

aligned_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')] | SampleData[Kraken2Output % Properties('contigs', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'mags')] | SampleData[Kraken2Output % Properties('reads', 'contigs')] | SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | SampleData[Kraken2Output % Properties('mags')] | FeatureData[Kraken2Output % Properties('mags')]: The report-aligned filtered kraken2 outputs.[required]

annotate -visualize-busco¶

This method generates a visualization from the BUSCO results table.

Citations¶

Inputs¶

results: BUSCOResults: BUSCO results table.[required]

Outputs¶

visualization: Visualization: <no description>[required]

annotate classify-kraken2¶

Use Kraken 2 to classify provided DNA sequence reads, contigs, or MAGs into taxonomic groups.

Citations¶

Buchfink et al., 2021; Huerta-Cepas et al., 2019

Inputs¶

seqs: List[SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality]] | List[SampleData[Contigs]] | List[FeatureData[MAG]] | List[SampleData[MAGs]]: Sequences to be classified. Single-/paired-end reads,contigs, or assembled MAGs, can be provided.[required]
db: Kraken2DB: Kraken 2 database.[required]

Parameters¶

threads: Int % Range(1, None): Number of threads.[default: 1]
confidence: Float % Range(0, 1, inclusive_end=True): Confidence score threshold.[default: 0.0]
minimum_base_quality: Int % Range(0, None): Minimum base quality used in classification. Only applies when reads are used as input.[default: 0]
memory_mapping: Bool: Avoids loading the database into RAM.[default: False]
minimum_hit_groups: Int % Range(1, None): Minimum number of hit groups (overlapping k-mers sharing the same minimizer).[default: 2]
quick: Bool: Quick operation (use first hit or hits).[default: False]
report_minimizer_data: Bool: Include number of read-minimizers per-taxon and unique read-minimizers per-taxon in the report. If this parameter is enabled then merging kraken2 reports with the same sample ID from two or more input artifacts will not be possible.[default: False]
num_partitions: Int % Range(1, None): The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional]

Outputs¶

reports: SampleData[Kraken2Report % Properties('reads')] | SampleData[Kraken2Report % Properties('contigs')] | FeatureData[Kraken2Report % Properties('mags')] | SampleData[Kraken2Report % Properties('mags')]: Reports produced by Kraken2.[required]
outputs: SampleData[Kraken2Output % Properties('reads')] | SampleData[Kraken2Output % Properties('contigs')] | FeatureData[Kraken2Output % Properties('mags')] | SampleData[Kraken2Output % Properties('mags')]: Output files produced by Kraken2.[required]

annotate search-orthologs-diamond¶

Use Diamond and eggNOG to align contig or MAG sequences against the Diamond database.

Citations¶

Inputs¶

seqs: SampleData[Contigs] | SampleData[MAGs] | FeatureData[MAG]: Sequences to be searched for hits using the Diamond Database[required]
db: ReferenceDB[Diamond]: The filepath to an artifact containing the Diamond database[required]

Parameters¶

num_cpus: Int: Number of CPUs to utilize. '0' will use all available.[default: 1]
db_in_memory: Bool: Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: False]
num_partitions: Int % Range(1, None): The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional]

Outputs¶

eggnog_hits: SampleData[Orthologs]: <no description>[required]
table: FeatureTable[Frequency]: <no description>[required]
loci: GenomeData[Loci]: <no description>[required]

annotate search-orthologs-hmmer¶

This method uses HMMER to find possible target sequences to annotate with eggNOG-mapper.

Citations¶

HMMER, 2024; Huerta-Cepas et al., 2019

Inputs¶

seqs: SampleData[Contigs | MAGs] | FeatureData[MAG]: Sequences to be searched for hits.[required]
pressed_hmm_db: ProfileHMM[PressedProtein]: Collection of profile HMMs in binary format and indexed.[required]
idmap: EggnogHmmerIdmap: List of protein families in pressed_hmm_db.[required]
seed_alignments: GenomeData[Proteins]: Seed alignments for the protein families in pressed_hmm_db.[required]

Parameters¶

num_cpus: Int: Number of CPUs to utilize per partition. '0' will use all available.[default: 1]
db_in_memory: Bool: Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: False]
num_partitions: Int % Range(1, None): The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional]

Outputs¶

eggnog_hits: SampleData[Orthologs]: <no description>[required]
table: FeatureTable[Frequency]: <no description>[required]
loci: GenomeData[Loci]: <no description>[required]

annotate map-eggnog¶

Apply eggnog mapper to annotate seed orthologs.

Citations¶

Inputs¶

eggnog_hits: SampleData[Orthologs]: BLAST6-like table(s) describing the identified orthologs.[required]
db: ReferenceDB[Eggnog]: eggNOG annotation database.[required]

Parameters¶

db_in_memory: Bool: Read eggnog database into memory. The eggnog database is very large (>44GB), so this option should only be used on clusters or other machines with enough memory.[default: False]
num_cpus: Int % Range(0, None): Number of CPUs to utilize. '0' will use all available.[default: 1]
num_partitions: Int % Range(1, None): The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional]

Outputs¶

ortholog_annotations: GenomeData[NOG]: Annotated hits.[required]

annotate evaluate-busco¶

This method uses BUSCO to assess the quality of assembled MAGs and generates a table summarizing the results.

Citations¶