annotate
Plugin Overview¶
MOdular SHotgun metagenome Pipelines with Integrated provenance Tracking: QIIME 2 plugin gor metagenome analysis withtools for genome binning and functional annotation.
- version: 2025.7.0
- website: https://github .com /bokulich -lab /q2 -annotate 
- user support:
- Please post to the QIIME 2 forum for help with this plugin: https://forum .qiime2 .org 
Actions¶
| Name | Type | Short Description | 
|---|---|---|
| bin-contigs-metabat | method | Bin contigs into MAGs using MetaBAT 2. | 
| -classify-kraken2 | method | Perform taxonomic classification of reads or MAGs using Kraken 2. | 
| collate-kraken2-reports | method | Collate kraken2 reports. | 
| collate-kraken2-outputs | method | Collate kraken2 outputs. | 
| estimate-bracken | method | Perform read abundance re-estimation using Bracken. | 
| build-kraken-db | method | Build Kraken 2 database. | 
| inspect-kraken2-db | method | Inspect a Kraken 2 database. | 
| dereplicate-mags | method | Dereplicate MAGs from multiple samples. | 
| kraken2-to-features | method | Select downstream features from Kraken 2. | 
| kraken2-to-mag-features | method | Select downstream MAG features from Kraken 2. | 
| build-custom-diamond-db | method | Create a DIAMOND formatted reference database from a FASTA input file. | 
| fetch-eggnog-db | method | Fetch the databases necessary to run the eggnog-annotate action. | 
| fetch-diamond-db | method | Fetch the complete Diamond database necessary to run the eggnog-diamond-search action. | 
| fetch-eggnog-proteins | method | Fetch the databases necessary to run the build-eggnog-diamond-db action. | 
| fetch-ncbi-taxonomy | method | Fetch NCBI reference taxonomy. | 
| build-eggnog-diamond-db | method | Create a DIAMOND formatted reference database for the specified taxon. | 
| -eggnog-diamond-search | method | Run eggNOG search using Diamond aligner. | 
| -eggnog-hmmer-search | method | Run eggNOG search using HMMER aligner. | 
| -eggnog-feature-table | method | Create an eggnog table. | 
| -eggnog-annotate | method | Annotate orthologs against eggNOG database. | 
| collate-busco-results | method | Collate BUSCO results. | 
| -evaluate-busco | method | Evaluate quality of the generated MAGs using BUSCO. | 
| predict-genes-prodigal | method | Predict gene sequences from MAGs or contigs using Prodigal. | 
| fetch-kaiju-db | method | Fetch Kaiju database. | 
| -classify-kaiju | method | Classify sequences using Kaiju. | 
| fetch-busco-db | method | Download BUSCO database. | 
| get-feature-lengths | method | Get feature lengths. | 
| filter-derep-mags | method | Filter dereplicated MAGs. | 
| filter-mags | method | Filter MAGs. | 
| fetch-eggnog-hmmer-db | method | Fetch the taxon specific database necessary to run the eggnog-hmmer-search action. | 
| estimate-abundance | method | Estimate feature (MAG/contig) abundance. | 
| extract-annotations | method | Extract annotation frequencies from all annotations. | 
| -multiply-tables | method | Multiply two feature tables. | 
| -multiply-tables-pa | method | Multiply two feature tables. | 
| -multiply-tables-relative | method | Multiply two feature tables. | 
| -filter-kraken2-reports-by-abundance | method | Filter kraken2 reports by relative abundance. | 
| -filter-kraken2-results-by-metadata | method | Filter Kraken2 reports and outputs. | 
| -merge-kraken2-results | method | Merge kraken2 reports and outputs. | 
| -align-outputs-with-reports | method | Align unfiltered kraken2 outputs with filtered kraken2 reports. | 
| -visualize-busco | visualizer | Visualize BUSCO results. | 
| classify-kraken2 | pipeline | Perform taxonomic classification of reads or MAGs using Kraken 2. | 
| search-orthologs-diamond | pipeline | Run eggNOG search using diamond aligner. | 
| search-orthologs-hmmer | pipeline | Run eggNOG search using HMMER aligner. | 
| map-eggnog | pipeline | Annotate orthologs against eggNOG database. | 
| evaluate-busco | pipeline | Evaluate quality of the generated MAGs using BUSCO. | 
| classify-kaiju | pipeline | Classify sequences using Kaiju. | 
| construct-pangenome-index | pipeline | Construct the human pangenome index. | 
| filter-reads-pangenome | pipeline | Remove contaminating human reads. | 
| multiply-tables | pipeline | Multiply two feature tables. | 
| filter-kraken2-results | pipeline | Filter kraken2 reports and outputs by metadata and abundance. | 
Artifact Classes¶
| BUSCOResults | 
| ReferenceDB[BUSCO] | 
| EggnogHmmerIdmap | 
Formats¶
| BUSCOResultsFormat | 
| BUSCOResultsDirectoryFormat | 
| BuscoDatabaseDirFmt | 
| EggnogHmmerIdmapFileFmt | 
| EggnogHmmerIdmapDirectoryFmt | 
annotate bin-contigs-metabat¶
This method uses MetaBAT 2 to bin provided contigs into MAGs.
Citations¶
Kang et al., 2019; Li et al., 2009; Zenodo, 2023
Inputs¶
- contigs: SampleData[Contigs]
- Contigs to be binned.[required] 
- alignment_maps: SampleData[AlignmentMap]
- Reads-to-contig alignment maps.[required] 
Parameters¶
- min_contig: Int%Range(1500, None)
- Minimum size of a contig for binning.[optional] 
- max_p: Int%Range(1, 100)
- Percentage of "good" contigs considered for binning decided by connection among contigs. The greater, the more sensitive.[optional] 
- min_s: Int%Range(1, 100)
- Minimum score of a edge for binning. The greater, the more specific.[optional] 
- max_edges: Int%Range(1, None)
- Maximum number of edges per node. The greater, the more sensitive.[optional] 
- p_tnf: Int%Range(0, 100)
- TNF probability cutoff for building TNF graph. Use it to skip the preparation step. (0: auto)[optional] 
- no_add: Bool
- Turning off additional binning for lost or small contigs.[optional] 
- min_cv: Int%Range(1, None)
- Minimum mean coverage of a contig in each library for binning.[optional] 
- min_cv_sum: Int%Range(1, None)
- Minimum total effective mean coverage of a contig (sum of depth over minCV) for binning.[optional] 
- min_cls_size: Int%Range(1, None)
- Minimum size of a bin as the output.[optional] 
- num_threads: Int%Range(0, None)
- Number of threads to use (0: use all cores).[optional] 
- seed: Int%Range(0, None)
- For exact reproducibility. (0: use random seed)[optional] 
- debug: Bool
- Debug output.[optional] 
- verbose: Bool
- Verbose output.[optional] 
Outputs¶
- mags: SampleData[MAGs]
- The resulting MAGs.[required] 
- contig_map: FeatureMap[MAGtoContigs]
- Mapping of MAG identifiers to the contig identifiers contained in each MAG.[required] 
- unbinned_contigs: SampleData[Contigs % Properties('unbinned')]
- Contigs that were not binned into any MAG.[required] 
annotate -classify-kraken2¶
Use Kraken 2 to classify provided DNA sequence reads, contigs, or MAGs into taxonomic groups.
Citations¶
Inputs¶
- seqs: SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality]|SampleData[Contigs]|FeatureData[MAG]|SampleData[MAGs]
- Sequences to be classified. Single-/paired-end reads, contigs, or assembled MAGs can be provided.[required] 
- db: Kraken2DB
- Kraken 2 database.[required] 
Parameters¶
- threads: Int%Range(1, None)
- Number of threads.[default: - 1]
- confidence: Float%Range(0, 1, inclusive_end=True)
- Confidence score threshold.[default: - 0.0]
- minimum_base_quality: Int%Range(0, None)
- Minimum base quality used in classification. Only applies when reads are used as input.[default: - 0]
- memory_mapping: Bool
- Avoids loading the database into RAM.[default: - False]
- minimum_hit_groups: Int%Range(1, None)
- Minimum number of hit groups (overlapping k-mers sharing the same minimizer).[default: - 2]
- quick: Bool
- Quick operation (use first hit or hits).[default: - False]
- report_minimizer_data: Bool
- Include number of read-minimizers per-taxon and unique read-minimizers per-taxon in the report. If this parameter is enabled then merging kraken2 reports with the same sample ID from two or more input artifacts will not be possible.[default: - False]
Outputs¶
- reports: SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|FeatureData[Kraken2Report % Properties('mags')]|SampleData[Kraken2Report % Properties('mags')]
- Reports produced by Kraken2.[required] 
- outputs: SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|FeatureData[Kraken2Output % Properties('mags')]|SampleData[Kraken2Output % Properties('mags')]
- Output files produced by Kraken2.[required] 
annotate collate-kraken2-reports¶
Collates kraken2 reports.
Inputs¶
- reports: List[SampleData[Kraken2Report % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]]
- <no description>[required] 
Outputs¶
- collated_reports: SampleData[Kraken2Report % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]
- <no description>[required] 
annotate collate-kraken2-outputs¶
Collates kraken2 outputs.
Inputs¶
- outputs: List[SampleData[Kraken2Output % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]]
- <no description>[required] 
Outputs¶
- collated_outputs: SampleData[Kraken2Output % (Properties('reads', 'contigs', 'mags')¹ | Properties('reads', 'contigs')² | Properties('reads', 'mags')³ | Properties('contigs', 'mags')⁴ | Properties('reads')⁵ | Properties('contigs')⁶ | Properties('mags')⁷)]
- <no description>[required] 
annotate estimate-bracken¶
This method uses Bracken to re-estimate read abundances. Only available on Linux platforms.
Citations¶
Inputs¶
- kraken2_reports: SampleData[Kraken2Report % Properties('reads')]
- Reports produced by Kraken2.[required] 
- db: BrackenDB
- Bracken database.[required] 
Parameters¶
- threshold: Int%Range(0, None)
- Bracken: number of reads required PRIOR to abundance estimation to perform re-estimation.[default: - 0]
- read_len: Int%Range(0, None)
- Bracken: the ideal length of reads in your sample. For paired end data (e.g., 2x150) this should be set to the length of the single-end reads (e.g., 150).[default: - 100]
- level: Str%Choices('D', 'P', 'C', 'O', 'F', 'G', 'S')
- Bracken: specifies the taxonomic rank to analyze. Each classification at this specified rank will receive an estimated number of reads belonging to that rank after abundance estimation.[default: - 'S']
- include_unclassified: Bool
- Bracken does not include the unclassified read counts in the feature table. Set this to True to include those regardless.[default: - True]
Outputs¶
- reports: SampleData[Kraken2Report % Properties('bracken')]
- Reports modified by Bracken.[required] 
- taxonomy: FeatureData[Taxonomy]
- <no description>[required] 
- table: FeatureTable[Frequency]
- <no description>[required] 
annotate build-kraken-db¶
This method builds Kraken 2 and Bracken databases either (1) from provided DNA sequences to build a custom database, or (2) simply fetches pre-built versions from an online resource.
Citations¶
Wood et al., 2019; Lu et al., 2017
Inputs¶
- seqs: List[FeatureData[Sequence]]
- Sequences to be added to the Kraken 2 database.[optional] 
Parameters¶
- collection: Str%Choices('viral', 'minusb', 'standard', 'standard8', 'standard16', 'pluspf', 'pluspf8', 'pluspf16', 'pluspfp', 'pluspfp8', 'pluspfp16', 'eupathdb', 'nt', 'corent', 'gtdb', 'greengenes', 'rdp', 'silva132', 'silva138')
- Name of the database collection to be fetched. Please check https:// - benlangmead - .github - .io - /aws - -indexes - /k2 for the description of the available options.[optional] 
- threads: Int%Range(1, None)
- Number of threads. Only applicable when building a custom database.[default: - 1]
- kmer_len: Int%Range(1, None)
- K-mer length in bp/aa.[default: - 35]
- minimizer_len: Int%Range(1, None)
- Minimizer length in bp/aa.[default: - 31]
- minimizer_spaces: Int%Range(1, None)
- Number of characters in minimizer that are ignored in comparisons.[default: - 7]
- no_masking: Bool
- Avoid masking low-complexity sequences prior to building; masking requires dustmasker or segmasker to be installed in PATH.[default: - False]
- max_db_size: Int%Range(0, None)
- Maximum number of bytes for Kraken 2 hash table; if the estimator determines more would normally be needed, the reference library will be downsampled to fit.[default: - 0]
- use_ftp: Bool
- Use FTP for downloading instead of RSYNC.[default: - False]
- load_factor: Float%Range(0, 1)
- Proportion of the hash table to be populated.[default: - 0.7]
- fast_build: Bool
- Do not require database to be deterministically built when using multiple threads. This is faster, but does introduce variability in minimizer/LCA pairs.[default: - False]
- read_len: List[Int%Range(1, None)]
- Ideal read lengths to be used while building the Bracken database.[optional] 
Outputs¶
annotate inspect-kraken2-db¶
This method generates a report of identical format to those generated by classify_kraken2, with a slightly different interpretation. Instead of reporting the number of inputs classified to a taxon/clade, the report displays the number of minimizers mapped to each taxon/clade.
Citations¶
Inputs¶
- db: Kraken2DB
- The Kraken 2 database for which to generate the report.[required] 
Parameters¶
Outputs¶
- report: Kraken2DBReport
- The report of the supplied database.[required] 
annotate dereplicate-mags¶
This method dereplicates MAGs from multiple samples using distances between them found in the provided distance matrix. For each cluster of similar MAGs, the longest one will be selected as the representative. If metadata is given as input, the MAG with the highest or lowest value in the specified metadata column is chosen, depending on the parameter "find-max". If there are MAGs with identical values, the longer one is chosen. For example an artifact of type BUSCOResults can be passed as metadata, and the dereplication can be done by highest "completeness".
Inputs¶
- mags: SampleData[MAGs]
- MAGs to be dereplicated.[required] 
- distance_matrix: DistanceMatrix
- Matrix of distances between MAGs.[required] 
Parameters¶
- threshold: Float%Range(0, 1, inclusive_end=True)
- Similarity threshold required to consider two bins identical.[default: - 0.99]
- metadata: Metadata
- Metadata table.[optional] 
- metadata_column: Str
- Name of the metadata column used to choose the most representative bins.[default: - 'complete']
- find_max: Bool
- Set to True to choose the bin with the highest value in the metadata column. Set to False to choose the bin with the lowest value.[default: - True]
Outputs¶
- dereplicated_mags: FeatureData[MAG]
- Dereplicated MAGs.[required] 
- table: FeatureTable[PresenceAbsence]
- Mapping between MAGs and samples.[required] 
annotate kraken2-to-features¶
Convert a Kraken 2 report, which is an annotated NCBI taxonomy tree, into generic artifacts for downstream analyses.
Inputs¶
- reports: SampleData[Kraken2Report]
- Per-sample Kraken 2 reports.[required] 
Parameters¶
- coverage_threshold: Float%Range(0, 100, inclusive_end=True)
- The minimum percent coverage required to produce a feature.[default: - 0.1]
Outputs¶
- table: FeatureTable[PresenceAbsence]
- A presence/absence table of selected features. The features are not of even ranks, but will be the most specific rank available.[required] 
- taxonomy: FeatureData[Taxonomy]
- Output taxonomy. Infra-clade ranks are ignored unless if they are strain-level. Missing internal ranks are annotated by their next most specific rank, with the exception of k__Bacteria and k__Archaea, which match their domain name.[required] 
annotate kraken2-to-mag-features¶
Convert a Kraken 2 report, which is an annotated NCBI taxonomy tree, into generic artifacts for downstream analyses.
Inputs¶
- reports: FeatureData[Kraken2Report % Properties('mags')]
- Per-sample Kraken 2 reports.[required] 
- outputs: FeatureData[Kraken2Output % Properties('mags')]
- Per-sample Kraken 2 output files.[required] 
Parameters¶
- coverage_threshold: Float%Range(0, 100, inclusive_end=True)
- The minimum percent coverage required to produce a feature.[default: - 0.1]
Outputs¶
- taxonomy: FeatureData[Taxonomy]
- Output taxonomy. Infra-clade ranks are ignored unless if they are strain-level. Missing internal ranks are annotated by their next most specific rank, with the exception of k__Bacteria and k__Archaea, which match their domain name.[required] 
annotate build-custom-diamond-db¶
Creates an artifact containing a binary DIAMOND database file (ref_db.dmnd) from a protein reference database file in FASTA format.
Citations¶
Inputs¶
- seqs: FeatureData[ProteinSequence]
- Protein reference database.[required] 
- taxonomy: ReferenceDB[NCBITaxonomy]
- Reference taxonomy, needed to provide taxonomy features.[optional] 
Parameters¶
- threads: Int%Range(1, None)
- Number of CPU threads.[default: - 1]
- file_buffer_size: Int%Range(1, None)
- File buffer size in bytes.[default: - 67108864]
- ignore_warnings: Bool
- Ignore warnings.[default: - False]
- no_parse_seqids: Bool
- Print raw seqids without parsing.[default: - False]
Outputs¶
- db: ReferenceDB[Diamond]
- DIAMOND database.[required] 
Examples¶
Minimum working example¶
wget -O 'sequences.qza' \
  'https://moshpit--20.org.readthedocs.build/en/20/data/examples/annotate/build-custom-diamond-db/1/sequences.qza'
qiime annotate build-custom-diamond-db \
  --i-seqs sequences.qza \
  --o-db diamond-db.qzafrom qiime2 import Artifact
from urllib import request
import qiime2.plugins.annotate.actions as annotate_actions
url = 'https://moshpit--20.org.readthedocs.build/en/20/data/examples/annotate/build-custom-diamond-db/1/sequences.qza'
fn = 'sequences.qza'
request.urlretrieve(url, fn)
sequences = Artifact.load(fn)
diamond_db, = annotate_actions.build_custom_diamond_db(
    seqs=sequences,
)library(reticulate)
Artifact <- import("qiime2")$Artifact
annotate_actions <- import("qiime2.plugins.annotate.actions")
request <- import("urllib")$request
url <- 'https://moshpit--20.org.readthedocs.build/en/20/data/examples/annotate/build-custom-diamond-db/1/sequences.qza'
fn <- 'sequences.qza'
request$urlretrieve(url, fn)
sequences <- Artifact$load(fn)
action_results <- annotate_actions$build_custom_diamond_db(
    seqs=sequences,
)
diamond_db <- action_results$db
from q2_annotate._examples import diamond_makedb
diamond_makedb(use)
annotate fetch-eggnog-db¶
Downloads EggNOG reference database using the download_eggnog_data.py script from eggNOG. Here, this script downloads 3 files and stores them in the output artifact. At least 80 GB of storage space is required to run this action.
Citations¶
Outputs¶
- db: ReferenceDB[Eggnog]
- eggNOG annotation database.[required] 
annotate fetch-diamond-db¶
Downloads Diamond reference database. This action downloads 1 file (ref_db.dmnd). At least 18 GB of storage space is required to run this action.
Citations¶
Buchfink et al., 2021; Huerta-Cepas et al., 2019
Outputs¶
- db: ReferenceDB[Diamond]
- Complete Diamond reference database.[required] 
annotate fetch-eggnog-proteins¶
Downloads eggNOG proteome database. This script downloads 2 files (e5.proteomes.faa and e5.taxid_info.tsv) and creates and artifact with them. At least 18 GB of storage space is required to run this action.
Citations¶
Outputs¶
- eggnog_proteins: ReferenceDB[EggnogProteinSequences]
- eggNOG database of protein sequences and their corresponding taxonomy information.[required] 
annotate fetch-ncbi-taxonomy¶
Downloads NCBI reference taxonomy from the NCBI FTP server. The resulting artifact is required by the build-custom-diamond-db action if one wishes to create a Diamond data base with taxonomy features. At least 30 GB of storage space is required to run this action.
Citations¶
National Center for Biotechnology Information (NCBI), n.d.
Outputs¶
- taxonomy: ReferenceDB[NCBITaxonomy]
- NCBI reference taxonomy.[required] 
annotate build-eggnog-diamond-db¶
Creates a DIAMOND database which contains the protein sequences that belong to the specified taxon.
Citations¶
Buchfink et al., 2021; Huerta-Cepas et al., 2019
Inputs¶
- eggnog_proteins: ReferenceDB[EggnogProteinSequences]
- eggNOG database of protein sequences and their corresponding taxonomy information (generated through the - fetch-eggnog-proteinsaction).[required]
Parameters¶
Outputs¶
- Complete Diamond reference database for the specified taxon.[required] 
annotate -eggnog-diamond-search¶
This method performs the steps by which we find our possible target sequences to annotate using the Diamond search functionality from the eggnog emapper.py script.
Citations¶
Buchfink et al., 2021; Huerta-Cepas et al., 2019
Inputs¶
- seqs: SampleData[Contigs]|SampleData[MAGs]|FeatureData[MAG]
- Sequences to be searched for ortholog hits.[required] 
- db: ReferenceDB[Diamond]
- Diamond database.[required] 
Parameters¶
- num_cpus: Int
- Number of CPUs to utilize. '0' will use all available.[default: - 1]
- db_in_memory: Bool
- Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: - False]
Outputs¶
- eggnog_hits: SampleData[Orthologs]
- BLAST6-like table(s) describing the identified orthologs. One table per sample or MAG in the input.[required] 
- table: FeatureTable[Frequency]
- Feature table with counts of orthologs per sample/MAG.[required] 
- loci: GenomeData[Loci]
- Loci of the identified orthologs.[required] 
annotate -eggnog-hmmer-search¶
This method performs the steps by which we find our possible target sequences to annotate using the HMMER search functionality from the eggnog emapper.py script.
Citations¶
Buchfink et al., 2021; Huerta-Cepas et al., 2019
Inputs¶
- seqs: SampleData[Contigs]|SampleData[MAGs]|FeatureData[MAG]
- Sequences to be searched for hits.[required] 
- idmap: EggnogHmmerIdmap
- List of protein families in - pressed_hmm_db.[required]
- pressed_hmm_db: ProfileHMM[PressedProtein]
- Collection of Profile HMMs in binary format and indexed.[required] 
- seed_alignments: GenomeData[Proteins]
- Seed alignments for the protein families in - pressed_hmm_db.[required]
Parameters¶
- num_cpus: Int
- Number of CPUs to utilize per partition. '0' will use all available.[default: - 1]
- db_in_memory: Bool
- Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: - False]
Outputs¶
- eggnog_hits: SampleData[Orthologs]
- BLAST6-like table(s) describing the identified orthologs. One table per sample or MAG in the input.[required] 
- table: FeatureTable[Frequency]
- Feature table with counts of orthologs per sample/MAG.[required] 
- loci: GenomeData[Loci]
- Loci of the identified orthologs.[required] 
annotate -eggnog-feature-table¶
Create an eggnog table.
Inputs¶
- seed_orthologs: SampleData[Orthologs]
- Sequence data to be turned into an eggnog feature table.[required] 
Outputs¶
- table: FeatureTable[Frequency]
- <no description>[required] 
annotate -eggnog-annotate¶
Apply eggnog mapper to annotate seed orthologs.
Citations¶
Inputs¶
- eggnog_hits: SampleData[Orthologs]
- <no description>[required] 
- db: ReferenceDB[Eggnog]
- <no description>[required] 
Parameters¶
- db_in_memory: Bool
- Read eggnog database into memory. The eggNOG database is very large (>44GB), so this option should only be used on clusters or other machines with enough memory.[default: - False]
- num_cpus: Int%Range(0, None)
- Number of CPUs to utilize. '0' will use all available.[default: - 1]
Outputs¶
- ortholog_annotations: GenomeData[NOG]
- <no description>[required] 
annotate collate-busco-results¶
Collates BUSCO results.
Inputs¶
- results: List[BUSCOResults]
- <no description>[required] 
Outputs¶
- collated_results: BUSCOResults
- <no description>[required] 
annotate -evaluate-busco¶
This method uses BUSCO to assess the quality of assembled MAGs and generates a table summarizing the results.
Citations¶
Inputs¶
- mags: SampleData[MAGs]|FeatureData[MAG]
- MAGs to be analyzed.[required] 
- db: ReferenceDB[BUSCO]
- BUSCO database.[required] 
Parameters¶
- mode: Str%Choices('genome')
- Specify which BUSCO analysis mode to run.Currently only the 'genome' option is supported, for genome assemblies. In the future modes for transcriptome assemblies and for annotated gene sets (proteins) will be made available.[default: - 'genome']
- lineage_dataset: Str
- Specify the name of the BUSCO lineage to be used. To see all possible options run - busco --list-datasets.[optional]
- augustus: Bool
- Use augustus gene predictor for eukaryote runs.[default: - False]
- augustus_parameters: Str
- Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: '--PARAM1=VALUE1,--PARAM2=VALUE2'.[optional] 
- augustus_species: Str
- Specify a species for Augustus training.[optional] 
- auto_lineage: Bool
- Run auto-lineage to find optimum lineage path.[default: - False]
- auto_lineage_euk: Bool
- Run auto-placement just on eukaryote tree to find optimum lineage path.[default: - False]
- auto_lineage_prok: Bool
- Run auto-lineage just on non-eukaryote trees to find optimum lineage path.[default: - False]
- cpu: Int%Range(1, None)
- Specify the number (N=integer) of threads/cores to use.[default: - 1]
- contig_break: Int%Range(0, None)
- Number of contiguous Ns to signify a break between contigs. See https:// - gitlab - .com - /ezlab - /busco - / - - - /issues - /691 for a more detailed explanation.[default: - 10]
- evalue: Float%Range(0, None, inclusive_start=False)
- E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03.[default: - 0.001]
- limit: Int%Range(1, 20)
- How many candidate regions (contig or transcript) to consider per BUSCO.[default: - 3]
- long: Bool
- Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms.[default: - False]
- metaeuk_parameters: Str
- Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: - --PARAM1=VALUE1,--PARAM2=VALUE2.[optional]
- metaeuk_rerun_parameters: Str
- Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: - --PARAM1=VALUE1,--PARAM2=VALUE2.[optional]
- miniprot: Bool
- Use miniprot gene predictor for eukaryote runs.[default: - False]
- additional_metrics: Bool
- Adds completeness and contamination values to the BUSCO report. Check here for documentation: https:// - github - .com - /metashot - /busco - ?tab - = - readme - -ov - -file - #documetation[default: - False]
Outputs¶
- results: BUSCOResults
- BUSCO result table.[required] 
annotate predict-genes-prodigal¶
Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm), a gene prediction algorithm designed for improved gene structure prediction, translation initiation site recognition, and reduced false positives in bacterial and archaeal genomes.
Citations¶
Inputs¶
- seqs: FeatureData[MAG]|SampleData[MAGs]|SampleData[Contigs]
- MAGs or contigs for which one wishes to predict genes.[required] 
Parameters¶
- translation_table_number: Str%Choices('1', '2', '3', '4', '5', '6', '9', '10', '11', '12', '13', '14', '15', '16', '21', '22', '23', '24', '25')
- Translation table to be used to translate genes into sequences of amino acids. See https:// - www - .ncbi - .nlm - .nih - .gov - /Taxonomy - /Utils - /wprintgc - .cgi for reference.[default: - '11']
Outputs¶
- loci: GenomeData[Loci]
- Gene coordinates files (one per MAG or sample) listing the location of each predicted gene as well as some additional scoring information.[required] 
- genes: GenomeData[Genes]
- Fasta files (one per MAG or sample) with the nucleotide sequences of the predicted genes.[required] 
- proteins: GenomeData[Proteins]
- Fasta files (one per MAG or sample) with the protein translation of the predicted genes.[required] 
annotate fetch-kaiju-db¶
This method fetches the latest Kaiju database from Kaiju's web server.
Citations¶
Parameters¶
- database_type: Str%Choices('nr', 'nr_euk', 'refseq', 'refseq_ref', 'refseq_nr', 'fungi', 'viruses', 'plasmids', 'progenomes', 'rvdb')
- Type of database to be downloaded. For more information on available types please see the list on Kaiju's web server: https:// - bioinformatics - -centre - .github - .io - /kaiju - /downloads - .html[required] 
Outputs¶
- db: KaijuDB
- Kaiju database.[required] 
annotate -classify-kaiju¶
This method uses Kaiju to perform taxonomic classification of DNA sequence reads or contigs.
Citations¶
Inputs¶
- seqs: SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality]|SampleData[Contigs]
- Sequences to be classified.[required] 
- db: KaijuDB
- Kaiju database.[required] 
Parameters¶
- z: Int%Range(1, None)
- Number of threads.[default: - 1]
- a: Str%Choices('greedy', 'mem')
- Run mode.[default: - 'greedy']
- e: Int%Range(1, None)
- Number of mismatches allowed in Greedy mode.[default: - 3]
- m: Int%Range(1, None)
- Minimum match length.[default: - 11]
- s: Int%Range(1, None)
- Minimum match score in Greedy mode.[default: - 65]
- evalue: Float%Range(0, 1)
- Minimum E-value in Greedy mode.[default: - 0.01]
- x: Bool
- Enable SEG low complexity filter.[default: - True]
- r: Str%Choices('phylum', 'class', 'order', 'family', 'genus', 'species')
- Taxonomic rank.[default: - 'species']
- c: Float%Range(0, 100)
- Minimum required number or fraction of reads for the taxon (except viruses) to be reported.[default: - 0.0]
- exp: Bool
- Expand viruses, which are always shown as full taxon path and read counts are not summarized in higher taxonomic levels.[default: - False]
- u: Bool
- Do not count unclassified reads for the total reads when calculating percentages for classified reads.[default: - False]
Outputs¶
- abundances: FeatureTable[Frequency]
- Sequence abundances.[required] 
- taxonomy: FeatureData[Taxonomy]
- Linked taxonomy.[required] 
annotate fetch-busco-db¶
Downloads BUSCO database for the specified lineage. Output can be used to run BUSCO with the 'evaluate-busco' action.
Citations¶
Parameters¶
- lineages: List[Str]
- Lineages to be included in the database. Can be any valid BUSCO lineage or any of the following: 'all' (for all lineages), 'prokaryota', 'eukaryota', 'virus'.[optional] 
Outputs¶
- db: ReferenceDB[BUSCO]
- BUSCO database for the specified lineages.[required] 
annotate get-feature-lengths¶
This method extract lengths for the provided feature set.
Inputs¶
- features: FeatureData[MAG | Sequence]|SampleData[MAGs | Contigs]
- Features to get lengths for.[required] 
Outputs¶
- lengths: FeatureData[SequenceCharacteristics % Properties('length')]
- Feature lengths.[required] 
annotate filter-derep-mags¶
Filter dereplicated MAGs based on metadata.
Inputs¶
- mags: FeatureData[MAG]
- Dereplicated MAGs to filter.[required] 
Parameters¶
- metadata: Metadata
- Sample metadata indicating which MAG ids to filter. The optional - whereparameter may be used to filter ids based on specified conditions in the metadata. The optional- exclude_idsparameter may be used to exclude the ids specified in the metadata from the filter.[required]
- where: Str
- Optional SQLite WHERE clause specifying MAG metadata criteria that must be met to be included in the filtered data. If not provided, all MAGs in - metadatathat are also in the MAG data will be retained.[optional]
- exclude_ids: Bool
- Defaults to False. If True, the MAGs selected by the - metadataand optional- whereparameter will be excluded from the filtered data.[default:- False]
Outputs¶
- filtered_mags: FeatureData[MAG]
- <no description>[required] 
annotate filter-mags¶
Filter MAGs based on metadata.
Inputs¶
- mags: SampleData[MAGs]
- MAGs to filter.[required] 
Parameters¶
- metadata: Metadata
- Sample metadata indicating which MAG ids to filter. The optional - whereparameter may be used to filter ids based on specified conditions in the metadata. The optional- exclude_idsparameter may be used to exclude the ids specified in the metadata from the filter.[required]
- where: Str
- Optional SQLite WHERE clause specifying MAG metadata criteria that must be met to be included in the filtered data. If not provided, all MAGs in - metadatathat are also in the MAG data will be retained.[optional]
- exclude_ids: Bool
- Defaults to False. If True, the MAGs selected by the - metadataand optional- whereparameter will be excluded from the filtered data.[default:- False]
- on: Str%Choices('sample', 'mag')
- Whether to filter based on sample or MAG metadata.[default: - 'mag']
Outputs¶
- filtered_mags: SampleData[MAGs]
- <no description>[required] 
annotate fetch-eggnog-hmmer-db¶
Downloads Profile HMM database for the specified taxon.
Citations¶
Huerta-Cepas et al., 2019; HMMER, 2024
Parameters¶
Outputs¶
- idmap: EggnogHmmerIdmap%Properties('eggnog')
- List of protein families in - hmm_db.[required]
- hmm_db: ProfileHMM[MultipleProtein]%Properties('eggnog')
- Collection of Profile HMMs.[required] 
- pressed_hmm_db: ProfileHMM[PressedProtein]%Properties('eggnog')
- Collection of Profile HMMs in binary format and indexed.[required] 
- seed_alignments: GenomeData[Proteins]%Properties('eggnog')
- Seed alignments for the protein families in - hmm_db.[required]
annotate estimate-abundance¶
This method estimates MAG/contig abundances by mapping the reads to them and calculating respective metric valueswhich are then used as a proxy for the frequency.
Inputs¶
- alignment_maps: FeatureData[AlignmentMap]|SampleData[AlignmentMap]
- Bowtie2 alignment maps between reads and features for which the abundance should be estimated.[required] 
- feature_lengths: FeatureData[SequenceCharacteristics % Properties('length')]
- Table containing length of every feature (MAG/contig).[required] 
Parameters¶
- metric: Str%Choices('rpkm')|Str%Choices('tpm')
- Metric to be used as a proxy of feature abundance.[default: - 'rpkm']
- min_mapq: Int%Range(0, 255)
- Minimum mapping quality.[default: - 0]
- min_query_len: Int%Range(0, None)
- Minimum query length.[default: - 0]
- min_base_quality: Int%Range(0, None)
- Minimum base quality.[default: - 0]
- min_read_len: Int%Range(0, None)
- Minimum read length.[default: - 0]
- threads: Int%Range(1, None)
- Number of threads to pass to samtools.[default: - 1]
Outputs¶
- abundances: FeatureTable[Frequency % (Properties('rpkm')¹ | Properties('tpm')²)]
- Feature abundances.[required] 
annotate extract-annotations¶
This method extract a specific annotation from the table generated by EggNOG and calculates its frequencies across all MAGs.
Inputs¶
- ortholog_annotations: GenomeData[NOG]
- Ortholog annotations.[required] 
Parameters¶
- annotation: Str%Choices('cog', 'caz', 'kegg_ko', 'kegg_pathway', 'kegg_reaction', 'kegg_module', 'brite', 'ec')
- Annotation to extract.[required] 
- max_evalue: Float%Range(0, None)
- <no description>[default: - 1.0]
- min_score: Float%Range(0, None)
- <no description>[default: - 0.0]
Outputs¶
- annotation_frequency: FeatureTable[Frequency]
- Feature table with frequency of each annotation.[required] 
annotate -multiply-tables¶
Calculates the dot product of two feature tables with matching dimensions. If table 1 has shape (M x N) and table 2 has shape (N x P), the resulting table will have shape (M x P). Note that the tables must be identical in the N dimension.
Inputs¶
- table1: FeatureTable[Frequency]
- First feature table.[required] 
- table2: FeatureTable[Frequency]
- Second feature table with matching dimension.[required] 
Outputs¶
- result_table: FeatureTable[Frequency]
- Feature table with the dot product of the two original tables. The table will have a shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required] 
annotate -multiply-tables-pa¶
Calculates the dot product of two feature tables with matching dimensions. If table 1 has shape (M x N) and table 2 has shape (N x P), the resulting table will have shape (M x P). Note that the tables must be identical in the N dimension.
Inputs¶
- table1: FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[Frequency]|FeatureTable[RelativeFrequency]
- First feature table.[required] 
- table2: FeatureTable[Frequency]|FeatureTable[RelativeFrequency]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]
- Second feature table with matching dimension.[required] 
Outputs¶
- result_table: FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]
- Feature table with the dot product of the two original tables. The table will have a shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required] 
annotate -multiply-tables-relative¶
Calculates the dot product of two feature tables with matching dimensions. If table 1 has shape (M x N) and table 2 has shape (N x P), the resulting table will have shape (M x P). Note that the tables must be identical in the N dimension.
Inputs¶
- table1: FeatureTable[RelativeFrequency]|FeatureTable[Frequency]|FeatureTable[RelativeFrequency]
- First feature table.[required] 
- table2: FeatureTable[Frequency]|FeatureTable[RelativeFrequency]|FeatureTable[RelativeFrequency]
- Second feature table with matching dimension.[required] 
Outputs¶
- result_table: FeatureTable[PresenceAbsence]|FeatureTable[RelativeFrequency]|FeatureTable[RelativeFrequency]
- Feature table with the dot product of the two original tables. The table will have a shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required] 
annotate -filter-kraken2-reports-by-abundance¶
Filters kraken2 reports on a per-taxon basis by relative abundance (relative frequency). Useful for removing suspected spurious classifications.
Inputs¶
- reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The kraken2 reports to filter by relative abundance.[required] 
Parameters¶
- abundance_threshold: Float%Range(0, 1, inclusive_end=True)
- A proportion between 0 and 1 representing the minimum relative abundance (by classified read count) that a taxon must have to be retained in the filtered report.[required] 
- remove_empty: Bool
- If True, reports with only unclassified reads remaining will be removed from the filtered data.[default: - False]
Outputs¶
- filtered_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The relative abundance-filtered kraken2 reports[required] 
annotate -filter-kraken2-results-by-metadata¶
Filter Kraken2 reports and outputs based on metadata or remove reports with 100% unclassified reads.
Inputs¶
- reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The Kraken reports to filter.[required] 
- outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- The Kraken outputs to filter.[required] 
Parameters¶
- metadata: Metadata
- Metadata indicating which IDs to filter. The optional - whereparameter may be used to filter IDs based on specified conditions in the metadata. The optional- exclude_idsparameter may be used to exclude the IDs specified in the metadata from the filter.[optional]
- where: Str
- Optional SQLite WHERE clause specifying metadata criteria that must be met to be included in the filtered data. If not provided, all IDs in - metadatathat are also in the data will be retained.[optional]
- exclude_ids: Bool
- If True, the samples selected by the - metadataand optional- whereparameter will be excluded from the filtered data.[default:- False]
- remove_empty: Bool
- If True, reports with only unclassified reads will be removed from the filtered data. Reports containing sequences classified only as root will also be removed.[default: - False]
Outputs¶
- filtered_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- <no description>[required] 
- filtered_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- <no description>[required] 
annotate -merge-kraken2-results¶
Merge multiple kraken2 reports and outputs such that the results contain a union of the samples represented in the inputs. If sample IDs overlap across the inputs, these reports and outputs will be processed into a single report or output per sample ID.
Inputs¶
- reports: List[SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]]
- The kraken2 reports to merge. Only reports with the same sample ID are merged into one report.[required] 
- outputs: List[SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]]
- The kraken2 outputs to merge. Only outputs with the same sample ID are merged into one output.[required] 
Outputs¶
- merged_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The merged kraken2 reports.[required] 
- merged_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- The merged kraken2 outputs.[required] 
annotate -align-outputs-with-reports¶
Inputs¶
- outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- The kraken2 outputs to align with the filtered reports.[required] 
- reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The filtered kraken2 reports.[required] 
Outputs¶
- aligned_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- The report-aligned filtered kraken2 outputs.[required] 
annotate -visualize-busco¶
This method generates a visualization from the BUSCO results table.
Citations¶
Inputs¶
- results: BUSCOResults
- BUSCO results table.[required] 
Outputs¶
- visualization: Visualization
- <no description>[required] 
annotate classify-kraken2¶
Use Kraken 2 to classify provided DNA sequence reads, contigs, or MAGs into taxonomic groups.
Citations¶
Inputs¶
- seqs: List[SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality]]|List[SampleData[Contigs]]|List[FeatureData[MAG]]|List[SampleData[MAGs]]
- Sequences to be classified. Single-/paired-end reads,contigs, or assembled MAGs, can be provided.[required] 
- db: Kraken2DB
- Kraken 2 database.[required] 
Parameters¶
- threads: Int%Range(1, None)
- Number of threads.[default: - 1]
- confidence: Float%Range(0, 1, inclusive_end=True)
- Confidence score threshold.[default: - 0.0]
- minimum_base_quality: Int%Range(0, None)
- Minimum base quality used in classification. Only applies when reads are used as input.[default: - 0]
- memory_mapping: Bool
- Avoids loading the database into RAM.[default: - False]
- minimum_hit_groups: Int%Range(1, None)
- Minimum number of hit groups (overlapping k-mers sharing the same minimizer).[default: - 2]
- quick: Bool
- Quick operation (use first hit or hits).[default: - False]
- report_minimizer_data: Bool
- Include number of read-minimizers per-taxon and unique read-minimizers per-taxon in the report. If this parameter is enabled then merging kraken2 reports with the same sample ID from two or more input artifacts will not be possible.[default: - False]
- num_partitions: Int%Range(1, None)
- The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional] 
Outputs¶
- reports: SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|FeatureData[Kraken2Report % Properties('mags')]|SampleData[Kraken2Report % Properties('mags')]
- Reports produced by Kraken2.[required] 
- outputs: SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|FeatureData[Kraken2Output % Properties('mags')]|SampleData[Kraken2Output % Properties('mags')]
- Output files produced by Kraken2.[required] 
annotate search-orthologs-diamond¶
Use Diamond and eggNOG to align contig or MAG sequences against the Diamond database.
Citations¶
Buchfink et al., 2021; Huerta-Cepas et al., 2019
Inputs¶
- seqs: SampleData[Contigs]|SampleData[MAGs]|FeatureData[MAG]
- Sequences to be searched for hits using the Diamond Database[required] 
- db: ReferenceDB[Diamond]
- The filepath to an artifact containing the Diamond database[required] 
Parameters¶
- num_cpus: Int
- Number of CPUs to utilize. '0' will use all available.[default: - 1]
- db_in_memory: Bool
- Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: - False]
- num_partitions: Int%Range(1, None)
- The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional] 
Outputs¶
- eggnog_hits: SampleData[Orthologs]
- <no description>[required] 
- table: FeatureTable[Frequency]
- <no description>[required] 
- loci: GenomeData[Loci]
- <no description>[required] 
annotate search-orthologs-hmmer¶
This method uses HMMER to find possible target sequences to annotate with eggNOG-mapper.
Citations¶
HMMER, 2024; Huerta-Cepas et al., 2019
Inputs¶
- seqs: SampleData[Contigs | MAGs]|FeatureData[MAG]
- Sequences to be searched for hits.[required] 
- pressed_hmm_db: ProfileHMM[PressedProtein]
- Collection of profile HMMs in binary format and indexed.[required] 
- idmap: EggnogHmmerIdmap
- List of protein families in - pressed_hmm_db.[required]
- seed_alignments: GenomeData[Proteins]
- Seed alignments for the protein families in - pressed_hmm_db.[required]
Parameters¶
- num_cpus: Int
- Number of CPUs to utilize per partition. '0' will use all available.[default: - 1]
- db_in_memory: Bool
- Read database into memory. The database can be very large, so this option should only be used on clusters or other machines with enough memory.[default: - False]
- num_partitions: Int%Range(1, None)
- The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional] 
Outputs¶
- eggnog_hits: SampleData[Orthologs]
- <no description>[required] 
- table: FeatureTable[Frequency]
- <no description>[required] 
- loci: GenomeData[Loci]
- <no description>[required] 
annotate map-eggnog¶
Apply eggnog mapper to annotate seed orthologs.
Citations¶
Inputs¶
- eggnog_hits: SampleData[Orthologs]
- BLAST6-like table(s) describing the identified orthologs.[required] 
- db: ReferenceDB[Eggnog]
- eggNOG annotation database.[required] 
Parameters¶
- db_in_memory: Bool
- Read eggnog database into memory. The eggnog database is very large (>44GB), so this option should only be used on clusters or other machines with enough memory.[default: - False]
- num_cpus: Int%Range(0, None)
- Number of CPUs to utilize. '0' will use all available.[default: - 1]
- num_partitions: Int%Range(1, None)
- The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional] 
Outputs¶
- ortholog_annotations: GenomeData[NOG]
- Annotated hits.[required] 
annotate evaluate-busco¶
This method uses BUSCO to assess the quality of assembled MAGs and generates a table summarizing the results.
Citations¶
Inputs¶
- mags: SampleData[MAGs]|FeatureData[MAG]
- MAGs to be analyzed.[required] 
- db: ReferenceDB[BUSCO]
- BUSCO database.[required] 
Parameters¶
- mode: Str%Choices('genome')
- Specify which BUSCO analysis mode to run.Currently only the 'genome' option is supported, for genome assemblies. In the future modes for transcriptome assemblies and for annotated gene sets (proteins) will be made available.[default: - 'genome']
- lineage_dataset: Str
- Specify the name of the BUSCO lineage to be used. To see all possible options run - busco --list-datasets.[optional]
- augustus: Bool
- Use augustus gene predictor for eukaryote runs.[default: - False]
- augustus_parameters: Str
- Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: '--PARAM1=VALUE1,--PARAM2=VALUE2'.[optional] 
- augustus_species: Str
- Specify a species for Augustus training.[optional] 
- auto_lineage: Bool
- Run auto-lineage to find optimum lineage path.[default: - False]
- auto_lineage_euk: Bool
- Run auto-placement just on eukaryote tree to find optimum lineage path.[default: - False]
- auto_lineage_prok: Bool
- Run auto-lineage just on non-eukaryote trees to find optimum lineage path.[default: - False]
- cpu: Int%Range(1, None)
- Specify the number (N=integer) of threads/cores to use.[default: - 1]
- contig_break: Int%Range(0, None)
- Number of contiguous Ns to signify a break between contigs. See https:// - gitlab - .com - /ezlab - /busco - / - - - /issues - /691 for a more detailed explanation.[default: - 10]
- evalue: Float%Range(0, None, inclusive_start=False)
- E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03.[default: - 0.001]
- limit: Int%Range(1, 20)
- How many candidate regions (contig or transcript) to consider per BUSCO.[default: - 3]
- long: Bool
- Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms.[default: - False]
- metaeuk_parameters: Str
- Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: - --PARAM1=VALUE1,--PARAM2=VALUE2.[optional]
- metaeuk_rerun_parameters: Str
- Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma. Example: - --PARAM1=VALUE1,--PARAM2=VALUE2.[optional]
- miniprot: Bool
- Use miniprot gene predictor for eukaryote runs.[default: - False]
- additional_metrics: Bool
- Adds completeness and contamination values to the BUSCO report. Check here for documentation: https:// - github - .com - /metashot - /busco - ?tab - = - readme - -ov - -file - #documetation[default: - True]
- num_partitions: Int%Range(1, None)
- The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional] 
Outputs¶
- results: BUSCOResults
- BUSCO result table.[required] 
- visualization: Visualization
- Visualization of the BUSCO results.[required] 
annotate classify-kaiju¶
This method uses Kaiju to perform taxonomic classification.
Citations¶
Inputs¶
- seqs: SampleData[SequencesWithQuality | PairedEndSequencesWithQuality | JoinedSequencesWithQuality]|SampleData[Contigs]
- Sequences to be classified.[required] 
- db: KaijuDB
- Kaiju database.[required] 
Parameters¶
- z: Int%Range(1, None)
- Number of threads.[default: - 1]
- a: Str%Choices('greedy', 'mem')
- Run mode.[default: - 'greedy']
- e: Int%Range(1, None)
- Number of mismatches allowed in Greedy mode.[default: - 3]
- m: Int%Range(1, None)
- Minimum match length.[default: - 11]
- s: Int%Range(1, None)
- Minimum match score in Greedy mode.[default: - 65]
- evalue: Float%Range(0, 1)
- Minimum E-value in Greedy mode.[default: - 0.01]
- x: Bool
- Enable SEG low complexity filter.[default: - True]
- r: Str%Choices('phylum', 'class', 'order', 'family', 'genus', 'species')
- Taxonomic rank.[default: - 'species']
- c: Float%Range(0, 100)
- Minimum required number or fraction of reads for the taxon (except viruses) to be reported.[default: - 0.0]
- exp: Bool
- Expand viruses, which are always shown as full taxon path and read counts are not summarized in higher taxonomic levels.[default: - False]
- u: Bool
- Do not count unclassified reads for the total reads when calculating percentages for classified reads.[default: - False]
- num_partitions: Int%Range(1, None)
- The number of partitions to split the contigs into. Defaults to partitioning into individual samples.[optional] 
Outputs¶
- abundances: FeatureTable[Frequency]
- Sequence abundances.[required] 
- taxonomy: FeatureData[Taxonomy]
- Linked taxonomy.[required] 
annotate construct-pangenome-index¶
This method generates a Bowtie2 index for the combined human GRCh38 reference genome and the draft human pangenome.
Parameters¶
- threads: Threads
- Number of threads to use when building the index.[default: - 1]
Outputs¶
- index: Bowtie2Index
- Generated combined human reference index.[required] 
annotate filter-reads-pangenome¶
Generates a Bowtie2 index fo the combined human GRCh38 reference genome and the draft human pangenome, anduses that index to remove the contaminating human reads from the reads provided as input.
Inputs¶
- reads: SampleData[SequencesWithQuality]|SampleData[PairedEndSequencesWithQuality]
- Reads to be filtered against the human genome.[required] 
- index: Bowtie2Index
- Bowtie2 index of the reference human genome. If not provided, an index combined from the reference GRCh38 human genome and the human pangenome will be generated.[optional] 
Parameters¶
- threads: Threads
- Number of threads to use for indexing and read filtering.[default: - 1]
- mode: Str%Choices('local', 'global')
- Bowtie2 alignment settings. See bowtie2 manual for more details.[default: - 'local']
- sensitivity: Str%Choices('very-fast', 'fast', 'sensitive', 'very-sensitive')
- Bowtie2 alignment sensitivity. See bowtie2 manual for details.[default: - 'sensitive']
- ref_gap_open_penalty: Int%Range(1, None)
- Reference gap open penalty.[default: - 5]
- ref_gap_ext_penalty: Int%Range(1, None)
- Reference gap extend penalty.[default: - 3]
Outputs¶
- filtered_reads: SampleData[SequencesWithQuality]|SampleData[PairedEndSequencesWithQuality]
- Original reads without the contaminating human reads.[required] 
- reference_index: Bowtie2Index
- Generated combined human reference index. If an index was provided as an input, it will be returned here instead.[required] 
annotate multiply-tables¶
Calculates the dot product of two feature tables with matching dimensions. If table 1 has shape (M x N) and table 2 has shape (N x P), the resulting table will have shape (M x P). Note that the tables must be identical in the N dimension.
Inputs¶
- table1: FeatureTable[Frequency]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[Frequency]|FeatureTable[RelativeFrequency]|FeatureTable[Frequency]|FeatureTable[RelativeFrequency]|FeatureTable[RelativeFrequency]
- First feature table.[required] 
- table2: FeatureTable[Frequency]|FeatureTable[Frequency]|FeatureTable[RelativeFrequency]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[RelativeFrequency]|FeatureTable[Frequency]|FeatureTable[RelativeFrequency]
- Second feature table with matching dimension.[required] 
Outputs¶
- result_table: FeatureTable[Frequency]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[PresenceAbsence]|FeatureTable[RelativeFrequency]|FeatureTable[RelativeFrequency]|FeatureTable[RelativeFrequency]
- Feature table with the dot product of the two original tables. The table will have the shape of (M x N) where M is the number of rows from table1 and N is number of columns from table2.[required] 
annotate filter-kraken2-results¶
Filter kraken2 reports and outputs by sample metadata, and/or filter classified taxa by relative abundance.
Inputs¶
- reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The kraken2 reports to filter.[required] 
- outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- The kraken2 outputs to filter.[required] 
Parameters¶
- metadata: Metadata
- Metadata indicating which IDs to filter. The optional - whereparameter may be used to filter IDs based on specified conditions in the metadata. The optional- exclude_idsparameter may be used to exclude the IDs specified in the metadata from the filter.[optional]
- where: Str
- Optional SQLite WHERE clause specifying metadata criteria that must be met to be included in the filtered data. If not provided, all IDs in - metadatathat are also in the data will be retained.[optional]
- exclude_ids: Bool
- If True, the samples selected by the - metadataand optional- whereparameter will be excluded from the filtered data.[default:- False]
- remove_empty: Bool
- If True, reports with only unclassified reads will be removed from the filtered data. Reports containing sequences classified only as root will also be removed.[default: - False]
- abundance_threshold: Float%Range(0, 1, inclusive_end=True)
- A proportion between 0 and 1 representing the minimum relative abundance (by classified read count) that a taxon must have to be retained in the filtered report. If a taxon is filtered from the report, its associated read counts are removed entirely from the report (i.e., the subtraction of those counts is propagated to parent taxonomic groupings).[optional] 
Outputs¶
- filtered_reports: SampleData[Kraken2Report % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Report % Properties('contigs', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'mags')]|SampleData[Kraken2Report % Properties('reads', 'contigs')]|SampleData[Kraken2Report % Properties('reads')]|SampleData[Kraken2Report % Properties('contigs')]|SampleData[Kraken2Report % Properties('mags')]|FeatureData[Kraken2Report % Properties('mags')]
- The filtered kraken2 reports.[required] 
- filtered_outputs: SampleData[Kraken2Output % Properties('reads', 'contigs', 'mags')]|SampleData[Kraken2Output % Properties('contigs', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'mags')]|SampleData[Kraken2Output % Properties('reads', 'contigs')]|SampleData[Kraken2Output % Properties('reads')]|SampleData[Kraken2Output % Properties('contigs')]|SampleData[Kraken2Output % Properties('mags')]|FeatureData[Kraken2Output % Properties('mags')]
- The filtered kraken2 outputs.[required] 
- Kang, D. D., Li, F., Kirton, E., Thomas, A., Egan, R., An, H., & Wang, Z. (2019). MetaBAT 2: An Adaptive Binning Algorithm for Robust and Efficient Genome Reconstruction from Metagenome Assemblies. PeerJ, 2019(7). 10.7717/peerj.7359
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & Subgroup, 1000 Genome Project Data Processing. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. 10.1093/bioinformatics/btp352
- biocore/scikit-bio: scikit-bio 0.5.9: Maintenance release (0.5.9). (2023). Zenodo. 10.5281/zenodo.8209901
- Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. 10.1186/s13059-019-1891-0
- Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. 10.1186/s13059-019-1891-0
- Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. 10.1186/s13059-019-1891-0
- Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: Estimating Species Abundance in Metagenomics Data. PeerJ Computer Science, 3, e104. 10.7717/peerj-cs.104
- Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. 10.1186/s13059-019-1891-0
- Buchfink, B., Reuter, K., & Drost, H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18(4), 366–368. 10.1038/s41592-021-01101-x
- Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., von Mering, C., & Bork, P. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research, 47(D1), D309–D314. 10.1093/nar/gky1085
- Buchfink, B., Reuter, K., & Drost, H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18(4), 366–368. 10.1038/s41592-021-01101-x
- Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., von Mering, C., & Bork, P. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research, 47(D1), D309–D314. 10.1093/nar/gky1085
- Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., von Mering, C., & Bork, P. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research, 47(D1), D309–D314. 10.1093/nar/gky1085
- National Center for Biotechnology Information (NCBI). (n.d.). https://www.ncbi.nlm.nih.gov/
- Buchfink, B., Reuter, K., & Drost, H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18(4), 366–368. 10.1038/s41592-021-01101-x