Creating Pipelines#

pipemake pipelines operates using a combination of two file types: Snakemake files and pipeline configuration files.

Snakemake files (Modules)#

Snakemake files are used to define pipemake Modules. In general, Modules follow the same structure and nomenclature as typical Snakemake files. However, pipemake Modules are focused on being reusable. This is achieved by following a few key principles:

  • Limiting a Module to a collection of rules used to perform a particular task (align reads, call variants, annotate a genome, etc.)

  • Consistent input and output usage

  • Defining configurable terms (e.g. samples, wildcards, parameters, etc.) from the config

  • Using singularity containers to ensure a consistent software environment

By following these principles, pipemake Modules may be easily used in multiple pipelines. For example, the following are examples of a Module that aligns RNAseq reads to a reference genome using BWA and outputs the results in BAM format.

In this first example, the Module only requires samples to be defined in the configuration file.

rule all:
    input:
        expand("reSEQ/BAM/Aligned/{sample}.bam", sample=config['samples'])

rule index_reference:
    input:
        f"Assembly/assembly.fa"
    output:
        f"Assembly/assembly.fa.bwt"
    singularity:
    "docker://quay.io/biocontainers/bwa:0.7.8"
    shell:
        "bwa index {input}"

rule align_reads:
    input:
        reads="reSEQ/FASTQ/{sample}_R1.fastq.gz",
        ref=f"Assembly/assembly.fa"
        index=f"Assembly/assembly.fa.bwt"
    output:
        "reSEQ/BAM/Aligned/{sample}.bam"
    singularity:
    "docker://quay.io/biocontainers/bwa:0.7.8"
    shell:
        "bwa mem -t 8 {input.ref} {input.reads} | samtools view -bS - > {output}"

In this next example, the Module uses additional configurable terms to define the species and assembly_version. While additional configurable do require additional command-line arguments, they allow for greater flexibility and easier reporting.

rule all:
    input:
        expand("reSEQ/BAM/Aligned/{sample}.bam", sample=config['samples'])

rule index_reference:
    input:
        f"Assembly/{config['species']}_{config['assembly_version']}.fa"
    output:
        f"Assembly/{config['species']}_{config['assembly_version']}.fa.bwt"
    singularity:
                "docker://quay.io/biocontainers/bwa:0.7.8"
    shell:
        "bwa index {input}"

rule align_reads:
    input:
        reads="reSEQ/FASTQ/{sample}_R1.fastq.gz",
        ref=f"Assembly/{config['species']}_{config['assembly_version']}.fa"
        index=f"Assembly/{config['species']}_{config['assembly_version']}.fa.bwt"
    output:
        "reSEQ/BAM/Aligned/{sample}.bam"
    singularity:
                "docker://quay.io/biocontainers/bwa:0.7.8"
    shell:
        "bwa mem -t 8 {input.ref} {input.reads} | samtools view -bS - > {output}"

By consistently using configurable terms (or standardized filenames), it is possible to easily connect multiple Modules together to form pipelines. For example, the output of the align_reads rule may be used as the input for another Module to sort the BAM files:

rule all:
    input:
        expand("reSEQ/BAM/Sorted/{sample}.bam", sample=config['samples'])

rule sort_reads:
    input:
        "reSEQ/BAM/Aligned/{sample}.bam",
    output:
        "reSEQ/BAM/Sorted/{sample}.bam",
    singularity:
        "docker://quay.io/biocontainers/samtools:1.9"
    shell:
        "samtools sort {input} -o {output}"

Note

pipemake is designed to detect configurable terms and will ensure the terms are properly assigned in the configuration file. Configurable terms may also be grouped together, config['assembly']['species'] and config['assembly']['assembly_version'], if desited.

Pipeline configuration files#

pipemake uses YAML-formatted files to define Pipelines. These files are used to define the following aspects of a pipeline:

  • The Pipeline name and version

  • Command-line arguments (description, input files, configurable terms, pipeline parameters, etc.)

  • Steps needed to standardize the input files for the Pipeline

  • And lastly, the Modules and Links requried for the Pipeline

The following is an example of a Pipeline configuration file:

pipeline: rnaseq-counts-star
version: 1.0
parser:
  help: Count RNAseq reads within a genome assembly using STAR and featureCounts
  arg-groups:
    basic:
      mutually-exclusive-groups:
        input-parser:
          required: True
      wildcards-args:
        rnaseq-standardized-wildcard: "{sample}_{read}.fq.gz"
      args:
        rnaseq-wildcard:
          help: "Wildcard statement to represent RNAseq FASTQs"
          type: str
          mutually-exclusive: 'input-parser'
          wildcards: rnaseq-standardized-wildcard
        rnaseq-table:
          help: "Table with sample and FASTQs filenames"
          type: str
          action: confirmFile
          mutually-exclusive: 'input-parser'
          wildcards: rnaseq-standardized-wildcard
        rnaseq-copy-method:
          help: "Specifies if RNAseq FASTQs should be copied or symbolically linked."
          choices:
            - 'symbolic_link'
            - 'copy'
          default: 'symbolic_link'
        assembly-fasta:
          help: "Assembly fasta"
          type: str
          required: True
          action: confirmFile
        assembly-gtf:
          help: "Assembly GTF"
          type: str
          required: True
          action: confirmFile
        read-len:
          help: "Read Length"
          type: int
          required: True
        assembly-version:
          help: "Assembly Version"
          type: str
          default:
            str: "v"
            suffix:
              - function: jobRandomString
        species:
          help: "Species name"
          type: str
          default:
            str: "Sp"
            suffix:
              - function: jobRandomString
setup:
  rnaseq_input:
    methods:
      wildcard-str: "{rnaseq-wildcard}"
      table-file: "{rnaseq-table}"
    args:
      standardized_filename: "RNAseq/FASTQ/{rnaseq-standardized-wildcard}"
      copy_method: '{rnaseq-copy-method}'
      gzipped: True
      sample_keywords:
        'samples'

  assembly_input:
    methods:
      file-str: "{assembly-fasta}"
    args:
      standardized_filename: "Assembly/{species}_{assembly_version}.fa"
      gzipped: False

  gtf_input:
    methods:
      file-str: "{assembly-gtf}"
    args:
      standardized_filename: "Assembly/{species}_{assembly_version}.gtf"
      gzipped: False

snakemake:
  modules:
    - fastq_trim_fastp
    - rna_seq_2pass_star
    - rna_seq_sort
    - rna_seq_feature_counts
  links:
    - input: fastp_single_end
      output: star_single_end_p1
    - input: fastp_pair_end
      output: star_pair_end_p1
      file_mappings:
      - input: r1_reads
        output: r1_reads
      - input: r2_reads
        output: r2_reads

Pipeline configuration guide#

A pipeline configuration file begins with the pipeline keyword, which is used to define the name of the pipeline. As this name is used to identify a pipeline within pipemake, it must be unique. Next is the version keyword, which is used to define the version of the pipeline and is included to track changes to the pipeline over time.

The configuration file then consists of the following required sections: parser, setup, and snakemake.

pipeline: rnaseq-counts-star
version: 1.0
parser:
  ...
setup:
  ...
snakemake:
  ...

parser:#

The parser section is used to create the command-line interface for a pipeline. It is divided into the following sub-sections: help and arg-groups.

help:#

The help sub-section is used to define the description of the pipeline, which is displayed when pipemake is run with the --help flag.

parser:
  help: Count RNAseq reads within a genome assembly using STAR and featureCounts

arg-groups:#

The arg-groups sub-section is used by pipemake to define command-line argument groups. The basic group is reserved by pipemake, arguments within this group will be automatically grouped within required or optional based on their required keyword. Users may place all arguments within the basic group or create additional groups as desired. Additional arg-groups may be defined as needed to organize related arguments, such as parameters for a particular software package.

parser:
  arg-groups:
    basic:
      mutually-exclusive-groups:
        input-parser:
          required: True
      wildcards-args:
        rnaseq-standardized-wildcard: "{sample}_{read}.fq.gz"
      args:
        rnaseq-wildcard:
          help: "Wildcard statement to represent RNAseq FASTQs"
          type: str
          mutually-exclusive: 'input-parser'
          wildcards: rnaseq-standardized-wildcard
        rnaseq-table:
          help: "Table with sample and FASTQs filenames"
          type: str
          action: confirmFile
          mutually-exclusive: 'input-parser'
          wildcards: rnaseq-standardized-wildcard
        rnaseq-copy-method:
          help: "Specifies if RNAseq FASTQs should be copied or symbolically linked."
          choices:
            - 'symbolic_link'
            - 'copy'
          default: 'symbolic_link'
        assembly-fasta:
          help: "Assembly fasta"
          type: str
          required: True
          action: confirmFile
        assembly-gtf:
          help: "Assembly GTF"
          type: str
          required: True
          action: confirmFile
        read-len:
          help: "Read Length"
          type: int
          required: True
        assembly-version:
          help: "Assembly Version"
          type: str
          default:
            str: "v"
            suffix:
              - function: jobRandomString
        species:
          help: "Species name"
          type: str
          default:
            str: "Sp"
            suffix:
              - function: jobRandomString
mutually-exclusive-groups:#

Each arg-groups may use the mutually-exclusive-groups keyword to define mutually exclusive arguments to ensure that only one of the arguments within a group may be used at a time. This is useful when a pipeline accepts different types of input, such as a wildcard statement or a table of input files. To create a mutually-exclusive-group, a user is only required to name the group.

parser:
  arg-groups:
    basic:
      mutually-exclusive-groups:
        input-parser:
          required: True

In this example, pipemake will create a single mutually-exclusive-group called input-parser. Currently, mutually-exclusive-groups supports the following keywords:

  • required: Defines if the mutually-exclusive-group is required (default is False)

Note

Please note that if a mutually-exclusive-group is placed within the basic group the required keyword will be used to place the arguments within required or optional.

Attention

At present, pipemake requires that the name of mutually-exclusive-groups to be unique among all arg-groups.

wildcards-args:#

The wildcards-args keyword is used to define a wildcard statement that may then be used by multiple arguments within the arg-groups. This is useful when a pipeline supports multiple input methods that should be standardized to a common naming convention.

parser:
  arg-groups:
    basic:
      wildcards-args:
        rnaseq-standardized-wildcard: "{sample}_{read}.fq.gz"

In this example, we defined a wildcard statement called rnaseq-standardized-wildcard. This wildcard statement is used to standardize the naming of RNAseq FASTQ files. The wildcard statement is defined as "{sample}_{read}.fq.gz", where {sample} represents the sample name and {read} represents the read type (e.g. R1, R2).

args:#

Each arg-groups requires a list of args that define the command-line arguments. Each argument must have the following keywords:

  • help: A description of the argument

  • type: The type of the argument

And the following optional keywords are also supported:

  • required: If the argument is required (default is False`)

  • choices: A list of choices for the argument

  • mutually-exclusive: Adds the argument to the specified mutually-exclusive-group

  • wildcards: Requires the argument to use the wildcards given in the specified wildcards-args

  • action: An action to perform on the argument (see below for supported actions)

  • default: The default value of the argument (see below for additional options)

Note

Arguments are parsed using argparse and therefore support may be added to allow all of the same options as argparse.

action:#

Beyond built-in actions, pipemake also supports the following actions:

  • confirmFile: Require the given string to be a file. If the file does not exist, an error will be raised.

  • confirmDir: Require the given string to be a directory. If the directory does not exist, an error will be raised.

Note

Additional actions may be added in the future, or updates to pipemake to allow for custom actions.

default:#

The default keyword may be used to define the default value of an argument. In general, the default value may share the same type as the type keyword. However, it’s also possible to define more complex default values.

parser:
  arg-groups:
    basic:
      args:
        assembly-version:
          help: "Assembly Version"
          type: str
          default:
            str: "v"
            suffix:
              - function: jobRandomString

In the above example, the assembly-version argument has a default value of v followed by a random string. This is achieved by using the suffix keyword. The suffix keyword allows for a list of values to be concatenated to the default value. These values may be either strings or one of the following functions: jobRandomString or jobTimeStamp.

setup:#

The setup section is used to define arguments that require input standardization. Each standardization argument includes the following keywords: methods, args, and optionally snakefiles.

setup:
  rnaseq_input:
    methods:
      wildcard-str: "{rnaseq-wildcard}"
      table-file: "{rnaseq-table}"
    args:
      standardized_filename: "RNAseq/FASTQ/{rnaseq-standardized-wildcard}"
      copy_method: '{rnaseq-copy-method}'
      gzipped: True
      sample_keywords:
        'samples'

In the above example, the setup section includes a single argument called rnaseq_input, which includes two methods to standardize input files: wildcard-str and table-file.

Standardization methods are defined by the following required keywords:

  • methods: The supported methods and their associated command-line argument

    • wildcard-str: Standardize input file(s) using the specified wildcard statement - "{rnaseq-wildcard}"

    • table-file: Standardize input files within the specified table file - "{rnaseq-table}"

    • file-str: Standardize the specified file - "{assembly-fasta}"

    • dir-str: Standardize the specified directory - e.g. "{index-dir}"

  • args: Contains the command-line arguments needed for the methods standardize the input file(s)

    • standardized_filename or standardized_directory: The standardized filename(s) or directory. This may be a string with or without a wildcard statements. Should result in a filename(s) specified in a snakemake module

    • copy_method: The method used to copy (copy) or symbolically link (symbolic_link) the input file(s)

    • gzipped: If the standardized file(s) should be gzipped (True, False) or keep the gzipped status of the input file(s) (None)

    • sample_keywords: A list of keywords that should be treated as samples (optional)

  • snakefiles: An optional list of Snakemake modules if the current standardization method is used

snakemake:#

The snakemake section is used to define the Snakemake modules used within the pipeline. This section includes the following sub-sections: modules and links.

snakemake:
  modules:
    - fastq_trim_fastp
    - rna_seq_2pass_star
    - rna_seq_sort
    - rna_seq_feature_counts
  links:
    - input: fastp_single_end
      output: star_single_end_p1
    - input: fastp_pair_end
      output: star_pair_end_p1
      file_mappings:
      - input: r1_reads
        output: r1_reads
      - input: r2_reads
        output: r2_reads

The modules sub-section is used to define the Snakemake modules used within the pipeline. Each module is defined by the name of the Snakemake file (e.g. rna_seq_2pass_star.smk), inclusion of the smk extension is optional.

The links sub-section is used to define the links between Snakemake rules. You think of links as a way to connect two rules that normally would not be connected. This is useful when rules have inconsistent filenames. Let’s examine a single link:

snakemake:
  links:
    - input: fastp_pair_end
      output: star_pair_end_p1
      file_mappings:
      - input: r1_reads
        output: r1_reads
      - input: r2_reads
        output: r2_reads

In this link, the input keyword indicates the input rule for the link (fastp_pair_end), whereas output keyword indicates the output rule for the link (star_pair_end_p1). This rule would then connect the output of the fastp_pair_end rule to the input of the star_pair_end_p1 rule. The file_mappings keyword is used to define the mapping of input and output files between the two rules. This is useful when the keywords used by the rules differ.