High-throughput computing and workflows

All material (C) 2022-2023 by CSC -IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0/

High-throughput Computing in HPC Environment

  • Our scientific computational problems need:
    • Faster computation
    • Scalability
    • Parallel processing
    • Improved accuracy
  • High-throughput computing (HTC) is more about running many independent tasks that require a large amount of computing power(i.e., across many different computers)

Good Practices for High-throughput Computing

  • Are you using conda environment for your application?
    • Conda environment may contain easily several thousands of files
    • When you load your application, you have to import all these files and can add overhead on Lustre file system
    • Takes lot of time when loading your environment in high-thoughput workflows
  • Solution:
    • Containerise your application; from apllication point of view, whole environment is a single file
    • e.g., use Tykky wrapper

Good Practices for High-throughput Computing

  • Do you have lot of I/O operations in your application?
    • Pay special attention to where these operations are performed
    • Note that Lustre file system is designed for efficient parallel IO of large files
    • Intensive IO-operations risk degrading the file system performance for all users
  • Solution:
    • Fast local NVMe disk on Puhti and Mahti GPU-nodes
    • Ramdisk (/dev/shm) on Mahti CPU-nodes

Good Practices for High-throughput Computing

  • If you have fewer jobs, an easy solution is use array jobs
  • Does your workflow involve running a substantial amount of (short) batch jobs?
    • it poses problems for batch job schedulers such as Slurm used in HPC systems
    • short jobs can also have a scheduling overhead
  • Solution:
    • Execute with minimal invocations of sbatch and srun
    • Use built-in option for farming-type workloads
    • Use external tools such as HyperQueue or GNU Parallel

How About Complex Workflows in Scientific Computing?

How to Run and Manage Complex Workflows?

  • If running your jobs gets more complex, requiring e.g. dependencies between subtasks, use workflow tools
  • Some list of workflow managers: FireWorks, Nextflow, Snakemake, Knime, BioBB, …

Introduction to Nextflow

  • A tool for managing scientific workflows
  • Written in groovy, extension of Java programming language
  • Follows dataflow programming model
    • Communication by dataflow vairables
  • Documentation Nextflow homepage

Core Features of Nextflow

  • Reproducibility
  • Portability
  • Parallelisation(implicit)
  • Continuous checkpoints
  • Easy prototyping

Getting started with Nextflow at CSC

  • Required
    • Runs on any Linux platform, macOS
    • at least Java 8
  • Nextflow installation:
    • As a module on Puhti/Mahti: module load nextflow
    • Own installation: curl get.nextflow.io | bash; mv nextflow ~/bin
  • Supported software stack:
    • Local installations as modules
    • Docker engine (Not possible)
    • Singularity/Apptainer
    • Conda (Not recommended)

Start Nextflow as a (Normal) Batch Job

Running Nextflow using HyperQueue Executor

- Good documentation on CSC docs page

Good Practices for Running Nextflow Pipelines at CSC (1/2)

  • Use version control of tools for reproducibility
  • Use containers for easy portability
  • Set singularity/Apptainer cache directory to scratch folder (to avoid blowing up home directory)
  • Avoid big databases on Lustre or any databases inside of a container
  • Try cleaning up temporary files created by workflow when possible (Big problem for Nextflow)

Good Practices for Running Nextflow Pipelines at CSC (2/2)

  • Avoid in-built slurm executor by Nextflow
  • Mount writable file system to avoid permission errors when working with containers