High-throughput computing and workflows

All material (C) 2022-2023 by CSC -IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0/

High-throughput Computing in HPC Environment

Our scientific computational problems need:
- Faster computation
- Scalability
- Parallel processing
- Improved accuracy
High-throughput computing (HTC) is more about running many independent tasks that require a large amount of computing power(i.e., across many different computers)

Good Practices for High-throughput Computing

Are you using conda environment for your application?
- Conda environment may contain easily several thousands of files
- When you load your application, you have to import all these files and can add overhead on Lustre file system
- Takes lot of time when loading your environment in high-thoughput workflows
Solution:
- Containerise your application; from apllication point of view, whole environment is a single file
- e.g., use Tykky wrapper

Good Practices for High-throughput Computing

Do you have lot of I/O operations in your application?
- Pay special attention to where these operations are performed
- Note that Lustre file system is designed for efficient parallel IO of large files
- Intensive IO-operations risk degrading the file system performance for all users
Solution:
- Fast local NVMe disk on Puhti and Mahti GPU-nodes
- Ramdisk (/dev/shm) on Mahti CPU-nodes

Good Practices for High-throughput Computing

If you have fewer jobs, an easy solution is use array jobs
Does your workflow involve running a substantial amount of (short) batch jobs?
- it poses problems for batch job schedulers such as Slurm used in HPC systems
- short jobs can also have a scheduling overhead
Solution:
- Execute with minimal invocations of sbatch and srun
- Use built-in option for farming-type workloads
- Use external tools such as HyperQueue or GNU Parallel

How About Complex Workflows in Scientific Computing?

How to Run and Manage Complex Workflows?

If running your jobs gets more complex, requiring e.g. dependencies between subtasks, use workflow tools
- Guidelines and solutions are in CSC Docs Pages
Some list of workflow managers: FireWorks, Nextflow, Snakemake, Knime, BioBB, …

Popular Choices for Bioinformatics Workflows

Introduction to Nextflow

A tool for managing scientific workflows
Written in groovy, extension of Java programming language
Follows dataflow programming model
- Communication by dataflow vairables
Documentation Nextflow homepage

Core Features of Nextflow

Reproducibility
Portability
Parallelisation(implicit)
Continuous checkpoints
Easy prototyping

Getting started with Nextflow at CSC

Required
- Runs on any Linux platform, macOS
- at least Java 8
Nextflow installation:
- As a module on Puhti/Mahti: module load nextflow
- Own installation: curl get.nextflow.io | bash; mv nextflow ~/bin
Supported software stack:
- Local installations as modules
- Docker engine (Not possible)
- Singularity/Apptainer
- Conda (Not recommended)

Start Nextflow as a (Normal) Batch Job

Running Nextflow using HyperQueue Executor

- Good documentation on CSC docs page

Good Practices for Running Nextflow Pipelines at CSC (1/2)

Use version control of tools for reproducibility
Use containers for easy portability
Set singularity/Apptainer cache directory to scratch folder (to avoid blowing up home directory)
Avoid big databases on Lustre or any databases inside of a container
Try cleaning up temporary files created by workflow when possible (Big problem for Nextflow)

Good Practices for Running Nextflow Pipelines at CSC (2/2)

Avoid in-built slurm executor by Nextflow
Mount writable file system to avoid permission errors when working with containers