Workflows workshop

CSC / Sami Ilvonen, Henrik Nortamo, 2022-12-13

Introduction

Why is the supercomputer slower than my laptop?
- You should understand how your tools works and how to utilize it.
- Some things are much easier to do in a smaller scale.
- When systems become larger, you have to operate your tools differently.

Slurm

Slurm and process management

CSC’s clusters use Slurm as the resource manager for batch jobs.
Slurm keeps track on the pending and running jobs on a very detailed level (CPU cores, memory limits, local storage, GPUs).
When a job starts, Slurm creates a process management interface that contains the information about the resources and their location on the cluster (which CPUs and GPUs on which nodes).
- MPI libraries use this interface to set up the needed data structures to communicate between different MPI ranks.
- Same information is also used by Slurm for job lifetime management.
- This step requires quite a lot of communication and synchronization for large MPI jobs.

Batch job scheduling

We are using fair share scheduling and resource limits, which ensures that single user can’t fill the whole cluster.
- Downside is that fair share is tricky to adjust and also very resource intensive.
- Scheduling algorithms are heuristic and there’s a practical limit on how large number of jobs can be scheduled -> limit on number of jobs that single user can submit.

Resource tracking and accounting

We have to keep track on resource usage and report it to parties that have funded the procurement.
- This reporting is done on several different levels, for example branch of science, University, i.e.
Basic information of each Slurm job is recorded into an internal Slurm job accounting database.
- Also each job step is recorded!
Large number of short jobs generate lots of accounting data and scheduling uses more resources than the actual jobs!

Things to avoid

Use correct partition and resource allocations.
- Do not use --mem=0
Large number of (small) jobs.
- Large number of job steps (srun commands).
Polling Slurm commands (squeue, sinfo, etc.)

Lustre

Parallel file systems

Parallel file system (PFS) provides a common file system area that can be accessed from all nodes in a cluster using same interfaces and tools as local file systems.
Without PFS users would have to always copy all needed data to compute nodes before runs.
- Also the results wouldn’t be visible outside the compute node.
CSC has Lustre parallel file system on both Puhti and Mahti.

Lustre parallel file system

One or more metadata servers (MDS) with metadata targets (MDT) that store the file system metadata.
One or more object storage servers (OSS) with object data targets that store the actual file system contents.

Accessing a file

Send metadata request
Response with metadata
Request data
Data response

Problematic IO patterns

Accessing lots of small files, large number of files in a single directory.
Large files on single OST (that is, no striping).
Opening and closing a file in a rapid pace.
Accessing same files from large number of nodes.
Using file locks for synchronization.
- Writing to a same file from several nodes.

Issues with Python (why not conda)

Python interpreter loads all source code files that are imported during the start up.
- Large libraries may have tens of thousands of small files.
Conda is a Python package manager and environment management system that creates huge amount of files.
- Activating conda environments triggers large number of file operations and can be very slow on a network file system.

Guidelines

Why we do not give exact numbers?

We mention in documentation and guidelines that users shouldn’t send too many jobs, but how many is too many?
Unfortunately it’s impossible to give any exact numbers because both Slurm and Lustre are shared resources.
- It’s possible to give better limits for global usage of the system.
- When system total load is low, it may be ok to run something that is problematic when system is full.

How many jobs/steps is too many?

SHOULD BE OK to run tens of jobs/steps
PAY ATTENTION if you run hundreds of jobs/steps
DON’T RUN several thousands of jobs

How many file operations is too many?

SHOULD BE OK to access hundreds of files
PAY ATTENTION if you need several thousands of files
DON’T USE hundreds of thoudsands of files

Note that these guideline numbers are for all operations on all jobs.

Remedies

Content

All topics are equally relevant if you have an existing one or just in the planning phase

Understanding our workflow
Dealing with a large number of small files
Dealing with a large number of tasks
Some notes on containers within the context of workflows

Understanding the workflow

What your workflow looks like
- How many tasks
- What kind of dependencies between tasks -> maximum width
- Conditionals?
- Are there dynamic components
How your tasks behave
- Resource requirements and scalability
- Duration
- IO behavior

Understanding the workflow

Additional requirements
- Error handling
- Checkpointing (also task level)
- Abstraction levels
- Security
What your tool is doing (+ available features)
- How it’s polling / using slurm
- What kind of extra / bookkeeping files are created

Understanding your workflow

Why we need to undestand our workflow

Tools do a better job at scheduling -> shorter time to solutions
Reasoning about scalability and system impact
Easier for CSC to provide support

I have lots of small files

Check the tool that you are using
- There may be different options for data storage,
Tar/untar and compress your datasets.
Use local disk (NVMe on Puhti, ramdisk on Mahti).
Remove intermediate files if possible.
Use squashfs for read-only datasets and containers.

I have lots of small tasks for Slurm

Check the tool that you are using
- There may already be support for multiple jobs in a single job (CP2K farming, Gromacs multidir, etc.)
Regroup your tasks and execute larger group of tasks in single job/step.
- Manual or automatic ( if feature is present in your tool)
- Horizontal and vertical packing
- Tradeoff (redundancy, parallelism, utilization )
Do a larger job and use another scheduler (hyperqueue, flux).
- Integration for nextflow and snakemake already exists
- CSC has some tools for farming type jobs
- Not all or nothing

Containers and workflows

Task might use applications which are in containers
Large or complicated dependencies might make containers necessary for:
- Workflow engine
- Code run directly by the workflow engine
- Additional scheduler

Containers can not be nested*
No way to escape container**
Launching a slurm job or using system ssh will start process outside the container on a remote node

Containers and workflows

Different configurations

Workflow engine outside a container
Workflow engine in the same container as the applications
Having the workflow engine in a separate container
- Depending on the workflow engine this might be hard