Brief introduction to HPC environments

All material (C) 2020-2021 by CSC -IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0/

Notes on vocabulary

Roughly you can think of

computer ~= node
processor ~= socket
core~= CPU

Cluster systems

Login nodes are used to set up jobs (and to launch them)
Jobs are run in the compute nodes
A batch job system (aka scheduler) is used to run and manage the jobs
- On CSC machines we use Slurm
- Other common systems include SGE and Torque/PBS
- Syntax is different, but basic operation is similar

Available HPC resources

Puhti is the general purpose supercomputer ☑️
Mahti is the massively parallel flagship supercomputer
Pouta provides cloud resources via OpenStack (Iaas)
Rahti provides containers via okd (Paas)
Allas provides object storage for all services

Which supercomputer to use?

What kind of recources can your application use?
- Can it use more than one core?
- How much memory it will need?
- Can it use GPU or NVMe?
- What takes long (is the time limiting part) in your job?
See what kind of resources are available
- Is my code already installed?
- Max. runtime, partitions (queues), provisioning policy (Per core/per node/other)
- Each system is different, so check the documentation

Quick and dirty comparison of Puhti and Mahti

	Puhti	Mahti
Number of preinstalled applications	123+	16+
Cores per node	40	128
Job size (min-max) cores	1-1040	128-25600
Memory per node (GiB)	192-1536	256
GPU cards (NVIDIA)	120 x V100	96 x A100
Fast node local disk (NVMe)	120	(24 GPU nodes)

In short: Mahti is for much larger parallel jobs, prepare to install and optimize your code. (Still, a Puhti node is > 10x your laptop.)

Disk areas in CSC HPC environment

In this section, you will learn how to manage different disk areas in HPC environment at CSC

Overview of disk areas

Main disk areas and their specific uses in Puhti/Mahti
Moving data between supercomputers
Understanding quotas (both usable space and number of files) for different disk areas
Additional fast disk areas

Disk and storage overview

Main disk areas in Puhti/Mahti

Home directory ($HOME)
- Other users cannot access you home directory
ProjAppl directory (/projappl/project_name)
- Shared with project members
- Possible to limit access (chmod g-rw) in subfolders
Scratch directory (/scratch/project_name)
- Shared with project members
- Files older than 90 days will be automatically removed
These directories reside on Lustre parallel file system
Default quotas and more info on disk areas section

Moving data between and to/from supercomputers

Puhti and Mahti have their own disk systems
Data can be moved between the supercomputers
- directly with rsync
- via Allas object storage
There are many ways to transfer data to/from CSC and your local computer

Displaying current status of disk areas

use csc-workspaces command to display available projects and quotas

Disk and storage overview (revisited)

Additional fast local disk areas

$TMPDIR on Login nodes
- Each of the login nodes have 2900 GiB of fast local storage $TMPDIR
- The local storage is meant for temporary storage and is cleaned frequently
NVMe on part of compute nodes in Puhti
- Interactive batch job nodes, IO- and gpu-nodes have local fast storage (NVMe) as $LOCAL_SCRATCH
- You must copy data in and out during your batch job. NVMe is accessible only during your job allocation.
- If your job reads or writes a lot of small files, using this can give 10x performance boost

What are the different disk areas for?

Allas - for data which is not actively used
HOME - small, thus only for most important (small) files, personal access only
scratch - main working area, can be used to share with project members
projappl - not cleaned up, e.g. for shared binaries
Login node local tmp - compiling, temporary, fast IO
NVMe - fast IO in batch jobs

Some best practice tips

Don’t put databases on Lustre (projappl, scratch, home)
- use other CSC services like kaivos or mongoDB in cPouta
Don’t create a lot of files in one folder
Don’t create overall a lot of files (if you’re creating tens of thousands of files, you should probably rethink the workflow)
Take backups of important files. Data on CSC disks is not backed up even if systems are fault tolerant.
When working with the large number of smaller files, consider using fast local disks
Best practice performance tips for using Lustre

The batch job system in CSC’s HPC environment

What is a batch job? 1/2

On a laptop we are used to start a program (job) by clicking on an icon and the job starts instantly
If we start many jobs at the same time we occasionally run into problems like running out of memory etc.
In an HPC environment the computer is shared among hundreds of other users who all have different resource needs
HPC batch jobs include an estimate (requirement) on how much resources they are expected to use

What is a batch job? 2/2

A batch job consists of two parts: A resource request and the actual computing step
A job is not started directly, but is sent into a queue
Depending on the requested resources and load, the job may need to wait to get started
At CSC (and HPC systems in general) all heavy computing must be done via batch jobs (see Usage policy)

What is a batch job system?

A resource management system that keeps track of all batch jobs that use, or would like to use the computing resources
Aims to share the resources in an efficient and fair way
Optimizes resource usage by filling the compute node with most suitable jobs

Schema on how the batch job scheduler works

The batch job system in CSC’s HPC environment

CSC uses a batch job system (SLURM) to manage jobs
SLURM is used to control how the overall computing resources are shared among all projects in an efficient and fair way
SLURM controls how a single job request gets resources, like:
- computing time
- number of cores
- amount of memory
- other resources like gpu, local disk, etc.

Example serial batch job script for Puhti

A batch job is a shell script (bash) that consists of two parts:
- A resource request flagged with #SBATCH and the actual computing step(s)

#!/bin/bash
#SBATCH --job-name=print_hostname     # Defines the job name shown in the queue.
#SBATCH --time=00:01:00               # Defines the max time the job can run.
#SBATCH --partition=test              # Defines the queue in which to run the job.
#SBATCH --ntasks=1                    # Defines the number of tasks.
#SBATCH --cpus-per-task=1             # Number of cores is ntasks * cpus-per-task.
#SBATCH --account=project_20001234    # Defines the billing project. Mandatory field.

srun echo "Hello $USER! You are on node $HOSTNAME"

The options have been described in Create batch jobs for Puhti
- The actual program is launched using the srun command
- The content above could be copied into a file like simple_serial.bash and put into the queue with the command sbatch simple_serial.bash

Use an application specific batch script template

The application list in docs contains example scripts for some software
Use these as the starting point for your own scripts
They have been tested and optimized (although for minimal resources) for that application
- Consult the manual or other examples to adapt to your own needs

Applications list in docs.csc.fi

Submitting, cancelling and stats of batch jobs

The job script file is submitted with the command:
- sbatch example_job.sh
List all your jobs that are queuing/running:
- squeue -u $USER
Detailed info of a queuing/running job:
- scontrol show job <jobid>
A job can be deleted using the command:
- scancel <jobid>
Display the used resources of a completed job:
- seff <jobid>

Available batch job partitions

The available batch job partitions are listed in docs.csc.fi
In order to use the resources in an efficient way, it is important to estimate the request as accurately as possible
By avoiding an excessive “just-in-case” requests, the job will start earlier

Different type of HPC jobs

Typically an HPC job can be classified as serial, parallel or GPU, depending on the main requested resource
The following slides will provide you with an overview of different job types
A serial job is the simplest type of job whereas parallel and GPU-jobs may require some advanced methods to fully utilise their capacity
If you use already installed software be sure to study if it needs resources for serial, parallel or GPU jobs

HPC serial jobs

A serial software can only use one core, so don’t reserve more!
Why could your serial job benefit from being executed using CSC’s resources instead of on your own computer?
- Part of a larger workflow
- Avoid data transfer between CSC and your own computer
- Data sharing among other project members
- CSC’s software licensing
- It was already installed
- Memory and/or disk demands

Running multiple serial jobs

You can utilize parallel resources for running multiple serial jobs at the same time
- Array jobs
- GREASY jobs
- … and lots of ther workflow tools or DIY scripts
Pure serial resources are only available in Puhti, but GREASY jobs can make serial jobs suitable for Mahti, too
- But, your workflow needs to fill (at least) one Mahti node and keep the CPUs busy for the job duration to be efficient use of resources

HPC parallel jobs

A parallel job distributes the calculation over several cores in order to achieve a shorter wall time (and/or a larger allocatable memory)
There are two major parallelization schemes: OpenMP and MPI
- Note, depending on the parallellization scheme there is a slight difference between how the resource reservation is done
Batch job script how-to create and examples for Puhti
Batch job script how-to create and examples for Mahti
The best starting point: Software specific batch scripts in docs

HPC GPU jobs

A graphics processing unit (GPU, a video card), is capable of doing certain type of simultaneous calculations very efficiently
In order to take advantage of this power, a computer program must be reprogrammed to adapt on how GPU handles data
CSC’s GPU resources are relatively scarce and hence should be used with particular care
- A GPU uses 60 times more billing units than a single CPU core - see above for performance requirements
- In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti

Interactive jobs

When you login to CSC’s supercomputers, you end up in one of the login nodes of the computer
- These login nodes are shared by all users and they are not intended for heavy computing.
If you have a heavier job that still requires interactive response (e.g. a graphical user interface)
- Allocate the resource via the the interactive partition
- This way your work is performed in a compute node, not on the login node

Brief introduction to HPC environments

Notes on vocabulary

Cluster systems

Available HPC resources

Which supercomputer to use?

Quick and dirty comparison of Puhti and Mahti

Disk areas in CSC HPC environment

Overview of disk areas

Disk and storage overview

Main disk areas in Puhti/Mahti

Moving data between and to/from supercomputers

Displaying current status of disk areas

Disk and storage overview (revisited)

Additional fast local disk areas

What are the different disk areas for?

Some best practice tips

The batch job system in CSC’s HPC environment

What is a batch job? 1/2

What is a batch job? 2/2

What is a batch job system?

Queueing and fair share of resources

Schema on how the batch job scheduler works

The batch job system in CSC’s HPC environment

Example serial batch job script for Puhti

Use an application specific batch script template

Submitting, cancelling and stats of batch jobs

Available batch job partitions

Different type of HPC jobs

HPC serial jobs

Running multiple serial jobs

HPC parallel jobs

HPC GPU jobs

Interactive jobs