Brief introduction to HPC environments
Notes on vocabulary
Roughly you can think of
- computer ~= node
- processor ~= socket
- core~= CPU
Cluster systems
- Login nodes are used to set up jobs (and to launch them)
- Jobs are run in the compute nodes
- A batch job system (aka scheduler) is used to run and manage the jobs
- On CSC machines we use Slurm
- Other common systems include SGE and Torque/PBS
- Syntax is different, but basic operation is similar
Available HPC resources
- Puhti is the general purpose supercomputer ☑️
- Mahti is the massively parallel flagship supercomputer
- Pouta provides cloud resources via OpenStack (Iaas)
- Rahti provides containers via okd (Paas)
- Allas provides object storage for all services
Which supercomputer to use?
- What kind of recources can your application use?
- Can it use more than one core?
- How much memory it will need?
- Can it use GPU or NVMe?
- What takes long (is the time limiting part) in your job?
- See what kind of resources are available
- Is my code already installed?
- Max. runtime, partitions (queues), provisioning policy (Per core/per node/other)
- Each system is different, so check the documentation
Quick and dirty comparison of Puhti and Mahti
Number of preinstalled applications |
123+ |
16+ |
Cores per node |
40 |
128 |
Job size (min-max) cores |
1-1040 |
128-25600 |
Memory per node (GiB) |
192-1536 |
256 |
GPU cards (NVIDIA) |
120 x V100 |
96 x A100 |
Fast node local disk (NVMe) |
120 |
(24 GPU nodes) |
In short: Mahti is for much larger parallel jobs, prepare to install and optimize your code. (Still, a Puhti node is > 10x your laptop.)
Disk areas in CSC HPC environment
In this section, you will learn how to manage different disk areas in HPC environment at CSC
Overview of disk areas
- Main disk areas and their specific uses in Puhti/Mahti
- Moving data between supercomputers
- Understanding quotas (both usable space and number of files) for different disk areas
- Additional fast disk areas
Disk and storage overview
Main disk areas in Puhti/Mahti
- Home directory (
$HOME
)
- Other users cannot access you home directory
- ProjAppl directory (
/projappl/project_name
)
- Shared with project members
- Possible to limit access (
chmod g-rw
) in subfolders
- Scratch directory (
/scratch/project_name
)
- Shared with project members
- Files older than 90 days will be automatically removed
- These directories reside on Lustre parallel file system
- Default quotas and more info on disk areas section
Moving data between and to/from supercomputers
Displaying current status of disk areas
- use
csc-workspaces
command to display available projects and quotas
Disk and storage overview (revisited)
Additional fast local disk areas
$TMPDIR
on Login nodes
- Each of the login nodes have 2900 GiB of fast local storage
$TMPDIR
- The local storage is meant for temporary storage and is cleaned frequently
- NVMe on part of compute nodes in Puhti
- Interactive batch job nodes, IO- and gpu-nodes have local fast storage (NVMe) as
$LOCAL_SCRATCH
- You must copy data in and out during your batch job. NVMe is accessible only during your job allocation.
- If your job reads or writes a lot of small files, using this can give 10x performance boost
What are the different disk areas for?
- Allas - for data which is not actively used
- HOME - small, thus only for most important (small) files, personal access only
- scratch - main working area, can be used to share with project members
- projappl - not cleaned up, e.g. for shared binaries
- Login node local tmp - compiling, temporary, fast IO
- NVMe - fast IO in batch jobs
Some best practice tips
- Don’t put databases on Lustre (projappl, scratch, home)
- use other CSC services like kaivos or mongoDB in cPouta
- Don’t create a lot of files in one folder
- Don’t create overall a lot of files (if you’re creating tens of thousands of files, you should probably rethink the workflow)
- Take backups of important files. Data on CSC disks is not backed up even if systems are fault tolerant.
- When working with the large number of smaller files, consider using fast local disks
- Best practice performance tips for using Lustre
The batch job system in CSC’s HPC environment
What is a batch job? 1/2
- On a laptop we are used to start a program (job) by clicking on an icon and the job starts instantly
- If we start many jobs at the same time we occasionally run into problems like running out of memory etc.
- In an HPC environment the computer is shared among hundreds of other users who all have different resource needs
- HPC batch jobs include an estimate (requirement) on how much resources they are expected to use
What is a batch job? 2/2
- A batch job consists of two parts: A resource request and the actual computing step
- A job is not started directly, but is sent into a queue
- Depending on the requested resources and load, the job may need to wait to get started
- At CSC (and HPC systems in general) all heavy computing must be done via batch jobs (see Usage policy)
What is a batch job system?
- A resource management system that keeps track of all batch jobs that use, or would like to use the computing resources
- Aims to share the resources in an efficient and fair way
- Optimizes resource usage by filling the compute node with most suitable jobs
Queueing and fair share of resources
- A job is queued and starts when the requested resources become available
- The order in which the queued jobs start depends on their priority and available resources
- At CSC the priority is configured to use “fair share”
- The initial priority of a job decreases if the user has recently run lots of jobs
- Over time (while queueing) its priority increases and eventually it will run
- Some queues have a lower priority (like longrun – use shorter if you can!)
- See our main documentation on Getting started with running jobs section in docs.csc.fi
Schema on how the batch job scheduler works
The batch job system in CSC’s HPC environment
- CSC uses a batch job system (SLURM) to manage jobs
- SLURM is used to control how the overall computing resources are shared among all projects in an efficient and fair way
- SLURM controls how a single job request gets resources, like:
- computing time
- number of cores
- amount of memory
- other resources like gpu, local disk, etc.
Example serial batch job script for Puhti
- A batch job is a shell script (bash) that consists of two parts:
- A resource request flagged with
#SBATCH
and the actual computing step(s)
#!/bin/bash
#SBATCH --job-name=print_hostname # Defines the job name shown in the queue.
#SBATCH --time=00:01:00 # Defines the max time the job can run.
#SBATCH --partition=test # Defines the queue in which to run the job.
#SBATCH --ntasks=1 # Defines the number of tasks.
#SBATCH --cpus-per-task=1 # Number of cores is ntasks * cpus-per-task.
#SBATCH --account=project_20001234 # Defines the billing project. Mandatory field.
srun echo "Hello $USER! You are on node $HOSTNAME"
- The options have been described in Create batch jobs for Puhti
- The actual program is launched using the
srun
command
- The content above could be copied into a file like
simple_serial.bash
and put into the queue with the command sbatch simple_serial.bash
Use an application specific batch script template
- The application list in docs contains example scripts for some software
- Use these as the starting point for your own scripts
- They have been tested and optimized (although for minimal resources) for that application
- Consult the manual or other examples to adapt to your own needs
Submitting, cancelling and stats of batch jobs
- The job script file is submitted with the command:
- List all your jobs that are queuing/running:
- Detailed info of a queuing/running job:
scontrol show job <jobid>
- A job can be deleted using the command:
- Display the used resources of a completed job:
Available batch job partitions
- The available batch job partitions are listed in docs.csc.fi
- In order to use the resources in an efficient way, it is important to estimate the request as accurately as possible
- By avoiding an excessive “just-in-case” requests, the job will start earlier
Different type of HPC jobs
- Typically an HPC job can be classified as serial, parallel or GPU, depending on the main requested resource
- The following slides will provide you with an overview of different job types
- A serial job is the simplest type of job whereas parallel and GPU-jobs may require some advanced methods to fully utilise their capacity
- If you use already installed software be sure to study if it needs resources for serial, parallel or GPU jobs
HPC serial jobs
A serial software can only use one core, so don’t reserve more!
Why could your serial job benefit from being executed using CSC’s resources instead of on your own computer?
- Part of a larger workflow
- Avoid data transfer between CSC and your own computer
- Data sharing among other project members
- CSC’s software licensing
- It was already installed
- Memory and/or disk demands
Running multiple serial jobs
- You can utilize parallel resources for running multiple serial jobs at the same time
- Pure serial resources are only available in Puhti, but GREASY jobs can make serial jobs suitable for Mahti, too
- But, your workflow needs to fill (at least) one Mahti node and keep the CPUs busy for the job duration to be efficient use of resources
HPC parallel jobs
- A parallel job distributes the calculation over several cores in order to achieve a shorter wall time (and/or a larger allocatable memory)
- There are two major parallelization schemes: OpenMP and MPI
- Note, depending on the parallellization scheme there is a slight difference between how the resource reservation is done
- Batch job script how-to create and examples for Puhti
- Batch job script how-to create and examples for Mahti
- The best starting point: Software specific batch scripts in docs
HPC GPU jobs
- A graphics processing unit (GPU, a video card), is capable of doing certain type of simultaneous calculations very efficiently
- In order to take advantage of this power, a computer program must be reprogrammed to adapt on how GPU handles data
- CSC’s GPU resources are relatively scarce and hence should be used with particular care
- A GPU uses 60 times more billing units than a single CPU core - see above for performance requirements
- In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti
Interactive jobs
- When you login to CSC’s supercomputers, you end up in one of the login nodes of the computer
- If you have a heavier job that still requires interactive response (e.g. a graphical user interface)
- Allocate the resource via the the interactive partition
- This way your work is performed in a compute node, not on the login node