Brief introduction to HPC environments

All material (C) 2020-2021 by CSC -IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0/

Notes on vocabulary

Roughly you can think of

  • computer ~= node
  • processor ~= socket
  • core~= CPU

Cluster systems

  • Login nodes are used to set up jobs (and to launch them)
  • Jobs are run in the compute nodes
  • A batch job system (aka scheduler) is used to run and manage the jobs
    • On CSC machines we use Slurm
    • Other common systems include SGE and Torque/PBS
    • Syntax is different, but basic operation is similar

Available HPC resources

  • Puhti is the general purpose supercomputer ☑️
  • Mahti is the massively parallel flagship supercomputer
  • Pouta provides cloud resources via OpenStack (Iaas)
  • Rahti provides containers via okd (Paas)
  • Allas provides object storage for all services

Which supercomputer to use?

  • What kind of recources can your application use?
    • Can it use more than one core?
    • How much memory it will need?
    • Can it use GPU or NVMe?
    • What takes long (is the time limiting part) in your job?
  • See what kind of resources are available
    • Is my code already installed?
    • Max. runtime, partitions (queues), provisioning policy (Per core/per node/other)
    • Each system is different, so check the documentation

Quick and dirty comparison of Puhti and Mahti

Puhti Mahti
Number of preinstalled applications 123+ 16+
Cores per node 40 128
Job size (min-max) cores 1-1040 128-25600
Memory per node (GiB) 192-1536 256
GPU cards (NVIDIA) 120 x V100 96 x A100
Fast node local disk (NVMe) 120 (24 GPU nodes)

In short: Mahti is for much larger parallel jobs, prepare to install and optimize your code. (Still, a Puhti node is > 10x your laptop.)

Disk areas in CSC HPC environment

In this section, you will learn how to manage different disk areas in HPC environment at CSC

Overview of disk areas

  • Main disk areas and their specific uses in Puhti/Mahti
  • Moving data between supercomputers
  • Understanding quotas (both usable space and number of files) for different disk areas
  • Additional fast disk areas

Disk and storage overview

Main disk areas in Puhti/Mahti

  • Home directory ($HOME)
    • Other users cannot access you home directory
  • ProjAppl directory (/projappl/project_name)
    • Shared with project members
    • Possible to limit access (chmod g-rw) in subfolders
  • Scratch directory (/scratch/project_name)
    • Shared with project members
    • Files older than 90 days will be automatically removed
  • These directories reside on Lustre parallel file system
  • Default quotas and more info on disk areas section

Moving data between and to/from supercomputers

Displaying current status of disk areas

  • use csc-workspaces command to display available projects and quotas

Disk and storage overview (revisited)

Additional fast local disk areas

  • $TMPDIR on Login nodes
    • Each of the login nodes have 2900 GiB of fast local storage $TMPDIR
    • The local storage is meant for temporary storage and is cleaned frequently
  • NVMe on part of compute nodes in Puhti
    • Interactive batch job nodes, IO- and gpu-nodes have local fast storage (NVMe) as $LOCAL_SCRATCH
    • You must copy data in and out during your batch job. NVMe is accessible only during your job allocation.
    • If your job reads or writes a lot of small files, using this can give 10x performance boost

What are the different disk areas for?

  • Allas - for data which is not actively used
  • HOME - small, thus only for most important (small) files, personal access only
  • scratch - main working area, can be used to share with project members
  • projappl - not cleaned up, e.g. for shared binaries
  • Login node local tmp - compiling, temporary, fast IO
  • NVMe - fast IO in batch jobs

Some best practice tips

  • Don’t put databases on Lustre (projappl, scratch, home)
    • use other CSC services like kaivos or mongoDB in cPouta
  • Don’t create a lot of files in one folder
  • Don’t create overall a lot of files (if you’re creating tens of thousands of files, you should probably rethink the workflow)
  • Take backups of important files. Data on CSC disks is not backed up even if systems are fault tolerant.
  • When working with the large number of smaller files, consider using fast local disks
  • Best practice performance tips for using Lustre

The batch job system in CSC’s HPC environment

What is a batch job? 1/2

  • On a laptop we are used to start a program (job) by clicking on an icon and the job starts instantly
  • If we start many jobs at the same time we occasionally run into problems like running out of memory etc.
  • In an HPC environment the computer is shared among hundreds of other users who all have different resource needs
  • HPC batch jobs include an estimate (requirement) on how much resources they are expected to use

What is a batch job? 2/2

  • A batch job consists of two parts: A resource request and the actual computing step
  • A job is not started directly, but is sent into a queue
  • Depending on the requested resources and load, the job may need to wait to get started
  • At CSC (and HPC systems in general) all heavy computing must be done via batch jobs (see Usage policy)

What is a batch job system?

  • A resource management system that keeps track of all batch jobs that use, or would like to use the computing resources
  • Aims to share the resources in an efficient and fair way
  • Optimizes resource usage by filling the compute node with most suitable jobs

Queueing and fair share of resources

  • A job is queued and starts when the requested resources become available
  • The order in which the queued jobs start depends on their priority and available resources
  • At CSC the priority is configured to use “fair share”
    • The initial priority of a job decreases if the user has recently run lots of jobs
    • Over time (while queueing) its priority increases and eventually it will run
    • Some queues have a lower priority (like longrun – use shorter if you can!)
  • See our main documentation on Getting started with running jobs section in docs.csc.fi

Schema on how the batch job scheduler works

The batch job system in CSC’s HPC environment

  • CSC uses a batch job system (SLURM) to manage jobs
  • SLURM is used to control how the overall computing resources are shared among all projects in an efficient and fair way
  • SLURM controls how a single job request gets resources, like:
    • computing time
    • number of cores
    • amount of memory
    • other resources like gpu, local disk, etc.

Example serial batch job script for Puhti

  • A batch job is a shell script (bash) that consists of two parts:
    • A resource request flagged with #SBATCH and the actual computing step(s)
#!/bin/bash
#SBATCH --job-name=print_hostname     # Defines the job name shown in the queue.
#SBATCH --time=00:01:00               # Defines the max time the job can run.
#SBATCH --partition=test              # Defines the queue in which to run the job.
#SBATCH --ntasks=1                    # Defines the number of tasks.
#SBATCH --cpus-per-task=1             # Number of cores is ntasks * cpus-per-task.
#SBATCH --account=project_20001234    # Defines the billing project. Mandatory field.

srun echo "Hello $USER! You are on node $HOSTNAME"
  • The options have been described in Create batch jobs for Puhti
    • The actual program is launched using the srun command
    • The content above could be copied into a file like simple_serial.bash and put into the queue with the command sbatch simple_serial.bash

Use an application specific batch script template

  • The application list in docs contains example scripts for some software
  • Use these as the starting point for your own scripts
  • They have been tested and optimized (although for minimal resources) for that application
    • Consult the manual or other examples to adapt to your own needs

Submitting, cancelling and stats of batch jobs

  • The job script file is submitted with the command:
    • sbatch example_job.sh
  • List all your jobs that are queuing/running:
    • squeue -u $USER
  • Detailed info of a queuing/running job:
    • scontrol show job <jobid>
  • A job can be deleted using the command:
    • scancel <jobid>
  • Display the used resources of a completed job:
    • seff <jobid>

Available batch job partitions

  • The available batch job partitions are listed in docs.csc.fi
  • In order to use the resources in an efficient way, it is important to estimate the request as accurately as possible
  • By avoiding an excessive “just-in-case” requests, the job will start earlier

Different type of HPC jobs

  • Typically an HPC job can be classified as serial, parallel or GPU, depending on the main requested resource
  • The following slides will provide you with an overview of different job types
  • A serial job is the simplest type of job whereas parallel and GPU-jobs may require some advanced methods to fully utilise their capacity
  • If you use already installed software be sure to study if it needs resources for serial, parallel or GPU jobs

HPC serial jobs

  • A serial software can only use one core, so don’t reserve more!

  • Why could your serial job benefit from being executed using CSC’s resources instead of on your own computer?

    • Part of a larger workflow
    • Avoid data transfer between CSC and your own computer
    • Data sharing among other project members
    • CSC’s software licensing
    • It was already installed
    • Memory and/or disk demands

Running multiple serial jobs

  • You can utilize parallel resources for running multiple serial jobs at the same time
  • Pure serial resources are only available in Puhti, but GREASY jobs can make serial jobs suitable for Mahti, too
    • But, your workflow needs to fill (at least) one Mahti node and keep the CPUs busy for the job duration to be efficient use of resources

HPC parallel jobs

  • A parallel job distributes the calculation over several cores in order to achieve a shorter wall time (and/or a larger allocatable memory)
  • There are two major parallelization schemes: OpenMP and MPI
    • Note, depending on the parallellization scheme there is a slight difference between how the resource reservation is done
  • Batch job script how-to create and examples for Puhti
  • Batch job script how-to create and examples for Mahti
  • The best starting point: Software specific batch scripts in docs

HPC GPU jobs

  • A graphics processing unit (GPU, a video card), is capable of doing certain type of simultaneous calculations very efficiently
  • In order to take advantage of this power, a computer program must be reprogrammed to adapt on how GPU handles data
  • CSC’s GPU resources are relatively scarce and hence should be used with particular care
    • A GPU uses 60 times more billing units than a single CPU core - see above for performance requirements
    • In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti

Interactive jobs

  • When you login to CSC’s supercomputers, you end up in one of the login nodes of the computer
  • If you have a heavier job that still requires interactive response (e.g. a graphical user interface)
    • Allocate the resource via the the interactive partition
    • This way your work is performed in a compute node, not on the login node