Not fast enough? How HPC can help.

All materials (c) 2020-2023 by CSC – IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0/

The purpose of large computers

  • Typically, large computers like those at CSC are not much faster than personal ones – they are simply bigger
    • For fast computation, they utilize parallelism (and typically have special disk, memory and network solutions, too)
  • Parallelism simplified:
    • You use hundreds of ordinary computers simultaneously to solve a single problem

First steps for fast jobs (1/2)

  • Spend a little time to investigate:
    • Which of the available software would be the best to solve the kind of problem you have?
  • Consider:
    • The software that solves your problem fastest might not always be the best
      • Issues like ease-of-use and compute power/memory/disk demands are also highly relevant
    • Quite often it is useful to start simple and gradually use more complex approaches if needed

First steps for fast jobs (2/2)

  • When you’ve found the software you want to use, check if it is available at CSC as a pre-installed optimized version
    • Familiarize yourself with the software manual, if available
  • If you need to install a software package distributed through Conda, you need to containerize it
    • Containerizing greatly speeds up performance at startup and can be done easily with the Tykky wrapper
  • If you can’t find suitable software, consider writing your own code

Optimize the performance of your own code (1/2)

Optimize the performance of your own code (2/2)

Running your software

  • It is not only how your software is constructed and compiled that affects performance
  • It may also be run in different ways

HPC parallel jobs

Running in parallel

What is MPI?

  • MPI (Message Passing Interface) is a widely used standard for writing software that runs in parallel
  • MPI utilizes parallel processes that do not share memory
    • To exchange information, processes pass data messages back and forth between the cores
    • Communication can be a performance bottleneck
  • MPI is required when running on multiple nodes

What is OpenMP?

  • OpenMP (Open Multi-Processing) is a standard that utilizes compute cores that share memory, i.e. threads
    • They do not need to send messages between each other
  • OpenMP is easier for beginners, but problems quickly arise with so-called race conditions
    • This appears when different compute cores process and update the same data without proper synchronization
  • OpenMP is restricted to a single node

Self study materials for OpenMP and MPI

  • There are many tutorials available online
    • Look with simple searches for e.g. “MPI tutorial”
  • Check the documented exercise material and model answers from the CSC course “Introduction to Parallel Programming”

Task farming – running multiple independent jobs simultaneously

  • Task farming == running many similar independent jobs simultaneously
  • If subtasks are few (<100), an easy solution is array jobs
    • Individual tasks should run >30 minutes. Otherwise, you’re generating too much overhead → consider another solution
    • Array jobs create job steps and for 1000s of tasks Slurm database will get overloaded → consider another solution
  • If running your jobs gets more complex, requiring e.g. dependencies between subtasks, workflow tools can be used

Task farming 2.0

  • Before opting for a workflow manager, check if the code you run has built-in high-throughput features
    • Many chemistry software (CP2K, GROMACS, Amber, etc.) provide methods for efficient task farming
    • Also Python and R, if you write your own code
  • Task farming can be combined with e.g. OpenMP to accelerate sub-jobs
    • HyperQueue is the best option for sub-node task scheduling (non-MPI)
  • Finally, MPI can be used to run several jobs in parallel
    • Three levels of parallelism, requires skill and time to set up
    • Always test before scaling up – a small mistake can result in lots of wasted resources!

Things to consider in task farming

  • In a big allocation, each computing core should have work to do
    • If the separate tasks are different, some might finish before the others, leaving some cores idle → waste of resources
    • Try combining small and numerous jobs into fewer and bigger ones
  • As always, try to estimate as accurately as possible the required memory and the time it takes for the separate tasks to finish

GPUs can speed up jobs

  • GPUs, or Graphics Processing Units, are extremely powerful processors developed for graphics and gaming
  • They can be used for science, but are often challenging to program
    • Not all algorithms can use the full power of GPUs
  • Check the manual if the software can utilize GPUs, don’t use GPUs if you’re unsure
  • Does your code run on AMD GPUs? LUMI has a massive GPU capacity!

Tricks of the trade 1/4

  • Although it is reasonable to try to achieve best performance by using the fastest computers available, it is not the only important issue
  • Different codes may give very different performance for a given use case
  • Before launching massive simulations, look for the most efficient algorithms to get the job done

Tricks of the trade 2/4

  • Well-known boosters are:
    • Enhanced sampling methods vs. brute force molecular dynamics
    • Machine learning methods
      • E.g. Bayesian optimization structure search (BOSS, potential energy maps)
    • Start with coarser models and gradually increase precision (if needed)
      • E.g. pre-optimize molecular geometries using a small basis set
    • When starting a new project, begin with small/fast tests before scaling up
      • Don’t submit large jobs before knowing that the setup works as intended
    • When using separate runs to scan a parameter space, start with a coarse scan, and improve resolution where needed
      • Be mindful of the number of jobs/job steps, use meta-schedulers if needed
    • Try to use or implement checkpoints/restarts in your software, and check results between restarts

Tricks of the trade 3/4

  • Try to formulate your scientific results when you have a minimum amount of computational results
    • Helps to clarify what you still need to compute, what computations would be redundant and what data you need to store
  • Reserving more memory and/or more compute cores does not necessary equal faster computations
    • Check with seff, sacct and from software-specific log files if the memory was used and whether the job ran faster
    • Testing for optimal amount of cores and memory is advised before performing massive computations

Tricks of the trade 4/4

  • If possible, running the same job on a laptop may be useful for comparison
  • Avoid unnecessary reads and writes of data and containerize Conda environments to improve I/O performance
    • Read and write in big chunks and avoid reading/writing lots of small files
  • Don’t run too short jobs to minimize queuing and scheduling overhead
    • There’s a time overhead in setting up a batch job, aim for >30 minute jobs
    • Don’t run too many/short job steps – they will bloat Slurm accounting
  • Don’t run too long jobs without a restart option
    • Increased risk of something going wrong, resulting in lost time/results