Mini-intro to possibilities of using R in CSC’s supercomputer Puhti
Samantha Wittke, CSC (Geoinformatics specialist) 
CSC - IT center for science
- Non-profit company producing IT services for research and higher education
- Owned by ministry of education and culture (70%) and higher education institutions (30%)
- Headquaters in Keilaniemi, Espoo
- Side offices and supercomputers in Kajaani
CSC services
research.csc.fi/en/service-catalog
Compute & Analyze
Webservices, virtual machines in the cloud: cPouta / ePouta / Rahti
Heavy computations on the supercomputer: Puhti / Mahti / LUMI
Teaching and collaborating: CSC Notebooks
Store, Share & Publish Data
Project lifetime data storage: Allas
Share and publish data: Fairdata
Share and publish geospatial data: Paituli
Working with privacy related data: Sensitive Data (SD) services
Why use CSC services?
- CSC specialist support
- “Outsource” heavy/specialized computations
- Free of charge for open science Finnish universities and research institutes
Supercomputer
Main differences to own computer:
- Not faster, but bigger
- For speed up: parallelism
- Memory and CPU(/GPU) availability (application needs to make use of this!)
- Non-interactive for heavy computations
- Resource knowledge
Possibilities - supercomputer
- Use more memory/CPU/GPU than your own computer has available
→ analyse large files, Machine learning model training
- Speed up so called embarrassingly parallel analyses (many identical, but separate tasks)
→ doing same thing to multiple map tiles/ data chunks
Puhti supercomputer - Basics
Puhti supercomputer - Applications
- CloudCompare
- FORCE
- GDAL/OGR
- GRASS GIS
- LasTools
- MatLab
- OpenDroneMap
- Orfeo Toolbox
- PCL
- PDAL
- Python geospatial packages: geoconda
- QGIS
- R geospatial packages: r-env
- SagaGIS
- SNAP, Sen2cor, sen2mosaic
- WhiteboxTools
- Zonation
- Deep learning: pytorch, tensorflow
Something missing? Ask us :) servicedesk@csc.fi
Puhti supercomputer - Data availability
- Large commonly used geospatial datasets with open license
- Removes transfer bottleneck
- Located at:
/appl/data/geo/
- All Puhti users have read access
- ~13 TB of datasets available:
- Paituli data
- SYKE open datasets
- LUKE Multi-source national forest inventory
- Virtual rasters for NLS DEMs
- Sentinel and Landsat mosaics
Running your own R script in Puhti
- Get CSC user account.
- Log in to Puhti web interface (www.puhti.csc.fi).
- Move your data and scripts to Puhti.
- Open RStudio.
- Check R package availability.
- Fix paths of your input/output files.
- Test your script with some test data.
…
Make use of the power of Puhti
…
- Write a batch job script.
- Run your scripts with all data as batch job (or interactively)
- Make use of several cores using future package in your R code, if needed.
“My R code runs slow, what can be done?”
- Try to understand which part of the code takes time and why
- Use
system.time()
or tictoc
package
- Different R packages may provide same functions but are implemented differently (i.e. run faster/slower)
- e.g. prefer
sf
over sp
and terra
over raster
.
- Always be suspicious of for-loops!
- Consider parallelization
- Understand that number of cores != multiplier of speedup
Parallelization locally and on the supercomputer
Within R:
- use package
future
- or
snow
, foreach
, Rmpi
,…
Outside R:
4.1 How we can help
→ servicedesk@csc.fi
→CSC as project partner / subcontractor
Summary - Why use a supercomputer?
⌛ Resource needs (time, memory, storage, GPU)
👾 “Outsource” heavy computations, keep own computer free
🏘 Prebuilt environments, application availability
📊 Run many experiments at same time
🌐 Data availability
👥 Collaboration possibility
❓ CSC specialist support
💸 Free of charge for open science at Finnish universities and research institutes.