Tips,tricks and being a good HPC citizen - High Performance Computing

Tips, tricks, and being a good HPC citizen

Being a good HPC citizen

Number of jobs

Please do not flood the queues with a large number of jobs at the same time.

The FairShare algorithm will generally ensure that every user and group gets the appropriate share of the computer by reducing the priority of jobs belonging to users/groups who have used more than their share in the recent past. It does however take time to adjust, so submitting a large number of jobs may block the cluster for other users. One way of preventing your jobs from blocking too many nodes is to queue them up behind each other by creating a dependency. If you add the following line to the SLURM directives:

#SBATCH -d afterany:<previous-job>

this job will not run before the job with the number <jobid> is finished. You can also add the dependency after the job is submitted (obviously only if it has not started yet) with the scontrol command:

scontrol update jobid=<job-to-be-delayed> dependency=afterany:<previous-job>

In both cases, replace <previous-job> and <ob-to-be-delayed> with their jobid (the number in the first column shown by squeue).

Array jobs

If you need to run a large number of jobs that only differ in easily scriptable way (such as the name of the input file), you can submit them as a job array. A good introduction to the syntax can be found .

If the number of jobs in the array is very large, please restrict the number that can run concurrently, ideally to a small number of nodes as in the following example:

#SBATCH --array=0-999%168

This will run 1000 jobs in total, but only up to 168 calculations at one time (which is one node if they are single-core jobs).

For calculations running only for a very short time, it may also make sense to cluster them into tasks that run multiple calculations in the same script. to reduce the overhead involved in starting and finishing jobs.

Disk usage

Home directories

The home directories are backed up, but our backup capacity is limited. We therefore ask that only essential and irreplaceable data, such as source code or input files, are stored there. There is currently no enforcement of quota, but if your data exceeds 100GB, it will not be backed up.

Scratch directories

Calculations creating large amounts of data should be run in the scratch filesystem. You will have a directory at /sharedscratch/<user>. This is not backed up, and it is the user's responsibility to secure their data elsewhere. Hypatia is not meant for long-term storage, so please regularly delete data that is not needed on the cluster anymore.

黑料吃瓜网