The Slurm Job Scheduler
How to Run a Job
See the Hyak Documentation page for complete information on scheduling jobs in Hyak.
Klone compute resources are organized by Account and Partition. An account is similar to the ‘group’ you have joined and a Partition represents a specific type of compute resource (for instance CPU or GPU).
To see the resources that you have access to, you can issue the hyakalloc command from the command prompt. You will see a table that looks like this:
~ $ hyakalloc
Account resources available to user: shrike
╭─────────┬───────────┬──────┬────────┬──────┬───────╮
│ Account │ Partition │ CPUs │ Memory │ GPUs │ │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│ coenv │ compute │ 40 │ 175G │ 0 │ TOTAL │
│ │ │ 0 │ 0G │ 0 │ USED │
│ │ │ 40 │ 175G │ 0 │ FREE │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│ coenv │ cpu-g2 │ 672 │ 5223G │ 0 │ TOTAL │
│ │ │ 202 │ 1090G │ 0 │ USED │
│ │ │ 470 │ 4133G │ 0 │ FREE │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│ coenv │ gpu-l40s │ 32 │ 364G │ 2 │ TOTAL │
│ │ │ 0 │ 0G │ 0 │ USED │
│ │ │ 32 │ 364G │ 2 │ FREE │
╰─────────┴───────────┴──────┴────────┴──────┴───────╯
Checkpoint Resources
╭──────────────────┬───────────────┬──────────────╮
│ │ CPUs │ GPUs │
├──────────────────┼───────────────┼──────────────┤
│ Idle: │ 2167 │ 194 │
╰──────────────────┴───────────────┴──────────────╯
Checkpoint is currently limited to 11068 jobs
~ $
This user has access to the resources in the ‘coenv’ account/group. the coenv group has nodes in the ‘compute,’ ‘cpu-g2,’ and ‘gpu-l40s’ partitions. Run the sinfo command to see a complete list of all partitions.
Klone uses the SLURM scheduler to manage access to the compute resources. SLURM jobs can be submitted from the initial login nodes and will run on resources from the compute nodes. In general, you will be submitting ‘batch’ or non-interactive jobs directly to the compute nodes, however you can also launch ‘interactive’ jobs which give you access to the command line on the compute nodes so you can interact with your compute processes. You may also create ‘recurring’ jobs that will run on a predefined schedule. When a non-interactive batch job completes, Hyak will send you a notification email.
Interactive Jobs
To run an interactive job from the login node, use the salloc command:
salloc -A coenv -p compute -N 1 -c 4 --mem=10G --time=1:00:00
This command will allocate a single compute node (-N 1) with 4 processor cores (-c 4), 10 gigabytes of memory (–mem=10G) for one hour (–time=1:00:00). When the command finishes, you will find yourself in a command line shell on the allocated compute node. You may use this shell to run your compute processes.
Batch Jobs
Batch jobs are run from the login node using the sbatch command. Sbatch uses a script to execute its jobs, which you must provide. An example script to set up a batch job looks like this:
#!/bin/bash
#SBATCH --job-name=<name>
#SBATCH --mail-type=<status>
#SBATCH --mail-user=<email>
#SBATCH --account=coenv
#SBATCH --partition=compute
#SBATCH --nodes=<num_nodes>
#SBATCH --ntasks-per-node=<cores_per_node>
#SBATCH --mem=<size[unit]>
#SBATCH --gpus=<type:quantity>
#SBATCH --time=<time> # Max runtime in DD-HH:MM:SS format.
#SBATCH --chdir=<working directory>
#SBATCH --export=all
#SBATCH --output=<file> # where STDOUT goes
#SBATCH --error=<file> # where STDERR goes
# Modules to use (optional).
<e.g., module load apptainer>
# Your programs to run.
<my_programs>
Bold fields should be replaced with your custom values. Save the script with a name like mybatch.slurm. Then you will run the batch job with the command sbatch mybatch.slurm
Note that there are two lines in the script that were not necessary under the previous cluster, mox:
#SBATCH --account=coenv
#SBATCH --partition=compute
Modules
Modules are used in Hyak to allow an easy ‘plug-in’ method to install additional software packages. To see what modules are currently available, you can use the command module avail from a compute node interactive shell. Note that this command will not work on the login node and will instead give you a warning message.
By default, the Research Computing groups maintains a large variety of useful modules such as compilers (gcc, g++, gfortran), programming and data languages (R, Python) and libraries. Additional modules are also supplied by the community at large, but these may not have full support available. Modules that are supported by default are marked with a (D) when you run the module avail command.
To load a module for use, you can use the command module load <modulename> and to unload a module, use module unload <modulename>
These commands can be included at the beginning of your batch scripts to ensure that any required software is available when your batch job is run