The Slurm Job Scheduler
How to Run a Job
See the Hyak Documentation page for complete information on scheduling jobs in Hyak.
Klone compute resources are organized by Account and Partition. An account is similar to the ‘group’ you have joined and a Partition represents a specific type of compute resource (for instance CPU or GPU).
To see the resources that you have access to, you can issue the hyakalloc command from the command prompt. You will see a table that looks like this:
~ $ hyakalloc
Account resources available to user: shrike
╭─────────┬───────────┬──────┬────────┬──────┬───────╮
│ Account │ Partition │ CPUs │ Memory │ GPUs │ │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│ coenv │ compute │ 40 │ 175G │ 0 │ TOTAL │
│ │ │ 0 │ 0G │ 0 │ USED │
│ │ │ 40 │ 175G │ 0 │ FREE │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│ coenv │ cpu-g2 │ 672 │ 5223G │ 0 │ TOTAL │
│ │ │ 202 │ 1090G │ 0 │ USED │
│ │ │ 470 │ 4133G │ 0 │ FREE │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│ coenv │ gpu-l40s │ 32 │ 364G │ 2 │ TOTAL │
│ │ │ 0 │ 0G │ 0 │ USED │
│ │ │ 32 │ 364G │ 2 │ FREE │
╰─────────┴───────────┴──────┴────────┴──────┴───────╯
Checkpoint Resources
╭──────────────────┬───────────────┬──────────────╮
│ │ CPUs │ GPUs │
├──────────────────┼───────────────┼──────────────┤
│ Idle: │ 2167 │ 194 │
╰──────────────────┴───────────────┴──────────────╯
Checkpoint is currently limited to 11068 jobs
~ $
This user has access to the resources in the ‘coenv’ account/group. the coenv group has nodes in the ‘compute’ and ‘cpu-g2 partitions. Run the sinfo command to see a complete list of all partitions.
Klone uses the SLURM scheduler to manage access to the compute resources. SLURM jobs can be submitted from the initial login nodes and will run on resources from the compute nodes. In general, you will be submitting ‘batch’ or non-interactive jobs directly to the compute nodes, however you can also launch ‘interactive’ jobs which give you access to the command line on the compute nodes so you can interact with your compute processes. You may also create ‘recurring’ jobs that will run on a predefined schedule. When a non-interactive batch job completes, Hyak will send you a notification email.
Interactive Jobs
To run an interactive job from the login node, use the salloc command:
salloc -A coenv -p compute -N 1 -c 4 --mem=10G --time=1:00:00
This command will allocate a single compute node (-N 1) with 4 processor cores (-c 4), 10 gigabytes of memory (–mem=10G) for one hour (–time=1:00:00). When the command finishes, you will find yourself in a command line shell on the allocated compute node. You may use this shell to run your compute processes.
Batch Jobs
Batch jobs are run from the login node using the sbatch command. Sbatch uses a script to execute its jobs, which you must provide. An example script to set up a batch job looks like this:
#!/bin/bash
#SBATCH --job-name=<name>
#SBATCH --mail-type=<status>
#SBATCH --mail-user=<email>
#SBATCH --account=coenv
#SBATCH --partition=compute
#SBATCH --nodes=<num_nodes>
#SBATCH --ntasks-per-node=<cores_per_node>
#SBATCH --mem=<size[unit]>
#SBATCH --gpus=<type:quantity>
#SBATCH --time=<time> # Max runtime in DD-HH:MM:SS format.
#SBATCH --chdir=<working directory>
#SBATCH --export=all
#SBATCH --output=<file> # where STDOUT goes
#SBATCH --error=<file> # where STDERR goes
# Modules to use (optional).
<e.g., module load apptainer>
# Your programs to run.
<my_programs>
Bold fields should be replaced with your custom values. Save the script with a name like mybatch.slurm. Then you will run the batch job with the command sbatch mybatch.slurm
Note that there are two lines in the script that were not necessary under the previous cluster, mox:
#SBATCH --account=coenv
#SBATCH --partition=compute
Monitoring Jobs and Resource Usage
The Hyak coenv group is a shared resource which means it is important to use computer resources responsibly. Hyak comes with several commands which can help users to monitor the resources they and others are using.
The squeue Command
squeue can be used to view a list of all jobs that are running or queue. To see jobs running on the cpu-g2 partition under the ‘coenv’ account (group), use this command:
~ $ squeue -A coenv -p cpu-g2
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
31493454 cpu-g2 sys/dash ccchien R 1:53:30 1 n3481
...
leaving off the -p option will show all jobs under the coenv account. To see only your jobs, use the ‘-u’ option with your username:
~ $ squeue -u shrike
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON
squeue can also display other information about the job such as the number of requested CPUs and Memory. To view additional information, use the ‘-o’ option to specify the format of the output. An example command that shows the username, group, jobid, status, Number of CPUs and RAM requested for everyone in the coenv group looks like this:
~ $ squeue -A coenv -o '%u %a %i %T %C %m'
USER ACCOUNT JOBID STATE CPUS MIN_MEMORY
ccchien coenv 31493454 RUNNING 32 128G
auroral coenv 31500909 RUNNING 192 0
ccchien coenv 31500256 RUNNING 64 256G
ccchien coenv 31500761 RUNNING 64 256G
jproef coenv 31491979 RUNNING 32 500G
qgoestch coenv 31495409 RUNNING 16 64G
For a full list of available information see https://slurm.schedmd.com/squeue.html
Modules
Modules are used in Hyak to allow an easy ‘plug-in’ method to install additional software packages. To see what modules are currently available, you can use the command module avail from a compute node interactive shell. Note that this command will not work on the login node and will instead give you a warning message.
By default, the Research Computing groups maintains a large variety of useful modules such as compilers (gcc, g++, gfortran), programming and data languages (R, Python) and libraries. Additional modules are also supplied by the community at large, but these may not have full support available. Modules that are supported by default are marked with a (D) when you run the module avail command.
To load a module for use, you can use the command module load <modulename> and to unload a module, use module unload <modulename>
These commands can be included at the beginning of your batch scripts to ensure that any required software is available when your batch job is run