Hyak: Scheduling Jobs

Note: Many of the commands on this page are custom to the College of the Environment. In order to access them, please run the install script using this command:

source /gscratch/coenv/shared/bin/coenv_install.sh

The Slurm Job Scheduler

How to Run a Job

See the Hyak Documentation page for complete information on scheduling jobs in Hyak.

Klone compute resources are organized by Account and Partition. An account is similar to the ‘group’ you have joined and a Partition represents a specific type of compute resource (for instance CPU or GPU).

To see the resources that you have access to, you can issue the hyakalloc command from the command prompt. You will see a table that looks like this:

~ $ hyakalloc
     Account resources available to user: shrike
╭─────────┬───────────┬──────┬────────┬──────┬───────╮
│ Account │ Partition │ CPUs │ Memory │ GPUs │       │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│   coenv │   compute │   40 │   175G │    0 │ TOTAL │
│         │           │    0 │     0G │    0 │ USED  │
│         │           │   40 │   175G │    0 │ FREE  │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│   coenv │    cpu-g2 │  672 │  5223G │    0 │ TOTAL │
│         │           │  202 │  1090G │    0 │ USED  │
│         │           │  470 │  4133G │    0 │ FREE  │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│   coenv │  gpu-l40s │   32 │   364G │    2 │ TOTAL │
│         │           │    0 │     0G │    0 │ USED  │
│         │           │   32 │   364G │    2 │ FREE  │
╰─────────┴───────────┴──────┴────────┴──────┴───────╯
               Checkpoint Resources
╭──────────────────┬───────────────┬──────────────╮
│                  │          CPUs │         GPUs │
├──────────────────┼───────────────┼──────────────┤
│            Idle: │          2167 │          194 │
╰──────────────────┴───────────────┴──────────────╯
   Checkpoint is currently limited to 11068 jobs
~ $

This user has access to the resources in the ‘coenv’ account/group. the coenv group has nodes in the ‘compute’ and ‘cpu-g2 partitions. Run the sinfo command to see a complete list of all partitions.

Klone uses the SLURM scheduler to manage access to the compute resources. SLURM jobs can be submitted from the initial login nodes and will run on resources from the compute nodes. In general, you will be submitting ‘batch’ or non-interactive jobs directly to the compute nodes, however you can also launch ‘interactive’ jobs which give you access to the command line on the compute nodes so you can interact with your compute processes. You may also create ‘recurring’ jobs that will run on a predefined schedule. When a non-interactive batch job completes, Hyak will send you a notification email.

Interactive Jobs

To run an interactive job from the login node, use the salloc command:

salloc -A coenv -p compute -N 1 -c 4 --mem=10G --time=1:00:00

This command will allocate a single compute node (-N 1) with 4 processor cores (-c 4), 10 gigabytes of memory (–mem=10G) for one hour (–time=1:00:00). When the command finishes, you will find yourself in a command line shell on the allocated compute node. You may use this shell to run your compute processes.

Batch Jobs

Batch jobs are run from the login node using the sbatch command. Sbatch uses a script to execute its jobs, which you must provide. An example script to set up a batch job looks like this:

#!/bin/bash 
#SBATCH --job-name=<name> 
#SBATCH --mail-type=<status> 
#SBATCH --mail-user=<email> 
#SBATCH --account=coenv 
#SBATCH --partition=compute 
#SBATCH --nodes=<num_nodes> 
#SBATCH --ntasks-per-node=<cores_per_node> 
#SBATCH --mem=<size[unit]> 
#SBATCH --gpus=<type:quantity>  
#SBATCH --time=<time> # Max runtime in DD-HH:MM:SS format. 
#SBATCH --chdir=<working directory> 
#SBATCH --export=all 
#SBATCH --output=<file> # where STDOUT goes 
#SBATCH --error=<file> # where STDERR goes

# Modules to use (optional). 
<e.g., module load apptainer> 

# Your programs to run. 
<my_programs>

Bold fields should be replaced with your custom values. Save the script with a name like mybatch.slurm. Then you will run the batch job with the command sbatch mybatch.slurm

Note that there are two lines in the script that were not necessary under the previous cluster, mox:

#SBATCH --account=coenv

#SBATCH --partition=compute

Monitoring Jobs and Resource Usage

The Hyak coenv group is a shared resource which means it is important to use computer resources responsibly. Hyak comes with several commands which can help users to monitor the resources they and others are using.

Note: Many of the commands on this page are custom to the College of the Environment. In order to access them, please run the install script using this command:

source /gscratch/coenv/shared/bin/coenv_install.sh

The coenvalloc Command

coenvalloc displays current group allocation within the College of the Environment broken down by PI group. By default coenvalloc only displays your current group’s allocations, but you can specify another group using the ‘-g’ or ‘-u’ option, or show all groups using ‘–all’

Lines in Red indicate that a PI group is currently exceeding its allocated resources.

~ $ coenvalloc -u abadi
    Coenv resources for group: abadi
╭───────┬───────┬──────┬────────┬───────╮
│ Group │ Nodes │ CPUs │ Memory │       │
├───────┼───────┼──────┼────────┼───────┤
│ abadi │     4 │  128 │   992G │ TOTAL │
│       │       │  128 │   320G │ USED  │
│       │       │    0 │   672G │ FREE  │
╰───────┴───────┴──────┴────────┴───────╯
~ $ coenvalloc --all
        Coenv resources: all groups
╭──────────┬───────┬──────┬────────┬───────╮
│    Group │ Nodes │ CPUs │ Memory │       │
├──────────┼───────┼──────┼────────┼───────┤
│    abadi │     4 │  128 │   992G │ TOTAL │
│          │       │  128 │   320G │ USED  │
│          │       │    0 │   672G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│ avancise │     1 │   32 │   248G │ TOTAL │
│          │       │    0 │     0G │ USED  │
│          │       │   32 │   248G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│ bjharvey │     1 │   32 │   248G │ TOTAL │
│          │       │    0 │     0G │ USED  │
│          │       │   32 │   248G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│  gmanuch │     5 │  160 │  1240G │ TOTAL │
│          │       │    0 │     0G │ USED  │
│          │       │  160 │  1240G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│  gmauger │     1 │   32 │   248G │ TOTAL │
│          │       │    0 │     0G │ USED  │
│          │       │   32 │   248G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│  lhauser │     1 │   32 │   248G │ TOTAL │
│          │       │   42 │  1000G │ USED  │
│          │       │  -10 │  -752G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│    pmacc │    10 │  320 │  2480G │ TOTAL │
│          │       │  192 │  1498G │ USED  │
│          │       │  128 │   982G │ FREE  │
├──────────┼───────┼──────┼────────┼───────┤
│    sr320 │     1 │   32 │   248G │ TOTAL │
│          │       │   32 │   150G │ USED  │
│          │       │    0 │    98G │ FREE  │
╰──────────┴───────┴──────┴────────┴───────╯

The coenvjobs Command

‘coenvjobs’ can be used to show a simplified table of all jobs running under the coenv account, broken down by College of the Environment PI Groups. The default is to display only jobs under the current user’s PI group, but all jobs in the coenv Hyak account can be displayed using the ‘–all’ option

Lines in Red indicate that a PI group is currently exceeding its allocated resources.

~ $ coenvjobs
                         Coenv jobs for group: pmacc
╭───────┬─────────┬──────────┬──────────┬───────────┬──────┬────────┬───────╮
│ Group │    User │   Job ID │ Job Name │ Partition │ CPUs │ Memory │       │
├───────┼─────────┼──────────┼──────────┼───────────┼──────┼────────┼───────┤
│ pmacc │ auroral │ 31728720 │    exnzv │    cpu-g2 │  192 │  1498G │       │
│       │         │          │          │           │  192 │  1498G │ TOTAL │
╰───────┴─────────┴──────────┴──────────┴───────────┴──────┴────────┴───────╯
~ $ coenvjobs --all
                                          Coenv jobs: all groups
╭─────────┬──────────┬──────────┬────────────────────────────────────┬───────────┬──────┬────────┬───────╮
│   Group │     User │   Job ID │                           Job Name │ Partition │ CPUs │ Memory │       │
├─────────┼──────────┼──────────┼────────────────────────────────────┼───────────┼──────┼────────┼───────┤
│   sr320 │ samwhite │ 31631680 │     sys/dashboard/sys/hyak-rstudio │   compute │   40 │   150G │       │
│         │          │          │                                    │           │   40 │   150G │ TOTAL │
├─────────┼──────────┼──────────┼────────────────────────────────────┼───────────┼──────┼────────┼───────┤
│   abadi │ qgoestch │ 31724017 │                      run_tracks.sh │    cpu-g2 │   64 │    64G │       │
│         │  ccchien │ 31728604 │ sys/dashboard/sys/hyak-code-server │    cpu-g2 │   64 │   256G │       │
│         │          │          │                                    │           │  128 │   320G │ TOTAL │
├─────────┼──────────┼──────────┼────────────────────────────────────┼───────────┼──────┼────────┼───────┤
│ lhauser │   jproef │ 31724867 │                               bash │    cpu-g2 │   32 │   500G │       │
│         │   babo94 │ 31726328 │                      gatk_combined │    cpu-g2 │   10 │   500G │       │
│         │          │          │                                    │           │   42 │  1000G │ TOTAL │
├─────────┼──────────┼──────────┼────────────────────────────────────┼───────────┼──────┼────────┼───────┤
│   pmacc │  auroral │ 31728720 │                              exnzv │    cpu-g2 │  192 │  1498G │       │
│         │          │          │                                    │           │  192 │  1498G │ TOTAL │
╰─────────┴──────────┴──────────┴────────────────────────────────────┴───────────┴──────┴────────┴───────╯

The squeue Command

squeue can be used to view a list of all jobs that are running or queue. To see jobs running on the cpu-g2 partition under the ‘coenv’ account (group), use this command:

~ $ squeue -A coenv -p cpu-g2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          31493454    cpu-g2 sys/dash  ccchien  R    1:53:30      1 n3481
...

leaving off the -p option will show all jobs under the coenv account. To see only your jobs, use the ‘-u’ option with your username:

~ $ squeue -u shrike
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON

squeue can also display other information about the job such as the number of requested CPUs and Memory. To view additional information, use the ‘-o’ option to specify the format of the output. An example command that shows the username, group, jobid, status, Number of CPUs and RAM requested for everyone in the coenv group looks like this:

~ $ squeue -A coenv -o '%u %a %i %T %C %m'
USER ACCOUNT JOBID STATE CPUS MIN_MEMORY
ccchien coenv 31493454 RUNNING 32 128G
auroral coenv 31500909 RUNNING 192 0
ccchien coenv 31500256 RUNNING 64 256G
ccchien coenv 31500761 RUNNING 64 256G
jproef coenv 31491979 RUNNING 32 500G
qgoestch coenv 31495409 RUNNING 16 64G

For a full list of available information see https://slurm.schedmd.com/squeue.html

Modules

Modules are used in Hyak to allow an easy ‘plug-in’ method to install additional software packages. To see what modules are currently available, you can use the command module avail from a compute node interactive shell. Note that this command will not work on the login node and will instead give you a warning message.

By default, the Research Computing groups maintains a large variety of useful modules such as compilers (gcc, g++, gfortran), programming and data languages (R, Python) and libraries. Additional modules are also supplied by the community at large, but these may not have full support available. Modules that are supported by default are marked with a (D) when you run the module avail command.

To load a module for use, you can use the command module load <modulename> and to unload a module, use module unload <modulename>

These commands can be included at the beginning of your batch scripts to ensure that any required software is available when your batch job is run