Skip to main content

Running Jobs

Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the cluster.

IMPORTANT

All jobs requesting 48 or more cores will automatically use all requested nodes in exclusive mode. For example, if you ask for 49 cores, you will receive two complete nodes (48 cores * 2 = 96 cores), and the consumed runtime of these 96 cores will be reflected in your budget.

Queues

Several queues are present in the machines, and users may access different queues. Queues have unlike limits regarding the number of cores and duration for the jobs.

Anytime you can check all queues you have access to and their limits by using:

bsc_queues
info

Besides, special queues are available upon request for longer/bigger executions and will require proof of scalability and application performance. To apply for access to these special queues, please get in touch with us.

The standard configuration and limits of the queues are the following:

QueueMaximum numer of nodes (cores)Maximum wallclock
Debug16 (768)2 h
Interactive(max 4 cores)2 h
BSC50 (2400)48 h
RES Class A200 (9600)72 h
RES Class B200 (9600)48 h
RES Class C21 (1008)24 h

Submitting jobs

A job is the execution unit for Slurm. A job is defined by a text file containing a set of directives describing the job's requirements and the commands to execute.

The method for submitting jobs is to use the Slurm sbatch directives directly.

info

For more information:

man sbatch
man srun
man salloc
IMPORTANT
  • The maximum amount of queued jobs (running or not) is 366.
  • Bear in mind there are execution limitations on the number of nodes and cores that can be used simultaneously by a group to ensure the proper scheduling of jobs.

SBATCH commands

These are the primary directives to submit jobs with sbatch:

  • Submit a job script to the queue system (see Job directives):

    sbatch <job_script>
  • Show all the submitted jobs:

    squeue
  • Remove a job from the queue system, canceling the execution of the processes (if they were still running):

    scancel <job_id>
  • To set up X11 forwarding on a srun allocation (so that you will be able to execute a graphical command):

    srun --x11
    REMARK

    You will get a graphical window as long as you don't close the current terminal.

  • Also, X11 forwarding can be set through interactive sessions:

    salloc -J interactive --x11

Interactive Sessions

Allocation of an interactive session has to be done through Slurm:

salloc [ OPTIONS ] 
note

Some of the parameters you can use with salloc are the following (see also Job directives):

-p, --partition=<name>
-q, --qos=<name>
-t, --time=<time>
-n, --ntasks=<number>
-c, --cpus-per-task=<number>
-J, --job-name=<name>
--exclusive
Examples

Interactive session for 10 minutes, 1 task, 4 CPUs (cores) per task, in the 'interactive' partition:

salloc -t 00:10:00 -n 1 -c 4 --partition=interactive -J myjob

Interactive session on a compute node ('main' partition) in exclusive mode:

salloc --exclusive

Job directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script and have to conform to either the sbatch syntaxes.

sbatch syntax is of the form:

#SBATCH --directive=value

Additionally, the job script may contain a set of commands to execute. If not, an external script may be provided with the 'executable' directive.

Here you may find the most common directives:

  • Request the queue for the job:

    #SBATCH --qos=debug
    REMARKS
    • Slurm will use the user's default queue if it is not specified.
    • The queue 'debug' is only intended for small tests.
  • Set the limit of wall clock time:

    #SBATCH --time=DD-HH:MM:SS
    caution

    This is a mandatory field and you must set it to a value greater than real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the time has passed.

  • Set the working directory of your job (i.e. where the job will run):

    #SBATCH --chdir=pathname

    Or:

    #SBATCH -D pathname
    caution

    If not specified, it is the current working directory at the time the job was submitted.

  • Set the name of the file to collect the standard output (stdout) of the job:

    #SBATCH --output=file
  • Set the name of the file to collect the standard error output (stderr) of the job:

    #SBATCH --error=file
  • Request an exclusive use of a compute node without sharing the resources with other users:

    #SBATCH --exclusive
    REMARK

    This only applies to jobs requesting less than one node (48 cores). All jobs with >= 48 cores will automatically use all requested nodes in exclusive mode.

  • Set the number of requested nodes:

    #SBATCH --nodes=number

    Or:

    #SBATCH -N number
    REMARK

    Please keep in mind that this parameter will enforce the exclusivity of the nodes in case you request more than one.

    Example 1

    For example, let's say that we specify just one node with this parameter, but we only want to use two tasks that use only one core each:

    #SBATCH -N 1
    #SBATCH -n 2
    #SBATCH -c 1

    This will only request the use of the total resources, which are just two cores. The remaining resources of the used node will be left available for other users.

    Example 2

    If we request more than one node (and we leave the other parameters untouched) like this:

    #SBATCH -N 2
    #SBATCH -n 2
    #SBATCH -c 1

    It will request the exclusivity of both nodes. This means that it will request the full resources of both nodes and they won't be shared between other users. In MN4, each node has 48 cores, so this example will request 96 cores.

    IMPORTANT

    It's important to be aware of this behavior, as it will charge the CPU time of those 96 cores to your computation time budget (in case you have one), even if you specified the use of only two cores.

  • Set the number of processes to start:

    #SBATCH --ntasks=number
  • Optionally, you can specify how many threads each process would open with the directive:

    #SBATCH --cpus-per-task=number
    info

    The number of cores assigned to the job will be the total_tasks number * cpus_per_task number.

  • Set the number of tasks assigned to a node:

    #SBATCH --tasks-per-node=number
  • Set the number of tasks assigned to a socket:

    #SBATCH --ntasks-per-socket=number
  • Select which configuration to run your job with, for example to run the job on a 'HighMem' node with 7928 MB per core:

    #SBATCH --constraint=highmem
    info

    Without this directive the jobs will be sent to standard nodes that have 1880 MB of RAM per core. There are only a limited number of high memory nodes available, 216 nodes (10368 cores) out of 3456 nodes (165888 cores) in total. Therefore when requesting these nodes you can expect significantly longer queueing times to fulfil the resource request before your job can start.

    info

    The accounting for one core hour in standard and highmem nodes is the same, e.g. 1 core hour per core per hour will be budgeted. For faster turnaround times in the queues you can also use standard nodes and run less processes per node. For this you will need to use more cores per task, as every cores requested comes with its 2 GB RAM. You can do this by specifying the flag #SBATCH --cpus-per-task=number and your budget will get charged for all cores requested.

  • Set the reservation name where you will allocate your jobs (assuming that your account has access to that reservation):

    #SBATCH --reservation=reservation_name
    REMARK

    Sometimes, node reservations can be granted for executions where only a set of accounts can run jobs. Useful for courses.

  • Those two directives are presented as a set because they need to be used at the same time. They will enable e-mail notifications that are triggered when a job starts its execution (begin), ends its execution (end) or both (all):

    #SBATCH --mail-type=[begin|end|all|none]
    #SBATCH --mail-user=<your_email>

    Example (notified at the end of the job execution):

    #SBATCH --mail-type=end
    #SBATCH --mail-user=dannydevito@bsc.es
    REMARK

    The "none" option doesn't trigger any e-mail, it is the same as not putting the directives. The only requisite is that the e-mail specified is valid and also the same one that you use for the HPC User Portal (what is the HPC User Portal, you ask? Excellent question, check it out here).

  • By default, Slurm tries to schedule a job using the minimum number of switches. However, you can request a maximum of switches for your job:

    #SBATCH --switches=number@timeout
    REMARK

    Slurm will try to schedule the job for timeout minutes. If it is not possible to request number switches (each rack has 3 switches, and every switch is connected to 24 nodes) after timeout minutes, Slurm will schedule the job by default.

  • Submit a job array, multiple jobs to be executed with identical parameters:

    #SBATCH --array=<indexes>
    REMARK

    The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator.

    info

    Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value.

    For example:

    sbatch --array=1-3 job.cmd
    Submitted batch job 36

    Will generate a job array containing three jobs and then the environment variables will be set as follows:

    # Job 1
    SLURM_JOB_ID=36
    SLURM_ARRAY_JOB_ID=36
    SLURM_ARRAY_TASK_ID=1

    # Job 2
    SLURM_JOB_ID=37
    SLURM_ARRAY_JOB_ID=36
    SLURM_ARRAY_TASK_ID=2

    # Job 3
    SLURM_JOB_ID=38
    SLURM_ARRAY_JOB_ID=36
    SLURM_ARRAY_TASK_ID=3
  • Request that job steps initiated by srun commands inside this sbatch script be run at some requested frequency, if possible, on the cores selected for the step on the compute node(s):

    #SBATCH --cpu-freq=<number>
    info

    Available frequency steps: 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz, 1.10 GHz, 1 GHz.

    REMARK

    The value is being given in 'KHz', and therefore you will need to specify the values from 2100000 to 1000000.

    info

    By default both the turbo boost an speed step technologies are activated in MareNostrum4.

Some useful Slurm's environment variables

VariableMeaning
SLURM_JOBIDSpecifies the job ID of the executing job
SLURM_NPROCSSpecifies the total number of processes in the job
SLURM_NNODESIs the actual number of nodes assigned to run your job
SLURM_PROCIDSpecifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEIDSpecifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALIDSpecifies the node-local task ID for the process within a job

Examples

sbatch examples

Example for a sequential job:

#!/bin/bash
#SBATCH --job-name="test_serial"
#SBATCH --chdir=.
#SBATCH --output=serial_%j.out
#SBATCH --error=serial_%j.err
#SBATCH --ntasks=1
#SBATCH --time=00:02:00

./serial_binary > serial.out

Examples for a parallel job:

  • Running a pure OpenMP job on one MN4 node using 48 cores on the debug queue:

    #!/bin/bash
    #SBATCH --job-name=omp
    #SBATCH --chdir=.
    #SBATCH --output=omp_%j.out
    #SBATCH --error=omp_%j.err
    #SBATCH --cpus-per-task=48
    #SBATCH --ntasks=1
    #SBATCH --time=00:10:00
    #SBATCH --qos=debug

    ./openmp_binary
  • Running on two MN4 nodes using a pure MPI job

    #!/bin/bash
    #SBATCH --job-name=mpi
    #SBATCH --output=mpi_%j.out
    #SBATCH --error=mpi_%j.err
    #SBATCH --ntasks=96

    srun ./mpi_binary
  • Running a hybrid MPI+OpenMP job on two MN4 nodes with 24 MPI tasks (12 per node), each using 4 cores via OpenMP:

    #!/bin/bash
    #SBATCH --job-name=test_parallel
    #SBATCH --chdir=.
    #SBATCH --output=mpi_%j.out
    #SBATCH --error=mpi_%j.err
    #SBATCH --ntasks=24
    #SBATCH --cpus-per-task=4
    #SBATCH --tasks-per-node=12
    #SBATCH --time=00:02:00

    srun ./parallel_binary> parallel.output
  • Running on four high memory MN4 nodes with 1 task per node, each using 48 cores:

    #!/bin/bash
    #SBATCH --job-name=test_parallel
    #SBATCH --chdir=.
    #SBATCH --output=mpi_%j.out
    #SBATCH --error=mpi_%j.err
    #SBATCH --ntasks=4
    #SBATCH --cpus-per-task=48
    #SBATCH --tasks-per-node=1
    #SBATCH --time=00:02:00
    #SBATCH --constraint=highmem

    srun ./parallel_binary> parallel.output

Interpreting job status and reason codes

When using squeue, Slurm will report back the status of your launched jobs. If they are still waiting to enter execution, they will be followed by the reason. Slurm uses codes to display this information, so in this section we will be covering the meaning of the most relevant ones.

Job state codes

This list contains the usual state codes for jobs that have been submitted:

  • COMPLETED (CD): The job has completed the execution.
  • COMPLETING (CG): The job is finishing, but some processes are still active.
  • FAILED (F): The job terminated with a non-zero exit code.
  • PENDING (PD): The job is waiting for resource allocation. The most common state after running "sbatch", it will run eventually.
  • PREEMPTED (PR): The job was terminated because of preemption by another job.
  • RUNNING (R): The job is allocated and running.
  • SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
  • STOPPED (ST): A running job has been stopped with its cores retained.

Job reason codes

This list contains the most common reason codes of the jobs that have been submitted and are still not in the running state:

  • Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
  • Dependency: This job is waiting for a dependent job to complete and will run afterwards.
  • Resources: The job is waiting for resources to become available and will eventually run.
  • InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
  • InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
  • QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
  • QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
  • QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
  • PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
  • PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
  • PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
  • AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
  • AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
  • AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.

Resource usage and job priorities

Projects will have assinged a certain amount of compute hours or core hours that are available to use. One core hour is the computing time of one core during the time of one hour. That is a full node with 48 cores running a job for one hour will use up 48 core hours from the assigned budget. The accounting is solely based in the amount of compute hours used.

The priority of a job and therefore its scheduling in the queues is being determined by a multitude of factors. The most important and influential ones are the fairshare in between groups, waiting time in queues and job size. MareNostrum is a system meant for and favouring large executions so that jobs using more cores have a higher priority. The time while waiting in queues for execution is being taken into account as well and jobs gain more and more priority the longer they are waiting. Finally our queue system implements a fairshare policy between groups. Users who did not run many jobs and consumed compute hours will get a higher priority for their jobs than groups that have a high usage. This is to allow everyone their fair share of compute time and the option to run jobs without one group or another being favoured. You can review your current fair share score using the command:

sshare -la

Notifications

It is currently not possible to be notified about the status of jobs via email. To check if your jobs are being executed or have finished you will need to connect to the system and verify their status manually. For the future it is being planned to enable automatic notifications.