Skip to main content

Running Jobs

Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the cluster.

IMPORTANT

All jobs requesting more cores than what a full node can offer, will automatically use all requested nodes in exclusive mode. For example, if you ask for n+1 cores and nodes have n cores, you will receive two complete nodes (2 x n cores), and the consumed runtime of these 2n cores will be reflected in your budget.

Queues

Several queues are present in the machines, and users may access different queues. Queues have unlike limits regarding the number of cores and duration for the jobs.

Anytime you can check all queues you have access to and their limits by using:

bsc_queues
info

Besides, special queues are available upon request for longer/bigger executions and will require proof of scalability and application performance. To apply for access to these special queues, please get in touch with us.

Submitting jobs

A job is the execution unit for Slurm. A job is defined by a text file containing a set of directives describing the job's requirements, and the commands to execute.

There are two supported methods for submitting jobs:

  • The first one is to use a wrapper maintained by the Operations Team at BSC that provides a standard syntax regardless of the underlying Batch system (mnsubmit).

  • The other one is to use the Slurm sbatch directives directly.

    info

    For more information:

    man sbatch
    man srun
    man salloc
IMPORTANT
  • The maximum amount of queued jobs (running or not) is 366.
  • Bear in mind there are execution limitations on the number of nodes and cores that can be used simultaneously by a group to ensure the proper scheduling of jobs.
  • Since MinoTauro is a cluster where more than 90% of the computing power comes from the GPUs, jobs that do not use them will have a lower priority than those that are GPU-ready.

Slurm wrapper commands

These are the primary directives to submit jobs with mnsubmit:

  • Submit a job script to the queue system (see Job directives):

    mnsubmit <job_script>
  • Show all the submitted jobs:

    mnq
  • Remove a job from the queue system, canceling the execution of the processes (if they were still running):

    mncancel <job_id>
  • Allocate an interactive session in the "debug" partition:

    mnsh
    info

    You may add -c <ncpus> to allocate n CPUs and/or -g to reserve a GPU.

SBATCH commands

These are the basic directives to submit jobs with sbatch:

  • Submits a job script to the queue system (see Job directives):

    sbatch <job_script>
  • Show all the submitted jobs:

    squeue
  • Remove a job from the queue system, canceling the execution of the processes (if they were still running):

    scancel <job_id>
  • Allocate an interactive session in the "debug" partition:

    salloc --partition=interactive

    Or:

    salloc -p interactive
    info

    You may add -c <ncpus> to allocate n CPUs and/or --gres=gpu:<ngpus> to reserve n GPUs.

Job directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script and have to conform to either the mnsubmit or the sbatch syntaxes. Using mnsubmit syntax with sbatch or the other way around will result in failure.

mnsubmit syntax is of the form:

# @ directive = value

while sbatch is of the form:

#SBATCH --directive=value

Additionally, the job script may contain a set of commands to execute. If not, an external script may be provided with the 'executable' directive.

Here you may find the most common directives for both syntaxes:

  • This partition is only intended for small tests:

    # mnsubmit
    # @ partition = debug
    # @ class = debug
    # sbatch
    #SBATCH --partition=debug
    #SBATCH --qos=debug
  • Set the limit of wall clock time:

    # mnsubmit
    # @ wall_clock_limit = HH:MM:SS
    # sbatch
    #SBATCH --time=HH:MM:SS
    caution

    This is a mandatory field and you must set it to a value greater than real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the time has passed.

  • Set the working directory of your job (i.e. where the job will run):

    # mnsubmit
    # @ initialdir = pathname
    # sbatch
    #SBATCH --workdir=pathname
    caution

    If not specified, it is the current working directory at the time the job was submitted.

  • Set the name of the file to collect the standard output (stdout) of the job:

    # mnsubmit
    # @ output = file
    # sbatch
    #SBATCH --output=file
  • Set the name of the file to collect the standard error output (stderr) of the job:

    # mnsubmit
    # @ error = file
    # sbatch
    #SBATCH --error=file
  • Set the number of processes to start:

    # mnsubmit
    # @ total_tasks = number
    # sbatch
    #SBATCH --ntasks=number
  • Optionally, you can specify how many threads each process would open with the directive:

    # mnsubmit
    # @ cpus_per_task = number
    # sbatch
    #SBATCH --cpus-per-task=number
    info

    The number of cpus assigned to the job will be the total_tasks number * cpus_per_task number.

  • Set the number of tasks assigned to a node:

    # mnsubmit
    # @ tasks_per_node = number
    # sbatch
    #SBATCH --ntasks-per-node=number
  • Set the number of GPU cards assigned to the job:

    # mnsubmit
    # @ gpus_per_node = number
    # sbatch
    #SBATCH --gres gpu:number
    REMARK
    • This number can be [1-4] on k80 configurations.
    • In order to allocate all the GPU cards in a node, you must allocate all the cores of the node.
    • You must not request GPUs if your job does not use them.
    caution
    • When submitting jobs with a single GPU, using a number of CPUs greater than 8 will fail:

      sbatch: error: Batch job submission failed: Requested node configuration is not available
    • If you wish to use more than 8 CPUs, make sure to ask for at least 2 GPUs: --gres gpu:2

  • Set the reservation namewhere your jobs will be allocated (assuming that your account has access to that reservation):

    # sbatch
    #SBATCH --reservation=reservation_name
    REMARK

    Sometimes, node reservations can be granted for executions where only a set of accounts can run jobs. Useful for courses.

  • Those two directives are presented as a set because they need to be used at the same time. They will enable e-mail notifications that are triggered when a job starts its execution (begin), ends its execution (end) or both (all):

    #SBATCH --mail-type=[begin|end|all|none]
    #SBATCH --mail-user=<your_email>

    Example (notified at the end of the job execution):

    #SBATCH --mail-type=end
    #SBATCH --mail-user=dannydevito@bsc.es
    REMARK

    The "none" option doesn't trigger any e-mail, it is the same as not putting the directives. The only requisite is that the e-mail specified is valid and also the same one that you use for the HPC User Portal (what is the HPC User Portal, you ask? Excellent question, check it out here!

  • By default, Slurm schedules a job in order to use the minimum amount of switches. However, you can request a maximum of switches for your job:

    # mnsubmit
    # @ switches = "number@timeout"
    # sbatch
    #SBATCH --switches=number@timeout
    REMARK

    Slurm will try to schedule the job for timeout minutes. If it is not possible to request number switches after timeout minutes, Slurm will schedule the job by default.

  • By default, Only Nvidia OpenCL driver is loaded to use the GPU device. In order to use the CPU device to run OpenCL, this directive must be added to the job script:

    # mnsubmit
    # @ intel_opencl = 1
    caution

    As it introduces important changes in the Operating System setup, it is mandatory to allocate full nodes to use this feature.

Some useful Slurm's environment variables

VariableMeaning
SLURM_JOBIDSpecifies the job ID of the executing job
SLURM_NPROCSSpecifies the total number of processes in the job
SLURM_NNODESIs the actual number of nodes assigned to run your job
SLURM_PROCIDSpecifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEIDSpecifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALIDSpecifies the node-local task ID for the process within a job

Examples

mnsubmit examples

Example for a sequential job:

#!/bin/bash
# @ job_name= test_serial
# @ initialdir= .
# @ output= serial_%j.out
# @ error= serial_%j.err
# @ total_tasks= 1
# @ wall_clock_limit = 00:02:00

./serial_binary > serial.out

Examples for a parallel job:

#!/bin/bash
# @ job_name= test_parallel
# @ initialdir= .
# @ output= mpi_%j.out
# @ error= mpi_%j.err
# @ total_tasks= 16
# @ gpus_per_node= 4
# @ cpus_per_task= 1
# @ wall_clock_limit = 00:02:00

srun ./parallel_binary > parallel.output

sbatch examples

Example for a sequential job:

#!/bin/bash
#SBATCH --job-name="test_serial"
#SBATCH --workdir=.
#SBATCH --output=serial_%j.out
#SBATCH --error=serial_%j.err
#SBATCH --ntasks=1
#SBATCH --time=00:02:00

./serial_binary > serial.out

Examples for a parallel job:

#!/bin/bash
#SBATCH --job-name=test_parallel
#SBATCH --workdir=.
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --ntasks=16
#SBATCH --gres gpu:4
#SBATCH --cpus-per-task=1
#SBATCH --time=00:02:00

srun ./parallel_binary > parallel.output

Interpreting job status and reason codes

When using squeue, Slurm will report back the status of your launched jobs. If they are still waiting to enter execution, they will be followed by the reason. Slurm uses codes to display this information, so in this section we will be covering the meaning of the most relevant ones.

Job state codes

This list contains the usual state codes for jobs that have been submitted:

  • COMPLETED (CD): The job has completed the execution.
  • COMPLETING (CG): The job is finishing, but some processes are still active.
  • FAILED (F): The job terminated with a non-zero exit code.
  • PENDING (PD): The job is waiting for resource allocation. The most common state after running "sbatch", it will run eventually.
  • PREEMPTED (PR): The job was terminated because of preemption by another job.
  • RUNNING (R): The job is allocated and running.
  • SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
  • STOPPED (ST): A running job has been stopped with its cores retained.

Job reason codes

This list contains the most common reason codes of the jobs that have been submitted and are still not in the running state:

  • Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
  • Dependency: This job is waiting for a dependent job to complete and will run afterwards.
  • Resources: The job is waiting for resources to become available and will eventually run.
  • InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
  • InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
  • QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
  • QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
  • QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
  • PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
  • PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
  • PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
  • AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
  • AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
  • AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.