Slurm is the utility used for batch processing support, so all jobs must be run through it. This section provides information for getting started with job execution at the Cluster.
All jobs requesting more cores than what a full node can offer, will automatically use all requested nodes in exclusive mode. For example if you request 129 cores on a 128-core node, you will receive two complete nodes (128 cores * 2 = 256 cores) and the consumed runtime of these 256 cores will be reflected in your budget.
The maximum amount of queued jobs (running or not) is 366. You can check this limit with bsc_queues.
There are several queues present in the machines and different users may access different queues. All queues have different limits in amount of cores for the jobs and duration. You can check anytime all queues you have access to and their limits using:
The standard configuration and limits of the queues are the following
|Queue||Maximum numer of nodes (cores)||Maximum wallclock|
|debug||4 (512)||2 h|
|normal||8 (1024)||3 days|
|interactive||1 (128)||2 h|
|xlarge||16 (2048)||72 h|
|xlong||8 (1024)||10 days|
For longer and/or larger executions special queues (xlarge and xlong) are available upon request and will require proof of scalability and application performance. To request access to these special queues please contact us.
One of the main appeals of this cluster is the inclusion of two specialized nodes: one node for AI training and another one for AI inference. In order to request one of those nodes, you will have to add some specific options to your jobscript. Please note that the specialized resources inside those nodes are not managed through "
#SBATCH --gres" parameters, so you should allocate the full node.
To request one of those nodes, you should add one of these directives in you jobscript:
The first constraint is used to define that you specifically need the AI Training node. The second one is used when the AI Inference node is needed.
It is highly recommended to request the full node, either explicitly requesting all the cores or using the "
#SBATCH --exclusive" parameter.
Keep in mind that there is only one node of each type in the whole cluster!
The method for submitting jobs is to use the SLURM sbatch directives directly.
A job is the execution unit for SLURM. A job is defined by a text file containing a set of directives describing the job's requirements, and the commands to execute.
In order to ensure the proper scheduling of jobs, there are execution limitations in the number of nodes and cores that can be used at the same time by a group. You may check those limits using command 'bsc_queues'. If you need to run an execution bigger than the limits already granted, you may contact us.
The AI training node "huatrain1" doesn't use an Infiniband connection. If you are running a multi-node MPI job intended for the general purpose nodes you will get a degraded performance if that node in particular gets assigned to your available resources.
If you don't intend to use the AI training node, you must explicitly exclude it in your jobscript using the following job directive:
These are the basic directives to submit jobs with sbatch:
submits a “job script” to the queue system (see Job directives).
shows all the submitted jobs.
remove the job from the queue system, canceling the execution of the processes, if they were still running.
For an allocating srun command, if the flag x11 is set the job will be handled as graphical (sets up X11 forwarding on the allocation) and you will be able to execute a graphical command. Meanwhile you do not close the current terminal you will get a graphical window.
salloc -J interactive --x11
Also, X11 forwarding can be set through interactive sessions.
Allocation of an interactive session in the interactive partition has to be done through SLURM:
salloc -p interactive
A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script and have to conform to either the sbatch syntaxes.
sbatch syxtax is of the form:
Additionally, the job script may contain a set of commands to execute. If not, an external script may be provided with the 'executable' directive. Here you may find the most common directives for both syntaxes:
To request the queue for the job. If it is not specified, Slurm will use the user's default queue. The debug queue is only intended for small test.
The limit of wall clock time. This is a mandatory field and you must set it to a value greater than real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the time has passed.
#SBATCH -D pathname
The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.
The name of the file to collect the standard error output (stderr) of the job.
The name of the file to collect the standard output (stdout) of the job.
#SBATCH -N number
The number of requested nodes.
Please keep in mind that this parameter will enforce the exclusivity of the nodes in case you request more than one.
For example, let's say that we specify just one node with this parameter, but we only want to use two tasks that use only one core each:
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -c 1
This will only request the use of the total resources, which are just two cores. The remaining resources of the used node will be left available for other users. If we request more than one node (and we leave the other parameters untouched) like this:
#SBATCH -N 2
#SBATCH -n 2
#SBATCH -c 1
It will request the exclusivity of both nodes. This means that it will request the full resources of both nodes and they won't be shared between other users. It's important to be aware of this behavior, as it will charge the CPU time of all those resources to your computation time budget (in case you have one), even if you specified the use of only two cores. With this being said, we can continue with the description of the remaining SLURM parameters.
The number of processes to start.
Optionally, you can specify how many threads each process would open with the directive:
The number of cores assigned to the job will be the total_tasks number * cpus_per_task number.
The number of tasks assigned to a node.
The number of tasks assigned to a socket.
The reservation where your jobs will be allocated (assuming that your account has access to that reservation). In some ocasions, node reservations can be granted for executions where only a set of accounts can run jobs. Useful for courses.
Submit a job array, multiple jobs to be executed with identical parameters. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator. Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value. For example:
sbatch --array=1-3 job.cmd
Submitted batch job 36
Will generate a job array containing three jobs and then the environment variables will be set as follows:
# Job 1
# Job 2
# Job 3
To request an exclusive use of a compute node without sharing the resources with other users. This only applies to jobs requesting less than one full node. All jobs with more cores than a full node will automatically use all requested nodes in exclusive mode.
|SLURM_JOBID||Specifies the job ID of the executing job|
|SLURM_NPROCS||Specifies the total number of processes in the job|
|SLURM_NNODES||Is the actual number of nodes assigned to run your job|
|SLURM_PROCID||Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)|
|SLURM_NODEID||Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)|
|SLURM_LOCALID||Specifies the node-local task ID for the process within a job|
For more information:
Example for a sequential job:
The job would be submitted using:
> sbatch ptest.cmd
Examples for a parallel job:
- Running a pure OpenMP job on one node using 128 cores on the debug queue:
- Running on two nodes using a pure MPI job
- Running a hybrid MPI+OpenMP job on two general purpose nodes with 16 MPI tasks (8 per node), each using 16 cores via OpenMP:
srun ./parallel_binary> parallel.output
- Running on a AI Training node with 1 task per node, which uses 192 cores:
srun ./parallel_binary> parallel.output
Interpreting job status and reason codes
When using squeue, Slurm will report back the status of your launched jobs. If they are still waiting to enter execution, they will be followed by the reason. Slurm uses codes to display this information, so in this section we will be covering the meaning of the most relevant ones.
Job state codes
This list contains the usual state codes for jobs that have been submitted:
- COMPLETED (CD): The job has completed the execution.
- COMPLETING (CG): The job is finishing, but some processes are still active.
- FAILED (F): The job terminated with a non-zero exit code.
- PENDING (PD): The job is waiting for resource allocation. The most common state after running "sbatch", it will run eventually.
- PREEMPTED (PR): The job was terminated because of preemption by another job.
- RUNNING (R): The job is allocated and running.
- SUSPENDED (S): A running job has been stopped with its cores released to other jobs.
- STOPPED (ST): A running job has been stopped with its cores retained.
Job reason codes
This list contains the most common reason codes of the jobs that have been submitted and are still not in the running state:
- Priority: One or more higher priority jobs is in queue for running. Your job will eventually run.
- Dependency: This job is waiting for a dependent job to complete and will run afterwards.
- Resources: The job is waiting for resources to become available and will eventually run.
- InvalidAccount: The job’s account is invalid. Cancel the job and resubmit with correct account.
- InvaldQoS: The job’s QoS is invalid. Cancel the job and resubmit with correct account.
- QOSGrpCpuLimit: All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
- QOSGrpMaxJobsLimit: Maximum number of jobs for your job’s QoS have been met; job will run eventually.
- QOSGrpNodeLimit: All nodes assigned to your job’s specified QoS are in use; job will run eventually.
- PartitionCpuLimit: All CPUs assigned to your job’s specified partition are in use; job will run eventually.
- PartitionMaxJobsLimit: Maximum number of jobs for your job’s partition have been met; job will run eventually.
- PartitionNodeLimit: All nodes assigned to your job’s specified partition are in use; job will run eventually.
- AssociationCpuLimit: All CPUs assigned to your job’s specified association are in use; job will run eventually.
- AssociationMaxJobsLimit: Maximum number of jobs for your job’s association have been met; job will run eventually.
- AssociationNodeLimit: All nodes assigned to your job’s specified association are in use; job will run eventually.
Resource usage and job priorities
Projects will have assinged a certain amount of compute hours or core hours that are available to use. One core hour is the computing time of one core during the time of one hour. That is a full node with 128 cores running a job for one hour will use up 128 core hours from the assigned budget. The accounting is solely based in the amount of compute hours used.
The priority of a job and therefore its scheduling in the queues is being determined by a multitude of factors. The most important and influential ones are the fairshare in between groups, waiting time in queues and job size. Our systems using Slurm, in general, favor large executions so that jobs using more cores have a higher priority. The time while waiting in queues for execution is being taken into account as well and jobs gain more and more priority the longer they are waiting. Finally our queue system implements a fairshare policy between groups. Users who did not run many jobs and consumed compute hours will get a higher priority for their jobs than groups that have a high usage. This is to allow everyone their fair share of compute time and the option to run jobs without one group or another being favoured. You can review your current fair share score using the command: