Skip to main content

CPU Affinity

Introduction to CPU Affinity and Distribution in SLURM

Optimizing CPU usage for maximum performance and efficiency in high-performance computing environments is crucial. One way to do this is using CPU affinity and distribution flags in SLURM.

CPU affinity refers to binding a process (in Slurm --ntasks) to a specific core or set of cores. By binding a process to specific cores, users can improve performance by reducing latency on communications and cache misses and improving CPU utilization, among others.

On the other hand, distribution refers to the distribution of the threads of each process (in Slurm --threads-per-process) across multiple CPUs or nodes. Distributing processes across multiple cores using threads can improve performance by increasing parallelism and reducing wait times.

In SLURM, the flags for specifying CPU affinity are the following:

  • --cpu-bind: Users can bind their processes to specific cores or core sets.
  • --distribution: Users can specify how their threads should be distributed across cores, sockets or nodes.
danger

CPU binding is honoured only when full nodes are allocated or the --exclusive flag is set

Cpu-bind

The --cpu-bind option in SLURM lets users bind their processes to specific cores or core sets. By default, SLURM does not bind processes to any specific cores, but with the --cpu-bind option, users can choose from a variety of bind options, including:

  • none: No CPU binding is specified, and the processes can run on any available core (Default option).
  • cores: The processes are bound to all the cores of a CPU.
  • ldoms: The processes are bound to all the cores of a NUMA domain.
  • threads: The processes are bound to hardware threads.
  • rank: The processes are bound to the same core as their rank_number mod node_number (e.g. MPI rank).
  • rankd_ldom: The processes are bound to the same NUMA domain as their rank number (e.g. MPI rank).
  • sockets: The processes are bound to all the cores of a socket.
note

In the case of Marenostrum 4, each node of the machine has two sockets, each with a CPU that has twenty-four cores. That means that each node has forty-eight cores.

Marenostrum 4 has Hyper-Threading deactivated, so only one thread can simultaneously run on each physical core. If a binding results in multiple processes or threads bound to the same physical core, they would run concurrently. This implies that binding on cores or threads will always result in the same affinity mask.

In our case, NUMA domains contain all the cores on each CPU, so binding on cores or ldoms will always result in the same affinity mask.

Cpu-bind Examples
We allocated two nodes with the command 'salloc -N2 --exclusive'. Remember that Slurm only honours binding when full nodes are allocated. This allocation also can be specified with the appropriate #SBATCH pragmas inside a job script.
We want to run a job in 2 nodes, 1 task per node and 5 cpus per task with the default binding. We will execute a python script that starts five tasks, each outputting CoreId : [CPU_MASK] : NodeId. We will run the script with "srun -l -n 2 --cpus-per-task=5 python threadsleep.py". The -l inserts the process's Id at the beginning, so the result is PythonProcessId : CoreId : [CPU_MASK] : NodeId.. We did not specify any binding, so Slurm will set it as None, the Default.

Here we can see that each python process is in a different node, and each process creates five threads, each with a CPU_MASK that allows each process to execute on any core. With this CPU_Mask, the threads could swap between cores, affecting performance.

PythonIdCoreIdCPU_MASKNodeId
07[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
012[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
016[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
027[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
029[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
13[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
14[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
123[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
132[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
135[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1

Distribution

The --distribution flag in SLURM lets users specify how processes and threads should be distributed across nodes and sockets. The following distribution options are available:

  • block: The threads are block distributed, with each node/socket receiving a contiguous block of threads (Default at node level).
  • cyclic: The threads are distributed round-robin across the same nodes/sockets (Default at socket level).
  • fcyclic: The threads are distributed round-robin across consecutive sockets.
  • plane: The threads are distributed in a batches of SLURM_DIST_PLANESIZE across the allocated nodes.
  • arbitrary: The processes are distributed arbitrarily across the allocated nodes as specified in a file of SLURM_HOSTFILE = path/to/file.
info

The format of the SLURM_HOSTFILE has to be one NodeId per line, each line being the node the user wants the process with ProcessId = LineNumber(starting 0) to be bound. If a Slurm job has 12 processes in two nodes, the HOSTFILE would have 12 lines, each with a NodeId based on how the user wants each process to go.

The usage is the following, with * being the default:

 --distribution={*|block|cyclic|arbitrary|plane}[:{*|block|cyclic|fcyclic}]

The first set refers to the node distribution, and the second one, after the ":", sets the distribution at socket level. Note that plane and arbitrary are only used to control node distribution and fcyclic at socket level.

Distribution Examples
With the same python script as before, we will run without bindings, 2 nodes, 3 processes in total, 4 threads and only change the distribution to check the behaviour (srun -l -n 3 --cpus-per-task=4 --distribution=*:* python threadsleep.py). We choose three tasks to visualize the distribution better.

This mask behaviour is expected because Block:Cyclic distribution is the default, so it is the same test as the first example on cpu-bind but with a different number of processes and threads.

PythonIdCoreIdCPU_MASKNodeId
016[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
017[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
019[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
031[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
118[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
120[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
129[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
132[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]0
27[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
232[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
233[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1
240[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]1

Let us mix things up!

We already saw that Slurm allows the user to mix Distribution and Cpu-binding. We want to ensure this documentation is concise and to the point, and explaining the details and nuances of all possible permutations would be too much clutter. Finding which configuration suits the jobs best is an exercise for the reader.

Still with unanswered questions?

For more detailed documentation, please refer to SLURM official wiki (distribution, cpu-bind, multicore-multithreaded architectures). Alternatively, you can contact our Support team at:

  • support AT bsc DOT es