CPU Affinity
Introduction to CPU Affinity and Distribution in SLURM
Optimizing CPU usage for maximum performance and efficiency in high-performance computing environments is crucial. One way to do this is using CPU affinity and distribution flags in SLURM.
CPU affinity refers to binding a process (in Slurm --ntasks) to a specific core or set of cores. By binding a process to specific cores, users can improve performance by reducing latency on communications and cache misses and improving CPU utilization, among others.
On the other hand, distribution refers to the distribution of the threads of each process (in Slurm --threads-per-process) across multiple CPUs or nodes. Distributing processes across multiple cores using threads can improve performance by increasing parallelism and reducing wait times.
In SLURM, the flags for specifying CPU affinity are the following:
- --cpu-bind: Users can bind their processes to specific cores or core sets.
- --distribution: Users can specify how their threads should be distributed across cores, sockets or nodes.
CPU binding is honoured only when full nodes are allocated or the --exclusive flag is set
Cpu-bind
The --cpu-bind
option in SLURM lets users bind their processes to specific cores or core sets. By default, SLURM does not bind processes to any specific cores, but with the --cpu-bind
option, users can choose from a variety of bind options, including:
- none: No CPU binding is specified, and the processes can run on any available core (Default option).
- cores: The processes are bound to all the cores of a CPU.
- ldoms: The processes are bound to all the cores of a NUMA domain.
- threads: The processes are bound to hardware threads.
- rank: The processes are bound to the same core as their rank_number mod node_number (e.g. MPI rank).
- rankd_ldom: The processes are bound to the same NUMA domain as their rank number (e.g. MPI rank).
- sockets: The processes are bound to all the cores of a socket.
In the case of Marenostrum 4, each node of the machine has two sockets, each with a CPU that has twenty-four cores. That means that each node has forty-eight cores.
Marenostrum 4 has Hyper-Threading deactivated, so only one thread can simultaneously run on each physical core. If a binding results in multiple processes or threads bound to the same physical core, they would run concurrently. This implies that binding on cores or threads will always result in the same affinity mask.
In our case, NUMA domains contain all the cores on each CPU, so binding on cores or ldoms will always result in the same affinity mask.
Cpu-bind Examples
- None[Default]
- Cores
- Sockets
- Rank
Here we can see that each python process is in a different node, and each process creates five threads, each with a CPU_MASK that allows each process to execute on any core. With this CPU_Mask, the threads could swap between cores, affecting performance.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 7 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 12 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 16 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 27 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 29 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
1 | 3 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
1 | 4 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
1 | 23 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
1 | 32 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
1 | 35 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
Here we can see that each python process is in a different node, and each process creates five threads, each with a CPU_MASK that allows each process to execute on the first five cores of the node.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0, 1, 2, 3, 4] | 0 |
0 | 1 | [0, 1, 2, 3, 4] | 0 |
0 | 2 | [0, 1, 2, 3, 4] | 0 |
0 | 3 | [0, 1, 2, 3, 4] | 0 |
0 | 4 | [0, 1, 2, 3, 4] | 0 |
1 | 0 | [0, 1, 2, 3, 4] | 1 |
1 | 1 | [0, 1, 2, 3, 4] | 1 |
1 | 2 | [0, 1, 2, 3, 4] | 1 |
1 | 3 | [0, 1, 2, 3, 4] | 1 |
1 | 4 | [0, 1, 2, 3, 4] | 1 |
If we execute the previous example with –cpu-bind=sockets, each process is bound to each node's first socket (having 24 cores), and each of the five threads in the process can run on any of the cores of the respective socket. If a single node is used, the first process gets bound to the first socket, and the second process goes to the second socket. SLURM ensures that no two processes in the running state are pinned to the same core.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 1 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 0 |
0 | 2 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 0 |
0 | 3 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 0 |
0 | 4 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 0 |
0 | 5 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 0 |
1 | 11 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 1 |
1 | 12 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 1 |
1 | 13 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 1 |
1 | 14 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 1 |
1 | 15 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | 1 |
For our last example, we change the binding to rank. Rank binding sets the CPU_MASK equal to the rank of the python process relative to the node where it runs, and hence all the processes run in the same core of the respective node. i.e. Python process id=0 gets pinned to the zeroth core of the first node, Python process id=1 gets pinned to the zeroth core of the second node & all the threads within a process also run on the respective zeroth core. If the same runs on a single node, the first process and its threads run on the zeroth core, whereas the second process and its threads run on the first core.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0] | 0 |
0 | 0 | [0] | 0 |
0 | 0 | [0] | 0 |
0 | 0 | [0] | 0 |
0 | 0 | [0] | 0 |
1 | 0 | [0] | 1 |
1 | 0 | [0] | 1 |
1 | 0 | [0] | 1 |
1 | 0 | [0] | 1 |
1 | 0 | [0] | 1 |
Distribution
The --distribution
flag in SLURM lets users specify how processes and threads should be distributed across nodes and sockets. The following distribution options are available:
- block: The threads are block distributed, with each node/socket receiving a contiguous block of threads (Default at node level).
- cyclic: The threads are distributed round-robin across the same nodes/sockets (Default at socket level).
- fcyclic: The threads are distributed round-robin across consecutive sockets.
- plane: The threads are distributed in a batches of
SLURM_DIST_PLANESIZE
across the allocated nodes. - arbitrary: The processes are distributed arbitrarily across the allocated nodes as specified in a file of
SLURM_HOSTFILE
= path/to/file.
The format of the SLURM_HOSTFILE has to be one NodeId per line, each line being the node the user wants the process with ProcessId = LineNumber(starting 0) to be bound. If a Slurm job has 12 processes in two nodes, the HOSTFILE would have 12 lines, each with a NodeId based on how the user wants each process to go.
The usage is the following, with * being the default:
--distribution={*|block|cyclic|arbitrary|plane}[:{*|block|cyclic|fcyclic}]
The first set refers to the node distribution, and the second one, after the ":", sets the distribution at socket level. Note that plane and arbitrary are only used to control node distribution and fcyclic at socket level.
Distribution Examples
- Block:Cyclic[Default]
- Block:Cyclic and binding
- Block:block
- Block:fcyclic
- Cyclic:Block
- Plane:block
This mask behaviour is expected because Block:Cyclic distribution is the default, so it is the same test as the first example on cpu-bind but with a different number of processes and threads.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 16 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 17 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 19 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
0 | 31 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
1 | 18 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
1 | 20 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
1 | 29 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
1 | 32 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 0 |
2 | 7 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
2 | 32 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
2 | 33 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
2 | 40 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47] | 1 |
We can see that now process one binds to the second socket of node one and processes zero and two bint to the first socket in each node. Moreover, now that we have specified cpu-bind at cores, each thread can only execute on four consecutive cores.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0, 1, 2, 3] | 0 |
0 | 1 | [0, 1, 2, 3] | 0 |
0 | 2 | [0, 1, 2, 3] | 0 |
0 | 3 | [0, 1, 2, 3] | 0 |
1 | 24 | [24, 25, 26, 27] | 0 |
1 | 25 | [24, 25, 26, 27] | 0 |
1 | 26 | [24, 25, 26, 27] | 0 |
1 | 27 | [24, 25, 26, 27] | 0 |
2 | 0 | [0, 1, 2, 3] | 1 |
2 | 1 | [0, 1, 2, 3] | 1 |
2 | 2 | [0, 1, 2, 3] | 1 |
2 | 3 | [0, 1, 2, 3] | 1 |
Now that we distribute in blocks at both node and socket levels, we see that Slurm puts processes 0 and 1 to node 0 and process 2 to node 1, all in the first socket. Changing a small option changes a lot the result.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0, 1, 2, 3] | 0 |
0 | 1 | [0, 1, 2, 3] | 0 |
0 | 2 | [0, 1, 2, 3] | 0 |
0 | 3 | [0, 1, 2, 3] | 0 |
1 | 4 | [4, 5, 6, 7] | 0 |
1 | 5 | [4, 5, 6, 7] | 0 |
1 | 6 | [4, 5, 6, 7] | 0 |
1 | 7 | [4, 5, 6, 7] | 0 |
2 | 0 | [0, 1, 2, 3] | 1 |
2 | 1 | [0, 1, 2, 3] | 1 |
2 | 2 | [0, 1, 2, 3] | 1 |
2 | 3 | [0, 1, 2, 3] | 1 |
The last socket distribution method, fcyclic, lets us divide the threads equally between the sockets.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0, 1, 24, 25] | 0 |
0 | 1 | [0, 1, 24, 25] | 0 |
0 | 24 | [0, 1, 24, 25] | 0 |
0 | 25 | [0, 1, 24, 25] | 0 |
1 | 2 | [2, 3, 26, 27] | 0 |
1 | 3 | [2, 3, 26, 27] | 0 |
1 | 26 | [2, 3, 26, 27] | 0 |
1 | 27 | [2, 3, 26, 27] | 0 |
2 | 0 | [0, 1, 24, 25] | 1 |
2 | 1 | [0, 1, 24, 25] | 1 |
2 | 24 | [0, 1, 24, 25] | 1 |
2 | 25 | [0, 1, 24, 25] | 1 |
With these options, Slurm now applies Round-Robin to the processes and then threads in blocks. Now, with two nodes, odd processes will go to node one and even to node 0.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0, 1, 2, 3] | 0 |
0 | 1 | [0, 1, 2, 3] | 0 |
0 | 2 | [0, 1, 2, 3] | 0 |
0 | 3 | [0, 1, 2, 3] | 0 |
1 | 0 | [0, 1, 2, 3] | 1 |
1 | 1 | [0, 1, 2, 3] | 1 |
1 | 2 | [0, 1, 2, 3] | 1 |
1 | 3 | [0, 1, 2, 3] | 1 |
2 | 4 | [4, 5, 6, 7] | 0 |
2 | 5 | [4, 5, 6, 7] | 0 |
2 | 6 | [4, 5, 6, 7] | 0 |
2 | 7 | [4, 5, 6, 7] | 0 |
"SLURM_DIST_PLANESIZE=1 srun -l -n 3 --cpus-per-task=4 --distribution=plane:cyclic --cpu-bind=cores python threadsleep.py"
In previous examples with Block node distribution, processes 0 and 1 went to node 0, and process 2 went to node 1; with plane sized at 1, processes 0 and 2 went to node 0 and process 1 to node 1.
PythonId | CoreId | CPU_MASK | NodeId |
---|---|---|---|
0 | 0 | [0, 1, 2, 3] | 0 |
0 | 1 | [0, 1, 2, 3] | 0 |
0 | 2 | [0, 1, 2, 3] | 0 |
0 | 3 | [0, 1, 2, 3] | 0 |
1 | 0 | [0, 1, 2, 3] | 1 |
1 | 1 | [0, 1, 2, 3] | 1 |
1 | 2 | [0, 1, 2, 3] | 1 |
1 | 3 | [0, 1, 2, 3] | 1 |
2 | 4 | [4, 5, 6, 7] | 0 |
2 | 5 | [4, 5, 6, 7] | 0 |
2 | 6 | [4, 5, 6, 7] | 0 |
2 | 7 | [4, 5, 6, 7] | 0 |
Let us mix things up!
We already saw that Slurm allows the user to mix Distribution and Cpu-binding. We want to ensure this documentation is concise and to the point, and explaining the details and nuances of all possible permutations would be too much clutter. Finding which configuration suits the jobs best is an exercise for the reader.
For more detailed documentation, please refer to SLURM official wiki (distribution, cpu-bind, multicore-multithreaded architectures). Alternatively, you can contact our Support team at:
- support AT bsc DOT es