CPU Affinity

Introduction to CPU Affinity and Distribution in SLURM

Optimizing CPU usage for maximum performance and efficiency in high-performance computing environments is crucial. One way to do this is using CPU affinity and distribution flags in SLURM.

CPU affinity refers to binding a process (in Slurm --ntasks) to a specific core or set of cores. By binding a process to specific cores, users can improve performance by reducing latency on communications and cache misses and improving CPU utilization, among others.

On the other hand, distribution refers to the distribution of the threads of each process (in Slurm --threads-per-process) across multiple CPUs or nodes. Distributing processes across multiple cores using threads can improve performance by increasing parallelism and reducing wait times.

In SLURM, the flags for specifying CPU affinity are the following:

--cpu-bind: Users can bind their processes to specific cores or core sets.
--distribution: Users can specify how their threads should be distributed across cores, sockets or nodes.

danger

CPU binding is honoured only when full nodes are allocated or the --exclusive flag is set

Cpu-bind

The --cpu-bind option in SLURM lets users bind their processes to specific cores or core sets. By default, SLURM does not bind processes to any specific cores, but with the --cpu-bind option, users can choose from a variety of bind options, including:

none: No CPU binding is specified, and the processes can run on any available core (Default option).
cores: The processes are bound to all the cores of a CPU.
ldoms: The processes are bound to all the cores of a NUMA domain.
threads: The processes are bound to hardware threads.
rank: The processes are bound to the same core as their rank_number mod node_number (e.g. MPI rank).
rankd_ldom: The processes are bound to the same NUMA domain as their rank number (e.g. MPI rank).
sockets: The processes are bound to all the cores of a socket.

note

In the case of Marenostrum 4, each node of the machine has two sockets, each with a CPU that has twenty-four cores. That means that each node has forty-eight cores.

Marenostrum 4 has Hyper-Threading deactivated, so only one thread can simultaneously run on each physical core. If a binding results in multiple processes or threads bound to the same physical core, they would run concurrently. This implies that binding on cores or threads will always result in the same affinity mask.

In our case, NUMA domains contain all the cores on each CPU, so binding on cores or ldoms will always result in the same affinity mask.

Cpu-bind Examples

We allocated two nodes with the command 'salloc -N2 --exclusive'. Remember that Slurm only honours binding when full nodes are allocated. This allocation also can be specified with the appropriate #SBATCH pragmas inside a job script.

None[Default]
Cores
Sockets
Rank

We want to run a job in 2 nodes, 1 task per node and 5 cpus per task with the default binding. We will execute a python script that starts five tasks, each outputting CoreId : [CPU_MASK] : NodeId. We will run the script with "srun -l -n 2 --cpus-per-task=5 python threadsleep.py". The -l inserts the process's Id at the beginning, so the result is PythonProcessId : CoreId : [CPU_MASK] : NodeId.. We did not specify any binding, so Slurm will set it as None, the Default.

Here we can see that each python process is in a different node, and each process creates five threads, each with a CPU_MASK that allows each process to execute on any core. With this CPU_Mask, the threads could swap between cores, affecting performance.

PythonId	CoreId	CPU_MASK	NodeId
0	7	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	12	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	16	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	27	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	29	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
1	3	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
1	4	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
1	23	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
1	32	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
1	35	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1

"srun -l -n 2 --cpus-per-task=5 --cpu-bind=cores python threadsleep.py".

Here we can see that each python process is in a different node, and each process creates five threads, each with a CPU_MASK that allows each process to execute on the first five cores of the node.

PythonId	CoreId	CPU_MASK	NodeId
0	0	[0, 1, 2, 3, 4]	0
0	1	[0, 1, 2, 3, 4]	0
0	2	[0, 1, 2, 3, 4]	0
0	3	[0, 1, 2, 3, 4]	0
0	4	[0, 1, 2, 3, 4]	0
1	0	[0, 1, 2, 3, 4]	1
1	1	[0, 1, 2, 3, 4]	1
1	2	[0, 1, 2, 3, 4]	1
1	3	[0, 1, 2, 3, 4]	1
1	4	[0, 1, 2, 3, 4]	1

"srun -l -n 2 --cpus-per-task=5 --cpu-bind=sockets python threadsleep.py".

If we execute the previous example with –cpu-bind=sockets, each process is bound to each node's first socket (having 24 cores), and each of the five threads in the process can run on any of the cores of the respective socket. If a single node is used, the first process gets bound to the first socket, and the second process goes to the second socket. SLURM ensures that no two processes in the running state are pinned to the same core.

PythonId	CoreId	CPU_MASK	NodeId
0	1	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	0
0	2	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	0
0	3	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	0
0	4	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	0
0	5	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	0
1	11	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	1
1	12	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	1
1	13	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	1
1	14	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	1
1	15	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]	1

"srun -l -n 2 --cpus-per-task=5 --cpu-bind=rank python threadsleep.py".

For our last example, we change the binding to rank. Rank binding sets the CPU_MASK equal to the rank of the python process relative to the node where it runs, and hence all the processes run in the same core of the respective node. i.e. Python process id=0 gets pinned to the zeroth core of the first node, Python process id=1 gets pinned to the zeroth core of the second node & all the threads within a process also run on the respective zeroth core. If the same runs on a single node, the first process and its threads run on the zeroth core, whereas the second process and its threads run on the first core.

PythonId	CPU_MASK	NodeId
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1

Distribution

The --distribution flag in SLURM lets users specify how processes and threads should be distributed across nodes and sockets. The following distribution options are available:

block: The threads are block distributed, with each node/socket receiving a contiguous block of threads (Default at node level).
cyclic: The threads are distributed round-robin across the same nodes/sockets (Default at socket level).
fcyclic: The threads are distributed round-robin across consecutive sockets.
plane: The threads are distributed in a batches of SLURM_DIST_PLANESIZE across the allocated nodes.
arbitrary: The processes are distributed arbitrarily across the allocated nodes as specified in a file of SLURM_HOSTFILE = path/to/file.

info

The format of the SLURM_HOSTFILE has to be one NodeId per line, each line being the node the user wants the process with ProcessId = LineNumber(starting 0) to be bound. If a Slurm job has 12 processes in two nodes, the HOSTFILE would have 12 lines, each with a NodeId based on how the user wants each process to go.

The usage is the following, with * being the default:

 --distribution={*|block|cyclic|arbitrary|plane}[:{*|block|cyclic|fcyclic}]

The first set refers to the node distribution, and the second one, after the ":", sets the distribution at socket level. Note that plane and arbitrary are only used to control node distribution and fcyclic at socket level.

Distribution Examples

With the same python script as before, we will run without bindings, 2 nodes, 3 processes in total, 4 threads and only change the distribution to check the behaviour (srun -l -n 3 --cpus-per-task=4 --distribution=*:* python threadsleep.py). We choose three tasks to visualize the distribution better.

This mask behaviour is expected because Block:Cyclic distribution is the default, so it is the same test as the first example on cpu-bind but with a different number of processes and threads.

PythonId	CoreId	CPU_MASK	NodeId
0	16	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	17	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	19	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
0	31	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
1	18	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
1	20	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
1	29	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
1	32	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	0
2	7	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
2	32	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
2	33	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1
2	40	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, [...], 40, 41, 42, 43, 44, 45, 46, 47]	1

We will now test the default with binding at cores: "srun -l -n 3 --cpus-per-task=4 --distribution=*:* --cpu-bind=cores python threadsleep.py"

We can see that now process one binds to the second socket of node one and processes zero and two bint to the first socket in each node. Moreover, now that we have specified cpu-bind at cores, each thread can only execute on four consecutive cores.

PythonId	CoreId	CPU_MASK	NodeId
0	0	[0, 1, 2, 3]	0
0	1	[0, 1, 2, 3]	0
0	2	[0, 1, 2, 3]	0
0	3	[0, 1, 2, 3]	0
1	24	[24, 25, 26, 27]	0
1	25	[24, 25, 26, 27]	0
1	26	[24, 25, 26, 27]	0
1	27	[24, 25, 26, 27]	0
2	0	[0, 1, 2, 3]	1
2	1	[0, 1, 2, 3]	1
2	2	[0, 1, 2, 3]	1
2	3	[0, 1, 2, 3]	1

"srun -l -n 3 --cpus-per-task=4 --distribution=block:block --cpu-bind=cores python threadsleep.py"

Now that we distribute in blocks at both node and socket levels, we see that Slurm puts processes 0 and 1 to node 0 and process 2 to node 1, all in the first socket. Changing a small option changes a lot the result.

PythonId	CoreId	CPU_MASK	NodeId
0	0	[0, 1, 2, 3]	0
0	1	[0, 1, 2, 3]	0
0	2	[0, 1, 2, 3]	0
0	3	[0, 1, 2, 3]	0
1	4	[4, 5, 6, 7]	0
1	5	[4, 5, 6, 7]	0
1	6	[4, 5, 6, 7]	0
1	7	[4, 5, 6, 7]	0
2	0	[0, 1, 2, 3]	1
2	1	[0, 1, 2, 3]	1
2	2	[0, 1, 2, 3]	1
2	3	[0, 1, 2, 3]	1

"srun -l -n 3 --cpus-per-task=4 --distribution=block:fcyclic --cpu-bind=cores python threadsleep.py"

The last socket distribution method, fcyclic, lets us divide the threads equally between the sockets.

PythonId	CoreId	CPU_MASK	NodeId
0	0	[0, 1, 24, 25]	0
0	1	[0, 1, 24, 25]	0
0	24	[0, 1, 24, 25]	0
0	25	[0, 1, 24, 25]	0
1	2	[2, 3, 26, 27]	0
1	3	[2, 3, 26, 27]	0
1	26	[2, 3, 26, 27]	0
1	27	[2, 3, 26, 27]	0
2	0	[0, 1, 24, 25]	1
2	1	[0, 1, 24, 25]	1
2	24	[0, 1, 24, 25]	1
2	25	[0, 1, 24, 25]	1

Let us check what happens if we invert the default options: "srun -l -n 3 --cpus-per-task=4 --distribution=Cyclic:Block --cpu-bind=cores python threadsleep.py"

With these options, Slurm now applies Round-Robin to the processes and then threads in blocks. Now, with two nodes, odd processes will go to node one and even to node 0.

PythonId	CoreId	CPU_MASK	NodeId
0	0	[0, 1, 2, 3]	0
0	1	[0, 1, 2, 3]	0
0	2	[0, 1, 2, 3]	0
0	3	[0, 1, 2, 3]	0
1	0	[0, 1, 2, 3]	1
1	1	[0, 1, 2, 3]	1
1	2	[0, 1, 2, 3]	1
1	3	[0, 1, 2, 3]	1
2	4	[4, 5, 6, 7]	0
2	5	[4, 5, 6, 7]	0
2	6	[4, 5, 6, 7]	0
2	7	[4, 5, 6, 7]	0

Plane distribution puts plane_size consecutive processes on each node. In this example, we set a Plane of size one, so it will alternate nodes for each process (same behaviour as Cyclic node distribution). If we had twelve processes and a plane size of five, SLURM would distribute them in batches of five.
"SLURM_DIST_PLANESIZE=1 srun -l -n 3 --cpus-per-task=4 --distribution=plane:cyclic --cpu-bind=cores python threadsleep.py"

In previous examples with Block node distribution, processes 0 and 1 went to node 0, and process 2 went to node 1; with plane sized at 1, processes 0 and 2 went to node 0 and process 1 to node 1.

PythonId	CoreId	CPU_MASK	NodeId
0	0	[0, 1, 2, 3]	0
0	1	[0, 1, 2, 3]	0
0	2	[0, 1, 2, 3]	0
0	3	[0, 1, 2, 3]	0
1	0	[0, 1, 2, 3]	1
1	1	[0, 1, 2, 3]	1
1	2	[0, 1, 2, 3]	1
1	3	[0, 1, 2, 3]	1
2	4	[4, 5, 6, 7]	0
2	5	[4, 5, 6, 7]	0
2	6	[4, 5, 6, 7]	0
2	7	[4, 5, 6, 7]	0

Let us mix things up!

We already saw that Slurm allows the user to mix Distribution and Cpu-binding. We want to ensure this documentation is concise and to the point, and explaining the details and nuances of all possible permutations would be too much clutter. Finding which configuration suits the jobs best is an exercise for the reader.

Still with unanswered questions?

For more detailed documentation, please refer to SLURM official wiki (distribution, cpu-bind, multicore-multithreaded architectures). Alternatively, you can contact our Support team at:

support AT bsc DOT es

PythonId	CPU_MASK	NodeId
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1

PythonId	CPU_MASK	NodeId
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1

Introduction to CPU Affinity and Distribution in SLURM​

Cpu-bind​

Distribution​

Let us mix things up!​

Introduction to CPU Affinity and Distribution in SLURM

Cpu-bind

Distribution

Let us mix things up!

PythonId	CPU_MASK	NodeId
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
0	[0]	0
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1
1	[0]	1