Use the HOSTLIMIT_PER_JOB parameter in lsb.queues to limit the number of hosts that a job can use. For example, if a user submits a parallel job using bsub -n 1,4096 -R "span[ptile=1]", this job requests 4096 hosts from the cluster. If you specify a limit of 20 hosts per job, a user submitting a job requesting 4096 hosts will only be allowed to use 20 hosts.
HOSTLIMIT_PER_JOB = integer
Specify the maximum number of hosts that a job can use. If the number of hosts requested for a parallel job exceeds this limit, the parallel job will pend.
If the number of hosts requested in a parallel job is unknown during the submission stage, the per-job host limit does not apply and the job submission is accepted.
The per-job host limit is verified during resource allocation. If the per-job host limit is exceeded and the minimum number of requested hosts cannot be satisfied, the parallel job will pend.
This parameter does not stop the parallel job from resuming even if the job's host allocation exceeds the per-job host limit specified in this parameter.
If a parallel job is submitted under a range of the number of slots (bsub -n "min, max"), the per-job host limit applies to the minimum number of requested slots. That is, if the minimum number of requested slots is satisfied under the per-job host limit, the job submission is accepted.
For example, hostA has two slots available, hostB and hostC have four slots available, and hostD has eight slots available, and HOSTLIMIT_PER_JOB=2. If you submit a job that requires ten slots and no ptile specification, the scheduler will determine that selecting hostA, hostB, and hostC will satisfy the requirements, but since this requires three hosts, the job will pend. This is a false scheduling failure because selecting hostA and hostD would satisfy this requirement.
To avoid false scheduling failure when HOSTLIMIT_PER_JOB is specified, submit jobs with the ptile resource requirement or add order[slots] to the resource requirements.