Start of change

Limit the number of allocated hosts

Use the HOSTLIMIT_PER_JOB parameter in lsb.queues to limit the number of hosts that a job can use. For example, if a user submits a parallel job using bsub -n 1,4096 -R "span[ptile=1]", this job requests 4096 hosts from the cluster. If you specify a limit of 20 hosts per job, a user submitting a job requesting 4096 hosts will only be allowed to use 20 hosts.

Syntax

HOSTLIMIT_PER_JOB = integer

Specify the maximum number of hosts that a job can use. If the number of hosts requested for a parallel job exceeds this limit, the parallel job will pend.

How HOSTLIMIT_PER_JOB affects submission of parallel jobs

span[ptile=value] resource requirements
If a parallel job is submitted with the span[ptile=processors_per_host] resource requirement, the exact number of hosts requested is known (by dividing the number of processors by the processors per host). The job is rejected if the number of hosts requested exceeds the HOSTLIMIT_PER_JOB value. Other commands that specify a span[ptile=processors_per_host] resource requirement (such as bmod) are also subjected to this per-job host limit.
Compound resource requirements
If there is any part of the compound resource requirement that does not have a ptile specification, that part is considered to have a minimum of one host requested (before multiplying) when calculating the number of hosts requested.
For example:
  • 2*{span[ptile=1]}+3*{-} is considered to have a minimum of three hosts requested because the last part uses at least three hosts.
  • 2*{-}+3*{-}+4*{-} is considered to have a minimum of three hosts requested.
Alternative resource requirements
The smallest calculated number of hosts for all sets of resource requirements is used to compare to requested number of hosts with the per-job host limit. Any sets of resource requirements containing compound resource requirements, are calculated as compound resource requirements (that is, if there is any part of the compound resource requirement that does not have a ptile specification, that part is considered to have a minimum of one host requested, before multiplying, when calculating the number of hosts requested).

If the number of hosts requested in a parallel job is unknown during the submission stage, the per-job host limit does not apply and the job submission is accepted.

The per-job host limit is verified during resource allocation. If the per-job host limit is exceeded and the minimum number of requested hosts cannot be satisfied, the parallel job will pend.

This parameter does not stop the parallel job from resuming even if the job's host allocation exceeds the per-job host limit specified in this parameter.

If a parallel job is submitted under a range of the number of slots (bsub -n "min, max"), the per-job host limit applies to the minimum number of requested slots. That is, if the minimum number of requested slots is satisfied under the per-job host limit, the job submission is accepted.

Note: If you do not use a ptile specification in your resource requirements, LSF may have a false scheduling failure (that is, LSF may fail to find an allocation for a parallel job), even if a valid allocation exists. This occurs due to the computational complexity of finding an allocation with complex resource and limit relationships.

For example, hostA has two slots available, hostB and hostC have four slots available, and hostD has eight slots available, and HOSTLIMIT_PER_JOB=2. If you submit a job that requires ten slots and no ptile specification, the scheduler will determine that selecting hostA, hostB, and hostC will satisfy the requirements, but since this requires three hosts, the job will pend. This is a false scheduling failure because selecting hostA and hostD would satisfy this requirement.

To avoid false scheduling failure when HOSTLIMIT_PER_JOB is specified, submit jobs with the ptile resource requirement or add order[slots] to the resource requirements.

End of change