The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the Infiniband hardware. The LSF integration for IBM PE systems provides support for submitting POE jobs to run on IBM PE host. An IBM HPS system consists of multiple nodes running AIX. The system can be configured with a high-performance switch to allow high bandwidth and low latency communication between the nodes. The allocation of the switch to jobs as well as the division of nodes into pools is controlled by the HPS Resource Manager.
Download the installation package and the distribution tar files:
lsf9.1.1_linux2.6-glibc2.3-x86_64.tar.Z
lsf9.1.1_lsfinstall.tar.Z
This is the standard installation package. Use this package in a heterogeneous cluster with a mix of systems other than x86-64.
lsf9.1_lsfinstall_linux_x86_64.tar.Z
Use this smaller installation package in a homogeneous x86-64 clsuter. If you add other non x86-64 hosts you must use the standard installation package.
Make sure:
Install LSF as usual. All required configurations are added by the installation scripts.
Run chown to change the owner of nrt_api to root, and then use chmod to set setuid bit (chmod u +s).
During installation, lsfinstall configures a queue in lsb.queues named hpc_ibm for running POE jobs. It defines requeue exit values to enable requeuing of POE jobs if some users submit jobs requiring exclusive access to the node.
Begin Queue
QUEUE_NAME = hpc_ibm
PRIORITY = 30
NICE = 20
...
RES_REQ = select[ poe > 0 ]
REQUEUE_EXIT_VALUES = 133 134 135
...
DESCRIPTION = This queue is to run POE jobs ONLY.
End Queue
The poejob script will exit with 133 if it is necessary to requeue the job. Other types of jobs should not be submitted to the same queue. Otherwise, they get requeued if they exit with 133.
Ensure that the HPS node names are the same as their host names. That is, st_status should return the same names for the nodes that lsload returns.
Configure per-slot resource reservation (lsb.resources).
To support the IBM HPS architecture, LSF must reserve resources based on job slots. During installation, lsfinstall configures the ReservationUsage section in (lsb.resources to reserve HPS resources on a per-slot basis.
Resource usage defined in the ReservationUsage section overrides the cluster-wide RESOURCE_RESERVE_PER_SLOT parameter defined in lsb.params if it also exists.
Begin ReservationUsage
RESOURCE METHOD
nrt_windows PER_SLOT
adapter_windows PER_SLOT
ntbl_windows PER_SLOT
csss PER_SLOT
css0 PER_SLOT
End ReservationUsage
Enable exclusive mode in lsb.queues (optional):
To support the MP_ADAPTER_USE and -adapter_use POE job options, you must enable the LSF exclusive mode for each queue. To enable exclusive mode, edit lsb.queues and set EXCLUSIVE=Y:
Begin Queue
. . .
EXCLUSIVE=Y
. . .
End Queue
Define resource management pools (rmpool) and node locking queue threshold (optional):
If you schedule jobs based on resource management pools, you must configure rmpools as a static resource in LSF. Resource management pools are collections of SP2 nodes that together contain all available SP2 nodes without any overlap.
For example, to configure 2 resource management pools, p1 and p2, made up of 6 SP2 nodes (sp2n1, sp2n1, sp2n3, ..., sp2n6):
Edit lsf.shared and add an external resource called pool. For example:
Begin Resource
RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION
. . .
pool Numeric () () (sp2 resource mgmt pool)
lock Numeric 60 Y (IBM SP Node lock status)
. . .
End Resource
pool represents the resource management pool the node is in, and lock indicates whether the switch is locked.
Edit lsf.cluster.cluster_name and allocate the pool resource. For example:
Begin ResourceMap
RESOURCENAME LOCATION
. . .
pool (p1@[sp2n1 sp2n2 sp2n3] p2@[sp2n4 sp2n5 sp2n6])
. . .
End ResourceMap
Edit lsb.queues and a threshold for the lock index in the hpc_ibm queue:
Begin Queue
NAME=hpc_ibm
. . .
lock=0
. . .
End Queue
The scheduling threshold on the lock index prevents dispatching to nodes which are being used in exclusive mode by other jobs.
Define system partitions in spname (optional):
If you schedule jobs based on system partition names, you must configure the static resource spname. System partitions are collections of HPS nodes that together contain all available HPS nodes without any overlap. For example, to configure two system partition names, spp1 and spp2, made up of 6 SP2 nodes (sp2n1, sp2n1, sp2n3, ..., sp2n6):
Edit lsf.shared and add an external resource called spname. For example:
Begin Resource
RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION
. . .
spname String () () (sp2 sys partition name)
. . .
End Resource
Edit lsf.cluster.cluster_name and allocate the spname resource. For example:
Begin ResourceMap
RESOURCENAME LOCATION
. . .
spname (p1@[sp2n1 sp2n2 sp2n3] p2@[sp2n4 sp2n5 sp2n6])
. . .
End ResourceMap
Allocate switch adapter specific resources. If you use a switch adapter, you must define specific resources in LSF. During installation, lsfinstall defines the following external resources in lsf.shared:
POE — numeric resource to show POE availability updated through ELIM
nrt_windows — number of free network table windows on IBM HPS systems updated through ELIM
adapter_windows — number of free adapter windows on IBM SP Switch2 systems
ntbl_windows — number of free network table windows on IBM HPS systems
css0 — number of free adapter windows on css0 on IBM SP Switch2 systems
csss — number of free adapter windows on csss on IBM SP Switch2 systems
dedicated_tasks — number of running dedicated tasks
ip_tasks — number of running IP (Internet Protocol communication subsystem) tasks
us_tasks — number of running US (User Space communication subsystem) tasks
LSF installation adds a shared nrt_windows resource to run and monitor POE jobs over the InfiniBand interconnect. These resources are updated through elim.hpc in lsb.shared:
Begin Resource
RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION
... on css0 on IBM SP
adapter_windows Numeric 30 N (free adapter windows)
ntbl_windows Numeric 30 N (free ntbl windows on IBM HPS)
poe Numeric 30 N (poe availability)
css0 Numeric 30 N (free adapter windows on css0 on IBM SP)
dedicated_tasks Numeric () Y (running dedicated tasks)
ip_tasks Numeric () Y (running IP tasks)
us_tasks Numeric () Y (running US tasks)
nrt_windows Numeric 30 N (free NRT windows on IBM POE over IB)
. . .
End Resource
You must edit lsf.cluster.cluster_name and allocate the external resources. For example, to configure a switch adapter for six SP2 nodes (sp2n1, sp2n1, sp2n3, ..., sp2n6):
Begin ResourceMap
RESOURCENAME LOCATION
. . .
adapter_windows [default]
poe [default]
nrt_windows [default]
ntbl_windows [default]
css0 [default]
csss [default]
dedicated_tasks (0@[default])
ip_tasks (0@[default])
us_tasks (0@[default])
. . .
End ResourceMap
The adapter_windows and ntbl_windows resources are required for all POE jobs. The other three resources are only required when you run IP and US jobs at the same time.
Tune PAM parameters (optional):
To improve performance and scalability for large POE jobs, tune the following lsf.conf parameters. The user's environment can override these.
LSF_HPC_PJL_LOADENV_TIMEOUT — Timeout value in seconds for PJL to load or unload the environment. For example, the time needed for IBM POE to load or unload adapter windows. At job startup, the PJL times out if the first task fails to register within the specified timeout value. At job shutdown, the PJL times out if it fails to exit after the last Taskstarter termination report within the specified timeout value.
Default: LSF_HPC_PJL_LOADENV_TIMEOUT=300
LSF_PAM_RUSAGE_UPD_FACTOR — This factor adjusts the update interval according to the following calculation:
RUSAGE_UPDATE_INTERVAL + num_tasks * 1 * LSF_PAM_RUSAGE_UPD_FACTOR.
PAM updates resource usage for each task for every SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15). For large parallel jobs, this interval is too long. As the number of parallel tasks increases, LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates.
Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01
Reconfigure to apply the changes:
Run badmin ckconfig to check the configuration changes. If any errors are reported, fix the problem and check the configuration again.
Reconfigure the cluster:
badmin reconfig
Checking configuration files ...
No errors found.
Do you want to reconfigure? [y/n] y
Reconfiguration initiated
LSF checks for any configuration errors. If no fatal errors are found, you are asked to confirm reconfiguration. If fatal errors are found, reconfiguration is aborted.
An external LIM (ELIM) for POE jobs is supplied with LSF. ELIM uses the nrt_status command to collect information from the Resource Manager. The ELIM searches the following path for the poe and st_status commands:
PATH="/usr/bin:/bin:/usr/local/bin:/local/bin:/sbin:/usr/sbin:/usr/ucb:/usr/ sbin:/usr/bsd:${PATH}"
If these commands are installed in a different directory, you must modify the PATH variable in LSF_SERVERDIR/elim.hpc to point to the correct directory. Run lsload to display the nrt_windows and POE resources:
lsload -l
HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem nrt_windows poe
hostA ok 0.0 0.0 0.0 1% 8.1 4 1 0 1008M 4090M 6976M 128.0 1.0
hostB ok 0.0 0.0 0.0 0% 0.7 1 0 0 1006M 4092M 7004M 128.0 1.0
The POE PJL (Parallel Job Launcher) wrapper, poejob, parses the POE job options, and filters out those that have been set by LSF.
$LSF_BINDIR/poejob script is the driver script to launch a PE job under LSF. It will create temporary files after a job is dispatched. The files are cleaned up after the job finishes or exits. The default location for the files is $HOME. You may customize the location for the files.
Edit $LSF_BINDIR/poejob by replacing all ${HOME} with the directory in which you chose to store temporary files while the job is running. After the change, simply save and exit. You do not need to restart or reconfigure LSF.