Running IBM POE Jobs

The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the Infiniband hardware. The LSF integration for IBM PE systems provides support for submitting POE jobs to run on IBM PE host. An IBM HPS system consists of multiple nodes running AIX. The system can be configured with a high-performance switch to allow high bandwidth and low latency communication between the nodes. The allocation of the switch to jobs as well as the division of nodes into pools is controlled by the HPS Resource Manager.

Configure the hpc_ibm queue for POE jobs

Download the installation package and the distribution tar files:
- lsf9.1.1_linux2.6-glibc2.3-x86_64.tar.Z
- lsf9.1.1_lsfinstall.tar.Z
  
  This is the standard installation package. Use this package in a heterogeneous cluster with a mix of systems other than x86-64.
- lsf9.1_lsfinstall_linux_x86_64.tar.Z
  
  Use this smaller installation package in a homogeneous x86-64 clsuter. If you add other non x86-64 hosts you must use the standard installation package.
Make sure:
- POE version is 5.2 or higher
- POE command is in $PATH
- /usr/lib64/libnrt.so or /usr/lib/libnrt.so exists
Install LSF as usual. All required configurations are added by the installation scripts.
Run chown to change the owner of nrt_api to root, and then use chmod to set setuid bit (chmod u +s).

During installation, lsfinstall configures a queue in lsb.queues named hpc_ibm for running POE jobs. It defines requeue exit values to enable requeuing of POE jobs if some users submit jobs requiring exclusive access to the node.

   Begin Queue
   QUEUE_NAME   = hpc_ibm
   PRIORITY     = 30
   NICE         = 20
   ...
   RES_REQ = select[ poe > 0 ]
   REQUEUE_EXIT_VALUES = 133 134 135 
   ...
   DESCRIPTION  = This queue is to run POE jobs ONLY.
   End Queue

The poejob script will exit with 133 if it is necessary to requeue the job. Other types of jobs should not be submitted to the same queue. Otherwise, they get requeued if they exit with 133.

Configure LSF to run POE jobs

Ensure that the HPS node names are the same as their host names. That is, st_status should return the same names for the nodes that lsload returns.

Configure per-slot resource reservation (lsb.resources).

To support the IBM HPS architecture, LSF must reserve resources based on job slots. During installation, lsfinstall configures the ReservationUsage section in (lsb.resources to reserve HPS resources on a per-slot basis.

Resource usage defined in the ReservationUsage section overrides the cluster-wide RESOURCE_RESERVE_PER_SLOT parameter defined in lsb.params if it also exists.
```
   Begin ReservationUsage
   RESOURCE          METHOD
   nrt_windows       PER_SLOT
   adapter_windows   PER_SLOT
   ntbl_windows      PER_SLOT
   csss              PER_SLOT
   css0              PER_SLOT
   End ReservationUsage
```
Enable exclusive mode in lsb.queues (optional):

To support the MP_ADAPTER_USE and -adapter_use POE job options, you must enable the LSF exclusive mode for each queue. To enable exclusive mode, edit lsb.queues and set EXCLUSIVE=Y:
```
   Begin Queue
   . . .
   EXCLUSIVE=Y
   . . .
   End Queue
```
Define resource management pools (rmpool) and node locking queue threshold (optional):

If you schedule jobs based on resource management pools, you must configure rmpools as a static resource in LSF. Resource management pools are collections of SP2 nodes that together contain all available SP2 nodes without any overlap.

For example, to configure 2 resource management pools, p1 and p2, made up of 6 SP2 nodes (sp2n1, sp2n1, sp2n3, ..., sp2n6):
1. Edit lsf.shared and add an external resource called pool. For example:
```
   Begin Resource
   RESOURCENAME 		TYPE    INTERVAL  INCREASING   DESCRIPTION
   . . .
   pool             Numeric   ()        ()          (sp2 resource mgmt pool)
   lock             Numeric   60        Y           (IBM SP Node lock status)
   . . .
   End Resource
```
  pool represents the resource management pool the node is in, and lock indicates whether the switch is locked.
2. Edit lsf.cluster.cluster_name and allocate the pool resource. For example:
```
   Begin ResourceMap
   RESOURCENAME 	LOCATION
   . . .
   pool           (p1@[sp2n1 sp2n2 sp2n3] p2@[sp2n4 sp2n5 sp2n6])
   . . .
   End ResourceMap
```
3. Edit lsb.queues and a threshold for the lock index in the hpc_ibm queue:
```
   Begin Queue
   NAME=hpc_ibm
   . . .
   lock=0
   . . .
   End Queue
```
  The scheduling threshold on the lock index prevents dispatching to nodes which are being used in exclusive mode by other jobs.
Define system partitions in spname (optional):

If you schedule jobs based on system partition names, you must configure the static resource spname. System partitions are collections of HPS nodes that together contain all available HPS nodes without any overlap. For example, to configure two system partition names, spp1 and spp2, made up of 6 SP2 nodes (sp2n1, sp2n1, sp2n3, ..., sp2n6):
1. Edit lsf.shared and add an external resource called spname. For example:
```
   Begin Resource
   RESOURCENAME 		TYPE    INTERVAL  INCREASING   DESCRIPTION
   . . .
   spname 				  String   ()        ()          (sp2 sys partition name)
   . . .
   End Resource
```
2. Edit lsf.cluster.cluster_name and allocate the spname resource. For example:
```
   Begin ResourceMap
   RESOURCENAME 	LOCATION
   . . .
   spname         (p1@[sp2n1 sp2n2 sp2n3] p2@[sp2n4 sp2n5 sp2n6])
   . . .
   End ResourceMap
```
Allocate switch adapter specific resources. If you use a switch adapter, you must define specific resources in LSF. During installation, lsfinstall defines the following external resources in lsf.shared:
- POE — numeric resource to show POE availability updated through ELIM
- nrt_windows — number of free network table windows on IBM HPS systems updated through ELIM
- adapter_windows — number of free adapter windows on IBM SP Switch2 systems
- ntbl_windows — number of free network table windows on IBM HPS systems
- css0 — number of free adapter windows on css0 on IBM SP Switch2 systems
- csss — number of free adapter windows on csss on IBM SP Switch2 systems
- dedicated_tasks — number of running dedicated tasks
- ip_tasks — number of running IP (Internet Protocol communication subsystem) tasks
- us_tasks — number of running US (User Space communication subsystem) tasks
LSF installation adds a shared nrt_windows resource to run and monitor POE jobs over the InfiniBand interconnect. These resources are updated through elim.hpc in lsb.shared:
```
   Begin Resource
   RESOURCENAME 		 TYPE    INTERVAL  INCREASING   DESCRIPTION
   ... on css0 on IBM SP
   adapter_windows  Numeric  30        N           (free adapter windows)
   ntbl_windows     Numeric  30        N           (free ntbl windows on IBM HPS)
   poe              Numeric  30        N           (poe availability)
   css0             Numeric  30        N           (free adapter windows on css0 on IBM SP)
   dedicated_tasks  Numeric  ()        Y           (running dedicated tasks)
   ip_tasks         Numeric  ()        Y           (running IP tasks)
   us_tasks         Numeric  ()        Y           (running US tasks)
   nrt_windows      Numeric  30        N           (free NRT windows on IBM POE over IB)
   . . .
   End Resource
```
You must edit lsf.cluster.cluster_name and allocate the external resources. For example, to configure a switch adapter for six SP2 nodes (sp2n1, sp2n1, sp2n3, ..., sp2n6):
```
   Begin ResourceMap
   RESOURCENAME 	  LOCATION
   . . .
   adapter_windows [default]
   poe             [default]
   nrt_windows     [default]
   ntbl_windows    [default]
   css0            [default]
   csss            [default]
   dedicated_tasks (0@[default])
   ip_tasks        (0@[default])
   us_tasks        (0@[default])
   . . .
   End ResourceMap
```
The adapter_windows and ntbl_windows resources are required for all POE jobs. The other three resources are only required when you run IP and US jobs at the same time.
Tune PAM parameters (optional):

To improve performance and scalability for large POE jobs, tune the following lsf.conf parameters. The user's environment can override these.
- LSF_HPC_PJL_LOADENV_TIMEOUT — Timeout value in seconds for PJL to load or unload the environment. For example, the time needed for IBM POE to load or unload adapter windows. At job startup, the PJL times out if the first task fails to register within the specified timeout value. At job shutdown, the PJL times out if it fails to exit after the last Taskstarter termination report within the specified timeout value.
  
  Default: LSF_HPC_PJL_LOADENV_TIMEOUT=300
- LSF_PAM_RUSAGE_UPD_FACTOR — This factor adjusts the update interval according to the following calculation:
  
  RUSAGE_UPDATE_INTERVAL + num_tasks * 1 * LSF_PAM_RUSAGE_UPD_FACTOR.
  
  PAM updates resource usage for each task for every SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15). For large parallel jobs, this interval is too long. As the number of parallel tasks increases, LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates.
  
  Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01
Reconfigure to apply the changes:
1. Run badmin ckconfig to check the configuration changes. If any errors are reported, fix the problem and check the configuration again.
2. Reconfigure the cluster:
```
badmin reconfig
Checking configuration files ...
No errors found.
Do you want to reconfigure? [y/n] y
Reconfiguration initiated
```
  LSF checks for any configuration errors. If no fatal errors are found, you are asked to confirm reconfiguration. If fatal errors are found, reconfiguration is aborted.

POE ELIM (elim.hpc)

An external LIM (ELIM) for POE jobs is supplied with LSF. ELIM uses the nrt_status command to collect information from the Resource Manager. The ELIM searches the following path for the poe and st_status commands:

PATH="/usr/bin:/bin:/usr/local/bin:/local/bin:/sbin:/usr/sbin:/usr/ucb:/usr/ sbin:/usr/bsd:${PATH}"

If these commands are installed in a different directory, you must modify the PATH variable in LSF_SERVERDIR/elim.hpc to point to the correct directory. Run lsload to display the nrt_windows and POE resources:

lsload -l
HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem nrt_windows poe
hostA ok 0.0 0.0 0.0 1% 8.1 4 1 0 1008M 4090M 6976M 128.0 1.0
hostB ok 0.0 0.0 0.0 0% 0.7 1 0 0 1006M 4092M 7004M 128.0 1.0

POE PJL wrapper (poejob)

The POE PJL (Parallel Job Launcher) wrapper, poejob, parses the POE job options, and filters out those that have been set by LSF.

$LSF_BINDIR/poejob

$LSF_BINDIR/poejob script is the driver script to launch a PE job under LSF. It will create temporary files after a job is dispatched. The files are cleaned up after the job finishes or exits. The default location for the files is $HOME. You may customize the location for the files.

Edit $LSF_BINDIR/poejob by replacing all ${HOME} with the directory in which you chose to store temporary files while the job is running. After the change, simply save and exit. You do not need to restart or reconfigure LSF.