Remove hung jobs from Platform LSF

About this task

If a job is submitted with a run limit, Platform LSF attempts to kill the job after it reaches the run limit. A job becomes a hung job if Platform LSF cannot kill the job after its run limit is expired (specifically, mbatchd attempts to send a signal to sbatchd to kill a job, but sbatchd is unable to kill the job).

Hung jobs occur because of one of the following reasons:

  • sbatchd on the execution host is down (that is, the host is in the unreach or unavail status).

    Jobs running on an execution host when sbatchd goes down go into the UNKWN state. These UNKWN jobs continue to occupy shared resources, making the shared resources unavailable for other jobs.

  • Reasons specific to the operating system on the execution host.

    Jobs that cannot be killed due to an issue with the operating system remain in the RUN state even after the run limit is expired.

Hung jobs continue to reserve shared resources, making the shared resources unavailable for other jobs.

If you enable hung job management, Platform LSF removes hung jobs after the grace period is expired. The grace period of a hung job is the following:

job_grace_period = 10 minutes + MAX(6 seconds, JOB_TERMINATE_INTERVAL)

Where JOB_TERMINATE_INTERVAL is a parameter that is specified in lsb.params. The grace period only begins once a job’s run limit is expired.

Procedure

To enable hung job management, edit lsb.params and set REMOVE_HUNG_JOBS_FOR=runlimit.