The bjobs command displays the current state of the job.
Most jobs enter only three states:
Job state |
Description |
---|---|
PEND |
Waiting in a queue for scheduling and dispatch |
RUN |
Dispatched to a host and running |
DONE |
Finished normally with a zero exit value |
If a job is suspended, it has three states:
Job state |
Description |
---|---|
PSUSP |
Suspended by its owner or the LSF administrator while in PEND state |
USUSP |
Suspended by its owner or the LSF administrator after being dispatched |
SSUSP |
Suspended by the LSF system after being dispatched |
A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.
A job remains pending until all conditions for its execution are met. Some of the conditions are:
Start time that is specified by the user when the job is submitted
Load conditions on qualified hosts
Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs
Run windows during which jobs from the queue can run
Limits on the number of job slots that are configured for a queue, a host, or a user
Relative priority to other users and jobs
Availability of the specified resources
Job dependency and pre-execution conditions
If the user or user group submitting the job has reached the pending job threshold as specified by MAX_PEND_JOBS (either in the User section of lsb.users, or cluster-wide in lsb.params), LSF will reject any further job submission requests sent by that user or user group. The system will continue to send the job submission requests with the interval specified by SUB_TRY_INTERVAL in lsb.params until it has made a number of attempts equal to the LSB_NTRIES environment variable. If LSB_NTRIES is undefined and LSF rejects the job submission request, the system will continue to send the job submission requests indefinitely as the default behavior.
A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.
After a job is dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.
If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.
LSF suspends jobs according to the priority of the job’s queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.
Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.
A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.
If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers, even though the entire chunk job has been dispatched and occupies a job slot. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.
You can switch (bswitch) or migrate (bmig) a chunk job member in WAIT state to another queue.
An exited job that is ended with a non-zero exit status.
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
The job is canceled by its owner or the LSF administrator while pending, or after being dispatched to a host.
The job is not able to be dispatched before it reaches its termination deadline that is set by bsub -t, and thus is terminated by LSF.
The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
The application exits with a non-zero exit code.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host.
Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.
The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the post_done and post_err keywords on the bsub -w command to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.
After the job completes, you cannot perform any job control on the post-processing. Post-processing exit codes are not reported to LSF.
The post-processing of a repetitive job cannot be longer than the repetition period.