If a job is forwarded to a remote cluster and then fails to start, it returns to the submission queue and LSF retries the job. After a certain number of failed retry attempts, LSF suspends the job (PSUSP). The job remains in that state until the job owner or administrator takes action to resume, modify, or remove the job.
By default, LSF tries to start a job up to 6 times (the threshold is 5 retry attempts). The retry threshold is configurable.
You can also configure LSF to send email to the job owner when the job is suspended. This allows the job owner to investigate the problem promptly. By default, LSF does not alert users when a job has reached its retry threshold.
Set LSB_MC_INITFAIL_RETRY in lsf.conf and specify the maximum number of retry attempts. For example, to attempt to start a job no more than 3 times in total, specify 2 retry attempts:
LSB_MC_INITFAIL_RETRY = 2
To make LSF email the user when a job is suspended after reaching the retry threshold, set LSB_MC_INITFAIL_MAIL in lsf.conf to y:
LSB_MC_INITFAIL_MAIL = y
By default, LSF does not notify the user.