About job checkpoint and restart

Checkpointing enables LSF users to restart a job on the same execution host or to migrate a job to a different execution host. LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.

When LSF checkpoints a job, the echkpnt interface creates a checkpoint file in the directory checkpoint_dir/job_ID, and then checkpoints and resumes the job. The job continues to run, even if checkpointing fails.

When LSF restarts a stopped job, the erestart interface recovers job state information from the checkpoint file, including information about the execution environment, and restarts the job from the point at which the job stopped. At job restart, LSF
  1. Resubmits the job to its original queue and assigns a new job ID

  2. Dispatches the job when a suitable host becomes available (not necessarily the original execution host)

  3. Re-creates the execution environment based on information from the checkpoint file

  4. Restarts the job from its most recent checkpoint

Default behavior (job checkpoint and restart not enabled)

With job checkpoint and restart enabled

Kernel-level checkpoint and restart

The operating system provides checkpoint and restart functionality that is transparent to your applications and enabled by default. To implement job checkpoint and restart at the kernel level, the LSF echkpnt and erestart executables invoke operating system-specific calls.

LSF uses the default executables echkpnt.default and erestart.default for kernel-level checkpoint and restart.

Application-level checkpoint and restart

Different applications have different checkpointing implementations that require the use of customized external executables (echkpnt.application and erestart.application). Application-level checkpoint and restart enables you to configure LSF to use specific echkpnt.application and erestart.application executables for a job, queue, or cluster. You can write customized checkpoint and restart executables for each application that you use.

LSF uses a combination of corresponding checkpoint and restart executables. For example, if you use echkpnt.fluent to checkpoint a particular job, LSF will use erestart.fluent to restart the checkpointed job. You cannot override this behavior or configure LSF to use a specific restart executable.

Scope

Applicability

Details

Operating system

  • Kernel-level checkpoint and restart using the LSF checkpoint libraries works only with supported operating system versions and architecture.

Job types

  • Non-interactive batch jobs submitted with bsub or bmod

  • Non-interactive batch jobs, including chunk jobs, checkpointed with bchkpnt

  • Non-interactive batch jobs migrated with bmig

  • Non-interactive batch jobs restarted with brestart

Dependencies

  • UNIX and Windows user accounts must be valid on all hosts in the cluster, or the correct type of account mapping must be enabled.
    • For a mixed UNIX/Windows cluster, UNIX/Windows user account mapping must be enabled.

    • For a cluster with a non-uniform user name space, between-host account mapping must be enabled.

    • For a MultiCluster environment with a non-uniform user name space, cross-cluster user account mapping must be enabled.

  • The checkpoint and restart executables run under the user account of the user who submits the job. User accounts must have the correct permissions to
    • Successfully run executables located in LSF_SERVERDIR or LSB_ECHKPNT_METHOD_DIR

    • Write to the checkpoint directory

  • The erestart.application executable must have access to the original command line used to submit the job.

  • For user-level checkpoint and restart, you must have access to your application object (.o) files.

  • To allow restart of a checkpointed job on a different host than the host on which the job originally ran, both the original and the new hosts must:
    • Be binary compatible

    • Run the same dot version of the operating system for predictable results

    • Have network connectivity and read/execute permissions to the checkpoint and restart executables (in LSF_SERVERDIR by default)

    • Have network connectivity and read/write permissions to the checkpoint directory and the checkpoint file

    • Have access to all files open during job execution so that LSF can locate them using an absolute path name

Limitations

  • bmod cannot change the echkpnt and erestart executables associated with a job.

  • Linux 32, AIX, and HP platforms with NFS (network file systems), checkpoint directories (including path and file name) must be shorter than 1000 characters.

  • Linux 64 with NFS (network file systems), checkpoint directories (including path and file name) must be shorter than 2000 characters.