Checkpointing
enables LSF users to restart a job on the same execution host or to
migrate a job to a different execution host. LSF controls
checkpointing and restart by means of interfaces named echkpnt and erestart.
By default, when a user specifies a checkpoint directory using bsub
-k or bmod -k or submits a job to a queue
that has a checkpoint directory specified, echkpnt sends
checkpoint instructions to an executable named echkpnt.default.
When
LSF checkpoints a job, the echkpnt interface creates
a checkpoint file in the directory checkpoint_dir/job_ID,
and then checkpoints and resumes the job. The job continues to run,
even if checkpointing fails.
When LSF restarts a stopped job,
the
erestart interface recovers job state information
from the checkpoint file, including information about the execution
environment, and restarts the job from the point at which the job
stopped. At job restart, LSF
Resubmits the job to its original queue and assigns a new job
ID
Dispatches the job when a suitable host becomes available (not
necessarily the original execution host)
Re-creates the execution environment based on information from
the checkpoint file
Restarts the job from its most recent checkpoint
Default behavior (job checkpoint
and restart not enabled)
With job checkpoint and restart
enabled
Kernel-level checkpoint and restart
The
operating system provides checkpoint and restart functionality that
is transparent to your applications and enabled by default. To implement
job checkpoint and restart at the kernel level, the LSF echkpnt and erestart executables
invoke operating system-specific calls.
LSF uses the default
executables echkpnt.default and erestart.default for
kernel-level checkpoint and restart.
Application-level checkpoint and restart
Different
applications have different checkpointing implementations that require
the use of customized external executables (echkpnt.application and erestart.application).
Application-level checkpoint and restart enables you to configure
LSF to use specific echkpnt.application and erestart.application executables
for a job, queue, or cluster. You can write customized checkpoint
and restart executables for each application that you use.
LSF
uses a combination of corresponding checkpoint and restart executables.
For example, if you use echkpnt.fluent to checkpoint
a particular job, LSF will use erestart.fluent to
restart the checkpointed job. You cannot override this behavior or
configure LSF to use a specific restart executable.
Scope
Applicability
|
Details
|
Operating system
|
|
Job types
|
Non-interactive batch jobs submitted with bsub or bmod
Non-interactive batch jobs, including chunk jobs, checkpointed
with bchkpnt
Non-interactive batch jobs migrated with bmig
Non-interactive batch jobs restarted with brestart
|
Dependencies
|
UNIX and Windows user accounts must be valid on all hosts in
the cluster, or the correct type of account mapping must be enabled. For a mixed UNIX/Windows cluster, UNIX/Windows user account
mapping must be enabled.
For a cluster with a non-uniform user name space, between-host
account mapping must be enabled.
For a MultiCluster environment with a non-uniform user name
space, cross-cluster user account mapping must be enabled.
The checkpoint and restart executables run under the user account
of the user who submits the job. User accounts must have the correct
permissions to
The erestart.application executable
must have access to the original command line used to submit the job.
For user-level checkpoint and restart, you must have access
to your application object (.o) files.
To allow restart of a checkpointed job on a different host
than the host on which the job originally ran, both the original and
the new hosts must: Be binary compatible
Run the same dot version of the operating system for predictable
results
Have network connectivity and read/execute permissions to the
checkpoint and restart executables (in LSF_SERVERDIR by
default)
Have network connectivity and read/write permissions to the
checkpoint directory and the checkpoint file
Have access to all files open during job execution so that
LSF can locate them using an absolute path name
|
Limitations
|
bmod cannot change the echkpnt and erestart executables
associated with a job.
Linux 32, AIX, and HP platforms with NFS (network file systems),
checkpoint directories (including path and file name) must be shorter
than 1000 characters.
Linux 64 with NFS (network file systems), checkpoint directories
(including path and file name) must be shorter than 2000 characters.
|