Changes between Version 19 and Version 20 of net.sf.basedb.opengrid/using


Ignore:
Timestamp:
Aug 24, 2020, 1:30:10 PM (16 months ago)
Author:
Nicklas Nordborg
Comment:

Updated documentation with Slurm information

Legend:

Unmodified
Added
Removed
Modified
  • net.sf.basedb.opengrid/using

    v19 v20  
    194194One important feature is that extensions can get notified when a job running on the cluster has ended. This is implemented in an asynchronous manner and it should not matter if the BASE server is updated or restarted or otherwise modified while a job is running. In the background there are three parts that work together to make this feature work.
    195195
    196  * The BASE system for requesting job progress information about external jobs has been setup to send requests to the `OpenGridService` whenever it want new information about a job. This is the reason why it is important to create a BASE job item as a proxy for the Open Grid Cluster jobs. Without it, no progress information is requested and we never get to know when the job has ended.
     196 * The BASE system for requesting job progress information about external jobs has been setup to send requests to the `OpenGridService` whenever it want new information about a job. This is the reason why it is important to create a BASE job item as a proxy for the all jobs submitted to a cluster. Without it, no progress information is requested and we never get to know when the job has ended.
    197197 * The `OpenGridService` is polling each registered cluster at regular intervals. Typically once every minute but it may be more or less often depending on if there are any known jobs executing or not. The `OpenGridSession.qstat()` and `OpenGridSession.qacct()` methods are used for this and will detect waiting, running and completed jobs. For running jobs, the service downloads the `progress` file (see `ScriptBuilder.progress()` above) and update the progress information in the BASE database.
    198198 * Once a job has been detected as completed the service will initiate the job completion sequence. This is implemented as a custom extension point (`net.sf.basedb.opengrid.job-complete`) that receive messages about completed jobs. Extensions that want to get notified should extend this extension point. Note that all registered extensions are notified about all jobs! It doesn't matter which extension that originally submitted the job to the cluster. Notifications are sent both for successful and failed jobs. '''Each extension is responsible for filtering and ignoring notifications about jobs that is of no interest to them'''. This is why it is important to set name, plug-in version, etc. on the job when submitting it. We recommend that this filtering step is done in the `ActionFactory` implementation that is registered for the `net.sf.basedb.opengrid.job-complete` extension point. Note that a single notification may handle more than one completed job. Thus, the `prepareContext()` method is called once and without any information about the jobs while the the `getActions()` method is called once for every job.
     
    255255The `ActionFactory.getActions()` implementation should not do anything except check if the job is of interest or not. It should return `null` if it is not interested in the job, and an instance implementing the `JobCompletionHandler` interface otherwise. This interface defines a single method: `JobCompletionHandler.jobCompleted(SessionControl, OpenGridSession, Job, JobStaus)`. The `Job` and `JobStatus` objects are the same as in the `ActionFactory`, but in this method you also get access to a `SessionControl` instance and a connected `OpenGridSession` to the cluster the job was running on. The `OpenGridSession` can for example be used to download and parse result files. The `SessionControl` can be used to access BASE and update items and/or annotations. The good thing about the `SessionControl` is that it has been automatically configured so that the owner of the job is already logged in and a project (if any is specified on the job) is set as the active project (in the `ActionFactory` the session control is a generic one with the root user logged in).
    256256
    257 Do not update the `Job` item since this may interfere with the updates to the job made by the Open Grid extension. The method may return a string to set the status message of the job, or throw an exception to set the job status to ERROR.
     257Do not update the `Job` item since this may interfere with the updates to the job that are automatically made by the Job Scheduler extension. The method may return a string to set the status message of the job, or throw an exception to set the job status to ERROR.
    258258
    259259{{{
     
    290290== Aborting jobs ==
    291291
    292 This is automatically handled by the Open Grid extension by the same mechanism that is used for progress reporting. The abort is handled by calling the `OpenGridSession.qdel()` method. After that the job is handled just as if any other error had occurred, eg. the job completion sequence is initiated. Extensions that are interested in manually aborted jobs should check for `JobStatus.getStatus() == Job.Status.ERROR` and `JobStatus.getExitCode() == 137`, which indicates that the job was aborted by the user.
     292This is automatically handled by the Job Scheduler extension by the same mechanism that is used for progress reporting. The abort is handled by calling the `OpenGridSession.qdel()` method. After that the job is handled just as if any other error had occurred, eg. the job completion sequence is initiated. Extensions that are interested in manually aborted jobs should check for `JobStatus.getStatus() == Job.Status.ERROR` and `JobStatus.getExitCode() == 137`, which indicates that the job was aborted by the user.
    293293
    294294== Advanced usage ==