What about instead of (automatic) requeue of the job, use --no-requeue in the first sbatch and when something went wrong with the job (why not something wrong with the node?) submit again with --no-requeue the job with the excluded nodes?
something as: sbatch --no-requeue file.sh, and then sbatch --no-requeue --exclude=n001file.sh (options in the command line overrides the options inside the script) El jue., 4 jun. 2020 a las 17:40, Ransom, Geoffrey M. (< geoffrey.ran...@jhuapl.edu>) escribió: > > > Not quite. > > The user’s job script in question is checking the error status of the > program it ran while it is running. If a program fails the running job > wants to exclude the machine it is currently running on and requeue itself > in case it died due to a local machine issue that the scheduler has not > flagged as a problem. > > > > The current goal is to have a running job step in an array job add the > current host to its exclude list and requeue itself when it detects a > problem. I can’t seem to modify the exclude list while a job is running, > but once the task is requeued and back in the queue it is no longer running > so it can’t modify its own exclude list. > > > > I.e…. put something like the following into a sbatch script so each task > can run it against itself. > > > > If ! $runprogram $args ; then > > NewExcNodeList=”$ ExcNodeList,$HOSTNAME” > > scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList > > scontrol requeue ${ SLURM_JOB_ID} > > sleep 10 > > fi > > > > > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Rodrigo Santibáñez > *Sent:* Thursday, June 4, 2020 4:16 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job > > > > *APL external email warning: *Verify sender > slurm-users-boun...@lists.schedmd.com before clicking links or attachments > > > > Hello, > > > > Jobs can be requeue if something wrong happens, and the node with failure > excluded by the controller. > > > > *--requeue* > > Specifies that the batch job should eligible to being requeue. The job may > be requeued explicitly by a system administrator, after node failure, or > upon preemption by a higher priority job. When a job is requeued, the batch > script is initiated from its beginning. Also see the *--no-requeue* > option. The *JobRequeue* configuration parameter controls the default > behavior on the cluster. > > > > Also, jobs can be run selecting a specific node or excluding nodes > > > > *-w*, *--nodelist*=<*node name list*> > > Request a specific list of hosts. The job will contain *all* of these > hosts and possibly additional hosts as needed to satisfy resource > requirements. The list may be specified as a comma-separated list of hosts, > a range of hosts (host[1-5,7,...] for example), or a filename. The host > list will be assumed to be a filename if it contains a "/" character. If > you specify a minimum node or processor count larger than can be satisfied > by the supplied host list, additional resources will be allocated on other > nodes as needed. Duplicate node names in the list will be ignored. The > order of the node names in the list is not important; the node names will > be sorted by Slurm. > > > > *-x*, *--exclude*=<*node name list*> > > Explicitly exclude certain nodes from the resources granted to the job. > > > > does this help? > > > > El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. (< > geoffrey.ran...@jhuapl.edu>) escribió: > > > > Hello > > We are moving from Univa(sge) to slurm and one of our users has jobs > that if they detect a failure on the current machine they add that machine > to their exclude list and requeue themselves. The user wants to emulate > that behavior in slurm. > > > > It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList > $NEWExcNodeList” won’t work on a running job, but it does work on a job > pending in the queue. This means the job can’t do this step and requeue > itself to avoid running on the same host as before. > > > > Our user wants his jobs to be able to exclude the current node and requeue > itself. > > Is there some way to accomplish this in slurm? > > Is there a requeue counter of some sort so a job can see if it has > requeued itself more than X times and give up? > > > > Thanks. > >