Hello
   We are moving from Univa(sge) to slurm and one of our users has jobs that if 
they detect a failure on the current machine they add that machine to their 
exclude list and requeue themselves. The user wants to emulate that behavior in 
slurm.

It seems like "scontrol update job ${SLURM_JOB_ID} ExcNodeList $NEWExcNodeList" 
won't work on a running job, but it does work on a job pending in the queue. 
This means the job can't do this step and requeue itself to avoid running on 
the same host as before.

Our user wants his jobs to be able to exclude the current node and requeue 
itself.
Is there some way to accomplish this in slurm?
Is there a requeue counter of some sort so a job can see if it has requeued 
itself more than X times and give up?

Thanks.

Reply via email to