The behavior of slurm jobs using the ‘afterok’ dependency seems to have issues 
with ephemeral compute nodes such as those in a cloud cluster. If the specified 
jobid dependency is associated with a compute node that has already spun down 
then a subsequent job that requires successful completion of the prior job will 
fail with a “Job dependency problem”. This occurs when the subsequent job is 
tied to a node that must spin up before beginning execution. This phenomenon 
does not occur if the ‘afterany’ dependency is used. It seems that job 
completion status is retained when a node is spun down but no information as to 
whether the job was successfully executed is saved. There are perhaps other 
scenarios that coud cause the same issue. Has anyone else witnessed this 
problem? How can it be avoided?

afterany
This job can begin execution after the specified jobs have terminated.
aftercorr
A task of this job array can begin execution after the corresponding task ID in 
the specified job has completed successfully
afternotok
This job can begin execution after the specified jobs have terminated in some 
failed state
afterok
This job can begin execution after the specified jobs have successfully executed


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to