> Hi Chris > > You are right in pointing that the job actually runs, despite of the error in > the sbatch. The customer mention that: > === start === > Problem had usual scenario - job script was submitted and executed, but > sbatch command returned non-zero exit status to ecflow, which thus assumed > job to be dead. > === end === > > Which version of slurm are you using? I'm using " 17.02.4-1", and we are > wondering about the possibility of upgrading to a newer version, that is, I > hope that there was a bug and Schedmd fixed the problem.
Sorry I missed that. I am not the admin of the system, but I believe we are using 18.08.7. I believe we have a ticket open with SchedMD and our admin team is working with them. And I believe the approach being taken is to capture statistics with sdiag and use that info to tune configuration parameters. It is my understanding that they view the problem as a configuration issue rather than a bug in the scheduler. What this means to me is that the timeouts can only be minimized, not eliminated. And because workflow corruption is such a disastrous event, I have built in attempts to try to work around it even though occurrences are “rare”. Chris