On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote:

I'm working on the Slurm integration in our Toil workflow runner project. I'm having a problem where an `sbatch` command to submit a job to Slurm can fail (with exit code 1 and message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation", in my case, but possibly in other ways), but the job can still actually have been submitted, and can still execute.

I know others have given ideas on working around this, but have you had a chance to dig into why this is happening for you? That sort of network timeout points to either the slurmctld being totally overwhelmed with RPCs, or wedged in I/O, or some odd network problem.

Do you see anything diagnostic in the slurmctld logs when that's happening?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Philadelphia, PA, USA

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to