I'm not really in a position to check, since I'm not our cluster admin. I
asked him and he thought it might be down to high load on the client node
at that point in time; we often run submission commands from our shared
compute nodes, which can become overloaded because they aren't themselves
managed by a scheduler. If it's *not* that and it's something we really
need to investigate, that would be good to know.

On Mon, Feb 23, 2026 at 9:42 PM Christopher Samuel via slurm-users <
[email protected]> wrote:

> On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote:
>
> > I'm working on the Slurm integration in our Toil workflow runner
> > project. I'm having a problem where an `sbatch` command to submit a job
> > to Slurm can fail (with exit code 1 and message "sbatch: error: Batch
> > job submission failed: Socket timed out on send/recv operation", in my
> > case, but possibly in other ways), but the job can still actually have
> > been submitted, and can still execute.
>
> I know others have given ideas on working around this, but have you had
> a chance to dig into why this is happening for you? That sort of network
> timeout points to either the slurmctld being totally overwhelmed with
> RPCs, or wedged in I/O, or some odd network problem.
>
> Do you see anything diagnostic in the slurmctld logs when that's happening?
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Philadelphia, PA, USA
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>


-- 
Adam Novak (He/Him)
Senior Software Engineer
Computational Genomics Lab
UC Santa Cruz Genomics Institute
"Revealing life’s code."

Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to