[slurm-users] send/recv timeout and zero bytes transmitted errors

2019-06-11 Thread Andrei Huang
Hi all, We are having issues submitting MPI jobs. There is intermittent issues as well where no slurm commands can be issued. The major concern is that jobs cannot be submitted to a particular set of 16 nodes in its own partition (1 mpi job for example, not 100s at once..) There are numerous

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Daniel Letai
I had similar problems in the past. The 2 most common issues were: 1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit. 2. Topology and msg forwarding and aggregation. For

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Steffen Grunewald
On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma002/ecf/home/Aos/Prod/

[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Marcelo Garcia
Hi Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails: + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 sbatch: error: Batch job submission failed: