date:20190611

[slurm-users] send/recv timeout and zero bytes transmitted errors

2019-06-11 Thread Andrei Huang

Hi all, We are having issues submitting MPI jobs. There is intermittent issues as well where no slurm commands can be issued. The major concern is that jobs cannot be submitted to a particular set of 16 nodes in its own partition (1 mpi job for example, not 100s at once..) There are numerous

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Daniel Letai

I had similar problems in the past. The 2 most common issues were: 1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit. 2. Topology and msg forwarding and aggregation. For

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Steffen Grunewald

On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma002/ecf/home/Aos/Prod/

[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Marcelo Garcia

Hi Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails: + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 sbatch: error: Batch job submission failed:

[slurm-users] send/recv timeout and zero bytes transmitted errors

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

4 matches

Site Navigation

Mail list logo

Footer information