[slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-12 Thread Alessandro Federico
this bug report https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the previous SLURM version and we do not set debug3 on slurmctld. thanks in advance ale -- Alessandro Federico HPC System Management Group System & Technology Department CINECA www.cineca.it Via dei Tiz

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-15 Thread Alessandro Federico
quot;John DeSantis" > To: "Alessandro Federico" > Cc: slurm-users@lists.schedmd.com, "Isabella Baccarelli" > , hpc-sysmgt-i...@cineca.it > Sent: Friday, January 12, 2018 7:58:38 PM > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/re

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread Alessandro Federico
> See Giovanni Torres' post on setting this up... > > http://giovannitorres.me/graphing-sdiag-with-graphite.html > > -- Trevor > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico > > wrote: > > > > Hi John > > > > thanks for the info. &

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread Alessandro Federico
han log > contents by themselves. > > See Giovanni Torres' post on setting this up... > > http://giovannitorres.me/graphing-sdiag-with-graphite.html > > -- Trevor > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico > > wrote: > > > >

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-17 Thread Alessandro Federico
have several partitions which contain the same set of nodes that > > > could be condensed - decreasing the amount of partitions. For > > > example, the partitions bdw_all_serial & bdw_all_rcm could be > > > consolidated into a single partition by: > > > > &

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-17 Thread Alessandro Federico
ssage - > From: "John DeSantis" > To: "Alessandro Federico" > Cc: "Slurm User Community List" , "Isabella > Baccarelli" , > hpc-sysmgt-i...@cineca.it > Sent: Wednesday, January 17, 2018 3:30:43 PM > Subject: Re: [slurm-users]

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-22 Thread Alessandro Federico
Hi John, just an update... we not have a solution for the SSSD issue yet, but we changed the ACL on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the slowdown has gone. Thanks for the help ale - Original Message - > From: "Alessandro Federico" > To: "

[slurm-users] Slurm 17.11.2: defunct slurmd process leaves a sleep in the step_extern cgroup

2018-01-30 Thread Alessandro Federico
62379 We tried to setup an UnkillableStepProgram to kill the sleep process but the script is not invoked, we guess because the slurmd is defunct. Any idea? Thanks ale -- Alessandro Federico HPC System Management Group System & Technology Department CINECA www.cineca.it V