Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-11 Thread Douglas Jacobsen
Applying patches d52d8f4f0 and f07f53fc13 to a slurm 17.11.7 source tree fixes this issue in my experience. Only requires restarting slurmctld. Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer Acting Group Lead, Computational Systems Group National Energy Research Scientific Computing C

Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-11 Thread Taras Shapovalov
Thank you, guys, Lets wait for 17.11.8. Any estimation for the release date? Best regards, Taras On Wed, Jul 11, 2018 at 12:11 AM Kilian Cavalotti < kilian.cavalotti.w...@gmail.com> wrote: > On Tue, Jul 10, 2018 at 10:34 AM, Taras Shapovalov > wrote: > > I noticed the commit that can be relat

Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-10 Thread Kilian Cavalotti
On Tue, Jul 10, 2018 at 10:34 AM, Taras Shapovalov wrote: > I noticed the commit that can be related to this: > > https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e Yes. See also this bug: https://bugs.schedmd.com/show_bug.cgi?id=5240 This commit will be reverted in

Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-10 Thread stolarek.marcin
What is the change in the commit you're thinking about? Original message From: Taras Shapovalov Date: 10/07/2018 19:34 (GMT+01:00) To: slurm-us...@schedmd.com Subject: [slurm-users] DefMemPerCPU is reset to 1 after upgrade Hey guys, When we upgraded to 17.11.7, th

Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-10 Thread Roberts, John E.
rm-us...@schedmd.com" Subject: [slurm-users] DefMemPerCPU is reset to 1 after upgrade Hey guys, When we upgraded to 17.11.7, then on some clusters all jobs are killed with these messages: slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being killed slurmstepd: error: Exceeded job

[slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-10 Thread Taras Shapovalov
Hey guys, When we upgraded to 17.11.7, then on some clusters all jobs are killed with these messages: slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB 374 ON node002 CANCELLED AT 2018-06-28T0