Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-10 Thread Roberts, John E.
Hi, I ran into this recently after upgrading from 16.05.10 to 17.11.7 and couldn’t run any jobs on any partitions. The only way I got around this was to set this flag on all “NodeName” definitions in slurm.conf: RealMemory= Where foo is the total memory of the nodes in MB. I believe the documen

[slurm-users] Rebooted Nodes & Jobs Stuck in Cleaning State

2018-10-10 Thread Roberts, John E.
Hi, Hopefully this isn't an obvious fix I'm missing. We have a large number of KNL nodes that can get rebooted when their memory or cluster modes are changed by users. I never heard any complaints when running Slurm v16.05.10, but I've seen a number of issues since our upgrade a couple months

[slurm-users] Update: Rebooted Nodes & Jobs Stuck in Cleaning State

2018-10-15 Thread Roberts, John E.
John On 10/10/18, 4:08 PM, "Roberts, John E." wrote: Hi, Hopefully this isn't an obvious fix I'm missing. We have a large number of KNL nodes that can get rebooted when their memory or cluster modes are changed by users. I never heard any complaints when run

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Roberts, John E.
TmpFS in slurm.conf wasn’t being honored from my experience from at least v16.05.10. When I initially configured Slurm, I noticed this myself. As with the user below, we are also just setting this elsewhere. Thanks! John From: slurm-users on behalf of Shenglong Wang Reply-To: Slurm User Comm

[slurm-users] Disable Account Limits Per Partition?

2018-02-21 Thread Roberts, John E.
Hi, I'm not sure of the best way to solve this and I don't see any obvious things I can set in the configuration. Please let me know if I'm missing something. I have several partitions in Slurm (16.05). I also have many accounts with users tied to them and all of the accounts have a CPU hour li

[slurm-users] Restoring Slurm

2018-04-09 Thread Roberts, John E.
Hi, The documentation is a little unclear to me, so I was wondering how do a complete backup and restore of Slurm for testing and/or disaster recovery. I'm looking to upgrade Slurm from 16.05.10 to the latest and I'm not sure all of what should go. I stood up some VMs to test this upgrade and m

[slurm-users] New Billing TRES Issue

2018-04-27 Thread Roberts, John E.
Hi, I'm testing the newest version of Slurm and I'm seeing an issue when using the newer billing TRES to charge for cpu time on a partition. I've seen that billing should be used now instead of cpu in order to properly use the "TRESBillingWeights" option on a partition. In my test case, I gav

Re: [slurm-users] New Billing TRES Issue

2018-04-27 Thread Roberts, John E.
want. On Fri, Apr 27, 2018 at 11:21 AM, Roberts, John E. mailto:jerobe...@anl.gov>> wrote: Hi, I'm testing the newest version of Slurm and I'm seeing an issue when using the newer billing TRES to charge for cpu time on a partition. I've seen that billing should be used

Re: [slurm-users] New Billing TRES Issue

2018-04-30 Thread Roberts, John E.
Hi, Unfortunately that can't be a solution in my running production environment for a number of reasons. I did consider it ( Thanks! -John On 4/30/18, 2:40 AM, "slurm-users on behalf of Bjørn-Helge Mevik" wrote: "Roberts, John E." writes: > So no

[slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Roberts, John E.
Hi, Seeing this after an upgrade today. I now can't get any jobs to run. Things were fin before the upgrade. Any Ideas? slurmstepd: error: Job 535721 exceeded memory limit (1160 > 1024), being killed slurmstepd: error: Exceeded job memory limit ulimit shows: $ u

Re: [slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Roberts, John E.
Renfro, Michael" wrote: Anything in particular set for DefMemPerCPU in your slurm.conf? > On Jun 11, 2018, at 3:50 PM, Roberts, John E. wrote: > > Hi, > >Seeing this after an upgrade today. I now can't get any jobs to run. Thing

Re: [slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Roberts, John E.
n On 6/11/18, 4:12 PM, "Roberts, John E." wrote: Nothing I assume isn't correct: DefMemPerNode = UNLIMITED MaxMemPerNode = UNLIMITED MemLimitEnforce = Yes PropagateResourceLimitsExcept = MEMLOCK CPU vars aren't