[slurm-users] Backfill CPU jobs on GPU nodes

2019-07-18 Thread Daniel Vecerka
Dears,  we are using SLURM 18.08.6, we have 12 nodes with 4 x GPUs and 21 CPU-only nodes. We have 3 partitions:   gpu: only gpu nodes,   cpu: only cpu nodes   longjobs: all nodes. Jobs in longjobs are with the lowest priority and can be preempted to suspend.   Our goal is to to allow using GP

Re: [slurm-users] sacct thinks slurmctld is not up

2019-07-18 Thread Riebs, Andy
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to hear if there’s a proper fix. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Brian Andrus Sent: Thursday, July 18, 2019 11:01 AM To: Slurm User Community List Subject: [slurm-use

[slurm-users] sacct thinks slurmctld is not up

2019-07-18 Thread Brian Andrus
All, I have slurmdbd running and everything is (mostly) happy. It's been working well for months, but fairly regularly, when I do 'sacctmgr show runaway jobs', I get: *sacctmgr: error: Slurmctld running on cluster orion is not up, can't check running jobs* if I do 'sacctmgr show cluster', it lis