Dears,
we are using SLURM 18.08.6, we have 12 nodes with 4 x GPUs and 21
CPU-only nodes. We have 3 partitions:
gpu: only gpu nodes,
cpu: only cpu nodes
longjobs: all nodes.
Jobs in longjobs are with the lowest priority and can be preempted to
suspend. Our goal is to to allow using GP
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to
hear if there’s a proper fix.
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Brian Andrus
Sent: Thursday, July 18, 2019 11:01 AM
To: Slurm User Community List
Subject: [slurm-use
All,
I have slurmdbd running and everything is (mostly) happy. It's been working
well for months, but fairly regularly, when I do 'sacctmgr show runaway
jobs', I get:
*sacctmgr: error: Slurmctld running on cluster orion is not up, can't check
running jobs*
if I do 'sacctmgr show cluster', it lis