[slurm-users] sacct thinks slurmctld is not up
All, I have slurmdbd running and everything is (mostly) happy. It's been working well for months, but fairly regularly, when I do 'sacctmgr show runaway jobs', I get: *sacctmgr: error: Slurmctld running on cluster orion is not up, can't check running jobs* if I do 'sacctmgr show cluster', it lists the cluster but has no IP in the ControlHost field. slurmctld is most definitely running (on the same system even), but the only fix I find is to restart slurmctld. Then I can check and there is an IP in the ControlHost field and I am able to check for runawayjobs. Is this a known issue? Is there a better fix than restarting slurmctld? Brian Andrus
Re: [slurm-users] sacct thinks slurmctld is not up
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to hear if there’s a proper fix. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Brian Andrus Sent: Thursday, July 18, 2019 11:01 AM To: Slurm User Community List Subject: [slurm-users] sacct thinks slurmctld is not up All, I have slurmdbd running and everything is (mostly) happy. It's been working well for months, but fairly regularly, when I do 'sacctmgr show runaway jobs', I get: sacctmgr: error: Slurmctld running on cluster orion is not up, can't check running jobs if I do 'sacctmgr show cluster', it lists the cluster but has no IP in the ControlHost field. slurmctld is most definitely running (on the same system even), but the only fix I find is to restart slurmctld. Then I can check and there is an IP in the ControlHost field and I am able to check for runawayjobs. Is this a known issue? Is there a better fix than restarting slurmctld? Brian Andrus
[slurm-users] Backfill CPU jobs on GPU nodes
Dears, we are using SLURM 18.08.6, we have 12 nodes with 4 x GPUs and 21 CPU-only nodes. We have 3 partitions: gpu: only gpu nodes, cpu: only cpu nodes longjobs: all nodes. Jobs in longjobs are with the lowest priority and can be preempted to suspend. Our goal is to to allow using GPU nodes also for backfill CPU jobs. The problem is with CPU jobs which requires a lot memory. Those jobs can block GPU jobs in queue, because suspended jobs are not releasing memory and GPU jobs will not be started, even free GPUs are available. My question is: Is there any partition or node option allowing to limit TRES memory but only on specific nodes? So jobs in partition longjobs with high memory requirements will be started only on CPU nodes and on GPU nodes will be started only GPU jobs ( without memory limit) and CPU jobs bellow memory limit. Or in different way: Is there any way how to reserve some memory on GPU nodes only for jobs in gpu partition and which can't be used for jobs in longjobs partition? Thanks in advance, Daniel Vecerka, CTU Prague smime.p7s Description: S/MIME Cryptographic Signature