[slurm-users] sacct thinks slurmctld is not up

2019-07-18 Thread Brian Andrus
All,

I have slurmdbd running and everything is (mostly) happy. It's been working
well for months, but fairly regularly, when I do 'sacctmgr show runaway
jobs', I get:

*sacctmgr: error: Slurmctld running on cluster orion is not up, can't check
running jobs*

if I do 'sacctmgr show cluster', it lists the cluster but has no IP in the
ControlHost field.

slurmctld is most definitely running (on the same system even), but the
only fix I find is to restart slurmctld. Then I can check and there is an
IP in the ControlHost field and I am able to check for runawayjobs.

Is this a known issue? Is there a better fix than restarting slurmctld?

Brian Andrus


Re: [slurm-users] sacct thinks slurmctld is not up

2019-07-18 Thread Riebs, Andy
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to 
hear if there’s a proper fix.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Brian Andrus
Sent: Thursday, July 18, 2019 11:01 AM
To: Slurm User Community List 
Subject: [slurm-users] sacct thinks slurmctld is not up

All,

I have slurmdbd running and everything is (mostly) happy. It's been working 
well for months, but fairly regularly, when I do 'sacctmgr show runaway jobs', 
I get:

sacctmgr: error: Slurmctld running on cluster orion is not up, can't check 
running jobs

if I do 'sacctmgr show cluster', it lists the cluster but has no IP in the 
ControlHost field.

slurmctld is most definitely running (on the same system even), but the only 
fix I find is to restart slurmctld. Then I can check and there is an IP in the 
ControlHost field and I am able to check for runawayjobs.

Is this a known issue? Is there a better fix than restarting slurmctld?

Brian Andrus


[slurm-users] Backfill CPU jobs on GPU nodes

2019-07-18 Thread Daniel Vecerka

Dears,

 we are using SLURM 18.08.6, we have 12 nodes with 4 x GPUs and 21 
CPU-only nodes. We have 3 partitions:

  gpu: only gpu nodes,
  cpu: only cpu nodes
  longjobs: all nodes.

Jobs in longjobs are with the lowest priority and can be preempted to 
suspend.   Our goal is to to allow using GPU nodes also for backfill CPU 
jobs. The problem is with CPU jobs which requires a lot memory. Those 
jobs can block GPU jobs in queue, because suspended jobs are not 
releasing memory and GPU jobs will not be started, even free GPUs are 
available.


My question is:  Is there any partition or node option allowing to limit 
TRES memory but only on specific nodes? So  jobs in partition longjobs  
with high memory requirements will be started only on CPU nodes and   on 
GPU nodes will be started only GPU jobs ( without memory limit) and CPU 
jobs bellow memory limit.


Or in different way: Is there any way how to reserve some memory on GPU 
nodes only for jobs in gpu partition and which can't be used for jobs in 
longjobs partition?


Thanks in advance,    Daniel Vecerka, CTU Prague





smime.p7s
Description: S/MIME Cryptographic Signature