I'm setting up an EC2 SLURM cluster and when an instance doesn't resume fast
enough I get an error like:
node c7-c5-24xl-464 not resumed by ResumeTimeout(600) - marking down and
power_save
I keep running into issues where my cloud nodes do not show up in sinfo and I
can't display their informa
down, etc.
Thanks for the help. I think it will solve the issues I’m having.
From: Kirill 'kkm' Katsnelson [mailto:k...@pobox.com]
Sent: Friday, February 28, 2020 5:56 AM
To: Slurm User Community List
Cc: Carter, Allan
Subject: Re: [slurm-users] How to show state of CLOUD nodes
I
I'm perplexed. My cluster has been churning along and tonight it has decided to
start pending jobs even though there are plenty of nodes available.
An example job from squeue:
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
409978 interactiver
If a job is pending only because it needs a license and all are being used, can
it preempt jobs in a lower priority partition that are using the license? Or
does preemption only work for compute resources. I've tried to configure
preemption, but when I submit a job that used my only license and