It's the association (account) limit. The problem being that lower
priority jobs were backfilling (even with the builtin scheduler) around
this larger job preventing it from running.
I have found what looks like the solution. I need to switch to the builtin
scheduler and add "assoc_limit_stop" t
On 28/2/19 7:29 am, Michael Gutteridge wrote:
2221670 largenode sleeper. me PD N/A 1
(null) (AssocGrpCpuLimit)
That says the job exceeds some policy limit you have set and so is not
permitted to start, looks like you've got a limit on the number of cor
sprio --long shows:
JOBID PARTITION USER PRIORITYAGE FAIRSHAREJOBSIZE
PARTITION QOS NICE TRES
...
2203317 largenodealice110 10 0 0
0 100 0 2203318 largenodealice110
10 0 0
> You might want to look at BatchStartTimeout Parameter
I've got that set to 300 seconds. Every so often one node here and there
won't start and gets "ResumeTimeoutExceeded", but we're not seeing those
associated with this situation (i.e. nothing in that state in this
particular partition)
> wha
On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote:
> Yes, we do have time limits set on partitions- 7 days maximum, 3 days
> default. In this case, the larger job is requesting 3 days of walltime,
> the smaller jobs are requesting 7.
It sounds like no forward reservation is
I am not very familiar with the Slurm power saving stuff. You might want
to look at BatchStartTimeout Parameter (See e.g.
https://slurm.schedmd.com/power_save.html)
Otherwise, what state are the Slurm power saving powered-down nodes in when
powered-down? From man pages sounds like should be idle
> You have not provided enough information (cluster configuration, job
information, etc) to diagnose what accounting policy is being violated.
Yeah, sorry. I'm trying to balance the amount of information and likely
skewed too concise 8-/
The partition looks like:
PartitionName=largenode
Allo
fore the job that requires infinite time.
>
> Andy
>
> --
> *From:* Michael Gutteridge
>
> *Sent:* Wednesday, February 27, 2019 3:29PM
> *To:* Slurm User Community List
>
> *Cc:*
> *Subject:* [slurm-users] Large job starvation on c
The "JobId=2210784 delayed for accounting policy is likely the key as it
indicates the job is currently unable to run, so the lower priority smaller
job bumps ahead of it.
You have not provided enough information (cluster configuration, job
information, etc) to diagnose what accounting policy is be
ining
nodes will be able to finish before the job that requires infinite time.
Andy
*From:* Michael Gutteridge
*Sent:* Wednesday, February 27, 2019 3:29PM
*To:* Slurm User Community List
*Cc:*
*Subject:* [slurm-users] Lar
I've run into a problem with a cluster we've got in a cloud provider-
hoping someone might have some advice.
The problem is that I've got a circumstance where large jobs _never_
start... or more correctly, that large-er jobs don't start when there are
many smaller jobs in the partition. In this c
11 matches
Mail list logo