It's the association (account) limit. The problem being that lower
priority jobs were backfilling (even with the builtin scheduler) around
this larger job preventing it from running.
I have found what looks like the solution. I need to switch to the builtin
scheduler and add "assoc_limit_stop" t
On 28/2/19 7:29 am, Michael Gutteridge wrote:
2221670 largenode sleeper. me PD N/A 1
(null) (AssocGrpCpuLimit)
That says the job exceeds some policy limit you have set and so is not
permitted to start, looks like you've got a limit on the number of cor
sprio --long shows:
JOBID PARTITION USER PRIORITYAGE FAIRSHAREJOBSIZE
PARTITION QOS NICE TRES
...
2203317 largenodealice110 10 0 0
0 100 0 2203318 largenodealice110
10 0 0
> You might want to look at BatchStartTimeout Parameter
I've got that set to 300 seconds. Every so often one node here and there
won't start and gets "ResumeTimeoutExceeded", but we're not seeing those
associated with this situation (i.e. nothing in that state in this
particular partition)
> wha