Re: [slurm-users] Troubleshooting job stuck in Pending state

Davide DelVento Mon, 11 Dec 2023 08:24:45 -0800

By getting "stuck" do you mean the job stays PENDING forever or does
eventually run? I've seen the latter (and I agree with you that I wish
Slurm will log things like "I looked at this job and I am not starting it
yet because....") but not the former


On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike <m.pa...@lancaster.ac.uk> wrote:

> Hi folks,
>
>
>
> I’m looking for some advice on how to troubleshoot jobs we occasionally
> see on our cluster that are stuck in a pending state despite sufficient
> matching resources being free. In the case I’m trying to troubleshoot the
> Reason field lists (Priority) but to find any way to get the scheduler to
> tell me what exactly is the priority job blocking.
>
>
>
>    - I tried setting the scheduler log level to debug3 for 5 minutes at
>    one point, but my logfile ballooned from 0.5G to 1.5G and didn’t offer any
>    useful info for this case.
>    - I’ve tried ‘scontrol schedloglevel 1’ but it returns the error:
>    ‘slurm_set_schedlog_level error: Requested operation is presently disabled’
>
>
>
> I’m aware that the backfill scheduler will occasionally hold on to free
> resources in order to schedule a larger job with higher priority, but in
> this case I can’t find any pending job that might fit the bill.
>
>
>
> And to possibly complicate matters, this is on a large partition that has
> no maximum time limit and most pending jobs have no time limits either. (We
> use backfill/fairshare as we have smaller partitions of rarer resources
> that benefit from it, plus we’re aiming to use fairshare even on the
> no-time-limits partitions to help balance out usage).
>
>
>
> Hoping someone can provide pointers.
>
>
>
> Regards,
>
> Mike
>

Re: [slurm-users] Troubleshooting job stuck in Pending state

Reply via email to