By getting "stuck" do you mean the job stays PENDING forever or does eventually run? I've seen the latter (and I agree with you that I wish Slurm will log things like "I looked at this job and I am not starting it yet because....") but not the former
On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike <m.pa...@lancaster.ac.uk> wrote: > Hi folks, > > > > I’m looking for some advice on how to troubleshoot jobs we occasionally > see on our cluster that are stuck in a pending state despite sufficient > matching resources being free. In the case I’m trying to troubleshoot the > Reason field lists (Priority) but to find any way to get the scheduler to > tell me what exactly is the priority job blocking. > > > > - I tried setting the scheduler log level to debug3 for 5 minutes at > one point, but my logfile ballooned from 0.5G to 1.5G and didn’t offer any > useful info for this case. > - I’ve tried ‘scontrol schedloglevel 1’ but it returns the error: > ‘slurm_set_schedlog_level error: Requested operation is presently disabled’ > > > > I’m aware that the backfill scheduler will occasionally hold on to free > resources in order to schedule a larger job with higher priority, but in > this case I can’t find any pending job that might fit the bill. > > > > And to possibly complicate matters, this is on a large partition that has > no maximum time limit and most pending jobs have no time limits either. (We > use backfill/fairshare as we have smaller partitions of rarer resources > that benefit from it, plus we’re aiming to use fairshare even on the > no-time-limits partitions to help balance out usage). > > > > Hoping someone can provide pointers. > > > > Regards, > > Mike >