I am not a Slurm expert by any stretch of the imagination, so my answer is not authoritative.
That said, I am not aware of any functional equivalent for Slurm, and I would love to learn that I am mistaken! On Tue, Dec 12, 2023 at 1:39 AM Pacey, Mike <m.pa...@lancaster.ac.uk> wrote: > Hi Davide, > > > > The jobs do eventually run, but can take several minutes or sometimes > several hours to switch to a running state even when there’s plenty of > resources free immediately. > > > > With Grid Engine it was possible to turn on scheduling diagnostics and get > a summary of the scheduler’s decisions on a pending job by running “qstat > -j jobid”. But there doesn’t seem to be any functional equivalent with > SLURM? > > > > Regards, > > Mike > > > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Davide DelVento > *Sent:* Monday, December 11, 2023 4:23 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* [External] Re: [slurm-users] Troubleshooting job stuck in > Pending state > > > > *This email originated outside the University. Check before clicking links > or attachments.* > > By getting "stuck" do you mean the job stays PENDING forever or does > eventually run? I've seen the latter (and I agree with you that I wish > Slurm will log things like "I looked at this job and I am not starting it > yet because....") but not the former > > > > On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike <m.pa...@lancaster.ac.uk> > wrote: > > Hi folks, > > > > I’m looking for some advice on how to troubleshoot jobs we occasionally > see on our cluster that are stuck in a pending state despite sufficient > matching resources being free. In the case I’m trying to troubleshoot the > Reason field lists (Priority) but to find any way to get the scheduler to > tell me what exactly is the priority job blocking. > > > > - I tried setting the scheduler log level to debug3 for 5 minutes at > one point, but my logfile ballooned from 0.5G to 1.5G and didn’t offer any > useful info for this case. > - I’ve tried ‘scontrol schedloglevel 1’ but it returns the error: > ‘slurm_set_schedlog_level error: Requested operation is presently disabled’ > > > > I’m aware that the backfill scheduler will occasionally hold on to free > resources in order to schedule a larger job with higher priority, but in > this case I can’t find any pending job that might fit the bill. > > > > And to possibly complicate matters, this is on a large partition that has > no maximum time limit and most pending jobs have no time limits either. (We > use backfill/fairshare as we have smaller partitions of rarer resources > that benefit from it, plus we’re aiming to use fairshare even on the > no-time-limits partitions to help balance out usage). > > > > Hoping someone can provide pointers. > > > > Regards, > > Mike > >