[slurm-users] Re: Job not starting

Michael DiDomenico via slurm-users Tue, 10 Dec 2024 10:58:14 -0800

you don't need to be a subscriber to search bugs.schedmd.com

On Tue, Dec 10, 2024 at 9:44 AM Davide DelVento via slurm-users
<slurm-users@lists.schedmd.com> wrote:
>
> Good sleuthing.
>
> It would be nice if Slurm would say something like 
> Reason=Priority_Lower_Than_Job_XXXX so people will immediately find the 
> culprit in such situations. Has anybody with a SchedMD subscription ever 
> asked something like that, or is there some reasons for which it'd be 
> impossible (or too hard) information to gather programmatically?
>
> On Tue, Dec 10, 2024 at 1:09 AM Diego Zuccato via slurm-users 
> <slurm-users@lists.schedmd.com> wrote:
>>
>> Found the problem: another job was blocking access to the reservation.
>> The strangest thing is that the node (gpu03) has always been reserved
>> for a project, the blocking job did not explicitly request it (and even
>> if it did, it would have been denied access) but its state was:
>>     JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:gpu03
>> Dependency=(null)
>>
>> Paint me surprised...
>>
>> Diego
>>
>> Il 07/12/2024 10:03, Diego Zuccato via slurm-users ha scritto:
>> > Ciao Davide.
>> >
>> > Il 06/12/2024 16:42, Davide DelVento ha scritto:
>> >
>> >> I find it extremely hard to understand situations like this. I wish
>> >> Slurm were more clear on how it reported what it is doing, but I
>> >> digress...
>> > I agree. A "scontrol explain" command could be really useful to pinpont
>> > the cause :)
>> >
>> >> I suspect that there are other job(s) which have higher priority than
>> >> this one which are supposed to run on that node but cannot start
>> >> because maybe this/these high-priority jobs(s) need(s) several nodes
>> >> and the other nodes are not available at the moment?
>> > That partition is a single node, and it's IDLE. If another job needed
>> > it, it would be in PLANNED state (IIRC).
>> >
>> >> Pure speculation, obviously, since I have no idea what the rest of
>> >> your cluster looks like, and what the rest of the workflow is, but the
>> >> clue/ hint is
>> >>
>> >>  > JobState=PENDING Reason=Priority Dependency=(null)
>> >>
>> >> You are pending because something else has higher priority. Going back
>> >> to my first sentence, I wish Slurm would say which one other job
>> >> (maybe there are more than one, but one would suffice for this
>> >> investigation) is trumping this job priority so one could more
>> >> clearly understand what is going on, without sleuthing.
>> > Couldn't agree more :) Scheduler is quite opaque in its decisions. :(
>> >
>> > Actually the job that the user submitted is not starting and has
>> > Reason=PartitionConfig . But QoS 'debug' (the one I'm using for testing)
>> > does have higher priority (1000) than QoS 'long' (10, IIRC).
>> >
>> > Diego
>> >
>> >> On Fri, Dec 6, 2024 at 7:36 AM Diego Zuccato via slurm-users <slurm-
>> >> us...@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote:
>> >>
>> >>     Hello all.
>> >>     An user reported that a job wasn't starting, so I tried to replicate
>> >>     the
>> >>     request and I get:
>> >>     -8<--
>> >>     [root@ophfe1 root.old]# scontrol show job 113936
>> >>     JobId=113936 JobName=test.sh
>> >>          UserId=root(0) GroupId=root(0) MCS_label=N/A
>> >>          Priority=1 Nice=0 Account=root QOS=long
>> >>          JobState=PENDING Reason=Priority Dependency=(null)
>> >>          Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>> >>          RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
>> >>          SubmitTime=2024-12-06T13:19:36 EligibleTime=2024-12-06T13:19:36
>> >>          AccrueTime=2024-12-06T13:19:36
>> >>          StartTime=Unknown EndTime=Unknown Deadline=N/A
>> >>          SuspendTime=None SecsPreSuspend=0
>> >>     LastSchedEval=2024-12-06T13:21:32
>> >>     Scheduler=Backfill:*
>> >>          Partition=m3 AllocNode:Sid=ophfe1:855189
>> >>          ReqNodeList=(null) ExcNodeList=(null)
>> >>          NodeList=
>> >>          NumNodes=1-1 NumCPUs=96 NumTasks=96 CPUs/Task=1
>> >> ReqB:S:C:T=0:0:*:*
>> >>          TRES=cpu=96,mem=95000M,node=1,billing=1296
>> >>          Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >>          MinCPUsNode=1 MinMemoryNode=95000M MinTmpDiskNode=0
>> >>          Features=(null) DelayBoot=00:00:00
>> >>          OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >>          Command=/home/root.old/test.sh
>> >>          WorkDir=/home/root.old
>> >>          StdErr=/home/root.old/%N-%J.err
>> >>          StdIn=/dev/null
>> >>          StdOut=/home/root.old/%N-%J.out
>> >>          Power=
>> >>
>> >>
>> >>     [root@ophfe1 root.old]# scontrol sho partition m3
>> >>     PartitionName=m3
>> >>          AllowGroups=ALL DenyAccounts=formazione AllowQos=ALL
>> >>          AllocNodes=ALL Default=NO QoS=N/A
>> >>          DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
>> >>     Hidden=NO
>> >>          MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
>> >>     MaxCPUsPerNode=UNLIMITED
>> >>          Nodes=mtx20
>> >>          PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
>> >>     OverSubscribe=NO
>> >>          OverTimeLimit=NONE PreemptMode=CANCEL
>> >>          State=UP TotalCPUs=192 TotalNodes=1
>> >>     SelectTypeParameters=CR_SOCKET_MEMORY
>> >>          JobDefaults=(null)
>> >>          DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>> >>          TRES=cpu=192,mem=1150000M,node=1,billing=2592
>> >>          TRESBillingWeights=CPU=13.500,Mem=2.2378G
>> >>
>> >>     [root@ophfe1 root.old]# scontrol show node mtx20
>> >>     NodeName=mtx20 Arch=x86_64 CoresPerSocket=24
>> >>          CPUAlloc=0 CPUEfctv=192 CPUTot=192 CPULoad=0.00
>> >>          AvailableFeatures=ib,matrix,intel,avx
>> >>          ActiveFeatures=ib,matrix,intel,avx
>> >>          Gres=(null)
>> >>          NodeAddr=mtx20 NodeHostName=mtx20 Version=22.05.6
>> >>          OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 14:48:47
>> >>     UTC 2022
>> >>          RealMemory=1150000 AllocMem=0 FreeMem=1156606 Sockets=4 Boards=1
>> >>          MemSpecLimit=2048
>> >>          State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=8 Owner=N/A
>> >>     MCS_label=N/A
>> >>          Partitions=m3
>> >>          BootTime=2024-12-06T10:01:42 SlurmdStartTime=2024-12-06T10:02:54
>> >>          LastBusyTime=2024-12-06T10:51:58
>> >>          CfgTRES=cpu=192,mem=1150000M,billing=2592
>> >>          AllocTRES=
>> >>          CapWatts=n/a
>> >>          CurrentWatts=0 AveWatts=0
>> >>          ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>> >>
>> >>     -8<--
>> >>
>> >>     So the node is free, the partition does not impose extra limits (used
>> >>     only for accounting factors) but the job does not start.
>> >>
>> >>     Any hints?
>> >>
>> >>     Tks
>> >>
>> >>     --     Diego Zuccato
>> >>     DIFA - Dip. di Fisica e Astronomia
>> >>     Servizi Informatici
>> >>     Alma Mater Studiorum - Università di Bologna
>> >>     V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>> >>     tel.: +39 051 20 95786
>> >>
>> >>
>> >>     --     slurm-users mailing list -- slurm-users@lists.schedmd.com
>> >>     <mailto:slurm-users@lists.schedmd.com>
>> >>     To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>> >>     <mailto:slurm-users-le...@lists.schedmd.com>
>> >>
>> >
>>
>> --
>> Diego Zuccato
>> DIFA - Dip. di Fisica e Astronomia
>> Servizi Informatici
>> Alma Mater Studiorum - Università di Bologna
>> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>> tel.: +39 051 20 95786
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Job not starting

Reply via email to