date:20240219

[slurm-users] slurmdbd 17.02: "cluster not registered" (but things work)

2024-02-19 Thread Matthias Leopold via slurm-users


Hi,

I need to take care of a 17.02 Slurm cluster (I'm preparing it for 
upgrades). I see that slurmdbd logs various "cluster not registered" 
messages at startup (DBD_CLUSTER_TRES,DBD_JOB_START,DBD_STEP_START), but 
I don't see a real problem. Accounting works. Do I have to worry? Can 
this be related to upper/lower case issues with ClusterName? For some 
reason ClusterName is all upper case in slurm.conf. According to docs 
(and my experience with Slurm 21.08) though, this should be OK.


Thx for help
Matthias


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0

2024-02-19 Thread Xaver Stiensmeier via slurm-users


Dear slurm-user list,

I had cases where our resumeProgram failed due to temporary cloud
timeouts. In that case the resumeProgram returns a value =/= 0. Why does
Slurm still wait until resumeTimeout instead of just accepting the
startup as failed which then should lead to a rescheduling of the job.

Is there some way to achieve the described effect i.e. tell Slurm: "You
can stop waiting, the node won't come alive." or am I missing the
correct way how this should be handled in Slurm?

Best regards,
Xaver


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] CPU utilisation using two commands scontrol association and sreport makes a huge difference

2024-02-19 Thread prachikakade.lit--- via slurm-users

Dear Team, 
I created a small cluster of 3 nodes on my VM ware to work on the CPU 
utilization concept. 
I created a user name= hpcuser01, and allocated GrpTresMin=cpu=5940 -> CPU 
minutes and gpu=0

Now, when I checked his utilization using scontrol association cmd 
# scontrol show ass user=hpcuser01 | grep -n15 hpcuser01
"" Output: QOS=hpcuser01_test(11)
UsageRaw=5789263754.153907
GrpJobs=N(0) GrpJobsAccrue=N(5) GrpSubmitJobs=N(5) GrpWall=N(1188290.27)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
GrpTRESMins=cpu=5940(59001152),mem=N(293429427523),energy=N(0),node=N(1993778),billing=N(96073808),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=1(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=46080(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)""

Here we can see that GrpTRESMins=cpu=5940(59001152) after converting in 
Hours ( value /60) it will be: 983334(983352) CPU hours This means 983352 CPU 
hours have been used by hpcuser01

but, when I checked his utilization using sreport cmd 
#sreport -t hour --tres=cpu,gres/gpu,billing cluster AccountUtilizationByUser 
user=hpcuser01 start=2021-01-01T00:00:00 end=2024-02-20T23:59:59 -P | grep -iw 
-e billing -e cpu -e gres/gpu
Output:
 cpu|979403
 billing|1594902 
 gres/gpu|0

here I found that his utilisation is : 979403   

I could like to understand why there is so much of a difference using two cmds 
to find out utilisation.  Almost the difference is 979403 - 983352 = 3949

Request you to help me out with this. 

Regards,
Prachi

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] slurmdbd 17.02: "cluster not registered" (but things work)

[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0

[slurm-users] CPU utilisation using two commands scontrol association and sreport makes a huge difference

3 matches

Site Navigation

Mail list logo

Footer information