I think it would be the slurm-slurmctld rpm.

I'm not sure on the timing of updating and restarting.  We noticed the issue 
when we were testing 18.08.01 and so didn't have any users/jobs at the time and 
just modified and rebuilt.

Jeff

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
David Baker
Sent: Thursday, July 25, 2019 8:30 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Slurm node weights


Hi Jeff,



Thank you for these details. so far we have never implemented any Slurm fixes. 
I suspect the node weights feature is quite important and useful, and it's 
probably worth me investigating this fix. In this respect could you please 
advise me?



If I use the fix to regenerate the "slurm-slurmd" rpm can I then stop the 
slurmctld processes on the servers, re-install the revised rpm and finally 
restart the slurmctld processes? Most importantly, can this replacement/fix be 
done on a live system that is running jobs, etc? That's assuming that we 
regard/announce the system to be at risk. Or alternatively, do we need to 
arrange downtime, etc?



Best regards,

David





________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sarlo, 
Jeffrey S <jsa...@central.uh.edu>
Sent: 25 July 2019 13:04
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


This is the fix if you want to modify the code and rebuild



https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2Ff66a2a3e2064&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc72db5f7dab1400983e008d710f8840c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=bhMG78N1%2FQ2ZInn599QuEQ6tyD5pRXAIomlNja1f3j0%3D&reserved=0>

I think 18.08.04 and later have it fixed.

Jeff

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David 
Baker <d.j.ba...@soton.ac.uk>
Sent: Thursday, July 25, 2019 6:53 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


Hello,



Thank you for the replies. We're running an early version of Slurm 18.08 and it 
does appear that the node weights are being ignored re the bug.



We're experimenting with Slurm 19*, however we don't expect to deploy that new 
version for quite a while. In the meantime does anyone know if there any fix or 
alternative strategy that might help us to achieve the same result?



Best regards,

David

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sarlo, 
Jeffrey S <jsa...@central.uh.edu>
Sent: 25 July 2019 12:26
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


Which version of Slurm are you running?  I know some of the earlier versions of 
18.08 had a bug and node weights were not working.



Jeff

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David 
Baker <d.j.ba...@soton.ac.uk>
Sent: Thursday, July 25, 2019 6:09 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


Hello,



As an update I note that I have tried restarting the slurmctld, however that 
doesn't help.



Best regards,

David

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David 
Baker <d.j.ba...@soton.ac.uk>
Sent: 25 July 2019 11:47:35
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Slurm node weights


Hello,



I'm experimenting with node weights and I'm very puzzled by what I see. Looking 
at the documentation I gathered that jobs will be allocated to the nodes with 
the lowest weight which satisfies their requirements. I have 3 nodes in a 
partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to 
orange02 and orange03 first. I find, however that my test job is always 
allocated to orange01 with the higher weight. Have I overlooked something? I 
would appreciate your advice, please.




Reply via email to