I think it would be the slurm-slurmctld rpm. I'm not sure on the timing of updating and restarting. We noticed the issue when we were testing 18.08.01 and so didn't have any users/jobs at the time and just modified and rebuilt.
Jeff From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of David Baker Sent: Thursday, July 25, 2019 8:30 AM To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights Hi Jeff, Thank you for these details. so far we have never implemented any Slurm fixes. I suspect the node weights feature is quite important and useful, and it's probably worth me investigating this fix. In this respect could you please advise me? If I use the fix to regenerate the "slurm-slurmd" rpm can I then stop the slurmctld processes on the servers, re-install the revised rpm and finally restart the slurmctld processes? Most importantly, can this replacement/fix be done on a live system that is running jobs, etc? That's assuming that we regard/announce the system to be at risk. Or alternatively, do we need to arrange downtime, etc? Best regards, David ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sarlo, Jeffrey S <jsa...@central.uh.edu> Sent: 25 July 2019 13:04 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Slurm node weights This is the fix if you want to modify the code and rebuild https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2Ff66a2a3e2064&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc72db5f7dab1400983e008d710f8840c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=bhMG78N1%2FQ2ZInn599QuEQ6tyD5pRXAIomlNja1f3j0%3D&reserved=0> I think 18.08.04 and later have it fixed. Jeff ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David Baker <d.j.ba...@soton.ac.uk> Sent: Thursday, July 25, 2019 6:53 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Slurm node weights Hello, Thank you for the replies. We're running an early version of Slurm 18.08 and it does appear that the node weights are being ignored re the bug. We're experimenting with Slurm 19*, however we don't expect to deploy that new version for quite a while. In the meantime does anyone know if there any fix or alternative strategy that might help us to achieve the same result? Best regards, David ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sarlo, Jeffrey S <jsa...@central.uh.edu> Sent: 25 July 2019 12:26 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Slurm node weights Which version of Slurm are you running? I know some of the earlier versions of 18.08 had a bug and node weights were not working. Jeff ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David Baker <d.j.ba...@soton.ac.uk> Sent: Thursday, July 25, 2019 6:09 AM To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Slurm node weights Hello, As an update I note that I have tried restarting the slurmctld, however that doesn't help. Best regards, David ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David Baker <d.j.ba...@soton.ac.uk> Sent: 25 July 2019 11:47:35 To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: [slurm-users] Slurm node weights Hello, I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so.. NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50 NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.