Re: [slurm-users] Slurm 21.08.8-2 upgrade

2022-05-05 Thread Juergen Salk
Hi John, this is really bad news. We have stopped our rolling update from Slurm 21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of compute nodes already running slurmd 21.08.8-1 suddenly started flapping between responding and not responding but all other nodes that were still r

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Christopher Samuel
On 5/5/22 7:08 am, Mark Dixon wrote: I'm confused how this is supposed to be achieved in a configless setting, as slurmctld isn't running to distribute the updated files to slurmd. That's exactly what happens with configless mode, slurmd's retrieve their config from the slurmctld, and will g

Re: [slurm-users] Slurm versions 21.08.8 and 20.11.9 are now available (CVE-2022-29500, 29501, 29502)

2022-05-05 Thread Tim Wickberg
And, what is hopefully my final update on this: Unfortunately I missed including a single last-minute commit in the 21.08.8 release. That missing commit fixes a communication issue between a mix of patched and unpatched slurmd processes that could lead to nodes being incorrectly marked as offl

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Christopher Samuel
On 5/5/22 5:17 am, Steven Varga wrote: Thank you for the quick reply! I know I am pushing my luck here: is it possible to modify slurm: src/common/[read_conf.c, node_conf.c] src/slurmctld/[read_config.c, ...] such that the state can be maintained dynamically? -- or cheaper to write a job manag

Re: [slurm-users] Slurm versions 21.08.8 and 20.11.9 are now available (CVE-2022-29500, 29501, 29502)

2022-05-05 Thread Tim Wickberg
I wanted to provide some elaboration on the new CommunicationParameters=block_null_hash option based on initial feedback. The original email said it was safe to enable after all daemons had been restarted. Unfortunately that statement was incomplete - the flag can only be safely enabled after

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen
On 5/5/22 16:08, Mark Dixon wrote: On Thu, 5 May 2022, Ole Holm Nielsen wrote: ... That is correct.  Just do "scontrol reconfig" on the slurmctld server.  If all your slurmd's are truly running Configless[1], they will pick up the new config and reconfigure without restarting. Details are su

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen
On 5/5/22 15:53, Ward Poelmans wrote: Hi Steven, I think truly dynamic adding and removing of nodes is something that's on the roadmap for slurm 23.02? Yes, see slide 37 in https://slurm.schedmd.com/SLUG21/Roadmap.pdf from the Slurm publications site https://slurm.schedmd.com/publications.ht

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Mark Dixon
On Thu, 5 May 2022, Ole Holm Nielsen wrote: ... That is correct. Just do "scontrol reconfig" on the slurmctld server. If all your slurmd's are truly running Configless[1], they will pick up the new config and reconfigure without restarting. Details are summarized in https://wiki.fysik.dtu.dk/n

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ward Poelmans
Hi Steven, I think truly dynamic adding and removing of nodes is something that's on the roadmap for slurm 23.02? Ward On 5/05/2022 15:28, Steven Varga wrote: Hi Tina, Thank you for sharing. This matches my observations when I checked if slurm could do what I am upto: manage AWS EC2 dynamic(

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen
Hi Tina, On 5/5/22 14:54, Tina Friedrich wrote: Hi List, out of curiosity - I would assume that if running configless, one doesn't manually need to restart slurmd on the nodes if the config changes? That is correct. Just do "scontrol reconfig" on the slurmctld server. If all your slurmd's

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Brian Andrus
@Tina, Figure slurmd reads the config in ones and runs with it. You would need to have it recheck regularly to see if there are any changes. This is exactly what 'scontrol reconfig' does: tells all the slurm nodes to recheck the config. @Steven, It seems to me you could just have a monitor

Re: [slurm-users] CommunicationParameters=block_null_hash issue in 21.08.8

2022-05-05 Thread Ole Holm Nielsen
Hi Marcus, On 5/5/22 14:45, Marcus Boden wrote: we had a similar issues on our systems. As I understand from the bug you linked, we just need to wait until all the old jobs are finished (and the old slurmstepd are gone). So a full drain should not be necessary? Yes, I believe that sounds righ

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Steven Varga
Hi Tina, Thank you for sharing. This matches my observations when I checked if slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances. After replacing MySQL with REDIS now i wonder what would it take to make slurm node addition | removal dynamic. I've been looking at the source code

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Tina Friedrich
Hi List, out of curiosity - I would assume that if running configless, one doesn't manually need to restart slurmd on the nodes if the config changes? Hi Steven, I have no idea if you want to do it every couple of minutes and what the implications are of that (although I've certainly manage

Re: [slurm-users] CommunicationParameters=block_null_hash issue in 21.08.8

2022-05-05 Thread Marcus Boden
Hi Ole, we had a similar issues on our systems. As I understand from the bug you linked, we just need to wait until all the old jobs are finished (and the old slurmstepd are gone). So a full drain should not be necessary? Best, Marcus On 05.05.22 13:53, Ole Holm Nielsen wrote: Just a heads-u

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Steven Varga
Thank you for the quick reply! I know I am pushing my luck here: is it possible to modify slurm: src/common/[read_conf.c, node_conf.c] src/slurmctld/[read_config.c, ...] such that the state can be maintained dynamically? -- or cheaper to write a job manager with less features but supporting dynamic