Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Steven Varga
Thank you for the quick reply! I know I am pushing my luck here: is it
possible to modify slurm: src/common/[read_conf.c, node_conf.c]
src/slurmctld/[read_config.c, ...] such that the state can be maintained
dynamically? -- or cheaper to write a job manager with less features but
supporting dynamic nodes from ground up?
best wishes: steve

On Thu, May 5, 2022 at 12:29 AM Christopher Samuel 
wrote:

> On 5/4/22 7:26 pm, Steven Varga wrote:
>
> > I am wondering what is the best way to update node changes, such as
> > addition and removal of nodes to SLURM. The excerpts below suggest a
> > full restart, can someone confirm this?
>
> You are correct, you need to restart slurmctld and slurmd daemons at
> present.  See https://slurm.schedmd.com/faq.html#add_nodes
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>


Re: [slurm-users] CommunicationParameters=block_null_hash issue in 21.08.8

2022-05-05 Thread Marcus Boden

Hi Ole,

we had a similar issues on our systems. As I understand from the bug you 
linked, we just need to wait until all the old jobs are finished (and 
the old slurmstepd are gone). So a full drain should not be necessary?


Best,
Marcus

On 05.05.22 13:53, Ole Holm Nielsen wrote:
Just a heads-up regarding setting 
CommunicationParameters=block_null_hash in slurm.conf:


On 5/4/22 21:50, Tim Wickberg wrote:

CVE-2022-29500:

An architectural flaw with how credentials are handled can be 
exploited to allow an unprivileged user to impersonate the SlurmUser 
account. Access to the SlurmUser account can be used to execute 
arbitrary processes as root.


This issue impacts all Slurm releases since at least Slurm 1.0.0.

Systems remain vulnerable until all slurmdbd, slurmctld, and slurmd 
processes have been restarted in the cluster.


Once all daemons have been upgraded sites are encouraged to add 
"block_null_hash" to CommunicationParameters. That new option provides 
additional protection against a potential exploit.


The block_null_hash still needs to be documented in the slurm.conf 
man-page.  But in https://bugs.schedmd.com/show_bug.cgi?id=14002 I was 
assured that it's OK to use it now.


I upgraded 21.08.7 to 21.08.8 using RPM packages while the cluster was 
running production jobs.  This is perhaps not recommended (see 
https://slurm.schedmd.com/quickstart_admin.html#upgrade), but it worked 
without a glitch also in this case.


However, when I defined CommunicationParameters=block_null_hash in 
slurm.conf later today, I started getting RPC errors on the compute 
nodes and in slurmctld when jobs were completing, see bug 14002.


I would recommend sites to hold up a bit with 
CommunicationParameters=block_null_hash until we have found a resolution 
in bug 14002.  Draining all jobs from the cluster before setting this 
parameter may be the safe approach(?).


/Ole



--
Marcus Vincent Boden, M.Sc. (he/him)
AG Computing
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen 
(GWDG) Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de


Support: Tel.: +49 551 39-3, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Tina Friedrich

Hi List,

out of curiosity - I would assume that if running configless, one 
doesn't manually need to restart slurmd on the nodes if the config changes?


Hi Steven,

I have no idea if you want to do it every couple of minutes and what the 
implications are of that (although I've certainly manage to restart them 
every 5 minutes by accident with no real problems caused), but - 
generally, restarting the daemons (slurmctld, slurmd) is a non-issue, as 
it's a safe operation. There's no risk to running jobs or anything. I 
have the config management restart them if any files change. It also 
doesn't seem to matter if the restarts of the controller & the node 
daemons are splayed a bit (i.e. don't happen at the same time), or what 
order they happen in.


Tina

On 05/05/2022 13:17, Steven Varga wrote:
Thank you for the quick reply! I know I am pushing my luck here: is it 
possible to modify slurm: src/common/[read_conf.c, node_conf.c]  
src/slurmctld/[read_config.c, ...] such that the state can be maintained 
dynamically? -- or cheaper to write a job manager with less features but 
supporting dynamic nodes from ground up?

best wishes: steve

On Thu, May 5, 2022 at 12:29 AM Christopher Samuel > wrote:


On 5/4/22 7:26 pm, Steven Varga wrote:

 > I am wondering what is the best way to update node changes, such as
 > addition and removal of nodes to SLURM. The excerpts below suggest a
 > full restart, can someone confirm this?

You are correct, you need to restart slurmctld and slurmd daemons at
present.  See https://slurm.schedmd.com/faq.html#add_nodes


All the best,
Chris
-- 
Chris Samuel  : http://www.csamuel.org/  
:  Berkeley, CA, USA




--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk



Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Steven Varga
Hi Tina,
Thank you for sharing. This matches my observations when I checked if slurm
could do what I am upto: manage AWS EC2 dynamic(spot) instances.

After replacing MySQL with REDIS now i wonder what would it take to make
slurm node addition | removal dynamic. I've been looking at the source code
for many months now and trying to decide if it can be done.

I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel
based robust backend.

Steven


On Thu., May 5, 2022, 08:57 Tina Friedrich, 
wrote:

> Hi List,
>
> out of curiosity - I would assume that if running configless, one
> doesn't manually need to restart slurmd on the nodes if the config changes?
>
> Hi Steven,
>
> I have no idea if you want to do it every couple of minutes and what the
> implications are of that (although I've certainly manage to restart them
> every 5 minutes by accident with no real problems caused), but -
> generally, restarting the daemons (slurmctld, slurmd) is a non-issue, as
> it's a safe operation. There's no risk to running jobs or anything. I
> have the config management restart them if any files change. It also
> doesn't seem to matter if the restarts of the controller & the node
> daemons are splayed a bit (i.e. don't happen at the same time), or what
> order they happen in.
>
> Tina
>
> On 05/05/2022 13:17, Steven Varga wrote:
> > Thank you for the quick reply! I know I am pushing my luck here: is it
> > possible to modify slurm: src/common/[read_conf.c, node_conf.c]
> > src/slurmctld/[read_config.c, ...] such that the state can be maintained
> > dynamically? -- or cheaper to write a job manager with less features but
> > supporting dynamic nodes from ground up?
> > best wishes: steve
> >
> > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel  > > wrote:
> >
> > On 5/4/22 7:26 pm, Steven Varga wrote:
> >
> >  > I am wondering what is the best way to update node changes, such
> as
> >  > addition and removal of nodes to SLURM. The excerpts below
> suggest a
> >  > full restart, can someone confirm this?
> >
> > You are correct, you need to restart slurmctld and slurmd daemons at
> > present.  See https://slurm.schedmd.com/faq.html#add_nodes
> > 
> >
> > All the best,
> > Chris
> > --
> > Chris Samuel  : http://www.csamuel.org/ 
> > :  Berkeley, CA, USA
> >
>
> --
> Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
>
> Research Computing and Support Services
> IT Services, University of Oxford
> http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
>
>


Re: [slurm-users] CommunicationParameters=block_null_hash issue in 21.08.8

2022-05-05 Thread Ole Holm Nielsen

Hi Marcus,

On 5/5/22 14:45, Marcus Boden wrote:
we had a similar issues on our systems. As I understand from the bug you 
linked, we just need to wait until all the old jobs are finished (and the 
old slurmstepd are gone). So a full drain should not be necessary?


Yes, I believe that sounds right.

I've been thinking about how to determine the timestamp of the oldest job 
running on the cluster, and then make sure this is after the time that all 
slurmd daemons were upgraded to 21.08.8.


This command will tell you the oldest running jobs:

$ squeue -t running -O StartTime | sort | head

You can add more -O options to get JobIDs etc., as long as you sort on the 
StartTime column (Slurm ISO 8601 timestamps[1] can simply be sorted in 
lexicographical order).


I hope this helps.

/Ole


[1] https://en.wikipedia.org/wiki/ISO_8601




On 05.05.22 13:53, Ole Holm Nielsen wrote:
Just a heads-up regarding setting 
CommunicationParameters=block_null_hash in slurm.conf:


On 5/4/22 21:50, Tim Wickberg wrote:

CVE-2022-29500:

An architectural flaw with how credentials are handled can be exploited 
to allow an unprivileged user to impersonate the SlurmUser account. 
Access to the SlurmUser account can be used to execute arbitrary 
processes as root.


This issue impacts all Slurm releases since at least Slurm 1.0.0.

Systems remain vulnerable until all slurmdbd, slurmctld, and slurmd 
processes have been restarted in the cluster.


Once all daemons have been upgraded sites are encouraged to add 
"block_null_hash" to CommunicationParameters. That new option provides 
additional protection against a potential exploit.


The block_null_hash still needs to be documented in the slurm.conf 
man-page.  But in https://bugs.schedmd.com/show_bug.cgi?id=14002 I was 
assured that it's OK to use it now.


I upgraded 21.08.7 to 21.08.8 using RPM packages while the cluster was 
running production jobs.  This is perhaps not recommended (see 
https://slurm.schedmd.com/quickstart_admin.html#upgrade), but it worked 
without a glitch also in this case.


However, when I defined CommunicationParameters=block_null_hash in 
slurm.conf later today, I started getting RPC errors on the compute 
nodes and in slurmctld when jobs were completing, see bug 14002.


I would recommend sites to hold up a bit with 
CommunicationParameters=block_null_hash until we have found a resolution 
in bug 14002.  Draining all jobs from the cluster before setting this 
parameter may be the safe approach(?).




Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Brian Andrus

@Tina,

Figure slurmd reads the config in ones and runs with it. You would need 
to have it recheck regularly to see if there are any changes. This is 
exactly what 'scontrol reconfig' does: tells all the slurm nodes to 
recheck the config.



@Steven,

It seems to me you could just have a monitor daemon that keeps things 
up-to-date.
It could watch for the alert that AWS sends (2 minute warning, IIRC) and 
take appropriate action of drain the node and cancel/checkpoint a job.
In addition, it could keep an eye on things in the event a warning 
wasn't received and a node 'vanishes'.  I suspect Nagios even has the 
hooks to make that work. You could also email the user to let them know 
their job was ended due to spot being pulled.


Just some ideas,

Brian Andrus

On 5/5/2022 6:28 AM, Steven Varga wrote:

Hi Tina,
Thank you for sharing. This matches my observations when I checked if 
slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.


After replacing MySQL with REDIS now i wonder what would it take to 
make slurm node addition | removal dynamic. I've been looking at the 
source code for many months now and trying to decide if it can be done.


I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel 
based robust backend.


Steven


On Thu., May 5, 2022, 08:57 Tina Friedrich, 
 wrote:


Hi List,

out of curiosity - I would assume that if running configless, one
doesn't manually need to restart slurmd on the nodes if the config
changes?

Hi Steven,

I have no idea if you want to do it every couple of minutes and
what the
implications are of that (although I've certainly manage to
restart them
every 5 minutes by accident with no real problems caused), but -
generally, restarting the daemons (slurmctld, slurmd) is a
non-issue, as
it's a safe operation. There's no risk to running jobs or anything. I
have the config management restart them if any files change. It also
doesn't seem to matter if the restarts of the controller & the node
daemons are splayed a bit (i.e. don't happen at the same time), or
what
order they happen in.

Tina

On 05/05/2022 13:17, Steven Varga wrote:
> Thank you for the quick reply! I know I am pushing my luck here:
is it
> possible to modify slurm: src/common/[read_conf.c, node_conf.c]
> src/slurmctld/[read_config.c, ...] such that the state can be
maintained
> dynamically? -- or cheaper to write a job manager with less
features but
> supporting dynamic nodes from ground up?
> best wishes: steve
>
> On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
 > wrote:
>
>     On 5/4/22 7:26 pm, Steven Varga wrote:
>
>      > I am wondering what is the best way to update node
changes, such as
>      > addition and removal of nodes to SLURM. The excerpts
below suggest a
>      > full restart, can someone confirm this?
>
>     You are correct, you need to restart slurmctld and slurmd
daemons at
>     present.  See https://slurm.schedmd.com/faq.html#add_nodes
>     
>
>     All the best,
>     Chris
>     --
>     Chris Samuel  : http://www.csamuel.org/

>     :  Berkeley, CA, USA
>

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems

Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk


Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen

Hi Tina,

On 5/5/22 14:54, Tina Friedrich wrote:

Hi List,

out of curiosity - I would assume that if running configless, one doesn't 
manually need to restart slurmd on the nodes if the config changes?


That is correct.  Just do "scontrol reconfig" on the slurmctld server.  If 
all your slurmd's are truly running Configless[1], they will pick up the 
new config and reconfigure without restarting.


Details are summarized in 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#reconfiguration-of-slurm-conf. 
 Beware that you can't add or remove nodes without restarting.  Also, 
changing certain slurm.conf parameters require restarting.


/Ole

[1] https://slurm.schedmd.com/configless_slurm.html



Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ward Poelmans

Hi Steven,

I think truly dynamic adding and removing of nodes is something that's on the 
roadmap for slurm 23.02?

Ward

On 5/05/2022 15:28, Steven Varga wrote:

Hi Tina,
Thank you for sharing. This matches my observations when I checked if slurm 
could do what I am upto: manage AWS EC2 dynamic(spot) instances.

After replacing MySQL with REDIS now i wonder what would it take to make slurm 
node addition | removal dynamic. I've been looking at the source code for many 
months now and trying to decide if it can be done.

I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel based 
robust backend.

Steven


On Thu., May 5, 2022, 08:57 Tina Friedrich, mailto:tina.friedr...@it.ox.ac.uk>> wrote:

Hi List,

out of curiosity - I would assume that if running configless, one
doesn't manually need to restart slurmd on the nodes if the config changes?

Hi Steven,

I have no idea if you want to do it every couple of minutes and what the
implications are of that (although I've certainly manage to restart them
every 5 minutes by accident with no real problems caused), but -
generally, restarting the daemons (slurmctld, slurmd) is a non-issue, as
it's a safe operation. There's no risk to running jobs or anything. I
have the config management restart them if any files change. It also
doesn't seem to matter if the restarts of the controller & the node
daemons are splayed a bit (i.e. don't happen at the same time), or what
order they happen in.

Tina

On 05/05/2022 13:17, Steven Varga wrote:
 > Thank you for the quick reply! I know I am pushing my luck here: is it
 > possible to modify slurm: src/common/[read_conf.c, node_conf.c]
 > src/slurmctld/[read_config.c, ...] such that the state can be maintained
 > dynamically? -- or cheaper to write a job manager with less features but
 > supporting dynamic nodes from ground up?
 > best wishes: steve
 >
 > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel mailto:ch...@csamuel.org>
 > >> wrote:
 >
 >     On 5/4/22 7:26 pm, Steven Varga wrote:
 >
 >      > I am wondering what is the best way to update node changes, such 
as
 >      > addition and removal of nodes to SLURM. The excerpts below 
suggest a
 >      > full restart, can someone confirm this?
 >
 >     You are correct, you need to restart slurmctld and slurmd daemons at
 >     present.  See https://slurm.schedmd.com/faq.html#add_nodes 

 >     >
 >
 >     All the best,
 >     Chris
 >     --
 >     Chris Samuel  : http://www.csamuel.org/  
>
 >     :  Berkeley, CA, USA
 >

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator


Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk  http://www.it.ox.ac.uk 






smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Mark Dixon

On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...

That is correct.  Just do "scontrol reconfig" on the slurmctld server.  If
all your slurmd's are truly running Configless[1], they will pick up the
new config and reconfigure without restarting.

Details are summarized in
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#reconfiguration-of-slurm-conf.
Beware that you can't add or remove nodes without restarting.  Also,
changing certain slurm.conf parameters require restarting.

...

However...

Given that the normal recommendation for adding/removing nodes safely is 
to:


* stop slurmctld
* edit slurm.conf etc.
* restart the slurmd nodes to pick up new slurm.conf
* start slurmctld

I'm confused how this is supposed to be achieved in a configless setting, 
as slurmctld isn't running to distribute the updated files to slurmd.


Best,

Mark



Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen

On 5/5/22 15:53, Ward Poelmans wrote:

Hi Steven,

I think truly dynamic adding and removing of nodes is something that's on 
the roadmap for slurm 23.02?


Yes, see slide 37 in https://slurm.schedmd.com/SLUG21/Roadmap.pdf from the 
Slurm publications site https://slurm.schedmd.com/publications.html


/Ole



On 5/05/2022 15:28, Steven Varga wrote:

Hi Tina,
Thank you for sharing. This matches my observations when I checked if 
slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.


After replacing MySQL with REDIS now i wonder what would it take to make 
slurm node addition | removal dynamic. I've been looking at the source 
code for many months now and trying to decide if it can be done.


I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel 
based robust backend.


Steven


On Thu., May 5, 2022, 08:57 Tina Friedrich, > wrote:


    Hi List,

    out of curiosity - I would assume that if running configless, one
    doesn't manually need to restart slurmd on the nodes if the config 
changes?


    Hi Steven,

    I have no idea if you want to do it every couple of minutes and what 
the
    implications are of that (although I've certainly manage to restart 
them

    every 5 minutes by accident with no real problems caused), but -
    generally, restarting the daemons (slurmctld, slurmd) is a 
non-issue, as

    it's a safe operation. There's no risk to running jobs or anything. I
    have the config management restart them if any files change. It also
    doesn't seem to matter if the restarts of the controller & the node
    daemons are splayed a bit (i.e. don't happen at the same time), or what
    order they happen in.

    Tina

    On 05/05/2022 13:17, Steven Varga wrote:
 > Thank you for the quick reply! I know I am pushing my luck here: 
is it

 > possible to modify slurm: src/common/[read_conf.c, node_conf.c]
 > src/slurmctld/[read_config.c, ...] such that the state can be 
maintained
 > dynamically? -- or cheaper to write a job manager with less 
features but

 > supporting dynamic nodes from ground up?
 > best wishes: steve
 >
 > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel 
mailto:ch...@csamuel.org>

 > >> wrote:
 >
 >     On 5/4/22 7:26 pm, Steven Varga wrote:
 >
 >      > I am wondering what is the best way to update node 
changes, such as
 >      > addition and removal of nodes to SLURM. The excerpts below 
suggest a

 >      > full restart, can someone confirm this?
 >
 >     You are correct, you need to restart slurmctld and slurmd 
daemons at
 >     present.  See https://slurm.schedmd.com/faq.html#add_nodes 

 >     >

 >
 >     All the best,
 >     Chris
 >     --
 >     Chris Samuel  : http://www.csamuel.org/ 
 >

 >     :  Berkeley, CA, USA
 >

    --     Tina Friedrich, Advanced Research Computing Snr HPC Systems 
Administrator


    Research Computing and Support Services
    IT Services, University of Oxford
    http://www.arc.ox.ac.uk  
http://www.it.ox.ac.uk 






Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen




On 5/5/22 16:08, Mark Dixon wrote:

On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...

That is correct.  Just do "scontrol reconfig" on the slurmctld server.  If
all your slurmd's are truly running Configless[1], they will pick up the
new config and reconfigure without restarting.

Details are summarized in
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#reconfiguration-of-slurm-conf. 


Beware that you can't add or remove nodes without restarting.  Also,
changing certain slurm.conf parameters require restarting.

...

However...

Given that the normal recommendation for adding/removing nodes safely is to:

* stop slurmctld
* edit slurm.conf etc.
* restart the slurmd nodes to pick up new slurm.conf
* start slurmctld

I'm confused how this is supposed to be achieved in a configless setting, 
as slurmctld isn't running to distribute the updated files to slurmd.


You're right, probably the correct order for Configless must be:

* stop slurmctld
* edit slurm.conf etc.
* start slurmctld
* restart the slurmd nodes to pick up new slurm.conf

See also slides 29-34 in 
https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf from the Slurm 
publications site https://slurm.schedmd.com/publications.html


Less-Safe, but usually okay, procedure:
1. Change configs
2. Restart slurmctld
3. Restart all slurmd processes really quickly


/Ole




Re: [slurm-users] Slurm versions 21.08.8 and 20.11.9 are now available (CVE-2022-29500, 29501, 29502)

2022-05-05 Thread Tim Wickberg
I wanted to provide some elaboration on the new 
CommunicationParameters=block_null_hash option based on initial feedback.


The original email said it was safe to enable after all daemons had been 
restarted. Unfortunately that statement was incomplete - the flag can 
only be safely enabled after all daemons have been restarted *and* all 
currently running jobs have completed.


The new maintenance releases - with or without this new option enabled - 
do fix the reported issues. The option is not required to secure your 
system.


This option provides an additional - redundant - layer of security 
within the cluster, and we do encourage sites to enable it at their 
earliest convenience, but only after currently running jobs (with an 
associated unpatched slurmstepd process) have all completed.


- Tim



Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Christopher Samuel

On 5/5/22 5:17 am, Steven Varga wrote:

Thank you for the quick reply! I know I am pushing my luck here: is it 
possible to modify slurm: src/common/[read_conf.c, node_conf.c] 
src/slurmctld/[read_config.c, ...] such that the state can be maintained 
dynamically? -- or cheaper to write a job manager with less features but 
supporting dynamic nodes from ground up?


I had said currently, because it looks like you will be in luck with the 
next release (though it sounds like it needs a little config):


From https://github.com/SchedMD/slurm/blob/master/RELEASE_NOTES:

 -- Allow nodes to be dynamically added and removed from the system. 
Configure
MaxNodeCount to accomodate nodes created with dynamic node 
registrations

(slurmd -Z --conf="") and scontrol.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Slurm versions 21.08.8 and 20.11.9 are now available (CVE-2022-29500, 29501, 29502)

2022-05-05 Thread Tim Wickberg

And, what is hopefully my final update on this:

Unfortunately I missed including a single last-minute commit in the 
21.08.8 release. That missing commit fixes a communication issue between 
a mix of patched and unpatched slurmd processes that could lead to nodes 
being incorrectly marked as offline.


That patch was included in 20.11.9. That missing commit is included in a 
new 21.08.8-2 release which is on our download page now.


If you've already starting rolling out 21.08.8 on your systems, the best 
path forward it to restart all slurmd processes in the cluster immediately.


- Tim



Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Christopher Samuel

On 5/5/22 7:08 am, Mark Dixon wrote:

I'm confused how this is supposed to be achieved in a configless 
setting, as slurmctld isn't running to distribute the updated files to 
slurmd.


That's exactly what happens with configless mode, slurmd's retrieve 
their config from the slurmctld, and will grab it again on an "scontrol 
reconfigure". There's no reason to stop slurmctld for this.


So your slurm.conf should only exist on the slurmctld node - this is how 
we operate on our latest system.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Slurm 21.08.8-2 upgrade

2022-05-05 Thread Juergen Salk
Hi John,

this is really bad news. We have stopped our rolling update from Slurm
21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of 
compute nodes already running slurmd 21.08.8-1 suddenly started 
flapping between responding and not responding but all other nodes 
that were still running version 21.08.6 slurmd were not affected.

For the affected nodes we did not see any obvious reason in slurmd.log
even with SlurmdDebug set to debug3 but we noticed the following
in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
enabled.

[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding

So you seen this as well with 21.08.8-2?

We didn't have CommunicationParameters=block_null_hash set, btw. 

Actually, after Tim's last announcement, I was hoping that we can start 
over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore, 
I would also be highly interested what others can say about rolling updates 
from 
Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a 
mix of patched and unpatched slurmd versions on the compute nodes. 

If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd 
we may have to drain the whole cluster for updating Slurm, which 
is something that I'd actually wished to avoid. 

Best regards
Jürgen



* Legato, John (NIH/NHLBI) [E]  [220505 22:30]:
> Hello,
> 
> We are in the process of upgrading from Slurm 21.08.6 to Slurm 21.08.8-2. 
> We’ve upgraded the controller and a few partitions worth of nodes. We notice 
> the nodes are
> losing contact with the controller but slurmd is still up. We thought that 
> this issue was fixed in -2 based on this bug report:
> 
> https://bugs.schedmd.com/show_bug.cgi?id=14011
> 
> However we are still seeing the same behavior. I note that nodes running 
> 21.08.6 are having no issues with communication. I could
> upgrade the remaining 21.08.6 nodes but hesitate to do that as it seems like 
> it would completely kill the functioning nodes.
> 
> Is anyone else still seeing this in -2?
> 
> Thanks
> 
> John
> 
> 
>