On 30/5/22 3:01 am, byron wrote:

The one thing I'm unsure about is as much as Linux / NFS issue than a a slurm one.  When I change the soft link for "default" to point to the new 20.11 slurm install but all the compute nodes are still run the old 19.05 version because they havent been restarted yet, will that not cause any problems?   Or will they still just see the same old 19.05 version of slurm that they are running until they are restarted.

That may cause issues, whilst the ASAP flag to scontrol reboot guarantees no new jobs will start on the selected nodes until after they've rebooted that doesn't (and shouldn't) stop new job steps from srun starting on them.

If you switch that symlink those jobs will pick up the 20.11 srun binary and that's where you may come unstuck.

This is one of the reasons why we do everything with Slurm installed via RPM inside an image, you have a pretty straightforward A -> B transition.

If your symlink was node-local in some way (say created at boot time via some config management system before slurmd start) then that could work around that as then the nodes would still see the appropriate slurm binaries for the running slurmd.

Best of luck!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to