Hi John,
this is really bad news. We have stopped our rolling update from Slurm
21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
compute nodes already running slurmd 21.08.8-1 suddenly started
flapping between responding and not responding but all other nodes
that were still r
On 5/5/22 7:08 am, Mark Dixon wrote:
I'm confused how this is supposed to be achieved in a configless
setting, as slurmctld isn't running to distribute the updated files to
slurmd.
That's exactly what happens with configless mode, slurmd's retrieve
their config from the slurmctld, and will g
And, what is hopefully my final update on this:
Unfortunately I missed including a single last-minute commit in the
21.08.8 release. That missing commit fixes a communication issue between
a mix of patched and unpatched slurmd processes that could lead to nodes
being incorrectly marked as offl
On 5/5/22 5:17 am, Steven Varga wrote:
Thank you for the quick reply! I know I am pushing my luck here: is it
possible to modify slurm: src/common/[read_conf.c, node_conf.c]
src/slurmctld/[read_config.c, ...] such that the state can be maintained
dynamically? -- or cheaper to write a job manag
I wanted to provide some elaboration on the new
CommunicationParameters=block_null_hash option based on initial feedback.
The original email said it was safe to enable after all daemons had been
restarted. Unfortunately that statement was incomplete - the flag can
only be safely enabled after
On 5/5/22 16:08, Mark Dixon wrote:
On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...
That is correct. Just do "scontrol reconfig" on the slurmctld server. If
all your slurmd's are truly running Configless[1], they will pick up the
new config and reconfigure without restarting.
Details are su
On 5/5/22 15:53, Ward Poelmans wrote:
Hi Steven,
I think truly dynamic adding and removing of nodes is something that's on
the roadmap for slurm 23.02?
Yes, see slide 37 in https://slurm.schedmd.com/SLUG21/Roadmap.pdf from the
Slurm publications site https://slurm.schedmd.com/publications.ht
On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...
That is correct. Just do "scontrol reconfig" on the slurmctld server. If
all your slurmd's are truly running Configless[1], they will pick up the
new config and reconfigure without restarting.
Details are summarized in
https://wiki.fysik.dtu.dk/n
Hi Steven,
I think truly dynamic adding and removing of nodes is something that's on the
roadmap for slurm 23.02?
Ward
On 5/05/2022 15:28, Steven Varga wrote:
Hi Tina,
Thank you for sharing. This matches my observations when I checked if slurm
could do what I am upto: manage AWS EC2 dynamic(
Hi Tina,
On 5/5/22 14:54, Tina Friedrich wrote:
Hi List,
out of curiosity - I would assume that if running configless, one doesn't
manually need to restart slurmd on the nodes if the config changes?
That is correct. Just do "scontrol reconfig" on the slurmctld server. If
all your slurmd's
@Tina,
Figure slurmd reads the config in ones and runs with it. You would need
to have it recheck regularly to see if there are any changes. This is
exactly what 'scontrol reconfig' does: tells all the slurm nodes to
recheck the config.
@Steven,
It seems to me you could just have a monitor
Hi Marcus,
On 5/5/22 14:45, Marcus Boden wrote:
we had a similar issues on our systems. As I understand from the bug you
linked, we just need to wait until all the old jobs are finished (and the
old slurmstepd are gone). So a full drain should not be necessary?
Yes, I believe that sounds righ
Hi Tina,
Thank you for sharing. This matches my observations when I checked if slurm
could do what I am upto: manage AWS EC2 dynamic(spot) instances.
After replacing MySQL with REDIS now i wonder what would it take to make
slurm node addition | removal dynamic. I've been looking at the source code
Hi List,
out of curiosity - I would assume that if running configless, one
doesn't manually need to restart slurmd on the nodes if the config changes?
Hi Steven,
I have no idea if you want to do it every couple of minutes and what the
implications are of that (although I've certainly manage
Hi Ole,
we had a similar issues on our systems. As I understand from the bug you
linked, we just need to wait until all the old jobs are finished (and
the old slurmstepd are gone). So a full drain should not be necessary?
Best,
Marcus
On 05.05.22 13:53, Ole Holm Nielsen wrote:
Just a heads-u
Thank you for the quick reply! I know I am pushing my luck here: is it
possible to modify slurm: src/common/[read_conf.c, node_conf.c]
src/slurmctld/[read_config.c, ...] such that the state can be maintained
dynamically? -- or cheaper to write a job manager with less features but
supporting dynamic
16 matches
Mail list logo