Since nobody replied after this, if the nodes are incapable of running the jobs
due to insufficient resources, it may be that the default
“EnforcePartLimits=No” [1] might be an issue. That might allow a job to stay
queued even if it’s impossible to run.
[1] https://slurm.schedmd.com/slurm.conf.
Hi all,
We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After
updating the slurmdbd, our multi cluster setup was broken until everything was
updated to 24.05. We had not anticipated this.
SchedMD says that fixing it would be a very complex operation.
Hence, this warning to
Hi Guillaume,
as Rob it already mentioned, this could maybe a way for you (partition
just created temporarily online for testing). You could also add your
MaxTRES=node=1 for more restrictions. We do something similar with QOS
to restrict the number of CPU's for user in certain partitions.
The trick, I think (and Guillaume can certainly correct me) is that the aim is
to allow the user to run as many (up to) 200G mem jobs as they wantso long
as they do not consume more than 200G on any single node. So, they could run
10 200G jobson 10 different nodes. So the mem limit isn
I am pretty sure there is no way to do exactly a per user per node limit
in SLURM. I cannot think of a good reason why one would do this. Can
you explain?
I don't see why it matters if you have two user submitting two 200G jobs
if the jobs for the users are spread out over two nodes rather t
Hello,
Thank you all for your answers.
Carsten, as sid by Rob we need a limit per node, not only per user.
Paul, we know we are asking for something quite unorthodox. The thing is, we
overbook the memory on our cluster (i.e., if a worker has 200G of memory, Slurm
can allocate up to 280G on it