Community:
What do you do to ensure database reliability in your SLURM environment? We
can have multiple controllers and multiple slurmdbds, but my understanding is
that slurmdbd can be configured with a single MySQL server, so what do you do?
Do you have that “single MySQL server” be a clust
I think it would be useful, yes, and mostly for the epilog script.
In the job script itself, you are creating such files, so some of the
proposed use cases are a bit tricky to get right in the way you described
them. For example, if you scp these files, you are scp'ing them to their
status before
Hi Tim and community,
We are currently having the same issue (cgroups not working it seems,
showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days
ago after a full update (apt upgrade). Now whenever we launch a job for
that partition, we get the error message mentioned by Tim.
I'm trying to set up GANG scheduling with Slurm on my single-node server so
the people at the lab can run experiments without blocking each other (so
if say someone has to run some code that takes days to finish, other jobs
that take less have the chance to run alternated with it and so they don't