[slurm-users] Database cluster

2024-01-22 Thread Daniel L'Hommedieu
Community: What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a clust

Re: [slurm-users] propose environment variables SLURM_STDOUT, SLURM_STDERR, SLURM_STDIN

2024-01-22 Thread Davide DelVento
I think it would be useful, yes, and mostly for the epilog script. In the job script itself, you are creating such files, so some of the proposed use cases are a bit tricky to get right in the way you described them. For example, if you scp these files, you are scp'ing them to their status before

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-22 Thread Cristóbal Navarro
Hi Tim and community, We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim.

[slurm-users] Tried setting up GANG scheduling for timeslicing, but jobs are not alternating

2024-01-22 Thread Francisco José Letterio
I'm trying to set up GANG scheduling with Slurm on my single-node server so the people at the lab can run experiments without blocking each other (so if say someone has to run some code that takes days to finish, other jobs that take less have the chance to run alternated with it and so they don't