This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory.
/tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Sent from my iPhone > On Dec 8, 2023, at 10:02, Xaver Stiensmeier <xaverstiensme...@gmx.de> wrote: > > Dear slurm-user list, > > during a larger cluster run (the same I mentioned earlier 242 nodes), I > got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a > directory on the workers that is used for job state information > (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, > I was unable to find more precise information on that dictionary. We > compute all data on another volume so SlurmdSpoolDir has roughly 38 GB > of free space where nothing is intentionally put during the run. This > error only occurred on very few nodes. > > I would like to understand what Slurmd is placing in this dir that fills > up the space. Do you have any ideas? Due to the workflow used, we have a > hard time reconstructing the exact scenario that caused this error. I > guess, the "fix" is to just pick a bit larger disk, but I am unsure > whether Slurm behaves normal here. > > Best regards > Xaver Stiensmeier > >