Hello all Since yesterday we’ve been having some trouble with slurm where it crashes and isn’t able to recover. I’ve managed to track the fault to a zero sized file, launching slurmctld -Dvvvv
slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero size That’s the StateSaveLocation, so the environment file for this particular job is not getting correctly created. I don’t believe it’s a space issue as there’s about 2TB of free space on this mountpoint. Shouldn’t be permissions either, as other jobs run fine and get completed. For now I’ve been launching slurmctld -i to work around this issue, killing the job in question. This way slurm can still be running for our users. Any ideas where I should look next to try and troubleshoot this issue? Thanks for all the help in advance. Best regards, Pedro Luiz de Castro IT Support & System Administrator Information Systems [iMM_JLA_horizontal_RGB_cor_positivo] Faculdade de Medicina, Universidade de Lisboa Avenida Professor Egas Moniz, 1649-028, Lisboa, Portugal iMM Lisboa general contact (+351) 217 999 411 - ext: 47356 imm.medicina.ulisboa.pt