Hello all

Since yesterday we’ve been having some trouble with slurm where it crashes and 
isn’t able to recover.
I’ve managed to track the fault to a zero sized file, launching slurmctld -Dvvvv

slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has 
zero size

That’s the StateSaveLocation, so the environment file for this particular job 
is not getting correctly created.
I don’t believe it’s a space issue as there’s about 2TB of free space on this 
mountpoint.
Shouldn’t be permissions either, as other jobs run fine and get completed.

For now I’ve been launching slurmctld -i to work around this issue, killing the 
job in question.
This way slurm can still be running for our users.

Any ideas where I should look next to try and troubleshoot this issue?

Thanks for all the help in advance.

Best regards,
Pedro Luiz de Castro
IT Support & System Administrator
Information Systems
[iMM_JLA_horizontal_RGB_cor_positivo]
Faculdade de Medicina, Universidade de Lisboa
Avenida Professor Egas Moniz, 1649​-​028, Lisboa, Portugal
iMM Lisboa general contact (+​351) ​217 ​999 ​411 - ext: 47356
imm.medicina​.ulisboa​.pt

Reply via email to