Hello,

I am using cgroups to track processes and limit memory. Occasionally it
seems like a job will use too much memory and instead of getting killed it
ends up in a unkillable state waiting for NFS I/O.  There are no other
signs of NFS issues, and in fact other jobs (even on the same node) seem to
be having no problem communicating with the same NFS server at that same
time.  I just get hung task errors for that one specific process (that used
too much memory).

Has anyone else ran into this? Searching this mailing list archive I found
some similar stuff, but that seemed to be in regards to installing Slurm
itself onto an NFS4 mount rather than just having jobs use an NFS4 mount.

Any advice is greatly appreciated.

Thanks,
Brendan

Reply via email to