Hello, I am using cgroups to track processes and limit memory. Occasionally it seems like a job will use too much memory and instead of getting killed it ends up in a unkillable state waiting for NFS I/O. There are no other signs of NFS issues, and in fact other jobs (even on the same node) seem to be having no problem communicating with the same NFS server at that same time. I just get hung task errors for that one specific process (that used too much memory).
Has anyone else ran into this? Searching this mailing list archive I found some similar stuff, but that seemed to be in regards to installing Slurm itself onto an NFS4 mount rather than just having jobs use an NFS4 mount. Any advice is greatly appreciated. Thanks, Brendan
