[lustre-discuss] Limiting Lustre memory use?

bill broadley via lustre-discuss Fri, 18 Feb 2022 13:42:53 -0800

On a cluster I managed (without Lustre), we had many problems with users runningnodes out of ram which often killed the node. We added cgroup support to slurmand those problems disappeared. Nearly 100% of the time get a cgroup OOMinstead of a kernel OOM and the nodes would stay up and stable. This becamedoubly important when we started allowing jobs to share nodes and didn't wantjob A to be able to crash job B.

I've tried similar on a Lustre enabled cluster and it seems like the memory usedby Lustre (which I believe is in the kernel and outside of the job's cgroup). Ithink part of the problem is I believe Lustre caches metadata in the linux pagecache, but not data. I've tried reducing the ram available to slurm, but stillgetting kernel OOMs instead of cgroup OOMs.

Anyone have a suggestion for fixing this? Is there any way to limit Lustre'smemory use in the kernel? Or force that caching into userspace and inside thecgroup? Or possibly out of ram and onto a client local NVMe?


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Limiting Lustre memory use?

Reply via email to