> On Thu, Sep 26, 2024 at 08:46:17AM GMT, Dmitry Dolgov wrote: > > On Thu, Sep 26, 2024 at 07:57:12AM GMT, Gabriele Bartolini wrote: > > Hi Dmitry, > > > > I've been attempting to replicate this issue directly in Kubernetes, but I > > haven't been successful so far. I've been using EKS nodes, and it seems > > that they all run cgroup v2 now. Do you have anything that could help me > > get started on this more quickly? > > Thanks for testing. I can check if I can get some EKS clusters to > experiment with. In the meantime, what about the reproducing script for > cgroup v2 (the plain one that I've attached with the patch, that doesn't > require any k8s cluster), doesn't it work for you?
Looks like there is a plot twist. After talking to Gabriele off list and testing on an EKS, I've discovered that since 5.7 Linux kernel supports hugetlb reservation via hugetlbfs [1]. That means that together with the original limitation at page fault time there is one at reservation time, which has a separate knob in cgroupfs: # cgroup v2, hugetlb controller # # original limit, page fault level hugetlb.2MB.limit_in_bytes # # new one, reservation level hugetlb.2MB.rsvd.limit_in_bytes This means that there still could be people facing the original issue patch is trying to address: for that one needs to either run older kernel, or have a container orchestration tool that do not set rsvd value (looks like there are such examples). But in the long term perspective I would expect everyone converging to use reservation limits correctly, so maybe the patch is not needed after all. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdc2fcfea79b9873bb63159f8ed973f4046018c8