On Thu, 7 Nov 2024 22:31:21 GMT, Sergey Chernyshev <schernys...@openjdk.org> 
wrote:

>> Cgroup V1 subsustem fails to initialize mounted controllers properly in 
>> certain cases, that may lead to controllers left undetected/inactive. We 
>> observed the behavior in CloudFoundry deployments, it affects also host 
>> systems.
>> 
>> The relevant /proc/self/mountinfo line is
>> 
>> 
>> 2207 2196 0:43 
>> /system.slice/garden.service/garden/good/2f57368b-0eda-4e52-64d8-af5c 
>> /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime master:25 - 
>> cgroup cgroup rw,cpu,cpuacct
>> 
>> 
>> /proc/self/cgroup:
>> 
>> 
>> 11:cpu,cpuacct:/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c
>> 
>> 
>> Here, Java runs inside containerized process that is being moved cgroups due 
>> to load balancing.
>> 
>> Let's examine the condition at line 64 here 
>> https://github.com/openjdk/jdk/blob/55a7cf14453b6cd1de91362927b2fa63cba400a1/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp#L59-L72
>> It is always FALSE and the branch is never taken. The issue was spotted 
>> earlier by @jerboaa in 
>> [JDK-8288019](https://bugs.openjdk.org/browse/JDK-8288019). 
>> 
>> The original logic was intended to find the common prefix of `_root`and 
>> `cgroup_path` and concatenate the remaining suffix to the `_mount_point` 
>> (lines 67-68). That could lead to the following results: 
>> 
>> Example input
>> 
>> _root = "/a"
>> cgroup_path = "/a/b"
>> _mount_point = "/sys/fs/cgroup/cpu,cpuacct"
>> 
>> 
>> result _path
>> 
>> "/sys/fs/cgroup/cpu,cpuacct/b"
>> 
>> 
>> Here, cgroup_path comes from /proc/self/cgroup 3rd column. The man page 
>> (https://man7.org/linux/man-pages/man7/cgroups.7.html#NOTES) for control 
>> groups states:
>> 
>> 
>> ...
>>        /proc/pid/cgroup (since Linux 2.6.24)
>>               This file describes control groups to which the process
>>               with the corresponding PID belongs.  The displayed
>>               information differs for cgroups version 1 and version 2
>>               hierarchies.
>>               For each cgroup hierarchy of which the process is a
>>               member, there is one entry containing three colon-
>>               separated fields:
>> 
>>                   hierarchy-ID:controller-list:cgroup-path
>> 
>>               For example:
>> 
>>                   5:cpuacct,cpu,cpuset:/daemons
>> ...
>>               [3]  This field contains the pathname of the control group
>>                    in the hierarchy to which the process belongs. This
>>                    pathname is relative to the mount point of the
>>                    hierarchy.
>> 
>> 
>> This explicitly states the "pathname is relative t...
>
> Sergey Chernyshev has updated the pull request with a new target base due to 
> a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains four additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into JDK-8343191
>  - patch reimplemented
>  - fix the logic that skips duplicate controller's mount points
>  - 8343191: Cgroup v1 subsystem fails to set subsystem path

We really need to consider adding container tests for this use-case. Perhaps we 
can require root perms for it. The tricky part would be adding appropriate 
synchronization for the cgroup move of the shell process and subsequent `java` 
runs.

src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 54:

> 52:     ss.print_raw(_mount_point);
> 53:     if (strcmp(_root, "/") == 0) {
> 54:       // host processes / containers w/private cgroup namespace

Suggestion:

      // host processes and containers with cgroupns=private

src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 62:

> 60:       // containers only, warn if doesn't match
> 61:       log_warning(os, container)("Cgroup v1 controller (%s) mounting root 
> [%s] doesn't match cgroup [%s]",
> 62:         _mount_point, _root, cgroup_path);

Why this warning?

It appears it would make more sense to produce this warning when `cgroup_path` 
contains `../` and falling back to the `mount_path` for the subsystem path 
which indicates we have a `cgroupns=private` deployment on CG v1, but would 
likely get away with it since `memory.limit_in_bytes` will be present at the 
mount root.

If `cgroup_path` doesn't contain `../` we should append the `cgroup_path` to 
the `_mount_point` similar to what we do for cg v2. In the cloudflare case we'd 
end up with a subsystem path of 
`/sys/fs/cgroup/cpu,cpuacct/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c`.
 Since the `cgroup_path != _root` we trigger path adjustment increasing the 
chance to detect any lower limit in any of the paths down to the mount point.

src/java.base/linux/classes/jdk/internal/platform/cgroupv1/CgroupV1SubsystemController.java
 line 46:

> 44:     }
> 45: 
> 46:     public void setPath(String cgroupPath) {

This should behave the same as Hotspot and also append the cgroup path to the 
mount point. Then let 
[JDK-8336881](https://bugs.openjdk.org/browse/JDK-8336881) kick in to reduce it 
down to the mount point (if necessary).

-------------

PR Review: https://git.openjdk.org/jdk/pull/21808#pullrequestreview-2430488512
PR Review Comment: https://git.openjdk.org/jdk/pull/21808#discussion_r1838641975
PR Review Comment: https://git.openjdk.org/jdk/pull/21808#discussion_r1838656361
PR Review Comment: https://git.openjdk.org/jdk/pull/21808#discussion_r1838666453

Reply via email to