On 3/11/26 02:12, Chen Ridong wrote:
On 2026/3/2 20:37, Natalie Vock wrote:
Callers can use this feedback to be more aggressive in making space for
allocations of a cgroup if they know it is protected.
These are counterparts to memcg's mem_cgroup_below_{min,low}.
Signed-off-by: Natalie Vock <[email protected]>
---
include/linux/cgroup_dmem.h | 16 ++++++++++++
kernel/cgroup/dmem.c | 62 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 78 insertions(+)
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..1a88cd0c9eb00 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -24,6 +24,10 @@ void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state
*pool, u64 size);
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state
*limit_pool,
struct dmem_cgroup_pool_state *test_pool,
bool ignore_low, bool *ret_hit_low);
+bool dmem_cgroup_below_min(struct dmem_cgroup_pool_state *root,
+ struct dmem_cgroup_pool_state *test);
+bool dmem_cgroup_below_low(struct dmem_cgroup_pool_state *root,
+ struct dmem_cgroup_pool_state *test);
void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
#else
@@ -59,6 +63,18 @@ bool dmem_cgroup_state_evict_valuable(struct
dmem_cgroup_pool_state *limit_pool,
return true;
}
+static inline bool dmem_cgroup_below_min(struct dmem_cgroup_pool_state *root,
+ struct dmem_cgroup_pool_state *test)
+{
+ return false;
+}
+
+static inline bool dmem_cgroup_below_low(struct dmem_cgroup_pool_state *root,
+ struct dmem_cgroup_pool_state *test)
+{
+ return false;
+}
+
static inline void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state
*pool)
{ }
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 9d95824dc6fa0..28227405f7cfe 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -694,6 +694,68 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region
*region, u64 size,
}
EXPORT_SYMBOL_GPL(dmem_cgroup_try_charge);
+/**
+ * dmem_cgroup_below_min() - Tests whether current usage is within min limit.
+ *
+ * @root: Root of the subtree to calculate protection for, or NULL to
calculate global protection.
+ * @test: The pool to test the usage/min limit of.
+ *
+ * Return: true if usage is below min and the cgroup is protected, false
otherwise.
+ */
+bool dmem_cgroup_below_min(struct dmem_cgroup_pool_state *root,
+ struct dmem_cgroup_pool_state *test)
+{
+ if (root == test || !pool_parent(test))
+ return false;
+
+ if (!root) {
+ for (root = test; pool_parent(root); root = pool_parent(root))
+ {}
+ }
It seems we don't have find the global protection(root), since the root's
protection can not be set. If !root, we can return false directly, right?
Or do I miss anything?
```
{
.name = "min",
.write = dmem_cgroup_region_min_write,
.seq_show = dmem_cgroup_region_min_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "low",
.write = dmem_cgroup_region_low_write,
.seq_show = dmem_cgroup_region_low_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
```
That's not quite how it works. You're correct that the min/low
properties don't exist on the root cgroup, but we don't use the root for
that.
The reason we have a root here in the first place has to do with how
recursive memory protection works in cgroups. Note that for the test
cgroup, we don't read the literal min/low protection setting, but the
"emin"/"elow" value, referring to effective protection. The effective
protection value depends not just on the settings of the "test" cgroup,
but also its ancestors (and potentially, their sibling groups). See [1]
for some documentation on how effective protection varies with different
cgroup relationships.
The "root" parameter here refers to the root of the common subtree
between the test cgroup and what the documentation refers to as the
"reclaim target". For device memory there usually isn't really any
reclaim happening in the traditional sense, but e.g. TTM evictions
follow the same principle (the reclaim target is simply the cgroup
owning the buffer that is to be evicted).
Sometimes, precise reclaim targets may not really be known yet (or we
want to try evicting different buffers originating from different
cgroups). In that case, the "root" parameter here is NULL. However, we
obviously know that all cgroups must be descendants of the root cgroup,
so the root cgroup is a guaranteed safe value for the shared subtree
between the test cgroup and any potential reclaim target.
In practice, this means that the effective min/low protection will be
capped by the protection value specified in all ancestors, which is the
most conservative estimate.
Regards,
Natalie
[1] https://docs.kernel.org/admin-guide/cgroup-v2.html#reclaim-protection
+
+ /*
+ * In mem_cgroup_below_min(), the memcg pendant, this call is missing.
+ * mem_cgroup_below_min() gets called during traversal of the cgroup
tree, where
+ * protection is already calculated as part of the traversal. dmem
cgroup eviction
+ * does not traverse the cgroup tree, so we need to recalculate
effective protection
+ * here.
+ */
+ dmem_cgroup_calculate_protection(root, test);
+ return page_counter_read(&test->cnt) <= READ_ONCE(test->cnt.emin);
+}
+EXPORT_SYMBOL_GPL(dmem_cgroup_below_min);
+
+/**
+ * dmem_cgroup_below_low() - Tests whether current usage is within low limit.
+ *
+ * @root: Root of the subtree to calculate protection for, or NULL to
calculate global protection.
+ * @test: The pool to test the usage/low limit of.
+ *
+ * Return: true if usage is below low and the cgroup is protected, false
otherwise.
+ */
+bool dmem_cgroup_below_low(struct dmem_cgroup_pool_state *root,
+ struct dmem_cgroup_pool_state *test)
+{
+ if (root == test || !pool_parent(test))
+ return false;
+
+ if (!root) {
+ for (root = test; pool_parent(root); root = pool_parent(root))
+ {}
+ }
+
+ /*
+ * In mem_cgroup_below_low(), the memcg pendant, this call is missing.
+ * mem_cgroup_below_low() gets called during traversal of the cgroup
tree, where
+ * protection is already calculated as part of the traversal. dmem
cgroup eviction
+ * does not traverse the cgroup tree, so we need to recalculate
effective protection
+ * here.
+ */
+ dmem_cgroup_calculate_protection(root, test);
+ return page_counter_read(&test->cnt) <= READ_ONCE(test->cnt.elow);
+}
+EXPORT_SYMBOL_GPL(dmem_cgroup_below_low);
+
static int dmem_cgroup_region_capacity_show(struct seq_file *sf, void *v)
{
struct dmem_cgroup_region *region;