On 4/8/25 6:22 PM, Roman Gushchin wrote:
On Sat, Apr 05, 2025 at 10:40:10PM -0400, Waiman Long wrote:
The test_memcg_protection() function is used for the test_memcg_min and
test_memcg_low sub-tests. This function generates a set of parent/child
cgroups like:

   parent:  memory.min/low = 50M
   child 0: memory.min/low = 75M,  memory.current = 50M
   child 1: memory.min/low = 25M,  memory.current = 50M
   child 2: memory.min/low = 0,    memory.current = 50M

After applying memory pressure, the function expects the following
actual memory usages.

   parent:  memory.current ~= 50M
   child 0: memory.current ~= 29M
   child 1: memory.current ~= 21M
   child 2: memory.current ~= 0

In reality, the actual memory usages can differ quite a bit from the
expected values. It uses an error tolerance of 10% with the values_close()
helper.

Both the test_memcg_min and test_memcg_low sub-tests can fail
sporadically because the actual memory usage exceeds the 10% error
tolerance. Below are a sample of the usage data of the tests runs
that fail.

   Child   Actual usage    Expected usage    %err
   -----   ------------    --------------    ----
     1       16990208         22020096      -12.9%
     1       17252352         22020096      -12.1%
     0       37699584         30408704      +10.7%
     1       14368768         22020096      -21.0%
     1       16871424         22020096      -13.2%

The current 10% error tolerenace might be right at the time
test_memcontrol.c was first introduced in v4.18 kernel, but memory
reclaim have certainly evolved quite a bit since then which may result
in a bit more run-to-run variation than previously expected.

Increase the error tolerance to 15% for child 0 and 20% for child 1 to
minimize the chance of this type of failure. The tolerance is bigger
for child 1 because an upswing in child 0 corresponds to a smaller
%err than a similar downswing in child 1 due to the way %err is used
in values_close().

Before this patch, a 100 test runs of test_memcontrol produced the
following results:

      17 not ok 1 test_memcg_min
      22 not ok 2 test_memcg_low

After applying this patch, there were no test failure for test_memcg_min
and test_memcg_low in 100 test runs.
Ideally we want to calculate these values dynamically based on the machine
size (number of cpus and total memory size).

We can calculate the memcg error margin and scale memcg sizes if necessarily.
It's the only way to make it pass both on a 2-CPU's vm and 512-CPU's physical
server.

Not a blocker for this patch, just an idea for the future.

Thanks for the suggestion.

As I said in a previous mail, the way the test works is by waiting until the the memory.current of the parent is close to 50M, then it checks the memory.current's of its children to see how much usage each of them have. I am not sure if nr of CPUs or total memory size is really a factor here. We will probably need to run some experiments to find out. Anyway, it will be a future patch if they are really a factor here.

Cheers,
Longman


Thanks!



Reply via email to