On Fri, Apr 16, 2021 at 02:18:10PM +0000, Dennis Zhou wrote: > Hello, > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > Hello Roman, > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > My results of the percpu_test are as follows: > > Intel KVM 4CPU:4G > > Vanilla 5.12-rc6 > > # ./percpu_test.sh > > Percpu: 1952 kB > > Percpu: 219648 kB > > Percpu: 219648 kB > > > > 5.12-rc6 + with patchset applied > > # ./percpu_test.sh > > Percpu: 2080 kB > > Percpu: 219712 kB > > Percpu: 72672 kB > > > > I'm able to see improvement comparable to that of what you're see too. > > > > However, on POWERPC I'm unable to reproduce these improvements with the > > patchset in the same configuration > > > > POWER9 KVM 4CPU:4G > > Vanilla 5.12-rc6 > > # ./percpu_test.sh > > Percpu: 5888 kB > > Percpu: 118272 kB > > Percpu: 118272 kB > > > > 5.12-rc6 + with patchset applied > > # ./percpu_test.sh > > Percpu: 6144 kB > > Percpu: 119040 kB > > Percpu: 119040 kB > > > > I'm wondering if there's any architectural specific code that needs plumbing > > here? > > > > There shouldn't be. Can you send me the percpu_stats debug output before > and after?
Btw, sidelined chunks are not listed in the debug output. It was actually on my to-do list, looks like I need to prioritize it a bit. > > > I will also look through the code to find the reason why POWER isn't > > depopulating pages. > > > > Thank you, > > Pratik > > > > On 08/04/21 9:27 am, Roman Gushchin wrote: > > > In our production experience the percpu memory allocator is sometimes > > > struggling > > > with returning the memory to the system. A typical example is a creation > > > of > > > several thousands memory cgroups (each has several chunks of the percpu > > > data > > > used for vmstats, vmevents, ref counters etc). Deletion and complete > > > releasing > > > of these cgroups doesn't always lead to a shrinkage of the percpu memory, > > > so that sometimes there are several GB's of memory wasted. > > > > > > The underlying problem is the fragmentation: to release an underlying > > > chunk > > > all percpu allocations should be released first. The percpu allocator > > > tends > > > to top up chunks to improve the utilization. It means new small-ish > > > allocations > > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, > > > effectively pinning them in memory. > > > > > > This patchset solves this problem by implementing a partial depopulation > > > of percpu chunks: chunks with many empty pages are being asynchronously > > > depopulated and the pages are returned to the system. > > > > > > To illustrate the problem the following script can be used: > > > > > > -- > > > #!/bin/bash > > > > > > cd /sys/fs/cgroup > > > > > > mkdir percpu_test > > > echo "+memory" > percpu_test/cgroup.subtree_control > > > > > > cat /proc/meminfo | grep Percpu > > > > > > for i in `seq 1 1000`; do > > > mkdir percpu_test/cg_"${i}" > > > for j in `seq 1 10`; do > > > mkdir percpu_test/cg_"${i}"_"${j}" > > > done > > > done > > > > > > cat /proc/meminfo | grep Percpu > > > > > > for i in `seq 1 1000`; do > > > for j in `seq 1 10`; do > > > rmdir percpu_test/cg_"${i}"_"${j}" > > > done > > > done > > > > > > sleep 10 > > > > > > cat /proc/meminfo | grep Percpu > > > > > > for i in `seq 1 1000`; do > > > rmdir percpu_test/cg_"${i}" > > > done > > > > > > rmdir percpu_test > > > -- > > > > > > It creates 11000 memory cgroups and removes every 10 out of 11. > > > It prints the initial size of the percpu memory, the size after > > > creating all cgroups and the size after deleting most of them. > > > > > > Results: > > > vanilla: > > > ./percpu_test.sh > > > Percpu: 7488 kB > > > Percpu: 481152 kB > > > Percpu: 481152 kB > > > > > > with this patchset applied: > > > ./percpu_test.sh > > > Percpu: 7488 kB > > > Percpu: 481408 kB > > > Percpu: 135552 kB > > > > > > So the total size of the percpu memory was reduced by more than 3.5 times. > > > > > > v3: > > > - introduced pcpu_check_chunk_hint() > > > - fixed a bug related to the hint check > > > - minor cosmetic changes > > > - s/pretends/fixes (cc Vlastimil) > > > > > > v2: > > > - depopulated chunks are sidelined > > > - depopulation happens in the reverse order > > > - depopulate list made per-chunk type > > > - better results due to better heuristics > > > > > > v1: > > > - depopulation heuristics changed and optimized > > > - chunks are put into a separate list, depopulation scan this list > > > - chunk->isolated is introduced, chunk->depopulate is dropped > > > - rearranged patches a bit > > > - fixed a panic discovered by krobot > > > - made pcpu_nr_empty_pop_pages per chunk type > > > - minor fixes > > > > > > rfc: > > > https://lwn.net/Articles/850508/ > > > > > > > > > Roman Gushchin (6): > > > percpu: fix a comment about the chunks ordering > > > percpu: split __pcpu_balance_workfn() > > > percpu: make pcpu_nr_empty_pop_pages per chunk type > > > percpu: generalize pcpu_balance_populated() > > > percpu: factor out pcpu_check_chunk_hint() > > > percpu: implement partial chunk depopulation > > > > > > mm/percpu-internal.h | 4 +- > > > mm/percpu-stats.c | 9 +- > > > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++-------- > > > 3 files changed, 261 insertions(+), 58 deletions(-) > > > > > > > Roman, sorry for the delay. I'm looking to apply this today to for-5.14. Great, thanks!