Hi, This nerd-sniped me badly :)
On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > On 2022-11-02 13:32:37 +0700, John Naylor wrote: > > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to > > remap the .text segment to huge pages at program start. Attached is a > > hackish, Meson-only, "works on my machine" patchset to experiment with this > > idea. > > I wonder how far we can get with just using the linker hints to align > sections. I know that the linux folks are working on promoting sufficiently > aligned executable pages to huge pages too, and might have succeeded already. > > IOW, adding the linker flags might be a good first step. Indeed, I did see that that works to some degree on the 5.19 kernel I was running. However, it never seems to get around to using huge pages sufficiently to compete with explicit use of huge pages. More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was added into linux 6.1. That explicitly remaps a region and uses huge pages for it. Of course that's going to take a while to be widely available, but it seems like a safer approach than the remapping approach from this thread. I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode the address / length), and it seems to work nicely. With the weird caveat that on fs one needs to make sure that the executable doesn't reflinks to reuse parts of other files, and that the mold linker and cp do... Not a concern on ext4, but on xfs. I took to copying the postgres binary with cp --reflink=never FWIW, you can see the state of the page mapping in more detail with the kernel's page-types tool sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 0x555555800,0x555556122 sudo /home/andres/src/kernel/tools/vm/page-types -f /srv/dev/build/m-opt/src/backend/postgres2 Perf results: c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g pgbench -n -M prepared -S -P1 -c$c -j$c -T10 without MADV_COLLAPSE: tps = 1038230.070771 (without initial connection time) Performance counter stats for 'system wide': 1,184,344,476,152 cycles (71.41%) 2,846,146,710 iTLB-loads (71.43%) 2,021,885,782 iTLB-load-misses # 71.04% of all iTLB cache accesses (71.44%) 75,633,850,933 itlb_misses.walk_active (71.44%) 2,020,962,930 itlb_misses.walk_completed_4k (71.44%) 1,213,368 itlb_misses.walk_completed_2m_4m (57.12%) 2,293 itlb_misses.walk_completed_1g (57.11%) 10.064352587 seconds time elapsed with MADV_COLLAPSE: tps = 1113717.114278 (without initial connection time) Performance counter stats for 'system wide': 1,173,049,140,611 cycles (71.42%) 1,059,224,678 iTLB-loads (71.44%) 653,603,712 iTLB-load-misses # 61.71% of all iTLB cache accesses (71.44%) 26,135,902,949 itlb_misses.walk_active (71.44%) 628,314,285 itlb_misses.walk_completed_4k (71.44%) 25,462,916 itlb_misses.walk_completed_2m_4m (57.13%) 2,228 itlb_misses.walk_completed_1g (57.13%) Note that while the rate of itlb-misses stays roughly the same, the total number of iTLB loads reduced substantially, and the number of cycles in which an itlb miss was in progress is 1/3 of what it was before. A lot of the remaining misses are from the context switches. The iTLB is flushed on context switches, and of course pgbench -S is extremely context switch heavy. Comparing plain -S with 10 pipelined -S transactions (using -t 100000 / -t 10000 to compare the same amount of work) I get: without MADV_COLLAPSE: not pipelined: tps = 1037732.722805 (without initial connection time) Performance counter stats for 'system wide': 1,691,411,678,007 cycles (62.48%) 8,856,107 itlb.itlb_flush (62.48%) 4,600,041,062 iTLB-loads (62.48%) 2,598,218,236 iTLB-load-misses # 56.48% of all iTLB cache accesses (62.50%) 100,095,862,126 itlb_misses.walk_active (62.53%) 2,595,376,025 itlb_misses.walk_completed_4k (50.02%) 2,558,713 itlb_misses.walk_completed_2m_4m (50.00%) 2,146 itlb_misses.walk_completed_1g (49.98%) 14.582927646 seconds time elapsed pipelined: tps = 161947.008995 (without initial connection time) Performance counter stats for 'system wide': 1,095,948,341,745 cycles (62.46%) 877,556 itlb.itlb_flush (62.46%) 4,576,237,561 iTLB-loads (62.48%) 307,971,166 iTLB-load-misses # 6.73% of all iTLB cache accesses (62.52%) 15,565,279,213 itlb_misses.walk_active (62.55%) 306,240,104 itlb_misses.walk_completed_4k (50.03%) 1,753,560 itlb_misses.walk_completed_2m_4m (50.00%) 2,189 itlb_misses.walk_completed_1g (49.96%) 9.374687885 seconds time elapsed with MADV_COLLAPSE: not pipelined: tps = 1112040.859643 (without initial connection time) Performance counter stats for 'system wide': 1,569,546,236,696 cycles (62.50%) 7,094,291 itlb.itlb_flush (62.51%) 1,599,845,097 iTLB-loads (62.51%) 692,042,864 iTLB-load-misses # 43.26% of all iTLB cache accesses (62.51%) 31,529,641,124 itlb_misses.walk_active (62.51%) 669,849,177 itlb_misses.walk_completed_4k (49.99%) 22,708,146 itlb_misses.walk_completed_2m_4m (49.99%) 2,752 itlb_misses.walk_completed_1g (49.99%) 13.611206182 seconds time elapsed pipelined: tps = 162484.443469 (without initial connection time) Performance counter stats for 'system wide': 1,092,897,514,658 cycles (62.48%) 942,351 itlb.itlb_flush (62.48%) 233,996,092 iTLB-loads (62.48%) 102,155,575 iTLB-load-misses # 43.66% of all iTLB cache accesses (62.49%) 6,419,597,286 itlb_misses.walk_active (62.52%) 98,758,409 itlb_misses.walk_completed_4k (50.03%) 3,342,332 itlb_misses.walk_completed_2m_4m (50.02%) 2,190 itlb_misses.walk_completed_1g (49.98%) 9.355239897 seconds time elapsed The difference in itlb.itlb_flush between pipelined / non-pipelined cases unsurprisingly is stark. While the pipelined case still sees a good bit reduced itlb traffic, the total amount of cycles in which a walk is active is just not large enough to matter, by the looks of it. Greetings, Andres Freund