Re: remap the .text segment into huge pages at run time

Andres Freund Fri, 04 Nov 2022 11:33:33 -0700

Hi,

This nerd-sniped me badly :)


On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> > remap the .text segment to huge pages at program start. Attached is a
> > hackish, Meson-only, "works on my machine" patchset to experiment with this
> > idea.
>
> I wonder how far we can get with just using the linker hints to align
> sections. I know that the linux folks are working on promoting sufficiently
> aligned executable pages to huge pages too, and might have succeeded already.
>
> IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages for
it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the executable
doesn't reflinks to reuse parts of other files, and that the mold linker and
cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never


FWIW, you can see the state of the page mapping in more detail with the
kernel's page-types tool

sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 
0x555555800,0x555556122
sudo /home/andres/src/kernel/tools/vm/page-types -f 
/srv/dev/build/m-opt/src/backend/postgres2


Perf results:

c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e 
cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g
 pgbench -n -M prepared -S -P1 -c$c -j$c -T10

without MADV_COLLAPSE:

tps = 1038230.070771 (without initial connection time)

 Performance counter stats for 'system wide':

 1,184,344,476,152      cycles                                                  
             (71.41%)
     2,846,146,710      iTLB-loads                                              
             (71.43%)
     2,021,885,782      iTLB-load-misses                 #   71.04% of all iTLB 
cache accesses  (71.44%)
    75,633,850,933      itlb_misses.walk_active                                 
             (71.44%)
     2,020,962,930      itlb_misses.walk_completed_4k                           
             (71.44%)
         1,213,368      itlb_misses.walk_completed_2m_4m                        
             (57.12%)
             2,293      itlb_misses.walk_completed_1g                           
             (57.11%)

      10.064352587 seconds time elapsed



with MADV_COLLAPSE:

tps = 1113717.114278 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,049,140,611      cycles                                                  
             (71.42%)
     1,059,224,678      iTLB-loads                                              
             (71.44%)
       653,603,712      iTLB-load-misses                 #   61.71% of all iTLB 
cache accesses  (71.44%)
    26,135,902,949      itlb_misses.walk_active                                 
             (71.44%)
       628,314,285      itlb_misses.walk_completed_4k                           
             (71.44%)
        25,462,916      itlb_misses.walk_completed_2m_4m                        
             (57.13%)
             2,228      itlb_misses.walk_completed_1g                           
             (57.13%)

Note that while the rate of itlb-misses stays roughly the same, the total
number of iTLB loads reduced substantially, and the number of cycles in which
an itlb miss was in progress is 1/3 of what it was before.


A lot of the remaining misses are from the context switches. The iTLB is
flushed on context switches, and of course pgbench -S is extremely context
switch heavy.

Comparing plain -S with 10 pipelined -S transactions (using -t 100000 / -t
10000 to compare the same amount of work) I get:


without MADV_COLLAPSE:

not pipelined:

tps = 1037732.722805 (without initial connection time)

 Performance counter stats for 'system wide':

 1,691,411,678,007      cycles                                                  
             (62.48%)
         8,856,107      itlb.itlb_flush                                         
             (62.48%)
     4,600,041,062      iTLB-loads                                              
             (62.48%)
     2,598,218,236      iTLB-load-misses                 #   56.48% of all iTLB 
cache accesses  (62.50%)
   100,095,862,126      itlb_misses.walk_active                                 
             (62.53%)
     2,595,376,025      itlb_misses.walk_completed_4k                           
             (50.02%)
         2,558,713      itlb_misses.walk_completed_2m_4m                        
             (50.00%)
             2,146      itlb_misses.walk_completed_1g                           
             (49.98%)

      14.582927646 seconds time elapsed


pipelined:

tps = 161947.008995 (without initial connection time)

 Performance counter stats for 'system wide':

 1,095,948,341,745      cycles                                                  
             (62.46%)
           877,556      itlb.itlb_flush                                         
             (62.46%)
     4,576,237,561      iTLB-loads                                              
             (62.48%)
       307,971,166      iTLB-load-misses                 #    6.73% of all iTLB 
cache accesses  (62.52%)
    15,565,279,213      itlb_misses.walk_active                                 
             (62.55%)
       306,240,104      itlb_misses.walk_completed_4k                           
             (50.03%)
         1,753,560      itlb_misses.walk_completed_2m_4m                        
             (50.00%)
             2,189      itlb_misses.walk_completed_1g                           
             (49.96%)

       9.374687885 seconds time elapsed



with MADV_COLLAPSE:

not pipelined:
tps = 1112040.859643 (without initial connection time)

 Performance counter stats for 'system wide':

 1,569,546,236,696      cycles                                                  
             (62.50%)
         7,094,291      itlb.itlb_flush                                         
             (62.51%)
     1,599,845,097      iTLB-loads                                              
             (62.51%)
       692,042,864      iTLB-load-misses                 #   43.26% of all iTLB 
cache accesses  (62.51%)
    31,529,641,124      itlb_misses.walk_active                                 
             (62.51%)
       669,849,177      itlb_misses.walk_completed_4k                           
             (49.99%)
        22,708,146      itlb_misses.walk_completed_2m_4m                        
             (49.99%)
             2,752      itlb_misses.walk_completed_1g                           
             (49.99%)

      13.611206182 seconds time elapsed


pipelined:

tps = 162484.443469 (without initial connection time)

 Performance counter stats for 'system wide':

 1,092,897,514,658      cycles                                                  
             (62.48%)
           942,351      itlb.itlb_flush                                         
             (62.48%)
       233,996,092      iTLB-loads                                              
             (62.48%)
       102,155,575      iTLB-load-misses                 #   43.66% of all iTLB 
cache accesses  (62.49%)
     6,419,597,286      itlb_misses.walk_active                                 
             (62.52%)
        98,758,409      itlb_misses.walk_completed_4k                           
             (50.03%)
         3,342,332      itlb_misses.walk_completed_2m_4m                        
             (50.02%)
             2,190      itlb_misses.walk_completed_1g                           
             (49.98%)

       9.355239897 seconds time elapsed

The difference in itlb.itlb_flush between pipelined / non-pipelined cases
unsurprisingly is stark.

While the pipelined case still sees a good bit reduced itlb traffic, the total
amount of cycles in which a walk is active is just not large enough to matter,
by the looks of it.

Greetings,

Andres Freund

Re: remap the .text segment into huge pages at run time

Reply via email to