Re: slab allocator performance issues

Tomas Vondra Fri, 10 Sep 2021 14:07:21 -0700

Hi,

I've been investigating the regressions in some of the benchmarkresults, together with the generation context benchmarks [1].

Turns out it's pretty difficult to benchmark this, because the resultsstrongly depend on what the backend did before. For example if I runslab_bench_fifo with the "decreasing" test for 32kB blocks and 512Bchunks, I get this:


  select * from slab_bench_fifo(1000000, 32768, 512, 100, 10000, 5000);

   mem_allocated | alloc_ms | free_ms
  ---------------+----------+---------
       528547840 |   155394 |   87440

i.e. palloc() takes ~155ms and pfree() ~87ms (and these result arestable, the numbers don't change much with more runs).

But if I run a set of "lifo" tests in the backend first, the resultslook like this:


   mem_allocated | alloc_ms | free_ms
  ---------------+----------+---------
       528547840 |    41728 |   71524
  (1 row)

so the pallocs are suddenly about ~4x faster. Clearly, what the backenddid before may have pretty dramatic impact on results, even for simplebenchmarks like this.

Note: The benchmark was a single SQL script, running all the differentworkloads in the same backend.

I did a fair amount of perf profiling, and the main difference betweenthe slow and fast runs seems to be this:

0 page-faults:u0 minor-faults:u0 major-faults:u

vs

20,634,153 page-faults:u20,634,153 minor-faults:u0 major-faults:u

Attached is a more complete perf stat output, but the page faults seemto be the main issue. My theory is that in the "fast" case, the pastbackend activity puts the glibc memory management into a state thatprevents page faults in the benchmark.

But of course, this theory may be incomplete - for example it's notclear why running the benchmark repeatedly would not "condition" thebackend the same way. But it doesn't - it's ~150ms even for repeated runs.

Secondly, I'm not sure this explains why some of the timings actuallygot much slower with the 0003 patch, when the sequence of the steps isstill the same. Of course, it's possible 0003 changes the allocationpattern a bit, interfering with glibc memory management.


This leads to a couple of interesting questions, I think:

1) I've only tested this on Linux, with glibc. I wonder how it'd behaveon other platforms, or with other allocators.

2) Which cases are more important? When the backend was warmed up, orwhen each benchmark runs in a new backend? It seems the "new backend" issomething like a "worst case" leading to more page faults, so maybethat's the thing to watch. OTOH it's unlikely to have a completely newbackend, so maybe not.

3) Can this teach us something about how to allocate stuff, to better"prepare" the backend for future allocations? For example, it's a bitstrange that repeated runs of the same benchmark don't do the trick, forsome reason.




regards

[1]https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com


--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

 Performance counter stats for process id '11829':

   219,869,739,737      cycles:u                                                
    
   119,722,280,436      instructions:u            #    0.54  insn per cycle     
    
                                                  #    1.56  stalled cycles per 
insn
     7,784,566,858      cache-references:u                                      
    
     2,487,257,287      cache-misses:u            #   31.951 % of all cache 
refs    
     5,942,054,520      bus-cycles:u                                            
    
                 0      page-faults:u                                           
    
                 0      minor-faults:u                                          
    
                 0      major-faults:u                                          
    
   187,181,661,719      stalled-cycles-frontend:u #   85.13% frontend cycles 
idle   
   144,274,017,071      stalled-cycles-backend:u  #   65.62% backend cycles 
idle    

      60.000876248 seconds time elapsed



 Performance counter stats for process id '11886':

   145,093,090,692      cycles:u                                                
    
    74,986,543,212      instructions:u            #    0.52  insn per cycle     
    
                                                  #    2.63  stalled cycles per 
insn
     4,753,764,781      cache-references:u                                      
    
     1,342,653,549      cache-misses:u            #   28.244 % of all cache 
refs    
     3,925,175,515      bus-cycles:u                                            
    
        20,634,153      page-faults:u                                           
    
        20,634,153      minor-faults:u                                          
    
                 0      major-faults:u                                          
    
   197,130,461,632      stalled-cycles-frontend:u #  135.86% frontend cycles 
idle   
   168,434,343,213      stalled-cycles-backend:u  #  116.09% backend cycles 
idle    

      60.000891867 seconds time elapsed

Re: slab allocator performance issues

Reply via email to