Hi,

I've been investigating the regressions in some of the benchmark results, together with the generation context benchmarks [1].

Turns out it's pretty difficult to benchmark this, because the results strongly depend on what the backend did before. For example if I run slab_bench_fifo with the "decreasing" test for 32kB blocks and 512B chunks, I get this:

  select * from slab_bench_fifo(1000000, 32768, 512, 100, 10000, 5000);

   mem_allocated | alloc_ms | free_ms
  ---------------+----------+---------
       528547840 |   155394 |   87440


i.e. palloc() takes ~155ms and pfree() ~87ms (and these result are stable, the numbers don't change much with more runs).

But if I run a set of "lifo" tests in the backend first, the results look like this:

   mem_allocated | alloc_ms | free_ms
  ---------------+----------+---------
       528547840 |    41728 |   71524
  (1 row)

so the pallocs are suddenly about ~4x faster. Clearly, what the backend did before may have pretty dramatic impact on results, even for simple benchmarks like this.

Note: The benchmark was a single SQL script, running all the different workloads in the same backend.

I did a fair amount of perf profiling, and the main difference between the slow and fast runs seems to be this:

0 page-faults:u 0 minor-faults:u 0 major-faults:u

vs

20,634,153 page-faults:u 20,634,153 minor-faults:u 0 major-faults:u

Attached is a more complete perf stat output, but the page faults seem to be the main issue. My theory is that in the "fast" case, the past backend activity puts the glibc memory management into a state that prevents page faults in the benchmark.

But of course, this theory may be incomplete - for example it's not clear why running the benchmark repeatedly would not "condition" the backend the same way. But it doesn't - it's ~150ms even for repeated runs.

Secondly, I'm not sure this explains why some of the timings actually got much slower with the 0003 patch, when the sequence of the steps is still the same. Of course, it's possible 0003 changes the allocation pattern a bit, interfering with glibc memory management.

This leads to a couple of interesting questions, I think:

1) I've only tested this on Linux, with glibc. I wonder how it'd behave on other platforms, or with other allocators.

2) Which cases are more important? When the backend was warmed up, or when each benchmark runs in a new backend? It seems the "new backend" is something like a "worst case" leading to more page faults, so maybe that's the thing to watch. OTOH it's unlikely to have a completely new backend, so maybe not.

3) Can this teach us something about how to allocate stuff, to better "prepare" the backend for future allocations? For example, it's a bit strange that repeated runs of the same benchmark don't do the trick, for some reason.



regards


[1] https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
 Performance counter stats for process id '11829':

   219,869,739,737      cycles:u                                                
    
   119,722,280,436      instructions:u            #    0.54  insn per cycle     
    
                                                  #    1.56  stalled cycles per 
insn
     7,784,566,858      cache-references:u                                      
    
     2,487,257,287      cache-misses:u            #   31.951 % of all cache 
refs    
     5,942,054,520      bus-cycles:u                                            
    
                 0      page-faults:u                                           
    
                 0      minor-faults:u                                          
    
                 0      major-faults:u                                          
    
   187,181,661,719      stalled-cycles-frontend:u #   85.13% frontend cycles 
idle   
   144,274,017,071      stalled-cycles-backend:u  #   65.62% backend cycles 
idle    

      60.000876248 seconds time elapsed



 Performance counter stats for process id '11886':

   145,093,090,692      cycles:u                                                
    
    74,986,543,212      instructions:u            #    0.52  insn per cycle     
    
                                                  #    2.63  stalled cycles per 
insn
     4,753,764,781      cache-references:u                                      
    
     1,342,653,549      cache-misses:u            #   28.244 % of all cache 
refs    
     3,925,175,515      bus-cycles:u                                            
    
        20,634,153      page-faults:u                                           
    
        20,634,153      minor-faults:u                                          
    
                 0      major-faults:u                                          
    
   197,130,461,632      stalled-cycles-frontend:u #  135.86% frontend cycles 
idle   
   168,434,343,213      stalled-cycles-backend:u  #  116.09% backend cycles 
idle    

      60.000891867 seconds time elapsed

Reply via email to