Hi,

On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:
> On Thu, 9 Jan 2025 at 18:25, Andres Freund <and...@anarazel.de> wrote:
> > > I'm curious about this because the checksum code should be fast enough
> > > to easily handle that throughput.
> >
> > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
> > workstation. But we don't have a good ready-made way of testing that without
> > also doing IO, so it's kinda hard to say.
>
> Interesting, I wonder if it's related to Intel increasing vpmulld
> latency to 10 already back in Haswell. The Zen 3 I'm testing on has
> latency 3 and has twice the throughput.

> Attached is a naive and crude benchmark that I used for testing here.
> Compiled with:
>
> gcc -O2 -funroll-loops -ftree-vectorize -march=native \
>   -I$(pg_config --includedir-server) \
>   bench-checksums.c -o bench-checksums-native
>
> Just fills up an array of pages and checksums them, first argument is
> number of checksums, second is array size. I used 1M checksums and 100
> pages for in cache behavior and 100000 pages for in memory
> performance.
>
> 869.85927ms @ 9.418 GB/s - generic from memory
> 772.12252ms @ 10.610 GB/s - generic in cache
> 442.61869ms @ 18.508 GB/s - native from memory
> 137.07573ms @ 59.763 GB/s - native in cache

printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do 
for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf 
"%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize 
-march=$march -I ~/src/postgresql/src/include/ -I src/include/ 
/tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 
--membind 0 ./bench-checksums-native 1000000 $mem;done; done

Workstation w/ 2x Xeon Gold 6442Y:

           march                     mem        result
          x86-64                     100        731.87779ms @ 11.193 GB/s
       x86-64-v2                     100        327.18580ms @ 25.038 GB/s
       x86-64-v3                     100        264.03547ms @ 31.026 GB/s
       x86-64-v4                     100        282.08065ms @ 29.041 GB/s
          native                     100        246.13766ms @ 33.282 GB/s
          x86-64                  100000        842.66827ms @ 9.722 GB/s
       x86-64-v2                  100000        604.52959ms @ 13.551 GB/s
       x86-64-v3                  100000        477.16239ms @ 17.168 GB/s
       x86-64-v4                  100000        476.07039ms @ 17.208 GB/s
          native                  100000        456.08080ms @ 17.962 GB/s
          x86-64                 1000000        845.51132ms @ 9.689 GB/s
       x86-64-v2                 1000000        612.07973ms @ 13.384 GB/s
       x86-64-v3                 1000000        485.23738ms @ 16.882 GB/s
       x86-64-v4                 1000000        483.86411ms @ 16.930 GB/s
          native                 1000000        462.88461ms @ 17.698 GB/s



Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
           march                     mem        result
          x86-64                     100        417.19762ms @ 19.636 GB/s
       x86-64-v2                     100        130.67596ms @ 62.689 GB/s
       x86-64-v3                     100        97.07758ms @ 84.386 GB/s
       x86-64-v4                     100        95.67704ms @ 85.621 GB/s
          native                     100        95.15734ms @ 86.089 GB/s
          x86-64                  100000        431.38370ms @ 18.990 GB/s
       x86-64-v2                  100000        215.74856ms @ 37.970 GB/s
       x86-64-v3                  100000        199.74492ms @ 41.012 GB/s
       x86-64-v4                  100000        186.98300ms @ 43.811 GB/s
          native                  100000        187.68125ms @ 43.648 GB/s
          x86-64                 1000000        433.87893ms @ 18.881 GB/s
       x86-64-v2                 1000000        217.46561ms @ 37.670 GB/s
       x86-64-v3                 1000000        200.40667ms @ 40.877 GB/s
       x86-64-v4                 1000000        187.51978ms @ 43.686 GB/s
          native                 1000000        190.29273ms @ 43.049 GB/s


Workstation w/ 2x Xeon Gold 5215:
           march                     mem        result
          x86-64                     100        780.38881ms @ 10.497 GB/s
       x86-64-v2                     100        389.62005ms @ 21.026 GB/s
       x86-64-v3                     100        323.97294ms @ 25.286 GB/s
       x86-64-v4                     100        274.19493ms @ 29.877 GB/s
          native                     100        283.48674ms @ 28.897 GB/s
          x86-64                  100000        1112.63898ms @ 7.363 GB/s
       x86-64-v2                  100000        831.45641ms @ 9.853 GB/s
       x86-64-v3                  100000        696.20789ms @ 11.767 GB/s
       x86-64-v4                  100000        685.61636ms @ 11.948 GB/s
          native                  100000        689.78023ms @ 11.876 GB/s
          x86-64                 1000000        1128.65580ms @ 7.258 GB/s
       x86-64-v2                 1000000        843.92594ms @ 9.707 GB/s
       x86-64-v3                 1000000        718.78848ms @ 11.397 GB/s
       x86-64-v4                 1000000        687.68258ms @ 11.912 GB/s
          native                 1000000        705.34731ms @ 11.614 GB/s


That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.


The difference between the baseline CPU target and a more modern profile is
also rather impressive.  Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.


I just realized that

a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
   matter in my numbers though because I was building with -O3 and
   march=native.

   This clearly ought to be fixed.

b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
   of which include checksum_imp.h directly.

   This probably should be fixed too - perhaps by building the relevant code
   once as part of fe_utils or such?


It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.

I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.

Here's a comparison of different flags run the 6442Y

printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; 
do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in 
"-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf 
"%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I 
~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o 
bench-checksums-native && numactl --physcpubind 3 --membind 0 
./bench-checksums-native 3000000 $mem;done; done;done
           march                                   flags                     
mem        result
          x86-64                                     -O2                     
100        2280.86253ms @ 10.775 GB/s
          x86-64                      -O2 -funroll-loops                     
100        2195.66942ms @ 11.193 GB/s
          x86-64                                     -O3                     
100        2422.57588ms @ 10.145 GB/s
          x86-64                      -O3 -funroll-loops                     
100        2243.75826ms @ 10.953 GB/s
       x86-64-v2                                     -O2                     
100        1243.68063ms @ 19.761 GB/s
       x86-64-v2                      -O2 -funroll-loops                     
100        979.67783ms @ 25.086 GB/s
       x86-64-v2                                     -O3                     
100        988.80296ms @ 24.854 GB/s
       x86-64-v2                      -O3 -funroll-loops                     
100        991.31632ms @ 24.791 GB/s
       x86-64-v3                                     -O2                     
100        1146.90165ms @ 21.428 GB/s
       x86-64-v3                      -O2 -funroll-loops                     
100        785.81395ms @ 31.275 GB/s
       x86-64-v3                                     -O3                     
100        800.53627ms @ 30.699 GB/s
       x86-64-v3                      -O3 -funroll-loops                     
100        790.21230ms @ 31.101 GB/s
       x86-64-v4                                     -O2                     
100        883.82916ms @ 27.806 GB/s
       x86-64-v4                      -O2 -funroll-loops                     
100        831.55372ms @ 29.554 GB/s
       x86-64-v4                                     -O3                     
100        843.23141ms @ 29.145 GB/s
       x86-64-v4                      -O3 -funroll-loops                     
100        821.19969ms @ 29.927 GB/s
          native                                     -O2                     
100        1197.41357ms @ 20.524 GB/s
          native                      -O2 -funroll-loops                     
100        718.05253ms @ 34.226 GB/s
          native                                     -O3                     
100        747.94090ms @ 32.858 GB/s
          native                      -O3 -funroll-loops                     
100        751.52379ms @ 32.702 GB/s
          x86-64                                     -O2                  
100000        2911.47087ms @ 8.441 GB/s
          x86-64                      -O2 -funroll-loops                  
100000        2525.45504ms @ 9.731 GB/s
          x86-64                                     -O3                  
100000        2497.42016ms @ 9.841 GB/s
          x86-64                      -O3 -funroll-loops                  
100000        2346.33551ms @ 10.474 GB/s
       x86-64-v2                                     -O2                  
100000        2124.10102ms @ 11.570 GB/s
       x86-64-v2                      -O2 -funroll-loops                  
100000        1819.09659ms @ 13.510 GB/s
       x86-64-v2                                     -O3                  
100000        1613.45823ms @ 15.232 GB/s
       x86-64-v2                      -O3 -funroll-loops                  
100000        1607.09245ms @ 15.292 GB/s
       x86-64-v3                                     -O2                  
100000        1972.89390ms @ 12.457 GB/s
       x86-64-v3                      -O2 -funroll-loops                  
100000        1432.58229ms @ 17.155 GB/s
       x86-64-v3                                     -O3                  
100000        1533.18003ms @ 16.029 GB/s
       x86-64-v3                      -O3 -funroll-loops                  
100000        1539.39779ms @ 15.965 GB/s
       x86-64-v4                                     -O2                  
100000        1591.96881ms @ 15.437 GB/s
       x86-64-v4                      -O2 -funroll-loops                  
100000        1434.91828ms @ 17.127 GB/s
       x86-64-v4                                     -O3                  
100000        1454.30133ms @ 16.899 GB/s
       x86-64-v4                      -O3 -funroll-loops                  
100000        1429.13733ms @ 17.196 GB/s
          native                                     -O2                  
100000        1980.53734ms @ 12.409 GB/s
          native                      -O2 -funroll-loops                  
100000        1373.95337ms @ 17.887 GB/s
          native                                     -O3                  
100000        1517.90164ms @ 16.191 GB/s
          native                      -O3 -funroll-loops                  
100000        1508.37021ms @ 16.293 GB/s



> > > Is it just that the calculation is slow, or is it the fact that 
> > > checksumming
> > > needs to bring the page into the CPU cache. Did you notice any hints which
> > > might be the case?
> >
> > I don't think the issue is that checksumming pulls the data into CPU caches
> >
> > 1) This is visible with SELECT that actually uses the data
> >
> > 2) I added prefetching to avoid any meaningful amount of cache misses and it
> >    doesn't change the overall timing much
> >
> > 3) It's visible with buffered IO, which has pulled the data into CPU caches
> >    already
>
> I didn't yet check the code, when doing aio completions checksumming
> be running on the same core as is going to be using the page?

With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.

With worker mode normally no.

Greetings,

Andres Freund


Reply via email to