Re: AIO v2.0

Andres Freund Thu, 09 Jan 2025 12:53:45 -0800

Hi,

On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:
> On Thu, 9 Jan 2025 at 18:25, Andres Freund <and...@anarazel.de> wrote:
> > > I'm curious about this because the checksum code should be fast enough
> > > to easily handle that throughput.
> >
> > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
> > workstation. But we don't have a good ready-made way of testing that without
> > also doing IO, so it's kinda hard to say.
>
> Interesting, I wonder if it's related to Intel increasing vpmulld
> latency to 10 already back in Haswell. The Zen 3 I'm testing on has
> latency 3 and has twice the throughput.


> Attached is a naive and crude benchmark that I used for testing here.
> Compiled with:
>
> gcc -O2 -funroll-loops -ftree-vectorize -march=native \
>   -I$(pg_config --includedir-server) \
>   bench-checksums.c -o bench-checksums-native
>
> Just fills up an array of pages and checksums them, first argument is
> number of checksums, second is array size. I used 1M checksums and 100
> pages for in cache behavior and 100000 pages for in memory
> performance.
>
> 869.85927ms @ 9.418 GB/s - generic from memory
> 772.12252ms @ 10.610 GB/s - generic in cache
> 442.61869ms @ 18.508 GB/s - native from memory
> 137.07573ms @ 59.763 GB/s - native in cache

printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do 
for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf 
"%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize 
-march=$march -I ~/src/postgresql/src/include/ -I src/include/ 
/tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 
--membind 0 ./bench-checksums-native 1000000 $mem;done; done

Workstation w/ 2x Xeon Gold 6442Y:

           march                     mem        result
          x86-64                     100        731.87779ms @ 11.193 GB/s
       x86-64-v2                     100        327.18580ms @ 25.038 GB/s
       x86-64-v3                     100        264.03547ms @ 31.026 GB/s
       x86-64-v4                     100        282.08065ms @ 29.041 GB/s
          native                     100        246.13766ms @ 33.282 GB/s
          x86-64                  100000        842.66827ms @ 9.722 GB/s
       x86-64-v2                  100000        604.52959ms @ 13.551 GB/s
       x86-64-v3                  100000        477.16239ms @ 17.168 GB/s
       x86-64-v4                  100000        476.07039ms @ 17.208 GB/s
          native                  100000        456.08080ms @ 17.962 GB/s
          x86-64                 1000000        845.51132ms @ 9.689 GB/s
       x86-64-v2                 1000000        612.07973ms @ 13.384 GB/s
       x86-64-v3                 1000000        485.23738ms @ 16.882 GB/s
       x86-64-v4                 1000000        483.86411ms @ 16.930 GB/s
          native                 1000000        462.88461ms @ 17.698 GB/s



Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
           march                     mem        result
          x86-64                     100        417.19762ms @ 19.636 GB/s
       x86-64-v2                     100        130.67596ms @ 62.689 GB/s
       x86-64-v3                     100        97.07758ms @ 84.386 GB/s
       x86-64-v4                     100        95.67704ms @ 85.621 GB/s
          native                     100        95.15734ms @ 86.089 GB/s
          x86-64                  100000        431.38370ms @ 18.990 GB/s
       x86-64-v2                  100000        215.74856ms @ 37.970 GB/s
       x86-64-v3                  100000        199.74492ms @ 41.012 GB/s
       x86-64-v4                  100000        186.98300ms @ 43.811 GB/s
          native                  100000        187.68125ms @ 43.648 GB/s
          x86-64                 1000000        433.87893ms @ 18.881 GB/s
       x86-64-v2                 1000000        217.46561ms @ 37.670 GB/s
       x86-64-v3                 1000000        200.40667ms @ 40.877 GB/s
       x86-64-v4                 1000000        187.51978ms @ 43.686 GB/s
          native                 1000000        190.29273ms @ 43.049 GB/s


Workstation w/ 2x Xeon Gold 5215:
           march                     mem        result
          x86-64                     100        780.38881ms @ 10.497 GB/s
       x86-64-v2                     100        389.62005ms @ 21.026 GB/s
       x86-64-v3                     100        323.97294ms @ 25.286 GB/s
       x86-64-v4                     100        274.19493ms @ 29.877 GB/s
          native                     100        283.48674ms @ 28.897 GB/s
          x86-64                  100000        1112.63898ms @ 7.363 GB/s
       x86-64-v2                  100000        831.45641ms @ 9.853 GB/s
       x86-64-v3                  100000        696.20789ms @ 11.767 GB/s
       x86-64-v4                  100000        685.61636ms @ 11.948 GB/s
          native                  100000        689.78023ms @ 11.876 GB/s
          x86-64                 1000000        1128.65580ms @ 7.258 GB/s
       x86-64-v2                 1000000        843.92594ms @ 9.707 GB/s
       x86-64-v3                 1000000        718.78848ms @ 11.397 GB/s
       x86-64-v4                 1000000        687.68258ms @ 11.912 GB/s
          native                 1000000        705.34731ms @ 11.614 GB/s


That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.


The difference between the baseline CPU target and a more modern profile is
also rather impressive.  Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.


I just realized that

a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
   matter in my numbers though because I was building with -O3 and
   march=native.

   This clearly ought to be fixed.

b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
   of which include checksum_imp.h directly.

   This probably should be fixed too - perhaps by building the relevant code
   once as part of fe_utils or such?


It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.

I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.

Here's a comparison of different flags run the 6442Y

printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; 
do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in 
"-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf 
"%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I 
~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o 
bench-checksums-native && numactl --physcpubind 3 --membind 0 
./bench-checksums-native 3000000 $mem;done; done;done
           march                                   flags                     
mem        result
          x86-64                                     -O2                     
100        2280.86253ms @ 10.775 GB/s
          x86-64                      -O2 -funroll-loops                     
100        2195.66942ms @ 11.193 GB/s
          x86-64                                     -O3                     
100        2422.57588ms @ 10.145 GB/s
          x86-64                      -O3 -funroll-loops                     
100        2243.75826ms @ 10.953 GB/s
       x86-64-v2                                     -O2                     
100        1243.68063ms @ 19.761 GB/s
       x86-64-v2                      -O2 -funroll-loops                     
100        979.67783ms @ 25.086 GB/s
       x86-64-v2                                     -O3                     
100        988.80296ms @ 24.854 GB/s
       x86-64-v2                      -O3 -funroll-loops                     
100        991.31632ms @ 24.791 GB/s
       x86-64-v3                                     -O2                     
100        1146.90165ms @ 21.428 GB/s
       x86-64-v3                      -O2 -funroll-loops                     
100        785.81395ms @ 31.275 GB/s
       x86-64-v3                                     -O3                     
100        800.53627ms @ 30.699 GB/s
       x86-64-v3                      -O3 -funroll-loops                     
100        790.21230ms @ 31.101 GB/s
       x86-64-v4                                     -O2                     
100        883.82916ms @ 27.806 GB/s
       x86-64-v4                      -O2 -funroll-loops                     
100        831.55372ms @ 29.554 GB/s
       x86-64-v4                                     -O3                     
100        843.23141ms @ 29.145 GB/s
       x86-64-v4                      -O3 -funroll-loops                     
100        821.19969ms @ 29.927 GB/s
          native                                     -O2                     
100        1197.41357ms @ 20.524 GB/s
          native                      -O2 -funroll-loops                     
100        718.05253ms @ 34.226 GB/s
          native                                     -O3                     
100        747.94090ms @ 32.858 GB/s
          native                      -O3 -funroll-loops                     
100        751.52379ms @ 32.702 GB/s
          x86-64                                     -O2                  
100000        2911.47087ms @ 8.441 GB/s
          x86-64                      -O2 -funroll-loops                  
100000        2525.45504ms @ 9.731 GB/s
          x86-64                                     -O3                  
100000        2497.42016ms @ 9.841 GB/s
          x86-64                      -O3 -funroll-loops                  
100000        2346.33551ms @ 10.474 GB/s
       x86-64-v2                                     -O2                  
100000        2124.10102ms @ 11.570 GB/s
       x86-64-v2                      -O2 -funroll-loops                  
100000        1819.09659ms @ 13.510 GB/s
       x86-64-v2                                     -O3                  
100000        1613.45823ms @ 15.232 GB/s
       x86-64-v2                      -O3 -funroll-loops                  
100000        1607.09245ms @ 15.292 GB/s
       x86-64-v3                                     -O2                  
100000        1972.89390ms @ 12.457 GB/s
       x86-64-v3                      -O2 -funroll-loops                  
100000        1432.58229ms @ 17.155 GB/s
       x86-64-v3                                     -O3                  
100000        1533.18003ms @ 16.029 GB/s
       x86-64-v3                      -O3 -funroll-loops                  
100000        1539.39779ms @ 15.965 GB/s
       x86-64-v4                                     -O2                  
100000        1591.96881ms @ 15.437 GB/s
       x86-64-v4                      -O2 -funroll-loops                  
100000        1434.91828ms @ 17.127 GB/s
       x86-64-v4                                     -O3                  
100000        1454.30133ms @ 16.899 GB/s
       x86-64-v4                      -O3 -funroll-loops                  
100000        1429.13733ms @ 17.196 GB/s
          native                                     -O2                  
100000        1980.53734ms @ 12.409 GB/s
          native                      -O2 -funroll-loops                  
100000        1373.95337ms @ 17.887 GB/s
          native                                     -O3                  
100000        1517.90164ms @ 16.191 GB/s
          native                      -O3 -funroll-loops                  
100000        1508.37021ms @ 16.293 GB/s



> > > Is it just that the calculation is slow, or is it the fact that 
> > > checksumming
> > > needs to bring the page into the CPU cache. Did you notice any hints which
> > > might be the case?
> >
> > I don't think the issue is that checksumming pulls the data into CPU caches
> >
> > 1) This is visible with SELECT that actually uses the data
> >
> > 2) I added prefetching to avoid any meaningful amount of cache misses and it
> >    doesn't change the overall timing much
> >
> > 3) It's visible with buffered IO, which has pulled the data into CPU caches
> >    already
>
> I didn't yet check the code, when doing aio completions checksumming
> be running on the same core as is going to be using the page?

With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.

With worker mode normally no.

Greetings,

Andres Freund

Re: AIO v2.0

Reply via email to