Hi, On 2025-01-09 20:10:24 +0200, Ants Aasma wrote: > On Thu, 9 Jan 2025 at 18:25, Andres Freund <and...@anarazel.de> wrote: > > > I'm curious about this because the checksum code should be fast enough > > > to easily handle that throughput. > > > > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y > > workstation. But we don't have a good ready-made way of testing that without > > also doing IO, so it's kinda hard to say. > > Interesting, I wonder if it's related to Intel increasing vpmulld > latency to 10 already back in Haswell. The Zen 3 I'm testing on has > latency 3 and has twice the throughput.
> Attached is a naive and crude benchmark that I used for testing here. > Compiled with: > > gcc -O2 -funroll-loops -ftree-vectorize -march=native \ > -I$(pg_config --includedir-server) \ > bench-checksums.c -o bench-checksums-native > > Just fills up an array of pages and checksums them, first argument is > number of checksums, second is array size. I used 1M checksums and 100 > pages for in cache behavior and 100000 pages for in memory > performance. > > 869.85927ms @ 9.418 GB/s - generic from memory > 772.12252ms @ 10.610 GB/s - generic in cache > 442.61869ms @ 18.508 GB/s - native from memory > 137.07573ms @ 59.763 GB/s - native in cache printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 --membind 0 ./bench-checksums-native 1000000 $mem;done; done Workstation w/ 2x Xeon Gold 6442Y: march mem result x86-64 100 731.87779ms @ 11.193 GB/s x86-64-v2 100 327.18580ms @ 25.038 GB/s x86-64-v3 100 264.03547ms @ 31.026 GB/s x86-64-v4 100 282.08065ms @ 29.041 GB/s native 100 246.13766ms @ 33.282 GB/s x86-64 100000 842.66827ms @ 9.722 GB/s x86-64-v2 100000 604.52959ms @ 13.551 GB/s x86-64-v3 100000 477.16239ms @ 17.168 GB/s x86-64-v4 100000 476.07039ms @ 17.208 GB/s native 100000 456.08080ms @ 17.962 GB/s x86-64 1000000 845.51132ms @ 9.689 GB/s x86-64-v2 1000000 612.07973ms @ 13.384 GB/s x86-64-v3 1000000 485.23738ms @ 16.882 GB/s x86-64-v4 1000000 483.86411ms @ 16.930 GB/s native 1000000 462.88461ms @ 17.698 GB/s Zen 4 laptop (AMD Ryzen 7 PRO 7840U): march mem result x86-64 100 417.19762ms @ 19.636 GB/s x86-64-v2 100 130.67596ms @ 62.689 GB/s x86-64-v3 100 97.07758ms @ 84.386 GB/s x86-64-v4 100 95.67704ms @ 85.621 GB/s native 100 95.15734ms @ 86.089 GB/s x86-64 100000 431.38370ms @ 18.990 GB/s x86-64-v2 100000 215.74856ms @ 37.970 GB/s x86-64-v3 100000 199.74492ms @ 41.012 GB/s x86-64-v4 100000 186.98300ms @ 43.811 GB/s native 100000 187.68125ms @ 43.648 GB/s x86-64 1000000 433.87893ms @ 18.881 GB/s x86-64-v2 1000000 217.46561ms @ 37.670 GB/s x86-64-v3 1000000 200.40667ms @ 40.877 GB/s x86-64-v4 1000000 187.51978ms @ 43.686 GB/s native 1000000 190.29273ms @ 43.049 GB/s Workstation w/ 2x Xeon Gold 5215: march mem result x86-64 100 780.38881ms @ 10.497 GB/s x86-64-v2 100 389.62005ms @ 21.026 GB/s x86-64-v3 100 323.97294ms @ 25.286 GB/s x86-64-v4 100 274.19493ms @ 29.877 GB/s native 100 283.48674ms @ 28.897 GB/s x86-64 100000 1112.63898ms @ 7.363 GB/s x86-64-v2 100000 831.45641ms @ 9.853 GB/s x86-64-v3 100000 696.20789ms @ 11.767 GB/s x86-64-v4 100000 685.61636ms @ 11.948 GB/s native 100000 689.78023ms @ 11.876 GB/s x86-64 1000000 1128.65580ms @ 7.258 GB/s x86-64-v2 1000000 843.92594ms @ 9.707 GB/s x86-64-v3 1000000 718.78848ms @ 11.397 GB/s x86-64-v4 1000000 687.68258ms @ 11.912 GB/s native 1000000 705.34731ms @ 11.614 GB/s That's quite the drastic difference between amd and intel. Of course it's also comparing a multi-core server uarch (lower per-core bandwidth, much higher aggregate bandwidth) with a client uarch. The difference between the baseline CPU target and a more modern profile is also rather impressive. Looks like some cpu-capability based dispatch would likely be worth it, even if it didn't matter in my case due to -march=native. I just realized that a) The meson build doesn't use the relevant flags for bufpage.c - it didn't matter in my numbers though because I was building with -O3 and march=native. This clearly ought to be fixed. b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both of which include checksum_imp.h directly. This probably should be fixed too - perhaps by building the relevant code once as part of fe_utils or such? It probably matters less than it used to - these days -O2 turns on -ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't enabled. I do see a perf difference at -O2 between using/not using -funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually being enabled by -O3. I think the relevant option that *is* turned on by O3 is -fpeel-loops. Here's a comparison of different flags run the 6442Y printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf "%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000 $mem;done; done;done march flags mem result x86-64 -O2 100 2280.86253ms @ 10.775 GB/s x86-64 -O2 -funroll-loops 100 2195.66942ms @ 11.193 GB/s x86-64 -O3 100 2422.57588ms @ 10.145 GB/s x86-64 -O3 -funroll-loops 100 2243.75826ms @ 10.953 GB/s x86-64-v2 -O2 100 1243.68063ms @ 19.761 GB/s x86-64-v2 -O2 -funroll-loops 100 979.67783ms @ 25.086 GB/s x86-64-v2 -O3 100 988.80296ms @ 24.854 GB/s x86-64-v2 -O3 -funroll-loops 100 991.31632ms @ 24.791 GB/s x86-64-v3 -O2 100 1146.90165ms @ 21.428 GB/s x86-64-v3 -O2 -funroll-loops 100 785.81395ms @ 31.275 GB/s x86-64-v3 -O3 100 800.53627ms @ 30.699 GB/s x86-64-v3 -O3 -funroll-loops 100 790.21230ms @ 31.101 GB/s x86-64-v4 -O2 100 883.82916ms @ 27.806 GB/s x86-64-v4 -O2 -funroll-loops 100 831.55372ms @ 29.554 GB/s x86-64-v4 -O3 100 843.23141ms @ 29.145 GB/s x86-64-v4 -O3 -funroll-loops 100 821.19969ms @ 29.927 GB/s native -O2 100 1197.41357ms @ 20.524 GB/s native -O2 -funroll-loops 100 718.05253ms @ 34.226 GB/s native -O3 100 747.94090ms @ 32.858 GB/s native -O3 -funroll-loops 100 751.52379ms @ 32.702 GB/s x86-64 -O2 100000 2911.47087ms @ 8.441 GB/s x86-64 -O2 -funroll-loops 100000 2525.45504ms @ 9.731 GB/s x86-64 -O3 100000 2497.42016ms @ 9.841 GB/s x86-64 -O3 -funroll-loops 100000 2346.33551ms @ 10.474 GB/s x86-64-v2 -O2 100000 2124.10102ms @ 11.570 GB/s x86-64-v2 -O2 -funroll-loops 100000 1819.09659ms @ 13.510 GB/s x86-64-v2 -O3 100000 1613.45823ms @ 15.232 GB/s x86-64-v2 -O3 -funroll-loops 100000 1607.09245ms @ 15.292 GB/s x86-64-v3 -O2 100000 1972.89390ms @ 12.457 GB/s x86-64-v3 -O2 -funroll-loops 100000 1432.58229ms @ 17.155 GB/s x86-64-v3 -O3 100000 1533.18003ms @ 16.029 GB/s x86-64-v3 -O3 -funroll-loops 100000 1539.39779ms @ 15.965 GB/s x86-64-v4 -O2 100000 1591.96881ms @ 15.437 GB/s x86-64-v4 -O2 -funroll-loops 100000 1434.91828ms @ 17.127 GB/s x86-64-v4 -O3 100000 1454.30133ms @ 16.899 GB/s x86-64-v4 -O3 -funroll-loops 100000 1429.13733ms @ 17.196 GB/s native -O2 100000 1980.53734ms @ 12.409 GB/s native -O2 -funroll-loops 100000 1373.95337ms @ 17.887 GB/s native -O3 100000 1517.90164ms @ 16.191 GB/s native -O3 -funroll-loops 100000 1508.37021ms @ 16.293 GB/s > > > Is it just that the calculation is slow, or is it the fact that > > > checksumming > > > needs to bring the page into the CPU cache. Did you notice any hints which > > > might be the case? > > > > I don't think the issue is that checksumming pulls the data into CPU caches > > > > 1) This is visible with SELECT that actually uses the data > > > > 2) I added prefetching to avoid any meaningful amount of cache misses and it > > doesn't change the overall timing much > > > > 3) It's visible with buffered IO, which has pulled the data into CPU caches > > already > > I didn't yet check the code, when doing aio completions checksumming > be running on the same core as is going to be using the page? With io_uring normally yes, the exception being that another backend that needs the same page could end up running the completion. With worker mode normally no. Greetings, Andres Freund