On Thu, 9 Jan 2025 at 18:25, Andres Freund <and...@anarazel.de> wrote: > > I'm curious about this because the checksum code should be fast enough > > to easily handle that throughput. > > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y > workstation. But we don't have a good ready-made way of testing that without > also doing IO, so it's kinda hard to say.
Interesting, I wonder if it's related to Intel increasing vpmulld latency to 10 already back in Haswell. The Zen 3 I'm testing on has latency 3 and has twice the throughput. Attached is a naive and crude benchmark that I used for testing here. Compiled with: gcc -O2 -funroll-loops -ftree-vectorize -march=native \ -I$(pg_config --includedir-server) \ bench-checksums.c -o bench-checksums-native Just fills up an array of pages and checksums them, first argument is number of checksums, second is array size. I used 1M checksums and 100 pages for in cache behavior and 100000 pages for in memory performance. 869.85927ms @ 9.418 GB/s - generic from memory 772.12252ms @ 10.610 GB/s - generic in cache 442.61869ms @ 18.508 GB/s - native from memory 137.07573ms @ 59.763 GB/s - native in cache > > Is it just that the calculation is slow, or is it the fact that checksumming > > needs to bring the page into the CPU cache. Did you notice any hints which > > might be the case? > > I don't think the issue is that checksumming pulls the data into CPU caches > > 1) This is visible with SELECT that actually uses the data > > 2) I added prefetching to avoid any meaningful amount of cache misses and it > doesn't change the overall timing much > > 3) It's visible with buffered IO, which has pulled the data into CPU caches > already I didn't yet check the code, when doing aio completions checksumming be running on the same core as is going to be using the page? It could also be that for some reason the checksumming is creating extra bandwidth on memory bus or CPU internal rings, which due to the already high amount of data already flying around causes contention. > > I don't really have a machine at hand that can do anywhere close to this > > amount of I/O. > > It's visible even when pulling from the page cache, if to a somewhat lesser > degree. Good point, I'll see if I can reproduce. > I wonder if it's worth adding a test function that computes checksums of all > shared buffers in memory already. That'd allow exercising the checksum code in > a realistic context (i.e. buffer locking etc preventing some out-of-order > effects, using 8kB chunks etc) without also needing to involve the IO path. OoO shouldn't matter that much, over here even in the best case it's still taking 500+ cycles per iteration. > > I'm asking because if it's the calculation that is slow then it seems > > like it's time to compile different ISA extension variants of the > > checksum code and select the best one at runtime. > > You think it's ISA specific? I don't see a significant effect of compiling > with -march=native or not - and that should suffice to make the checksum code > built with sufficiently high ISA support, right? Right, the disassembly below looked very good. > FWIW CPU profiles show all the time being spent in the "main checksum > calculation" loop: .. disassembly omitted for brevity Not sure if it's applicable here or not due to microarch differences. But in my case when bounded by memory bandwidth the main loop events were clustered around a few instructions like it was in here, whereas when running from cache all instructions were about equally represented. > I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64 > seems to be about the same as 32. This suggests that mulld latency is not the culprit. Regards, Ants
#include "postgres.h" #include "storage/checksum_impl.h" #include <time.h> #undef printf int __attribute__ ((noinline)) checksum_block(char *page, uint32 blockno) { return pg_checksum_page(page, blockno); } int main(int argc, char *argv[]) { char *page; uint64 i; uint64 sum = 0; struct timespec start; struct timespec end; double delta; if (argc<3) { printf("Usage: %s niterations nblocks\n", argv[0]); return 1; } uint64 n = strtoull(argv[1], 0, 10); uint64 b = strtoull(argv[2], 0, 10); page = malloc(BLCKSZ*b); for (i = 0; i < BLCKSZ*b; i++) page[i] = (i*997) & 0xFF; clock_gettime(CLOCK_MONOTONIC_RAW, &start); for (i = 0; i < n; i++) sum += checksum_block(page + BLCKSZ*(i % b), (uint32) i); clock_gettime(CLOCK_MONOTONIC_RAW, &end); delta = (double)(end.tv_sec - start.tv_sec) + (1e-9*(double) (end.tv_nsec - start.tv_nsec)); printf("%0.5fms @ %0.3f GB/s\n", delta*1000, (8192.0 * n)/delta/1e9); return 0; }