Hi, On 2025-01-09 10:59:22 +0200, Ants Aasma wrote: > On Wed, 8 Jan 2025 at 22:58, Andres Freund <and...@anarazel.de> wrote: > > master: ~18 GB/s > > patch, buffered: ~20 GB/s > > patch, direct, worker: ~28 GB/s > > patch, direct, uring: ~35 GB/s > > > > > > This was with io_workers=32, io_max_concurrency=128, > > effective_io_concurrency=1000 (doesn't need to be that high, but it's what I > > still have the numbers for). > > > > > > This was without data checksums enabled as otherwise the checksum code > > becomes > > a *huge* bottleneck. > > I'm curious about this because the checksum code should be fast enough > to easily handle that throughput.
It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y workstation. But we don't have a good ready-made way of testing that without also doing IO, so it's kinda hard to say. > I remember checksum overhead being negligible even when pulling in pages > from page cache. It's indeed much less of an issue when pulling pages from the page cache, as the copy from the page cache is fairly slow. With direct-IO, where the copy from the page cache isn't the main driver of CPU use anymore, it becomes much clearer. Even with buffered IO it became a bigger issue with 17, due to io_combine_limit. It turns out that lots of tiny syscalls are slow, so the peak throughput that could reach the checksumming code was lower. I created a 21554MB relation and measured the time to do a pg_prewarm() of that relation after evicting all of shared buffers (the first time buffers are touched has a bit different perf characteristics). I am using direct IO and io_uring here, as buffered IO would include the page cache copy cost and worker mode could parallelize the checksum computation on reads. The checksum cost is a bigger issue for writes than reads, but it's harder to quickly generate enough dirty data for a repeatable benchmark. This system can do about 12.5GB/s of read IO. Just to show the effect of the read size on page cache copy performance: config checksums time in ms buffered io_engine=sync io_combine_limit=1 0 6712.153 buffered io_engine=sync io_combine_limit=2 0 5919.215 buffered io_engine=sync io_combine_limit=4 0 5738.496 buffered io_engine=sync io_combine_limit=8 0 5396.415 buffered io_engine=sync io_combine_limit=16 0 5312.803 buffered io_engine=sync io_combine_limit=32 0 5275.389 To see the effect of page cache copy overhead: config checksums time in ms buffered io_engine=io_uring 0 3901.625 direct io_engine=io_uring 0 2075.330 Now to show the effect of checksums (enabled/disabled with pg_checksums): config checksums time in ms buffered io_engine=io_uring 0 3883.127 buffered io_engine=io_uring 1 5880.892 direct io_engine=io_uring 0 2067.142 direct io_engine=io_uring 1 3835.968 So with direct + uring w/o checksums, we can reach 10427 MB/s (close-ish to disk speed), but with checksums we only reach 5620 MB/s. > Is it just that the calculation is slow, or is it the fact that checksumming > needs to bring the page into the CPU cache. Did you notice any hints which > might be the case? I don't think the issue is that checksumming pulls the data into CPU caches 1) This is visible with SELECT that actually uses the data 2) I added prefetching to avoid any meaningful amount of cache misses and it doesn't change the overall timing much 3) It's visible with buffered IO, which has pulled the data into CPU caches already > I don't really have a machine at hand that can do anywhere close to this > amount of I/O. It's visible even when pulling from the page cache, if to a somewhat lesser degree. I wonder if it's worth adding a test function that computes checksums of all shared buffers in memory already. That'd allow exercising the checksum code in a realistic context (i.e. buffer locking etc preventing some out-of-order effects, using 8kB chunks etc) without also needing to involve the IO path. > I'm asking because if it's the calculation that is slow then it seems > like it's time to compile different ISA extension variants of the > checksum code and select the best one at runtime. You think it's ISA specific? I don't see a significant effect of compiling with -march=native or not - and that should suffice to make the checksum code built with sufficiently high ISA support, right? FWIW CPU profiles show all the time being spent in the "main checksum calculation" loop: Percent | Source code & Disassembly of postgres for cycles:P (5866 samples, percent: local period) -------------------------------------------------------------------------------------------------------- : : : : 3 Disassembly of section .text: : : 5 00000000009e3670 <pg_checksum_page>: : 6 * calculation isn't affected by the old checksum stored on the page. : 7 * Restore it after, because actually updating the checksum is NOT part of : 8 * the API of this function. : 9 */ : 10 save_checksum = cpage->phdr.pd_checksum; : 11 cpage->phdr.pd_checksum = 0; 0.00 : 9e3670: xor %eax,%eax : 13 CHECKSUM_COMP(sums[j], page->data[i][j]); 0.00 : 9e3672: mov $0x1000193,%r8d : 15 cpage->phdr.pd_checksum = 0; 0.00 : 9e3678: vmovdqa -0x693fa0(%rip),%ymm3 # 34f6e0 <.LC0> 0.05 : 9e3680: vmovdqa -0x6935c8(%rip),%ymm4 # 3500c0 <.LC1> 0.00 : 9e3688: vmovdqa -0x693c10(%rip),%ymm0 # 34fa80 <.LC2> 0.00 : 9e3690: vmovdqa -0x693598(%rip),%ymm1 # 350100 <.LC3> : 20 { 0.00 : 9e3698: mov %esi,%ecx 0.02 : 9e369a: lea 0x2000(%rdi),%rdx : 23 save_checksum = cpage->phdr.pd_checksum; 0.00 : 9e36a1: movzwl 0x8(%rdi),%esi : 25 CHECKSUM_COMP(sums[j], page->data[i][j]); 0.00 : 9e36a5: vpbroadcastd %r8d,%ymm5 : 27 cpage->phdr.pd_checksum = 0; 0.00 : 9e36ab: mov %ax,0x8(%rdi) : 29 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) 0.14 : 9e36af: mov %rdi,%rax 0.00 : 9e36b2: nopw 0x0(%rax,%rax,1) : 32 CHECKSUM_COMP(sums[j], page->data[i][j]); 15.36 : 9e36b8: vpxord (%rax),%ymm1,%ymm1 4.79 : 9e36be: vmovdqu 0x80(%rax),%ymm2 : 35 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) 0.07 : 9e36c6: add $0x100,%rax : 37 CHECKSUM_COMP(sums[j], page->data[i][j]); 2.45 : 9e36cc: vpxord -0xe0(%rax),%ymm0,%ymm0 2.85 : 9e36d3: vpmulld %ymm5,%ymm1,%ymm6 0.02 : 9e36d8: vpsrld $0x11,%ymm1,%ymm1 3.17 : 9e36dd: vpternlogd $0x96,%ymm6,%ymm1,%ymm2 2.01 : 9e36e4: vpmulld %ymm5,%ymm0,%ymm6 13.16 : 9e36e9: vpmulld %ymm5,%ymm2,%ymm1 0.03 : 9e36ee: vpsrld $0x11,%ymm2,%ymm2 0.02 : 9e36f3: vpsrld $0x11,%ymm0,%ymm0 2.57 : 9e36f8: vpxord %ymm2,%ymm1,%ymm1 0.89 : 9e36fe: vmovdqu -0x60(%rax),%ymm2 0.12 : 9e3703: vpternlogd $0x96,%ymm6,%ymm0,%ymm2 4.48 : 9e370a: vpmulld %ymm5,%ymm2,%ymm0 0.49 : 9e370f: vpsrld $0x11,%ymm2,%ymm2 0.99 : 9e3714: vpxord %ymm2,%ymm0,%ymm0 11.88 : 9e371a: vpxord -0xc0(%rax),%ymm4,%ymm2 2.80 : 9e3721: vpmulld %ymm5,%ymm2,%ymm6 0.68 : 9e3726: vpsrld $0x11,%ymm2,%ymm4 4.94 : 9e372b: vmovdqu -0x40(%rax),%ymm2 1.45 : 9e3730: vpternlogd $0x96,%ymm6,%ymm4,%ymm2 8.63 : 9e3737: vpmulld %ymm5,%ymm2,%ymm4 0.17 : 9e373c: vpsrld $0x11,%ymm2,%ymm2 1.81 : 9e3741: vpxord %ymm2,%ymm4,%ymm4 0.10 : 9e3747: vpxord -0xa0(%rax),%ymm3,%ymm2 0.70 : 9e374e: vpmulld %ymm5,%ymm2,%ymm6 1.65 : 9e3753: vpsrld $0x11,%ymm2,%ymm3 0.03 : 9e3758: vmovdqu -0x20(%rax),%ymm2 0.85 : 9e375d: vpternlogd $0x96,%ymm6,%ymm3,%ymm2 3.73 : 9e3764: vpmulld %ymm5,%ymm2,%ymm3 0.07 : 9e3769: vpsrld $0x11,%ymm2,%ymm2 1.48 : 9e376e: vpxord %ymm2,%ymm3,%ymm3 : 68 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) 0.02 : 9e3774: cmp %rax,%rdx 2.32 : 9e3777: jne 9e36b8 <pg_checksum_page+0x48> : 71 CHECKSUM_COMP(sums[j], 0); 0.17 : 9e377d: vpmulld %ymm5,%ymm0,%ymm7 0.07 : 9e3782: vpmulld %ymm5,%ymm1,%ymm6 : 74 checksum = pg_checksum_block(cpage); : 75 cpage->phdr.pd_checksum = save_checksum; 0.00 : 9e3787: mov %si,0x8(%rdi) : 77 CHECKSUM_COMP(sums[j], 0); 0.02 : 9e378b: vpsrld $0x11,%ymm0,%ymm0 0.02 : 9e3790: vpsrld $0x11,%ymm1,%ymm1 0.02 : 9e3795: vpsrld $0x11,%ymm4,%ymm2 0.00 : 9e379a: vpxord %ymm0,%ymm7,%ymm7 0.10 : 9e37a0: vpmulld %ymm5,%ymm4,%ymm0 0.00 : 9e37a5: vpxord %ymm1,%ymm6,%ymm6 0.17 : 9e37ab: vpmulld %ymm5,%ymm3,%ymm1 0.19 : 9e37b0: vpmulld %ymm5,%ymm6,%ymm4 0.00 : 9e37b5: vpsrld $0x11,%ymm6,%ymm6 0.02 : 9e37ba: vpxord %ymm2,%ymm0,%ymm0 0.00 : 9e37c0: vpsrld $0x11,%ymm3,%ymm2 0.22 : 9e37c5: vpmulld %ymm5,%ymm7,%ymm3 0.02 : 9e37ca: vpsrld $0x11,%ymm7,%ymm7 0.00 : 9e37cf: vpxord %ymm2,%ymm1,%ymm1 0.03 : 9e37d5: vpsrld $0x11,%ymm0,%ymm2 0.15 : 9e37da: vpmulld %ymm5,%ymm0,%ymm0 : 94 result ^= sums[i]; 0.00 : 9e37df: vpternlogd $0x96,%ymm3,%ymm7,%ymm2 : 96 CHECKSUM_COMP(sums[j], 0); 0.05 : 9e37e6: vpsrld $0x11,%ymm1,%ymm3 0.19 : 9e37eb: vpmulld %ymm5,%ymm1,%ymm1 : 99 result ^= sums[i]; 0.02 : 9e37f0: vpternlogd $0x96,%ymm4,%ymm6,%ymm0 0.10 : 9e37f7: vpxord %ymm1,%ymm0,%ymm0 0.07 : 9e37fd: vpternlogd $0x96,%ymm2,%ymm3,%ymm0 0.15 : 9e3804: vextracti32x4 $0x1,%ymm0,%xmm1 0.03 : 9e380b: vpxord %xmm0,%xmm1,%xmm0 0.14 : 9e3811: vpsrldq $0x8,%xmm0,%xmm1 0.12 : 9e3816: vpxord %xmm1,%xmm0,%xmm0 0.09 : 9e381c: vpsrldq $0x4,%xmm0,%xmm1 0.12 : 9e3821: vpxord %xmm1,%xmm0,%xmm0 0.05 : 9e3827: vmovd %xmm0,%eax : : 111 /* Mix in the block number to detect transposed pages */ : 112 checksum ^= blkno; 0.07 : 9e382b: xor %ecx,%eax : : 115 /* : 116 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of : 117 * one. That avoids checksums of zero, which seems like a good idea. : 118 */ : 119 return (uint16) ((checksum % 65535) + 1); 0.00 : 9e382d: mov $0x80008001,%ecx 0.03 : 9e3832: mov %eax,%edx 0.27 : 9e3834: imul %rcx,%rdx 0.09 : 9e3838: shr $0x2f,%rdx 0.07 : 9e383c: lea 0x1(%rax,%rdx,1),%eax 0.00 : 9e3840: vzeroupper : 126 } 0.15 : 9e3843: ret I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64 seems to be about the same as 32. Greetings, Andres Freund