Re: AIO v2.0

Andres Freund Thu, 09 Jan 2025 08:25:58 -0800

Hi,

On 2025-01-09 10:59:22 +0200, Ants Aasma wrote:
> On Wed, 8 Jan 2025 at 22:58, Andres Freund <[email protected]> wrote:
> > master:                 ~18 GB/s
> > patch, buffered:        ~20 GB/s
> > patch, direct, worker:  ~28 GB/s
> > patch, direct, uring:   ~35 GB/s
> >
> >
> > This was with io_workers=32, io_max_concurrency=128,
> > effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
> > still have the numbers for).
> >
> >
> > This was without data checksums enabled as otherwise the checksum code 
> > becomes
> > a *huge* bottleneck.
> 
> I'm curious about this because the checksum code should be fast enough
> to easily handle that throughput.


It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.


> I remember checksum overhead being negligible even when pulling in pages
> from page cache.

It's indeed much less of an issue when pulling pages from the page cache, as
the copy from the page cache is fairly slow. With direct-IO, where the copy
from the page cache isn't the main driver of CPU use anymore, it becomes much
clearer.

Even with buffered IO it became a bigger issue with 17, due to
io_combine_limit. It turns out that lots of tiny syscalls are slow, so the
peak throughput that could reach the checksumming code was lower.


I created a 21554MB relation and measured the time to do a pg_prewarm() of
that relation after evicting all of shared buffers (the first time buffers are
touched has a bit different perf characteristics). I am using direct IO and
io_uring here, as buffered IO would include the page cache copy cost and
worker mode could parallelize the checksum computation on reads.  The checksum
cost is a bigger issue for writes than reads, but it's harder to quickly
generate enough dirty data for a repeatable benchmark.

This system can do about 12.5GB/s of read IO.


Just to show the effect of the read size on page cache copy performance:

config                                                  checksums       time in 
ms
buffered io_engine=sync io_combine_limit=1              0               6712.153
buffered io_engine=sync io_combine_limit=2              0               5919.215
buffered io_engine=sync io_combine_limit=4              0               5738.496
buffered io_engine=sync io_combine_limit=8              0               5396.415
buffered io_engine=sync io_combine_limit=16             0               5312.803
buffered io_engine=sync io_combine_limit=32             0               5275.389


To see the effect of page cache copy overhead:

config                                                  checksums       time in 
ms
buffered io_engine=io_uring                             0               3901.625
direct io_engine=io_uring                               0               2075.330


Now to show the effect of checksums (enabled/disabled with pg_checksums):

config                                                  checksums       time in 
ms
buffered io_engine=io_uring                             0               3883.127
buffered io_engine=io_uring                             1               5880.892
direct io_engine=io_uring                               0               2067.142
direct io_engine=io_uring                               1               3835.968

So with direct + uring w/o checksums, we can reach 10427 MB/s (close-ish to
disk speed), but with checksums we only reach 5620 MB/s.


> Is it just that the calculation is slow, or is it the fact that checksumming
> needs to bring the page into the CPU cache. Did you notice any hints which
> might be the case?

I don't think the issue is that checksumming pulls the data into CPU caches

1) This is visible with SELECT that actually uses the data

2) I added prefetching to avoid any meaningful amount of cache misses and it
   doesn't change the overall timing much

3) It's visible with buffered IO, which has pulled the data into CPU caches
   already


> I don't really have a machine at hand that can do anywhere close to this
> amount of I/O.

It's visible even when pulling from the page cache, if to a somewhat lesser
degree.

I wonder if it's worth adding a test function that computes checksums of all
shared buffers in memory already. That'd allow exercising the checksum code in
a realistic context (i.e. buffer locking etc preventing some out-of-order
effects, using 8kB chunks etc) without also needing to involve the IO path.


> I'm asking because if it's the calculation that is slow then it seems
> like it's time to compile different ISA extension variants of the
> checksum code and select the best one at runtime.

You think it's ISA specific? I don't see a significant effect of compiling
with -march=native or not - and that should suffice to make the checksum code
built with sufficiently high ISA support, right?


FWIW CPU profiles show all the time being spent in the "main checksum
calculation" loop:

 Percent |      Source code & Disassembly of postgres for cycles:P (5866 
samples, percent: local period)
--------------------------------------------------------------------------------------------------------
         :
         :
         :
         : 3      Disassembly of section .text:
         :
         : 5      00000000009e3670 <pg_checksum_page>:
         : 6      * calculation isn't affected by the old checksum stored on 
the page.
         : 7      * Restore it after, because actually updating the checksum is 
NOT part of
         : 8      * the API of this function.
         : 9      */
         : 10     save_checksum = cpage->phdr.pd_checksum;
         : 11     cpage->phdr.pd_checksum = 0;
    0.00 :   9e3670: xor    %eax,%eax
         : 13     CHECKSUM_COMP(sums[j], page->data[i][j]);
    0.00 :   9e3672: mov    $0x1000193,%r8d
         : 15     cpage->phdr.pd_checksum = 0;
    0.00 :   9e3678: vmovdqa -0x693fa0(%rip),%ymm3        # 34f6e0 <.LC0>
    0.05 :   9e3680: vmovdqa -0x6935c8(%rip),%ymm4        # 3500c0 <.LC1>
    0.00 :   9e3688: vmovdqa -0x693c10(%rip),%ymm0        # 34fa80 <.LC2>
    0.00 :   9e3690: vmovdqa -0x693598(%rip),%ymm1        # 350100 <.LC3>
         : 20     {
    0.00 :   9e3698: mov    %esi,%ecx
    0.02 :   9e369a: lea    0x2000(%rdi),%rdx
         : 23     save_checksum = cpage->phdr.pd_checksum;
    0.00 :   9e36a1: movzwl 0x8(%rdi),%esi
         : 25     CHECKSUM_COMP(sums[j], page->data[i][j]);
    0.00 :   9e36a5: vpbroadcastd %r8d,%ymm5
         : 27     cpage->phdr.pd_checksum = 0;
    0.00 :   9e36ab: mov    %ax,0x8(%rdi)
         : 29     for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * 
N_SUMS)); i++)
    0.14 :   9e36af: mov    %rdi,%rax
    0.00 :   9e36b2: nopw   0x0(%rax,%rax,1)
         : 32     CHECKSUM_COMP(sums[j], page->data[i][j]);
   15.36 :   9e36b8: vpxord (%rax),%ymm1,%ymm1
    4.79 :   9e36be: vmovdqu 0x80(%rax),%ymm2
         : 35     for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * 
N_SUMS)); i++)
    0.07 :   9e36c6: add    $0x100,%rax
         : 37     CHECKSUM_COMP(sums[j], page->data[i][j]);
    2.45 :   9e36cc: vpxord -0xe0(%rax),%ymm0,%ymm0
    2.85 :   9e36d3: vpmulld %ymm5,%ymm1,%ymm6
    0.02 :   9e36d8: vpsrld $0x11,%ymm1,%ymm1
    3.17 :   9e36dd: vpternlogd $0x96,%ymm6,%ymm1,%ymm2
    2.01 :   9e36e4: vpmulld %ymm5,%ymm0,%ymm6
   13.16 :   9e36e9: vpmulld %ymm5,%ymm2,%ymm1
    0.03 :   9e36ee: vpsrld $0x11,%ymm2,%ymm2
    0.02 :   9e36f3: vpsrld $0x11,%ymm0,%ymm0
    2.57 :   9e36f8: vpxord %ymm2,%ymm1,%ymm1
    0.89 :   9e36fe: vmovdqu -0x60(%rax),%ymm2
    0.12 :   9e3703: vpternlogd $0x96,%ymm6,%ymm0,%ymm2
    4.48 :   9e370a: vpmulld %ymm5,%ymm2,%ymm0
    0.49 :   9e370f: vpsrld $0x11,%ymm2,%ymm2
    0.99 :   9e3714: vpxord %ymm2,%ymm0,%ymm0
   11.88 :   9e371a: vpxord -0xc0(%rax),%ymm4,%ymm2
    2.80 :   9e3721: vpmulld %ymm5,%ymm2,%ymm6
    0.68 :   9e3726: vpsrld $0x11,%ymm2,%ymm4
    4.94 :   9e372b: vmovdqu -0x40(%rax),%ymm2
    1.45 :   9e3730: vpternlogd $0x96,%ymm6,%ymm4,%ymm2
    8.63 :   9e3737: vpmulld %ymm5,%ymm2,%ymm4
    0.17 :   9e373c: vpsrld $0x11,%ymm2,%ymm2
    1.81 :   9e3741: vpxord %ymm2,%ymm4,%ymm4
    0.10 :   9e3747: vpxord -0xa0(%rax),%ymm3,%ymm2
    0.70 :   9e374e: vpmulld %ymm5,%ymm2,%ymm6
    1.65 :   9e3753: vpsrld $0x11,%ymm2,%ymm3
    0.03 :   9e3758: vmovdqu -0x20(%rax),%ymm2
    0.85 :   9e375d: vpternlogd $0x96,%ymm6,%ymm3,%ymm2
    3.73 :   9e3764: vpmulld %ymm5,%ymm2,%ymm3
    0.07 :   9e3769: vpsrld $0x11,%ymm2,%ymm2
    1.48 :   9e376e: vpxord %ymm2,%ymm3,%ymm3
         : 68     for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * 
N_SUMS)); i++)
    0.02 :   9e3774: cmp    %rax,%rdx
    2.32 :   9e3777: jne    9e36b8 <pg_checksum_page+0x48>
         : 71     CHECKSUM_COMP(sums[j], 0);
    0.17 :   9e377d: vpmulld %ymm5,%ymm0,%ymm7
    0.07 :   9e3782: vpmulld %ymm5,%ymm1,%ymm6
         : 74     checksum = pg_checksum_block(cpage);
         : 75     cpage->phdr.pd_checksum = save_checksum;
    0.00 :   9e3787: mov    %si,0x8(%rdi)
         : 77     CHECKSUM_COMP(sums[j], 0);
    0.02 :   9e378b: vpsrld $0x11,%ymm0,%ymm0
    0.02 :   9e3790: vpsrld $0x11,%ymm1,%ymm1
    0.02 :   9e3795: vpsrld $0x11,%ymm4,%ymm2
    0.00 :   9e379a: vpxord %ymm0,%ymm7,%ymm7
    0.10 :   9e37a0: vpmulld %ymm5,%ymm4,%ymm0
    0.00 :   9e37a5: vpxord %ymm1,%ymm6,%ymm6
    0.17 :   9e37ab: vpmulld %ymm5,%ymm3,%ymm1
    0.19 :   9e37b0: vpmulld %ymm5,%ymm6,%ymm4
    0.00 :   9e37b5: vpsrld $0x11,%ymm6,%ymm6
    0.02 :   9e37ba: vpxord %ymm2,%ymm0,%ymm0
    0.00 :   9e37c0: vpsrld $0x11,%ymm3,%ymm2
    0.22 :   9e37c5: vpmulld %ymm5,%ymm7,%ymm3
    0.02 :   9e37ca: vpsrld $0x11,%ymm7,%ymm7
    0.00 :   9e37cf: vpxord %ymm2,%ymm1,%ymm1
    0.03 :   9e37d5: vpsrld $0x11,%ymm0,%ymm2
    0.15 :   9e37da: vpmulld %ymm5,%ymm0,%ymm0
         : 94     result ^= sums[i];
    0.00 :   9e37df: vpternlogd $0x96,%ymm3,%ymm7,%ymm2
         : 96     CHECKSUM_COMP(sums[j], 0);
    0.05 :   9e37e6: vpsrld $0x11,%ymm1,%ymm3
    0.19 :   9e37eb: vpmulld %ymm5,%ymm1,%ymm1
         : 99     result ^= sums[i];
    0.02 :   9e37f0: vpternlogd $0x96,%ymm4,%ymm6,%ymm0
    0.10 :   9e37f7: vpxord %ymm1,%ymm0,%ymm0
    0.07 :   9e37fd: vpternlogd $0x96,%ymm2,%ymm3,%ymm0
    0.15 :   9e3804: vextracti32x4 $0x1,%ymm0,%xmm1
    0.03 :   9e380b: vpxord %xmm0,%xmm1,%xmm0
    0.14 :   9e3811: vpsrldq $0x8,%xmm0,%xmm1
    0.12 :   9e3816: vpxord %xmm1,%xmm0,%xmm0
    0.09 :   9e381c: vpsrldq $0x4,%xmm0,%xmm1
    0.12 :   9e3821: vpxord %xmm1,%xmm0,%xmm0
    0.05 :   9e3827: vmovd  %xmm0,%eax
         :
         : 111    /* Mix in the block number to detect transposed pages */
         : 112    checksum ^= blkno;
    0.07 :   9e382b: xor    %ecx,%eax
         :
         : 115    /*
         : 116    * Reduce to a uint16 (to fit in the pd_checksum field) with 
an offset of
         : 117    * one. That avoids checksums of zero, which seems like a good 
idea.
         : 118    */
         : 119    return (uint16) ((checksum % 65535) + 1);
    0.00 :   9e382d: mov    $0x80008001,%ecx
    0.03 :   9e3832: mov    %eax,%edx
    0.27 :   9e3834: imul   %rcx,%rdx
    0.09 :   9e3838: shr    $0x2f,%rdx
    0.07 :   9e383c: lea    0x1(%rax,%rdx,1),%eax
    0.00 :   9e3840: vzeroupper
         : 126    }
    0.15 :   9e3843: ret


I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
seems to be about the same as 32.

Greetings,

Andres Freund

Re: AIO v2.0

Reply via email to