Re: AIO v2.5

Noah Misch Sun, 23 Mar 2025 17:36:25 -0700

commit 247ce06b wrote:
> +                     pgaio_io_reopen(ioh);
> +
> +                     /*
> +                      * To be able to exercise the reopen-fails path, allow 
> injection
> +                      * points to trigger a failure at this point.
> +                      */
> +                     pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> +
> +                     error_errno = 0;
> +                     error_ioh = NULL;
> +
> +                     /*
> +                      * We don't expect this to ever fail with ERROR or 
> FATAL, no need
> +                      * to keep error_ioh set to the IO.
> +                      * pgaio_io_perform_synchronously() contains a critical 
> section to
> +                      * ensure we don't accidentally fail.
> +                      */
> +                     pgaio_io_perform_synchronously(ioh);

A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores.  Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code.  It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11

> Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

> +int
> +FileStartReadV(PgAioHandle *ioh, File file,
> +                        int iovcnt, off_t offset,
> +                        uint32 wait_event_info)
> +{
> +     int                     returnCode;
> +     Vfd                *vfdP;
> +
> +     Assert(FileIsValid(file));
> +
> +     DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> +                        file, VfdCache[file].fileName,
> +                        (int64) offset,
> +                        iovcnt));
> +
> +     returnCode = FileAccess(file);
> +     if (returnCode < 0)
> +             return returnCode;
> +
> +     vfdP = &VfdCache[file];
> +
> +     pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);

FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode.  I didn't expect that from
functions so named.  The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED.  I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect.  Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread.  How do
you see it?

> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

> Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Ready for commit apart from some trivia:

> +if (ioret.result.status == PGAIO_RS_ERROR)
> +    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);

I think ioret and aio_ret are supposed to be the same object.  If that's
right, change one of the names.  Likewise elsewhere in this file.

> +The central API piece for postgres' AIO abstraction are AIO handles. To
> +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) 
> and
> +then "defined", i.e. associate an IO operation with the handle.

s/"defined"/"define" it/ or similar

> +The "solution" to this the ability to associate multiple completion callbacks

s/this the/this is the/

> Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

> @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
>               Assert(refcount > 0);
>               if (refcount != 1)
>                       return false;
> +
> +             /*
> +              * Check that the AIO subsystem doesn't have a pin. Likely not
> +              * possible today, but better safe than sorry.
> +              */
> +             bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> +             buf_state = pg_atomic_read_u32(&bufHdr->state);
> +             refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> +             Assert(refcount > 0);
> +             if (refcount != 1)
> +                     return false;
> +

LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends.  Currently, it just assumes all local buffers are
cleanup-lockable:

        /* Nobody else to wait for */
        if (BufferIsLocal(buffer))
                return;

> @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool 
> check_unreferenced)
>  
>       buf_state = pg_atomic_read_u32(&bufHdr->state);
>  
> -     if (check_unreferenced && LocalRefCount[bufid] != 0)
> +     /*
> +      * We need to test not just LocalRefCount[bufid] but also the BufferDesc
> +      * itself, as the latter is used to represent a pin by the AIO 
> subsystem.
> +      * This can happen if AIO is initiated and then the query errors out.
> +      */
> +     if (check_unreferenced &&
> +             (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) 
> != 0))
>               elog(ERROR, "block %u of %s is still referenced (local %u)",

I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

  CREATE TEMP TABLE foo ...;
  [some command that starts reading a block of foo into local buffers, then 
ERROR with IO ongoing]
  DROP TABLE foo;

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer().  Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code.  I think that bufmgr ERROR is unreachable, since
only a private refcnt triggers that bufmgr ERROR.  Is there something
preventing the localbuf error from being a problem?  (This wouldn't require
changes to the current patch; responsibility would fall in a bufmgr AIO
patch.)

> Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

(Still reviewing this and later patches, but incidental observations follow.)

> +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> +                                               bool failed, bool is_temp)
> +{
...
> +     PgAioResult result;
...
> +     result.status = PGAIO_RS_OK;
...
> +     return result;

gcc 14.2.0 -Werror gives me:

  bufmgr.c:7297:16: error: ‘result’ may be used uninitialized 
[-Werror=maybe-uninitialized]

Zeroing the unset fields silenced it:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, 
uint8 flags,
        char       *bufdata = BufferGetBlock(buffer);
-       PgAioResult result;
+       PgAioResult result = { .status = PGAIO_RS_OK };
        uint32          set_flag_bits;
@@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, 
uint8 flags,

-       result.status = PGAIO_RS_OK;
-
        /* check for garbage data */

> Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO

> @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
>  static void
>  read_stream_look_ahead(ReadStream *stream)
>  {
> +     /*
> +      * Allow amortizing the cost of submitting IO over multiple IOs. This
> +      * requires that we don't do any operations that could lead to a 
> deadlock
> +      * with staged-but-unsubmitted IO.
> +      */
> +     pgaio_enter_batchmode();

We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that.  A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN().  That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress".  Are there mitigating factors?

> Subject: [PATCH v2.11 17/27] aio: Add test_aio module

> +     # verify that page verification errors are detected even as part of a
> +     # shortened multi-block read (tbl_corr, block 1 is tbl_corred)

Is "tbl_corred" a typo of something?

> --- /dev/null
> +++ b/src/test/modules/test_aio/test_aio.c
> @@ -0,0 +1,657 @@
> +/*-------------------------------------------------------------------------
> + *
> + * delay_execution.c
> + *           Test module to allow delay between parsing and execution of a 
> query.
> + *
> + * The delay is implemented by taking and immediately releasing a specified
> + * advisory lock.  If another process has previously taken that lock, the
> + * current process will be blocked until the lock is released; otherwise,
> + * there's no effect.  This allows an isolationtester script to reliably
> + * test behaviors where some specified action happens in another backend
> + * between parsing and execution of any desired query.
> + *
> + * Copyright (c) 2020-2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *     src/test/modules/delay_execution/delay_execution.c

Header comment is surviving from copy-paste of delay_execution.c.

> +      * Tor tests we don't want the resowner release preventing us from

s/Tor/For/

Re: AIO v2.5

Reply via email to