Re: Avoiding smgrimmedsync() during nbtree index builds

Andres Freund Sun, 23 Jan 2022 13:56:03 -0800

Hi,

On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> <[email protected]> wrote:
> >
> > I have attached a v3 which includes two commits -- one of which
> > implements the directmgr API and uses it and the other which adds
> > functionality to use either directmgr or bufmgr API during index build.
> >
> > Also registering for march commitfest.
> 
> Forgot directmgr.h. Attached v4


Are you looking at the failures Justin pointed out? Something isn't quite
right yet. See https://postgr.es/m/20220116202559.GW14051%40telsasoft.com and
the subsequent mail about it also triggering on once on linux.


> Thus, the backend must ensure that
> either the Redo pointer has not moved or that the data is fsync'd before
> freeing the page.

"freeing"?


> This is not a problem with pages written in shared buffers because the
> checkpointer will block until all buffers that were dirtied before it
> began finish before it moves the Redo pointer past their associated WAL
> entries.

> This commit makes two main changes:
> 
> 1) It wraps smgrextend() and smgrwrite() in functions from a new API
>    for writing data outside of shared buffers, directmgr.
> 
> 2) It saves the XLOG Redo pointer location before doing the write or
>    extend. It also adds an fsync request for the page to the
>    checkpointer's pending-ops table. Then, after doing the write or
>    extend, if the Redo pointer has moved (meaning a checkpoint has
>    started since it saved it last), then the backend fsync's the page
>    itself. Otherwise, it lets the checkpointer take care of fsync'ing
>    the page the next time it processes the pending-ops table.

Why combine those two into one commit?


> @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
>       /* Now extend the file */
>       while (vm_nblocks_now < vm_nblocks)
>       {
> -             PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
> -
> -             smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, 
> pg.data, false);
> +             // TODO: aren't these pages empty? why checksum them
> +             unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, 
> vm_nblocks_now, (Page) pg.data, false);

Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:

        /* If we don't need a checksum, just return */
        if (PageIsNew(page) || !DataChecksumsEnabled())
                return;

OTOH, it seems easier to have it there than to later forget it, when
e.g. adding some actual initial content to the pages during the smgrextend().



> @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
>  
>       wstate.heap = btspool->heap;
>       wstate.index = btspool->index;
> +     wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
> +     wstate.ub_wstate.redo = InvalidXLogRecPtr;
>       wstate.inskey = _bt_mkscankey(wstate.index, NULL);
>       /* _bt_mkscankey() won't set allequalimage without metapage */
>       wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
> @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, 
> BlockNumber blkno)
>               if (!wstate->btws_zeropage)
>                       wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
>               /* don't set checksum for all-zero page */
> -             smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> -                                wstate->btws_pages_written++,
> -                                (char *) wstate->btws_zeropage,
> -                                true);
> +             unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, 
> wstate->btws_pages_written++, wstate->btws_zeropage, true);
>       }

There's a bunch of long lines in here...


> -     /*
> -      * When we WAL-logged index pages, we must nonetheless fsync index 
> files.
> -      * Since we're building outside shared buffers, a CHECKPOINT occurring
> -      * during the build has no way to flush the previously written data to
> -      * disk (indeed it won't know the index even exists).  A crash later on
> -      * would replay WAL from the checkpoint, therefore it wouldn't replay 
> our
> -      * earlier WAL entries. If we do not fsync those pages here, they might
> -      * still not be on disk when the crash occurs.
> -      */
>       if (wstate->btws_use_wal)
> -             smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +             unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
>  }

The API of unbuffered_finish() only sometimes getting called, but
unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
it'd make sense to pass in whether the relation needs to be synced or not 
instead?



>  spgbuildempty(Relation index)
>  {
>       Page            page;
> +     UnBufferedWriteState wstate;
> +
> +     wstate.smgr_rel = RelationGetSmgr(index);
> +     unbuffered_prep(&wstate);

I don't think that's actually safe, and one of the instances could be the
cause cause of the bug CI is seeing:

 * Note: since a relcache flush can cause the file handle to be closed again,
 * it's unwise to hold onto the pointer returned by this function for any
 * long period.  Recommended practice is to just re-execute RelationGetSmgr
 * each time you need to access the SMgrRelation.  It's quite cheap in
 * comparison to whatever an smgr function is going to do.
 */
static inline SMgrRelation
RelationGetSmgr(Relation rel)


Greetings,

Andres Freund

Re: Avoiding smgrimmedsync() during nbtree index builds

Reply via email to