Re: Relation bulk write facility

Andres Freund Sat, 18 Nov 2023 16:04:45 -0800

Hi,

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> The new facility makes it easier to optimize bulk loading, as the
> logic for buffering, WAL-logging, and syncing the relation only needs
> to be implemented once. It's also less error-prone: We have had a
> number of bugs in how a relation is fsync'd - or not - at the end of a
> bulk loading operation. By centralizing that logic to one place, we
> only need to write it correctly once.


One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.


> The new facility is faster for small relations: Instead of of calling
> smgrimmedsync(), we register the fsync to happen at next checkpoint,
> which avoids the fsync latency. That can make a big difference if you
> are e.g. restoring a schema-only dump with lots of relations.

I think this would be a huge win for people running their application tests
against postgres.


> +     bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> +
>       nblocks = smgrnblocks(src, forkNum);
>  
>       for (blkno = 0; blkno < nblocks; blkno++)
>       {
> +             Page            page;
> +
>               /* If we got a cancel signal during the copy of the data, quit 
> */
>               CHECK_FOR_INTERRUPTS();
>  
> -             smgrread(src, forkNum, blkno, buf.data);
> +             page = bulkw_alloc_buf(bulkw);
> +             smgrread(src, forkNum, blkno, page);
>  
>               if (!PageIsVerifiedExtended(page, blkno,
>                                                                       
> PIV_LOG_WARNING | PIV_REPORT_STAT))
> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
>                * page this is, so we have to log the full page including any 
> unused
>                * space.
>                */
> -             if (use_wal)
> -                     log_newpage(&dst->smgr_rlocator.locator, forkNum, 
> blkno, page, false);
> -
> -             PageSetChecksumInplace(page, blkno);
> -
> -             /*
> -              * Now write the page.  We say skipFsync = true because there's 
> no
> -              * need for smgr to schedule an fsync for this write; we'll do 
> it
> -              * ourselves below.
> -              */
> -             smgrextend(dst, forkNum, blkno, buf.data, true);
> +             bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?


> +/*-------------------------------------------------------------------------
> + *
> + * bulk_write.c
> + *     Efficiently and reliably populate a new relation
> + *
> + * The assumption is that no other backends access the relation while we are
> + * loading it, so we can take some shortcuts.  Alternatively, you can use the
> + * buffer manager as usual, if performance is not critical, but you must not
> + * mix operations through the buffer manager and the bulk loading interface 
> at
> + * the same time.

>From "Alternatively" onward this is is somewhat confusing.


> + * We bypass the buffer manager to avoid the locking overhead, and call
> + * smgrextend() directly.  A downside is that the pages will need to be
> + * re-read into shared buffers on first use after the build finishes.  That's
> + * usually a good tradeoff for large relations, and for small relations, the
> + * overhead isn't very significant compared to creating the relation in the
> + * first place.

FWIW, I doubt the "isn't very significant" bit is really true.


> + * One tricky point is that because we bypass the buffer manager, we need to
> + * register the relation for fsyncing at the next checkpoint ourselves, and
> + * make sure that the relation is correctly fsync by us or the checkpointer
> + * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.



> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *     src/backend/storage/smgr/bulk_write.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/xloginsert.h"
> +#include "access/xlogrecord.h"
> +#include "storage/bufmgr.h"
> +#include "storage/bufpage.h"
> +#include "storage/bulk_write.h"
> +#include "storage/proc.h"
> +#include "storage/smgr.h"
> +#include "utils/rel.h"
> +
> +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
> +
> +typedef struct BulkWriteBuffer
> +{
> +     Page            page;
> +     BlockNumber blkno;
> +     bool            page_std;
> +     int16           order;
> +} BulkWriteBuffer;
> +

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...


> +/*
> + * Bulk writer state for one relation fork.
> + */
> +typedef struct BulkWriteState
> +{
> +     /* Information about the target relation we're writing */
> +     SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
is merged?


> +     ForkNumber      forknum;
> +     bool            use_wal;
> +
> +     /* We keep several pages buffered, and WAL-log them in batches */
> +     int                     nbuffered;
> +     BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> +
> +     /* Current size of the relation */
> +     BlockNumber pages_written;
> +
> +     /* The RedoRecPtr at the time that the bulk operation started */
> +     XLogRecPtr      start_RedoRecPtr;
> +
> +     Page            zeropage;               /* workspace for filling zeroes 
> */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?


> +/*
> + * Write all buffered pages to disk.
> + */
> +static void
> +bulkw_flush(BulkWriteState *bulkw)
> +{
> +     int                     nbuffered = bulkw->nbuffered;
> +     BulkWriteBuffer *buffers = bulkw->buffers;
> +
> +     if (nbuffered == 0)
> +             return;
> +
> +     if (nbuffered > 1)
> +     {
> +             int                     o;
> +
> +             qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
> +
> +             /*
> +              * Eliminate duplicates, keeping the last write of each block.
> +              * (buffer_cmp uses 'order' as the last sort key)
> +              */

Huh - which use cases would actually cause duplicate writes?


Greetings,

Andres Freund

Re: Relation bulk write facility

Reply via email to