Hi, On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote: > The new facility makes it easier to optimize bulk loading, as the > logic for buffering, WAL-logging, and syncing the relation only needs > to be implemented once. It's also less error-prone: We have had a > number of bugs in how a relation is fsync'd - or not - at the end of a > bulk loading operation. By centralizing that logic to one place, we > only need to write it correctly once.
One thing I'd like to use the centralized handling for is to track such writes in pg_stat_io. I don't mean as part of the initial patch, just that that's another reason I like the facility. > The new facility is faster for small relations: Instead of of calling > smgrimmedsync(), we register the fsync to happen at next checkpoint, > which avoids the fsync latency. That can make a big difference if you > are e.g. restoring a schema-only dump with lots of relations. I think this would be a huge win for people running their application tests against postgres. > + bulkw = bulkw_start_smgr(dst, forkNum, use_wal); > + > nblocks = smgrnblocks(src, forkNum); > > for (blkno = 0; blkno < nblocks; blkno++) > { > + Page page; > + > /* If we got a cancel signal during the copy of the data, quit > */ > CHECK_FOR_INTERRUPTS(); > > - smgrread(src, forkNum, blkno, buf.data); > + page = bulkw_alloc_buf(bulkw); > + smgrread(src, forkNum, blkno, page); > > if (!PageIsVerifiedExtended(page, blkno, > > PIV_LOG_WARNING | PIV_REPORT_STAT)) > @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, > * page this is, so we have to log the full page including any > unused > * space. > */ > - if (use_wal) > - log_newpage(&dst->smgr_rlocator.locator, forkNum, > blkno, page, false); > - > - PageSetChecksumInplace(page, blkno); > - > - /* > - * Now write the page. We say skipFsync = true because there's > no > - * need for smgr to schedule an fsync for this write; we'll do > it > - * ourselves below. > - */ > - smgrextend(dst, forkNum, blkno, buf.data, true); > + bulkw_write(bulkw, blkno, page, false); I wonder if bulkw_alloc_buf() is a good name - if you naively read this change, it looks like it'll just leak memory. It also might be taken to be valid until freed, which I don't think is the case? > +/*------------------------------------------------------------------------- > + * > + * bulk_write.c > + * Efficiently and reliably populate a new relation > + * > + * The assumption is that no other backends access the relation while we are > + * loading it, so we can take some shortcuts. Alternatively, you can use the > + * buffer manager as usual, if performance is not critical, but you must not > + * mix operations through the buffer manager and the bulk loading interface > at > + * the same time. >From "Alternatively" onward this is is somewhat confusing. > + * We bypass the buffer manager to avoid the locking overhead, and call > + * smgrextend() directly. A downside is that the pages will need to be > + * re-read into shared buffers on first use after the build finishes. That's > + * usually a good tradeoff for large relations, and for small relations, the > + * overhead isn't very significant compared to creating the relation in the > + * first place. FWIW, I doubt the "isn't very significant" bit is really true. > + * One tricky point is that because we bypass the buffer manager, we need to > + * register the relation for fsyncing at the next checkpoint ourselves, and > + * make sure that the relation is correctly fsync by us or the checkpointer > + * even if a checkpoint happens concurrently. "fsync'ed" or such? Otherwise this reads awkwardly for me. > + * > + * > + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > + * > + * > + * IDENTIFICATION > + * src/backend/storage/smgr/bulk_write.c > + * > + *------------------------------------------------------------------------- > + */ > +#include "postgres.h" > + > +#include "access/xloginsert.h" > +#include "access/xlogrecord.h" > +#include "storage/bufmgr.h" > +#include "storage/bufpage.h" > +#include "storage/bulk_write.h" > +#include "storage/proc.h" > +#include "storage/smgr.h" > +#include "utils/rel.h" > + > +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID > + > +typedef struct BulkWriteBuffer > +{ > + Page page; > + BlockNumber blkno; > + bool page_std; > + int16 order; > +} BulkWriteBuffer; > + The name makes it sound like this struct itself contains a buffer - but it's just pointing to one. *BufferRef or such maybe? I was wondering how you dealt with the alignment of buffers given the struct definition, which is what lead me to look at this... > +/* > + * Bulk writer state for one relation fork. > + */ > +typedef struct BulkWriteState > +{ > + /* Information about the target relation we're writing */ > + SMgrRelation smgr; Isn't there a danger of this becoming a dangling pointer? At least until https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com is merged? > + ForkNumber forknum; > + bool use_wal; > + > + /* We keep several pages buffered, and WAL-log them in batches */ > + int nbuffered; > + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES]; > + > + /* Current size of the relation */ > + BlockNumber pages_written; > + > + /* The RedoRecPtr at the time that the bulk operation started */ > + XLogRecPtr start_RedoRecPtr; > + > + Page zeropage; /* workspace for filling zeroes > */ We really should just have one such page in shared memory somewhere... For WAL writes as well. But until then - why do you allocate the page? Seems like we could just use a static global variable? > +/* > + * Write all buffered pages to disk. > + */ > +static void > +bulkw_flush(BulkWriteState *bulkw) > +{ > + int nbuffered = bulkw->nbuffered; > + BulkWriteBuffer *buffers = bulkw->buffers; > + > + if (nbuffered == 0) > + return; > + > + if (nbuffered > 1) > + { > + int o; > + > + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp); > + > + /* > + * Eliminate duplicates, keeping the last write of each block. > + * (buffer_cmp uses 'order' as the last sort key) > + */ Huh - which use cases would actually cause duplicate writes? Greetings, Andres Freund