Hi, On Fri, 23 Aug 2024 at 22:01, Noah Misch <n...@leadboat.com> wrote: > > On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote: > > On Tue, 20 Aug 2024 at 21:47, Noah Misch <n...@leadboat.com> wrote: > > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote: > > > > The downside of this approach is there are too many "vmbuffer is valid > > > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so > > > > vmbuffer needs to be read again" cases in the read stream version (700 > > > > vs 20 for the 6 GB table). This is caused by the callback function of > > > > the read stream reading a new vmbuf while getting next block numbers. > > > > So, vmbuf is wrong when we are checking visibility map bits that might > > > > have changed while we were acquiring the page lock. > > > > > > Did the test that found 700 "read again" use a different procedure than > > > the > > > one shared as setup.sh down-thread? Using that script alone, none of the > > > vm > > > bits are set, so I'd not expect any heap page reads. > > > > Yes, it is basically the same script but the query is 'SELECT > > pg_check_visible('${TABLE}'::regclass);'. > > I gather the scripts deal exclusively in tables with no vm bits set, so > pg_visibility did no heap page reads under those scripts. But the '700 "read > again"' result suggests either I'm guessing wrong, or that result came from a > different test procedure. Thoughts?
Sorry, yes. You need to run VACUUM on the test table before running the query. The script is attached. You can run the script with [corrupt | visibility] arguments to test the related patches. > > > The "vmbuffer needs to be read again" regression fits what I would expect > > > the > > > v1 patch to do with a table having many vm bits set. In general, I think > > > the > > > fix is to keep two heap Buffer vars, so the callback can work on one > > > vmbuffer > > > while collect_corrupt_items() works on another vmbuffer. Much of the > > > time, > > > they'll be the same buffer. It could be as simple as that, but you could > > > consider further optimizations like these: > > > > Thanks for the suggestion. Keeping two vmbuffers lowered the count of > > "read-again" cases to ~40. I ran the test again and the timing > > improvement is ~5% now. > > > I think the number of "read-again" cases are low enough now. > > Agreed. The increase from 20 to 40 is probably an increase in buffer mapping > lookups, not an increase in I/O. > > > Does creating something like > > pg_general_read_stream_next_block() in read_stream code and exporting > > it makes sense? > > To me, that name isn't conveying the function's behavior, and the words "pg_" > and "general_" aren't adding information there. How about one of these names > or similar: > > seq_read_stream_cb > sequential_read_stream_cb > block_range_read_stream_cb I liked the block_range_read_stream_cb. Attached patches for that (first 3 patches). I chose an nblocks way instead of last_blocks in the struct. > > > The callback doesn't return blocks having zero vm bits, so the blkno > > > variable > > > is not accurate. I didn't test, but I think the loop's "Recheck to avoid > > > returning spurious results." looks at the bit for the wrong block. If > > > that's > > > what v1 does, could you expand the test file to catch that? For example, > > > make > > > a two-block table with only the second block all-visible. > > > > Yes, it was not accurate. I am getting blockno from the buffer now. I > > checked and confirmed it is working as expected by manually logging > > blocknos returned from the read stream. I am not sure how to add a > > test case for this. > > VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular > TID, even if rolled back, clears both vm bits for the TID's page. Past tests > like that had instability problems. One cause is a concurrent session's XID > or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may > have been one of the countermeasures, but I don't remember clearly. Hence, > please search the archives or the existing pg_visibility tests for how we > dealt with that. It may not be problem for this particular test. Thanks for the information, I will check these. What I still do not understand is how to make sure that only the second block is processed and the first one is skipped. pg_check_visible() and pg_check_frozen() returns TIDs that cause corruption in the visibility map, there is no information about block numbers. > > There is one behavioral change introduced with the patches. It could > > happen when collect_corrupt_items() is called with both all_visible > > and all_frozen being true. > > -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns > > false in callback. Callback returns this block number because > > VM_ALL_FROZEN() is true. > > -> At the /* Recheck to avoid returning spurious results. */ part, we > > should only check VM_ALL_FROZEN() again as this was returning true in > > the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE(). > > VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now > > (vice versa of callback). > > -> We were skipping this block in the master but the patched version > > does not skip that. > > > > Is this a problem? If this is a problem, a solution might be to > > callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the > > per_buffer_data. > > No, I don't consider that a problem. That's not making us do additional I/O, > just testing more bits within the pages we're already loading. The difference > in behavior is user-visible, but it's the same behavior change the user would > get if the timing had been a bit different. Thanks for the clarification. -- Regards, Nazir Bilal Yavuz Microsoft
From 55e8e78d05b135babf8ff8c84fb9747ad19c58c1 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavuz81@gmail.com> Date: Mon, 26 Aug 2024 12:12:52 +0300 Subject: [PATCH v3 1/5] Add general-use struct and callback to read stream Number of callbacks that just iterates over block range in read stream are increasing. So, add general-use struct and callback to read stream. --- src/include/storage/read_stream.h | 12 ++++++++++++ src/backend/storage/aio/read_stream.c | 17 +++++++++++++++++ src/tools/pgindent/typedefs.list | 1 + 3 files changed, 30 insertions(+) diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h index 4e599904f26..7613c1e361c 100644 --- a/src/include/storage/read_stream.h +++ b/src/include/storage/read_stream.h @@ -45,11 +45,23 @@ struct ReadStream; typedef struct ReadStream ReadStream; +/* + * General-use struct to use in callback functions for block range scans. + */ +typedef struct BlockRangeReadStreamPrivate +{ + BlockNumber blocknum; + BlockNumber nblocks; +} BlockRangeReadStreamPrivate; + /* Callback that returns the next block number to read. */ typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream, void *callback_private_data, void *per_buffer_data); +extern BlockNumber block_range_read_stream_cb(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data); extern ReadStream *read_stream_begin_relation(int flags, BufferAccessStrategy strategy, Relation rel, diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index a83c18c2a4b..937d001ef08 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -166,6 +166,23 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index) stream->per_buffer_data_size * buffer_index; } +/* + * General-use callback function for block range scans. Callback loops between + * blocknum (inclusive) and nblocks (exclusive). + */ +BlockNumber +block_range_read_stream_cb(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + BlockRangeReadStreamPrivate *p = callback_private_data; + + if (p->blocknum < p->nblocks) + return p->blocknum++; + + return InvalidBlockNumber; +} + /* * Ask the callback which block it would like us to read next, with a small * buffer in front to allow read_stream_unget_block() to work and to allow the diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 9e951a9e6f3..df3f336bec0 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -275,6 +275,7 @@ BlockId BlockIdData BlockInfoRecord BlockNumber +BlockRangeReadStreamPrivate BlockRefTable BlockRefTableBuffer BlockRefTableChunk -- 2.45.2
From 01de5ff954f7db6a749ebfc0b1281ed010b5e8f9 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavuz81@gmail.com> Date: Mon, 26 Aug 2024 12:16:06 +0300 Subject: [PATCH v3 2/5] pg_prewarm: Use generic-use read stream struct and callback Instead of creating another read stream struct and callback, use generic one that is exported from read stream. --- contrib/pg_prewarm/pg_prewarm.c | 25 +++---------------------- 1 file changed, 3 insertions(+), 22 deletions(-) diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c index 5c859e983c5..889639a09a3 100644 --- a/contrib/pg_prewarm/pg_prewarm.c +++ b/contrib/pg_prewarm/pg_prewarm.c @@ -39,25 +39,6 @@ typedef enum static PGIOAlignedBlock blockbuffer; -struct pg_prewarm_read_stream_private -{ - BlockNumber blocknum; - int64 last_block; -}; - -static BlockNumber -pg_prewarm_read_stream_next_block(ReadStream *stream, - void *callback_private_data, - void *per_buffer_data) -{ - struct pg_prewarm_read_stream_private *p = callback_private_data; - - if (p->blocknum <= p->last_block) - return p->blocknum++; - - return InvalidBlockNumber; -} - /* * pg_prewarm(regclass, mode text, fork text, * first_block int8, last_block int8) @@ -203,7 +184,7 @@ pg_prewarm(PG_FUNCTION_ARGS) } else if (ptype == PREWARM_BUFFER) { - struct pg_prewarm_read_stream_private p; + BlockRangeReadStreamPrivate p; ReadStream *stream; /* @@ -212,13 +193,13 @@ pg_prewarm(PG_FUNCTION_ARGS) /* Set up the private state for our streaming buffer read callback. */ p.blocknum = first_block; - p.last_block = last_block; + p.nblocks = last_block + 1; stream = read_stream_begin_relation(READ_STREAM_FULL, NULL, rel, forkNumber, - pg_prewarm_read_stream_next_block, + block_range_read_stream_cb, &p, 0); -- 2.45.2
From afc8c50c39212e016f71ee287062c287fffbacc2 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavuz81@gmail.com> Date: Mon, 26 Aug 2024 12:24:26 +0300 Subject: [PATCH v3 3/5] RelationCopyStorageUsingBuffer: Use generic-use read stream struct and callback Instead of creating another read stream struct and callback, use generic one that is exported from read stream. --- src/backend/storage/buffer/bufmgr.c | 31 ++--------------------------- 1 file changed, 2 insertions(+), 29 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index fb38c7bdd45..c04e4733921 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -136,33 +136,6 @@ typedef struct SMgrSortArray SMgrRelation srel; } SMgrSortArray; -/* - * Helper struct for read stream object used in - * RelationCopyStorageUsingBuffer() function. - */ -struct copy_storage_using_buffer_read_stream_private -{ - BlockNumber blocknum; - BlockNumber nblocks; -}; - -/* - * Callback function to get next block for read stream object used in - * RelationCopyStorageUsingBuffer() function. - */ -static BlockNumber -copy_storage_using_buffer_read_stream_next_block(ReadStream *stream, - void *callback_private_data, - void *per_buffer_data) -{ - struct copy_storage_using_buffer_read_stream_private *p = callback_private_data; - - if (p->blocknum < p->nblocks) - return p->blocknum++; - - return InvalidBlockNumber; -} - /* GUC variables */ bool zero_damaged_pages = false; int bgwriter_lru_maxpages = 100; @@ -4710,7 +4683,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator, PGIOAlignedBlock buf; BufferAccessStrategy bstrategy_src; BufferAccessStrategy bstrategy_dst; - struct copy_storage_using_buffer_read_stream_private p; + BlockRangeReadStreamPrivate p; ReadStream *src_stream; SMgrRelation src_smgr; @@ -4750,7 +4723,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator, src_smgr, permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED, forkNum, - copy_storage_using_buffer_read_stream_next_block, + block_range_read_stream_cb, &p, 0); -- 2.45.2
From fbbca6e1c27e45afae9137228811d9fa11a7a197 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavuz81@gmail.com> Date: Mon, 26 Aug 2024 15:45:20 +0300 Subject: [PATCH v3 4/5] Use read stream in pg_visibility in collect_visibility_data() Instead of reading blocks with ReadBufferExtended() in collect_visibility_data(), use read stream. This change provides about 4% performance improvement. --- contrib/pg_visibility/pg_visibility.c | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 773ba92e454..6c64f5ead7e 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -21,6 +21,7 @@ #include "storage/bufmgr.h" #include "storage/proc.h" #include "storage/procarray.h" +#include "storage/read_stream.h" #include "storage/smgr.h" #include "utils/rel.h" #include "utils/snapmgr.h" @@ -478,6 +479,8 @@ collect_visibility_data(Oid relid, bool include_pd) BlockNumber blkno; Buffer vmbuffer = InvalidBuffer; BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); + BlockRangeReadStreamPrivate p; + ReadStream *stream = NULL; rel = relation_open(relid, AccessShareLock); @@ -489,6 +492,20 @@ collect_visibility_data(Oid relid, bool include_pd) info->next = 0; info->count = nblocks; + /* Crate a read stream if read will be done. */ + if (include_pd) + { + p.blocknum = 0; + p.nblocks = nblocks; + stream = read_stream_begin_relation(READ_STREAM_FULL, + bstrategy, + rel, + MAIN_FORKNUM, + block_range_read_stream_cb, + &p, + 0); + } + for (blkno = 0; blkno < nblocks; ++blkno) { int32 mapbits; @@ -513,8 +530,7 @@ collect_visibility_data(Oid relid, bool include_pd) Buffer buffer; Page page; - buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, - bstrategy); + buffer = read_stream_next_buffer(stream, NULL); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); @@ -525,6 +541,12 @@ collect_visibility_data(Oid relid, bool include_pd) } } + if (include_pd && stream) + { + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); + } + /* Clean up. */ if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); -- 2.45.2
From aae5ded7c3b3f7600f01674a6200645d10d07116 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavuz81@gmail.com> Date: Fri, 23 Aug 2024 10:15:09 +0300 Subject: [PATCH v3 5/5] Use read stream in pg_visibility in collect_corrupt_items() Instead of reading blocks with ReadBufferExtended() in collect_corrupt_items(), use read stream. This change provides about 5% performance improvement. --- contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++----- 1 file changed, 72 insertions(+), 15 deletions(-) diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 6c64f5ead7e..75a5b362fe2 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -42,6 +42,20 @@ typedef struct corrupt_items ItemPointer tids; } corrupt_items; +/* + * Helper struct for read stream object used in + * collect_corrupt_items() function. + */ +struct collect_corrupt_items_read_stream_private +{ + bool all_frozen; + bool all_visible; + BlockNumber blocknum; + BlockNumber nblocks; + Relation rel; + Buffer *vmbuffer; +}; + PG_FUNCTION_INFO_V1(pg_visibility_map); PG_FUNCTION_INFO_V1(pg_visibility_map_rel); PG_FUNCTION_INFO_V1(pg_visibility); @@ -632,6 +646,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel) } } +/* + * Callback function to get next block for read stream object used in + * collect_corrupt_items() function. + */ +static BlockNumber +collect_corrupt_items_read_stream_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + struct collect_corrupt_items_read_stream_private *p = callback_private_data; + + for (; p->blocknum < p->nblocks; p->blocknum++) + { + bool check_frozen = false; + bool check_visible = false; + + /* Make sure we are interruptible. */ + CHECK_FOR_INTERRUPTS(); + + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer)) + check_frozen = true; + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer)) + check_visible = true; + if (!check_visible && !check_frozen) + continue; + + return p->blocknum++; + } + + return InvalidBlockNumber; +} + /* * Returns a list of items whose visibility map information does not match * the status of the tuples on the page. @@ -654,8 +700,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) corrupt_items *items; BlockNumber blkno; Buffer vmbuffer = InvalidBuffer; + Buffer stream_vmbuffer = InvalidBuffer; BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); TransactionId OldestXmin = InvalidTransactionId; + struct collect_corrupt_items_read_stream_private p; + ReadStream *stream; + Buffer buffer; rel = relation_open(relid, AccessShareLock); @@ -680,12 +730,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) items->count = 64; items->tids = palloc(items->count * sizeof(ItemPointerData)); + p.blocknum = 0; + p.nblocks = nblocks; + p.rel = rel; + p.vmbuffer = &stream_vmbuffer; + p.all_frozen = all_frozen; + p.all_visible = all_visible; + stream = read_stream_begin_relation(READ_STREAM_FULL, + bstrategy, + rel, + MAIN_FORKNUM, + collect_corrupt_items_read_stream_next_block, + &p, + 0); + /* Loop over every block in the relation. */ - for (blkno = 0; blkno < nblocks; ++blkno) + while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) { bool check_frozen = false; bool check_visible = false; - Buffer buffer; Page page; OffsetNumber offnum, maxoff; @@ -693,29 +756,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) /* Make sure we are interruptible. */ CHECK_FOR_INTERRUPTS(); - /* Use the visibility map to decide whether to check this page. */ - if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer)) - check_frozen = true; - if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) - check_visible = true; - if (!check_visible && !check_frozen) - continue; - - /* Read and lock the page. */ - buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, - bstrategy); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); maxoff = PageGetMaxOffsetNumber(page); + blkno = BufferGetBlockNumber(buffer); /* * The visibility map bits might have changed while we were acquiring * the page lock. Recheck to avoid returning spurious results. */ - if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer)) + if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer)) check_frozen = false; - if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) + if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) check_visible = false; if (!check_visible && !check_frozen) { @@ -800,10 +853,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) UnlockReleaseBuffer(buffer); } + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); /* Clean up. */ if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); + if (stream_vmbuffer != InvalidBuffer) + ReleaseBuffer(stream_vmbuffer); relation_close(rel, AccessShareLock); /* -- 2.45.2
setup.sh
Description: Bourne shell script