Hi, Thanks for the review and feedback!
On Tue, 20 Aug 2024 at 21:47, Noah Misch <n...@leadboat.com> wrote: > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote: > > 2- collect_corrupt_items() > > > > This one is more complicated. The read stream callback function loops > > until it finds a suitable block to read. So, if the callback returns > > an InvalidBlockNumber; it means that the stream processed all possible > > blocks and the stream should be finished. There is ~3% timing > > improvement with this change. I started the server with the default > > settings and created a 6 GB table. Then run 100 times > > pg_check_visible() by clearing the OS cache between each run. > > > > The downside of this approach is there are too many "vmbuffer is valid > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so > > vmbuffer needs to be read again" cases in the read stream version (700 > > vs 20 for the 6 GB table). This is caused by the callback function of > > the read stream reading a new vmbuf while getting next block numbers. > > So, vmbuf is wrong when we are checking visibility map bits that might > > have changed while we were acquiring the page lock. > > Did the test that found 700 "read again" use a different procedure than the > one shared as setup.sh down-thread? Using that script alone, none of the vm > bits are set, so I'd not expect any heap page reads. Yes, it is basically the same script but the query is 'SELECT pg_check_visible('${TABLE}'::regclass);'. > The "vmbuffer needs to be read again" regression fits what I would expect the > v1 patch to do with a table having many vm bits set. In general, I think the > fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer > while collect_corrupt_items() works on another vmbuffer. Much of the time, > they'll be the same buffer. It could be as simple as that, but you could > consider further optimizations like these: Thanks for the suggestion. Keeping two vmbuffers lowered the count of "read-again" cases to ~40. I ran the test again and the timing improvement is ~5% now. > - Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping > lookup when the other Buffer var already has the block you want. > > - [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c > to stop calling the ReadStreamBlockNumberCB for awhile. This could help if > nonzero vm bits are infrequent, causing us to visit few heap blocks per vm > block. For example, if one block out of every 33000 is all-visible, every > heap block we visit has a different vmbuffer. It's likely not optimal for > the callback to pin and unpin 20 vmbuffers, then have > collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility > could notice this trend and request a stop of the callbacks until more of > the heap block work completes. If pg_visibility is going to be the only > place in the code with this use case, it's probably not worth carrying the > extra API just for pg_visibility. However, if we get a stronger use case > later, pg_visibility could be another beneficiary. I think the number of "read-again" cases are low enough now. I am working on 'using ReadRecentBuffer() or a similar technique' but that may take more time, so I attached patches without these optimizations. > > +/* > > + * Callback function to get next block for read stream object used in > > + * collect_visibility_data() function. > > + */ > > +static BlockNumber > > +collect_visibility_data_read_stream_next_block(ReadStream *stream, > > + > > void *callback_private_data, > > + > > void *per_buffer_data) > > +{ > > + struct collect_visibility_data_read_stream_private *p = > > callback_private_data; > > + > > + if (p->blocknum < p->nblocks) > > + return p->blocknum++; > > + > > + return InvalidBlockNumber; > > This is the third callback that just iterates over a block range, after > pg_prewarm_read_stream_next_block() and > copy_storage_using_buffer_read_stream_next_block(). While not a big problem, > I think it's time to have a general-use callback for block range scans. The > quantity of duplicate code is low, but the existing function names are long > and less informative than a behavior-based name. I agree. Does creating something like pg_general_read_stream_next_block() in read_stream code and exporting it makes sense? > > +static BlockNumber > > +collect_corrupt_items_read_stream_next_block(ReadStream *stream, > > + > > void *callback_private_data, > > + > > void *per_buffer_data) > > +{ > > + struct collect_corrupt_items_read_stream_private *p = > > callback_private_data; > > + > > + for (; p->blocknum < p->nblocks; p->blocknum++) > > + { > > + bool check_frozen = false; > > + bool check_visible = false; > > + > > + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, > > p->vmbuffer)) > > + check_frozen = true; > > + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, > > p->vmbuffer)) > > + check_visible = true; > > + if (!check_visible && !check_frozen) > > + continue; > > If a vm has zero bits set, this loop will scan the entire vm before returning. > Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how > VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS(). Done. I added CHECK_FOR_INTERRUPTS() to the loop in callback. > > @@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, > > bool all_frozen) > > items->count = 64; > > items->tids = palloc(items->count * sizeof(ItemPointerData)); > > > > + p.blocknum = 0; > > + p.nblocks = nblocks; > > + p.rel = rel; > > + p.vmbuffer = &vmbuffer; > > + p.all_frozen = all_frozen; > > + p.all_visible = all_visible; > > + stream = read_stream_begin_relation(READ_STREAM_FULL, > > + > > bstrategy, > > + > > rel, > > + > > MAIN_FORKNUM, > > + > > collect_corrupt_items_read_stream_next_block, > > + > > &p, > > + > > 0); > > + > > /* Loop over every block in the relation. */ > > for (blkno = 0; blkno < nblocks; ++blkno) > > The callback doesn't return blocks having zero vm bits, so the blkno variable > is not accurate. I didn't test, but I think the loop's "Recheck to avoid > returning spurious results." looks at the bit for the wrong block. If that's > what v1 does, could you expand the test file to catch that? For example, make > a two-block table with only the second block all-visible. Yes, it was not accurate. I am getting blockno from the buffer now. I checked and confirmed it is working as expected by manually logging blocknos returned from the read stream. I am not sure how to add a test case for this. > Since the callback skips blocks, this function should use the > read_stream_reset() style of looping: Done. There is one behavioral change introduced with the patches. It could happen when collect_corrupt_items() is called with both all_visible and all_frozen being true. -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns false in callback. Callback returns this block number because VM_ALL_FROZEN() is true. -> At the /* Recheck to avoid returning spurious results. */ part, we should only check VM_ALL_FROZEN() again as this was returning true in the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE(). VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now (vice versa of callback). -> We were skipping this block in the master but the patched version does not skip that. Is this a problem? If this is a problem, a solution might be to callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the per_buffer_data. v2 of the patches are attached, only 0002 is updated. -- Regards, Nazir Bilal Yavuz Microsoft
From e7ec4f3af86499b8433379b3b102eb58f75cf361 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavu...@gmail.com> Date: Fri, 9 Aug 2024 11:41:52 +0300 Subject: [PATCH v2 1/2] Use read stream in pg_visibility in collect_visibility_data() Instead of reading blocks with ReadBufferExtended() in collect_visibility_data(), use read stream. This change provides about 4% performance improvement. --- contrib/pg_visibility/pg_visibility.c | 53 ++++++++++++++++++++++++++- 1 file changed, 51 insertions(+), 2 deletions(-) diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 773ba92e454..ece67ccbe5f 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -21,6 +21,7 @@ #include "storage/bufmgr.h" #include "storage/proc.h" #include "storage/procarray.h" +#include "storage/read_stream.h" #include "storage/smgr.h" #include "utils/rel.h" #include "utils/snapmgr.h" @@ -41,6 +42,16 @@ typedef struct corrupt_items ItemPointer tids; } corrupt_items; +/* + * Helper struct for read stream object used in + * collect_visibility_data() function. + */ +struct collect_visibility_data_read_stream_private +{ + BlockNumber blocknum; + BlockNumber nblocks; +}; + PG_FUNCTION_INFO_V1(pg_visibility_map); PG_FUNCTION_INFO_V1(pg_visibility_map_rel); PG_FUNCTION_INFO_V1(pg_visibility); @@ -463,6 +474,23 @@ pg_visibility_tupdesc(bool include_blkno, bool include_pd) return BlessTupleDesc(tupdesc); } +/* + * Callback function to get next block for read stream object used in + * collect_visibility_data() function. + */ +static BlockNumber +collect_visibility_data_read_stream_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + struct collect_visibility_data_read_stream_private *p = callback_private_data; + + if (p->blocknum < p->nblocks) + return p->blocknum++; + + return InvalidBlockNumber; +} + /* * Collect visibility data about a relation. * @@ -478,6 +506,8 @@ collect_visibility_data(Oid relid, bool include_pd) BlockNumber blkno; Buffer vmbuffer = InvalidBuffer; BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); + struct collect_visibility_data_read_stream_private p; + ReadStream *stream = NULL; rel = relation_open(relid, AccessShareLock); @@ -489,6 +519,20 @@ collect_visibility_data(Oid relid, bool include_pd) info->next = 0; info->count = nblocks; + /* Crate a read stream if read will be done. */ + if (include_pd) + { + p.blocknum = 0; + p.nblocks = nblocks; + stream = read_stream_begin_relation(READ_STREAM_FULL, + bstrategy, + rel, + MAIN_FORKNUM, + collect_visibility_data_read_stream_next_block, + &p, + 0); + } + for (blkno = 0; blkno < nblocks; ++blkno) { int32 mapbits; @@ -513,8 +557,7 @@ collect_visibility_data(Oid relid, bool include_pd) Buffer buffer; Page page; - buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, - bstrategy); + buffer = read_stream_next_buffer(stream, NULL); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); @@ -525,6 +568,12 @@ collect_visibility_data(Oid relid, bool include_pd) } } + if (include_pd && stream) + { + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); + } + /* Clean up. */ if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); -- 2.45.2
From 1ed469a6301d8ac889228c49198368028a6b6a0e Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavu...@gmail.com> Date: Fri, 23 Aug 2024 10:15:09 +0300 Subject: [PATCH v2 2/2] Use read stream in pg_visibility in collect_corrupt_items() Instead of reading blocks with ReadBufferExtended() in collect_corrupt_items(), use read stream. This change provides about 5% performance improvement. --- contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++----- 1 file changed, 72 insertions(+), 15 deletions(-) diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index ece67ccbe5f..cfd94bd3be9 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -52,6 +52,20 @@ struct collect_visibility_data_read_stream_private BlockNumber nblocks; }; +/* + * Helper struct for read stream object used in + * collect_corrupt_items() function. + */ +struct collect_corrupt_items_read_stream_private +{ + bool all_frozen; + bool all_visible; + BlockNumber blocknum; + BlockNumber nblocks; + Relation rel; + Buffer *vmbuffer; +}; + PG_FUNCTION_INFO_V1(pg_visibility_map); PG_FUNCTION_INFO_V1(pg_visibility_map_rel); PG_FUNCTION_INFO_V1(pg_visibility); @@ -659,6 +673,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel) } } +/* + * Callback function to get next block for read stream object used in + * collect_corrupt_items() function. + */ +static BlockNumber +collect_corrupt_items_read_stream_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + struct collect_corrupt_items_read_stream_private *p = callback_private_data; + + for (; p->blocknum < p->nblocks; p->blocknum++) + { + bool check_frozen = false; + bool check_visible = false; + + /* Make sure we are interruptible. */ + CHECK_FOR_INTERRUPTS(); + + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer)) + check_frozen = true; + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer)) + check_visible = true; + if (!check_visible && !check_frozen) + continue; + + return p->blocknum++; + } + + return InvalidBlockNumber; +} + /* * Returns a list of items whose visibility map information does not match * the status of the tuples on the page. @@ -681,8 +727,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) corrupt_items *items; BlockNumber blkno; Buffer vmbuffer = InvalidBuffer; + Buffer stream_vmbuffer = InvalidBuffer; BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); TransactionId OldestXmin = InvalidTransactionId; + struct collect_corrupt_items_read_stream_private p; + ReadStream *stream; + Buffer buffer; rel = relation_open(relid, AccessShareLock); @@ -707,12 +757,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) items->count = 64; items->tids = palloc(items->count * sizeof(ItemPointerData)); + p.blocknum = 0; + p.nblocks = nblocks; + p.rel = rel; + p.vmbuffer = &stream_vmbuffer; + p.all_frozen = all_frozen; + p.all_visible = all_visible; + stream = read_stream_begin_relation(READ_STREAM_FULL, + bstrategy, + rel, + MAIN_FORKNUM, + collect_corrupt_items_read_stream_next_block, + &p, + 0); + /* Loop over every block in the relation. */ - for (blkno = 0; blkno < nblocks; ++blkno) + while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) { bool check_frozen = false; bool check_visible = false; - Buffer buffer; Page page; OffsetNumber offnum, maxoff; @@ -720,29 +783,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) /* Make sure we are interruptible. */ CHECK_FOR_INTERRUPTS(); - /* Use the visibility map to decide whether to check this page. */ - if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer)) - check_frozen = true; - if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) - check_visible = true; - if (!check_visible && !check_frozen) - continue; - - /* Read and lock the page. */ - buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, - bstrategy); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); maxoff = PageGetMaxOffsetNumber(page); + blkno = BufferGetBlockNumber(buffer); /* * The visibility map bits might have changed while we were acquiring * the page lock. Recheck to avoid returning spurious results. */ - if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer)) + if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer)) check_frozen = false; - if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) + if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) check_visible = false; if (!check_visible && !check_frozen) { @@ -827,10 +880,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) UnlockReleaseBuffer(buffer); } + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); /* Clean up. */ if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); + if (stream_vmbuffer != InvalidBuffer) + ReleaseBuffer(stream_vmbuffer); relation_close(rel, AccessShareLock); /* -- 2.45.2