Hi,

Thanks for the review and feedback!

On Tue, 20 Aug 2024 at 21:47, Noah Misch <n...@leadboat.com> wrote:
>
> On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > 2- collect_corrupt_items()
> >
> > This one is more complicated. The read stream callback function loops
> > until it finds a suitable block to read. So, if the callback returns
> > an InvalidBlockNumber; it means that the stream processed all possible
> > blocks and the stream should be finished. There is ~3% timing
> > improvement with this change. I started the server with the default
> > settings and created a 6 GB table. Then run 100 times
> > pg_check_visible() by clearing the OS cache between each run.
> >
> > The downside of this approach is there are too many "vmbuffer is valid
> > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > vmbuffer needs to be read again" cases in the read stream version (700
> > vs 20 for the 6 GB table). This is caused by the callback function of
> > the read stream reading a new vmbuf while getting next block numbers.
> > So, vmbuf is wrong when we are checking visibility map bits that might
> > have changed while we were acquiring the page lock.
>
> Did the test that found 700 "read again" use a different procedure than the
> one shared as setup.sh down-thread?  Using that script alone, none of the vm
> bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

> The "vmbuffer needs to be read again" regression fits what I would expect the
> v1 patch to do with a table having many vm bits set.  In general, I think the
> fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> while collect_corrupt_items() works on another vmbuffer.  Much of the time,
> they'll be the same buffer.  It could be as simple as that, but you could
> consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

> - Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
>   lookup when the other Buffer var already has the block you want.
>
> - [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
>   to stop calling the ReadStreamBlockNumberCB for awhile.  This could help if
>   nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
>   block.  For example, if one block out of every 33000 is all-visible, every
>   heap block we visit has a different vmbuffer.  It's likely not optimal for
>   the callback to pin and unpin 20 vmbuffers, then have
>   collect_corrupt_items() pin and unpin the same 20 vmbuffers.  pg_visibility
>   could notice this trend and request a stop of the callbacks until more of
>   the heap block work completes.  If pg_visibility is going to be the only
>   place in the code with this use case, it's probably not worth carrying the
>   extra API just for pg_visibility.  However, if we get a stronger use case
>   later, pg_visibility could be another beneficiary.

I think the number of "read-again" cases are low enough now. I am
working on 'using ReadRecentBuffer() or a similar technique' but that
may take more time, so I attached patches without these optimizations.

> > +/*
> > + * Callback function to get next block for read stream object used in
> > + * collect_visibility_data() function.
> > + */
> > +static BlockNumber
> > +collect_visibility_data_read_stream_next_block(ReadStream *stream,
> > +                                                                           
> >              void *callback_private_data,
> > +                                                                           
> >              void *per_buffer_data)
> > +{
> > +     struct collect_visibility_data_read_stream_private *p = 
> > callback_private_data;
> > +
> > +     if (p->blocknum < p->nblocks)
> > +             return p->blocknum++;
> > +
> > +     return InvalidBlockNumber;
>
> This is the third callback that just iterates over a block range, after
> pg_prewarm_read_stream_next_block() and
> copy_storage_using_buffer_read_stream_next_block().  While not a big problem,
> I think it's time to have a general-use callback for block range scans.  The
> quantity of duplicate code is low, but the existing function names are long
> and less informative than a behavior-based name.

I agree. Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

> > +static BlockNumber
> > +collect_corrupt_items_read_stream_next_block(ReadStream *stream,
> > +                                                                           
> >            void *callback_private_data,
> > +                                                                           
> >            void *per_buffer_data)
> > +{
> > +     struct collect_corrupt_items_read_stream_private *p = 
> > callback_private_data;
> > +
> > +     for (; p->blocknum < p->nblocks; p->blocknum++)
> > +     {
> > +             bool            check_frozen = false;
> > +             bool            check_visible = false;
> > +
> > +             if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, 
> > p->vmbuffer))
> > +                     check_frozen = true;
> > +             if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, 
> > p->vmbuffer))
> > +                     check_visible = true;
> > +             if (!check_visible && !check_frozen)
> > +                     continue;
>
> If a vm has zero bits set, this loop will scan the entire vm before returning.
> Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
> VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

Done. I added CHECK_FOR_INTERRUPTS() to the loop in callback.

> > @@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, 
> > bool all_frozen)
> >       items->count = 64;
> >       items->tids = palloc(items->count * sizeof(ItemPointerData));
> >
> > +     p.blocknum = 0;
> > +     p.nblocks = nblocks;
> > +     p.rel = rel;
> > +     p.vmbuffer = &vmbuffer;
> > +     p.all_frozen = all_frozen;
> > +     p.all_visible = all_visible;
> > +     stream = read_stream_begin_relation(READ_STREAM_FULL,
> > +                                                                           
> >   bstrategy,
> > +                                                                           
> >   rel,
> > +                                                                           
> >   MAIN_FORKNUM,
> > +                                                                           
> >   collect_corrupt_items_read_stream_next_block,
> > +                                                                           
> >   &p,
> > +                                                                           
> >   0);
> > +
> >       /* Loop over every block in the relation. */
> >       for (blkno = 0; blkno < nblocks; ++blkno)
>
> The callback doesn't return blocks having zero vm bits, so the blkno variable
> is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
> returning spurious results." looks at the bit for the wrong block.  If that's
> what v1 does, could you expand the test file to catch that?  For example, make
> a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

> Since the callback skips blocks, this function should use the
> read_stream_reset() style of looping:

Done.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

v2 of the patches are attached, only 0002 is updated.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From e7ec4f3af86499b8433379b3b102eb58f75cf361 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Fri, 9 Aug 2024 11:41:52 +0300
Subject: [PATCH v2 1/2] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 53 ++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..ece67ccbe5f 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -41,6 +42,16 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_visibility_data() function.
+ */
+struct collect_visibility_data_read_stream_private
+{
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -463,6 +474,23 @@ pg_visibility_tupdesc(bool include_blkno, bool include_pd)
 	return BlessTupleDesc(tupdesc);
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data)
+{
+	struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Collect visibility data about a relation.
  *
@@ -478,6 +506,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	struct collect_visibility_data_read_stream_private p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +519,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.blocknum = 0;
+		p.nblocks = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											collect_visibility_data_read_stream_next_block,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +557,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +568,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
-- 
2.45.2

From 1ed469a6301d8ac889228c49198368028a6b6a0e Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Fri, 23 Aug 2024 10:15:09 +0300
Subject: [PATCH v2 2/2] Use read stream in pg_visibility in
 collect_corrupt_items()

Instead of reading blocks with ReadBufferExtended() in
collect_corrupt_items(), use read stream.

This change provides about 5% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index ece67ccbe5f..cfd94bd3be9 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -52,6 +52,20 @@ struct collect_visibility_data_read_stream_private
 	BlockNumber nblocks;
 };
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -659,6 +673,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -681,8 +727,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	corrupt_items *items;
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
+	Buffer		stream_vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -707,12 +757,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &stream_vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
 		bool		check_frozen = false;
 		bool		check_visible = false;
-		Buffer		buffer;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
@@ -720,29 +783,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
 		 * the page lock.  Recheck to avoid returning spurious results.
 		 */
-		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;
 		if (!check_visible && !check_frozen)
 		{
@@ -827,10 +880,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (stream_vmbuffer != InvalidBuffer)
+		ReleaseBuffer(stream_vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

Reply via email to