Hi,

On Fri, 23 Aug 2024 at 22:01, Noah Misch <n...@leadboat.com> wrote:
>
> On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 20 Aug 2024 at 21:47, Noah Misch <n...@leadboat.com> wrote:
> > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > > The downside of this approach is there are too many "vmbuffer is valid
> > > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > > vmbuffer needs to be read again" cases in the read stream version (700
> > > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > > the read stream reading a new vmbuf while getting next block numbers.
> > > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > > have changed while we were acquiring the page lock.
> > >
> > > Did the test that found 700 "read again" use a different procedure than 
> > > the
> > > one shared as setup.sh down-thread?  Using that script alone, none of the 
> > > vm
> > > bits are set, so I'd not expect any heap page reads.
> >
> > Yes, it is basically the same script but the query is 'SELECT
> > pg_check_visible('${TABLE}'::regclass);'.
>
> I gather the scripts deal exclusively in tables with no vm bits set, so
> pg_visibility did no heap page reads under those scripts.  But the '700 "read
> again"' result suggests either I'm guessing wrong, or that result came from a
> different test procedure.  Thoughts?

Sorry, yes. You need to run VACUUM on the test table before running
the query. The script is attached. You can run the script with
[corrupt | visibility] arguments to test the related patches.

> > > The "vmbuffer needs to be read again" regression fits what I would expect 
> > > the
> > > v1 patch to do with a table having many vm bits set.  In general, I think 
> > > the
> > > fix is to keep two heap Buffer vars, so the callback can work on one 
> > > vmbuffer
> > > while collect_corrupt_items() works on another vmbuffer.  Much of the 
> > > time,
> > > they'll be the same buffer.  It could be as simple as that, but you could
> > > consider further optimizations like these:
> >
> > Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> > "read-again" cases to ~40. I ran the test again and the timing
> > improvement is ~5% now.
>
> > I think the number of "read-again" cases are low enough now.
>
> Agreed.  The increase from 20 to 40 is probably an increase in buffer mapping
> lookups, not an increase in I/O.
>
> > Does creating something like
> > pg_general_read_stream_next_block() in read_stream code and exporting
> > it makes sense?
>
> To me, that name isn't conveying the function's behavior, and the words "pg_"
> and "general_" aren't adding information there.  How about one of these names
> or similar:
>
> seq_read_stream_cb
> sequential_read_stream_cb
> block_range_read_stream_cb

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

> > > The callback doesn't return blocks having zero vm bits, so the blkno 
> > > variable
> > > is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
> > > returning spurious results." looks at the bit for the wrong block.  If 
> > > that's
> > > what v1 does, could you expand the test file to catch that?  For example, 
> > > make
> > > a two-block table with only the second block all-visible.
> >
> > Yes, it was not accurate. I am getting blockno from the buffer now. I
> > checked and confirmed it is working as expected by manually logging
> > blocknos returned from the read stream. I am not sure how to add a
> > test case for this.
>
> VACUUM FREEZE makes an all-visible, all-frozen table.  DELETE of a particular
> TID, even if rolled back, clears both vm bits for the TID's page.  Past tests
> like that had instability problems.  One cause is a concurrent session's XID
> or snapshot, which can prevent VACUUM setting vm bits.  Using a TEMP table may
> have been one of the countermeasures, but I don't remember clearly.  Hence,
> please search the archives or the existing pg_visibility tests for how we
> dealt with that.  It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

> > There is one behavioral change introduced with the patches. It could
> > happen when collect_corrupt_items() is called with both all_visible
> > and all_frozen being true.
> > -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
> > false in callback. Callback returns this block number because
> > VM_ALL_FROZEN() is true.
> > -> At the /* Recheck to avoid returning spurious results. */ part, we
> > should only check VM_ALL_FROZEN() again as this was returning true in
> > the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
> > VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
> > (vice versa of callback).
> > -> We were skipping this block in the master but the patched version
> > does not skip that.
> >
> > Is this a problem? If this is a problem, a solution might be to
> > callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
> > per_buffer_data.
>
> No, I don't consider that a problem.  That's not making us do additional I/O,
> just testing more bits within the pages we're already loading.  The difference
> in behavior is user-visible, but it's the same behavior change the user would
> get if the timing had been a bit different.

Thanks for the clarification.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 55e8e78d05b135babf8ff8c84fb9747ad19c58c1 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:12:52 +0300
Subject: [PATCH v3 1/5] Add general-use struct and callback to read stream

Number of callbacks that just iterates over block range in read stream
are increasing. So, add general-use struct and callback to read stream.
---
 src/include/storage/read_stream.h     | 12 ++++++++++++
 src/backend/storage/aio/read_stream.c | 17 +++++++++++++++++
 src/tools/pgindent/typedefs.list      |  1 +
 3 files changed, 30 insertions(+)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f26..7613c1e361c 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -45,11 +45,23 @@
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
+/*
+ * General-use struct to use in callback functions for block range scans.
+ */
+typedef struct BlockRangeReadStreamPrivate
+{
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+} BlockRangeReadStreamPrivate;
+
 /* Callback that returns the next block number to read. */
 typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 												void *callback_private_data,
 												void *per_buffer_data);
 
+extern BlockNumber block_range_read_stream_cb(ReadStream *stream,
+											  void *callback_private_data,
+											  void *per_buffer_data);
 extern ReadStream *read_stream_begin_relation(int flags,
 											  BufferAccessStrategy strategy,
 											  Relation rel,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a83c18c2a4b..937d001ef08 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -166,6 +166,23 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 		stream->per_buffer_data_size * buffer_index;
 }
 
+/*
+ * General-use callback function for block range scans. Callback loops between
+ * blocknum (inclusive) and nblocks (exclusive).
+ */
+BlockNumber
+block_range_read_stream_cb(ReadStream *stream,
+						   void *callback_private_data,
+						   void *per_buffer_data)
+{
+	BlockRangeReadStreamPrivate *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Ask the callback which block it would like us to read next, with a small
  * buffer in front to allow read_stream_unget_block() to work and to allow the
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..df3f336bec0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -275,6 +275,7 @@ BlockId
 BlockIdData
 BlockInfoRecord
 BlockNumber
+BlockRangeReadStreamPrivate
 BlockRefTable
 BlockRefTableBuffer
 BlockRefTableChunk
-- 
2.45.2

From 01de5ff954f7db6a749ebfc0b1281ed010b5e8f9 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:16:06 +0300
Subject: [PATCH v3 2/5] pg_prewarm: Use generic-use read stream struct and
 callback

Instead of creating another read stream struct and callback, use generic
one that is exported from read stream.
---
 contrib/pg_prewarm/pg_prewarm.c | 25 +++----------------------
 1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 5c859e983c5..889639a09a3 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -39,25 +39,6 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
-struct pg_prewarm_read_stream_private
-{
-	BlockNumber blocknum;
-	int64		last_block;
-};
-
-static BlockNumber
-pg_prewarm_read_stream_next_block(ReadStream *stream,
-								  void *callback_private_data,
-								  void *per_buffer_data)
-{
-	struct pg_prewarm_read_stream_private *p = callback_private_data;
-
-	if (p->blocknum <= p->last_block)
-		return p->blocknum++;
-
-	return InvalidBlockNumber;
-}
-
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -203,7 +184,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
-		struct pg_prewarm_read_stream_private p;
+		BlockRangeReadStreamPrivate p;
 		ReadStream *stream;
 
 		/*
@@ -212,13 +193,13 @@ pg_prewarm(PG_FUNCTION_ARGS)
 
 		/* Set up the private state for our streaming buffer read callback. */
 		p.blocknum = first_block;
-		p.last_block = last_block;
+		p.nblocks = last_block + 1;
 
 		stream = read_stream_begin_relation(READ_STREAM_FULL,
 											NULL,
 											rel,
 											forkNumber,
-											pg_prewarm_read_stream_next_block,
+											block_range_read_stream_cb,
 											&p,
 											0);
 
-- 
2.45.2

From afc8c50c39212e016f71ee287062c287fffbacc2 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:24:26 +0300
Subject: [PATCH v3 3/5] RelationCopyStorageUsingBuffer: Use generic-use read
 stream struct and callback

Instead of creating another read stream struct and callback, use generic
one that is exported from read stream.
---
 src/backend/storage/buffer/bufmgr.c | 31 ++---------------------------
 1 file changed, 2 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fb38c7bdd45..c04e4733921 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -136,33 +136,6 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
-/*
- * Helper struct for read stream object used in
- * RelationCopyStorageUsingBuffer() function.
- */
-struct copy_storage_using_buffer_read_stream_private
-{
-	BlockNumber blocknum;
-	BlockNumber nblocks;
-};
-
-/*
- * Callback function to get next block for read stream object used in
- * RelationCopyStorageUsingBuffer() function.
- */
-static BlockNumber
-copy_storage_using_buffer_read_stream_next_block(ReadStream *stream,
-												 void *callback_private_data,
-												 void *per_buffer_data)
-{
-	struct copy_storage_using_buffer_read_stream_private *p = callback_private_data;
-
-	if (p->blocknum < p->nblocks)
-		return p->blocknum++;
-
-	return InvalidBlockNumber;
-}
-
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -4710,7 +4683,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
-	struct copy_storage_using_buffer_read_stream_private p;
+	BlockRangeReadStreamPrivate p;
 	ReadStream *src_stream;
 	SMgrRelation src_smgr;
 
@@ -4750,7 +4723,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 												 src_smgr,
 												 permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED,
 												 forkNum,
-												 copy_storage_using_buffer_read_stream_next_block,
+												 block_range_read_stream_cb,
 												 &p,
 												 0);
 
-- 
2.45.2

From fbbca6e1c27e45afae9137228811d9fa11a7a197 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 15:45:20 +0300
Subject: [PATCH v3 4/5] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..6c64f5ead7e 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -478,6 +479,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +492,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.blocknum = 0;
+		p.nblocks = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +530,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +541,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
-- 
2.45.2

From aae5ded7c3b3f7600f01674a6200645d10d07116 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 23 Aug 2024 10:15:09 +0300
Subject: [PATCH v3 5/5] Use read stream in pg_visibility in
 collect_corrupt_items()

Instead of reading blocks with ReadBufferExtended() in
collect_corrupt_items(), use read stream.

This change provides about 5% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 6c64f5ead7e..75a5b362fe2 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -42,6 +42,20 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -632,6 +646,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -654,8 +700,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	corrupt_items *items;
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
+	Buffer		stream_vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -680,12 +730,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &stream_vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
 		bool		check_frozen = false;
 		bool		check_visible = false;
-		Buffer		buffer;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
@@ -693,29 +756,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
 		 * the page lock.  Recheck to avoid returning spurious results.
 		 */
-		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;
 		if (!check_visible && !check_frozen)
 		{
@@ -800,10 +853,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (stream_vmbuffer != InvalidBuffer)
+		ReleaseBuffer(stream_vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

Attachment: setup.sh
Description: Bourne shell script

Reply via email to