Re: Pluggable Storage - Andres's take

Andres Freund Thu, 25 Apr 2019 15:44:21 -0700

Hi Heikki, Ashwin, Tom,

On 2019-04-23 15:52:01 -0700, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> > doesn't use normal data files, that won't work. I bumped into that with my
> > toy implementation, which wouldn't need to create any data files, if it
> > wasn't for this.
> 
> There are a few more of these:


> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Hm. I think some of those changes would be a bit bigger than I initially
though. Attached is a more minimal fix that'd route
RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
think it's definitely the right answer for 1), probably the pragmatic
answer to 2), but certainly not for 3).

I've for now made the AM return the size in bytes, and then convert that
into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
are going to continue to want it internally as pages (otherwise there's
going to be way too much churn, without a benefit I can see). So I
thinkt that's OK.

There's also a somewhat weird bit of returning the total relation size
for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
postgres' current forks, but have equivalent concepts. And without that
there'd be no way to get that size.  I'm not sure I like this, input
welcome. But it seems good to offer the ability to get the entire size
somehow.

Btw, isn't RelationGetNumberOfBlocksForFork() currently weirdly placed?
I don't see why bufmgr.c would be appropriate? Although I don't think
it's particulary clear where it'd best reside - I'd tentatively say
storage.c.

Heikki, Ashwin, your inputs would be appreciated here, in particular the
tid fetch bit below.


The attached patch isn't intended to be applied as-is, just basis for
discussion.


> 1) index_update_stats(), computing pg_class.relpages
> 
>    Feels like the number of both heap and index blocks should be
>    computed by the index build and stored in IndexInfo. That'd also get
>    a bit closer towards allowing indexams not going through smgr (useful
>    e.g. for memory only ones).

Due to parallel index builds that'd actually be hard. Given the number
of places wanting to compute relpages for pg_class I think the above
patch routing RelationGetNumberOfBlocksForFork() through tableam is the
right fix.


> 2) commands/analyze.c, computing pg_class.relpages
> 
>    This should imo be moved to the tableam callback. It's currently done
>    a bit weirdly imo, with fdws computing relpages the callback, but
>    then also returning the acquirefunc. Seems like it should entirely be
>    computed as part of calling acquirefunc.

Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
tableam wouldn't be the right minimal approach too. It has the
disadvantage of implying certain values for the
RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
would be to return the desire sampling range in
table_beginscan_analyze() - but that'd require some duplication because
currently that just uses the generic scan_begin() callback.

I suspect - as previously mentioned- that we're going to have to extend
statistics collection beyond the current approach at some point, but I
don't think that's now. At least to me it's not clear how to best
represent the stats, and how to best use them, if the underlying storage
is fundamentally not block best.  Nor how we'd avoid code duplication...


> 3) nodeTidscan, skipping over too large tids
>    I think this should just be moved into the AMs, there's no need to
>    have this in nodeTidscan.c

I think here it's *not* actually correct at all to use the relation
size. It's currently doing:

        /*
         * We silently discard any TIDs that are out of range at the time of 
scan
         * start.  (Since we hold at least AccessShareLock on the table, it 
won't
         * be possible for someone to truncate away the blocks we intend to
         * visit.)
         */
        nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);

which is fine (except for a certain abstraction leakage) for an AM like
heap or zheap, but I suspect strongly that that's not ok for Ashwin &
Heikki's approach where tid isn't tied to physical representation.


The obvious answer would be to just move that check into the
table_fetch_row_version implementation (currently just calling
heap_fetch()) - but that doesn't seem OK from a performance POV, because
we'd then determine the relation size once for each tid, rather than
once per tidscan.  And it'd also check in cases where we know the tid is
supposed to be valid (e.g. fetching trigger tuples and such).

The proper fix seems to be to introduce a new scan variant
(e.g. table_beginscan_tid()), and then have table_fetch_row_version take
a scan as a parameter.  But it seems we'd have to introduce that as a
separate tableam callback, because we'd not want to incur the overhead
of creating an additional scan / RelationGetNumberOfBlocks() checks for
triggers et al.


Greetings,

Andres Freund

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6584a9cb8da..132e9466450 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1962,6 +1962,31 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
 }
 
 
+/* ------------------------------------------------------------------------
+ * Miscellaneous callbacks for the heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static uint64
+heapam_relation_size(Relation rel, ForkNumber forkNumber)
+{
+	uint64		nblocks = 0;
+
+	/* Open it at the smgr level if not already done */
+	RelationOpenSmgr(rel);
+
+	/* InvalidForkNumber indicates the size of all forks */
+	if (forkNumber == InvalidForkNumber)
+	{
+		for (int i = 0; i < MAX_FORKNUM; i++)
+			nblocks += smgrnblocks(rel->rd_smgr, i);
+	}
+	else
+		nblocks = smgrnblocks(rel->rd_smgr, forkNumber);
+
+	return nblocks * BLCKSZ;
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2543,6 +2568,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
+	.relation_size = heapam_relation_size,
+
 	.relation_estimate_size = heapam_estimate_rel_size,
 
 	.scan_bitmap_next_block = heapam_scan_bitmap_next_block,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index bfd713f3af1..2b632e002c4 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -86,6 +86,9 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
+
+	Assert(routine->relation_size != NULL);
+
 	Assert(routine->relation_estimate_size != NULL);
 
 	/* optional, but one callback implies presence of hte other */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 887023fc8a5..10ef0de78b4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -33,6 +33,7 @@
 #include <sys/file.h>
 #include <unistd.h>
 
+#include "access/tableam.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
@@ -2789,14 +2790,50 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
+ *
+ * Note that the accuracy of the result will depend on the details of the
+ * relation's storage. For all builtin AMs it'll be accurate, but for external
+ * AMs it might not perfectly be.
  */
 BlockNumber
 RelationGetNumberOfBlocksInFork(Relation relation, ForkNumber forkNum)
 {
-	/* Open it at the smgr level if not already done */
-	RelationOpenSmgr(relation);
+	switch (relation->rd_rel->relkind)
+	{
+		case RELKIND_SEQUENCE:
+		case RELKIND_INDEX:
+		case RELKIND_PARTITIONED_INDEX:
+			/* Open it at the smgr level if not already done */
+			RelationOpenSmgr(relation);
 
-	return smgrnblocks(relation->rd_smgr, forkNum);
+			return smgrnblocks(relation->rd_smgr, forkNum);
+
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			{
+				/*
+				 * Not every table AM uses BLCKSZ wide fixed size
+				 * blocks. Therefore tableam returns the size in bytes - but
+				 * for the purpose of this routine, we want the number of
+				 * blocks. Therefore divide, rounding up.
+				 */
+				uint64 szbytes;
+
+				szbytes = table_relation_size(relation, forkNum);
+
+				return (szbytes + (BLCKSZ - 1)) / BLCKSZ;
+			}
+		case RELKIND_VIEW:
+		case RELKIND_COMPOSITE_TYPE:
+		case RELKIND_FOREIGN_TABLE:
+		case RELKIND_PARTITIONED_TABLE:
+		default:
+			Assert(false);
+			break;
+	}
+
+	return 0; /* satisfy compiler */
 }
 
 /*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c018a44267a..00d105af9e5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -533,6 +533,22 @@ typedef struct TableAmRoutine
 										struct ValidateIndexState *state);
 
 
+	/* ------------------------------------------------------------------------
+	 * Miscellaneous functions.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * See table_relation_size().
+	 *
+	 * Note that currently a few use the MAIN_FORKNUM size to vet the validity
+	 * of tids (e.g. nodeTidscans.c), and others use it to figure out the
+	 * range of potentially interesting blocks (brin, analyze). The
+	 * abstraction around this will need to be improved in the near future.
+	 */
+	uint64		(*relation_size) (Relation rel, ForkNumber forkNumber);
+
+
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
 	 * ------------------------------------------------------------------------
@@ -543,6 +559,10 @@ typedef struct TableAmRoutine
 	 *
 	 * While block oriented, it shouldn't be too hard for an AM that doesn't
 	 * doesn't internally use blocks to convert into a usable representation.
+	 *
+	 * This differs from the relation_size callback by returning size
+	 * estimates (both relation size and tuple count) for planning purposes,
+	 * rather than returning a currently correct estimate.
 	 */
 	void		(*relation_estimate_size) (Relation rel, int32 *attr_widths,
 										   BlockNumber *pages, double *tuples,
@@ -1492,6 +1512,26 @@ table_index_validate_scan(Relation heap_rel,
 }
 
 
+/* ----------------------------------------------------------------------------
+ * Miscellaneous functionality
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the current size of `rel` in bytes. If `forkNumber` is
+ * InvalidForkNumber return the relation's overall size, otherwise the size
+ * for the indicated fork.
+ *
+ * Note that the overall size might not be the equivalent of the sum of sizes
+ * for the individual forks for some AMs, e.g. because the AMs storage does
+ * not neatly map onto the builtin types of forks.
+ */
+static inline uint64
+table_relation_size(Relation rel, ForkNumber forkNumber)
+{
+	return rel->rd_tableam->relation_size(rel, forkNumber);
+}
+
 /* ----------------------------------------------------------------------------
  * Planner related functionality
  * ----------------------------------------------------------------------------

Re: Pluggable Storage - Andres's take

Reply via email to