Hi,
The last version on this thread (v7, the "Rebased" post) used the
RowBatch design: the AM handed the executor a RowBatch carrying a
slice of tuples, a single scan slot was re-pointed at the current
tuple through a repoint_slot AM callback, and an executor_batch_rows
GUC controlled the batch size. As I described in my pgconf.dev talk,
I have regrouped around a smaller, incremental foundation and dropped
that design. This series is the result; it supersedes v7 rather than
extending it.
What changed from the RowBatch design:
* RowBatch is gone. There is no batch container passed across the
AM/executor boundary, no RowBatchOps, and no am_payload indirection.
The batch lives in the scan slot itself.
* v7 already used a single re-pointed scan slot (the slot-array
design, with separate in/out arrays for the qual evaluator, was
dropped before that). What changes here is that the re-point is a
slot op (batch_next) rather than a separate repoint_slot AM callback,
so the executor drives iteration through the normal slot interface and
the AM exposes nothing beyond its scan slot.
* executor_batch_rows is gone. Batching is not opt-in or
size-tuned: the AM serves a natural batch (for heap, one page's
visible tuples) and the executor consumes it a tuple at a time. There
is no GUC and no per-query batch sizing.
* EXPLAIN (ANALYZE, BATCHES) is gone. Its counters reported the
effect of the executor_batch_rows size knob; with a batch fixed at one
page there is nothing batch-specific left to show, since a batch count
would just track pages scanned. The instrumentation that would be
worth having -- time and cardinality per batch as it crosses a plan
edge -- only has something to measure once batches propagate beyond
the scan node, so I would revisit it when batching reaches further
into the executor.
* The batch qual evaluator is also not part of this series. Batched
expression evaluation remains future work; quals here are evaluated
per tuple through the existing path.
The interface is two table-AM callbacks -- scan_getnextbatch and
batch_slot_callbacks -- plus a batch_next slot op. As the series
stands a sequentially scanned AM must provide them: ExecInitSeqScan
takes the scan slot from table_slot_batch_callbacks() and SeqNext
drives table_scan_getnextbatch(), with no fallback to getnextslot, so
an AM lacking them cannot be seqscanned. That is deliberate -- it
keeps SeqNext to one path rather than a per-row capability branch --
but it does make these required of any heap-like AM, the way
scan_getnextslot is required today, and I would like opinions on
whether that is acceptable or whether a getnextslot fallback for AMs
that do not implement batching is worth the branch. (An out-of-tree
AM would need to add the two callbacks; both have straightforward
implementations on top of the existing page scan.)
The interface does not assume heap's representation: an AM that does
not produce per-tuple HeapTupleData (a columnar AM, say) is free to
choose how its batch holds data internally. What it must provide is
batch_next, which advances the slot to the current row and leaves it
deformable through the slot's ordinary deform routines (getsomeattrs
and friends); how the batch arrives at that row -- decoding a column
strip, materializing on demand -- is up to the AM. So the internal
layout is the AM's choice while the per-row face the executor sees is
fixed. The executor no longer allocates or manages receiving slots
and there is no row-oriented container an AM must fit into, which
addresses the AM-agnosticism concern from the earlier discussion.
Patche are:
0001 - heapam: store full HeapTupleData in rs_vistuples[].
Stores the per-tuple headers that page_collect_tuples() already
builds, instead of rederiving them per tuple in heapgettup_pagemode().
A standalone improvement to the existing pagemode path, independent of
the rest of the series and considerable on its own; it also gives the
batch path pre-built tuple headers to hand out. (This is the
rs_vistuples[] change from v7, essentially unchanged.)
0002 - tableam/slot interface for batched scans.
Adds scan_getnextbatch and batch_slot_callbacks to TableAmRoutine and
batch_next to TupleTableSlotOps, with their inline wrappers. Interface
only; no implementation, no caller.
0003 - heap implementation + sequential scan.
Implements the interface in heapam and uses it from the sequential
scan node. ExecInitSeqScan obtains the scan slot from
table_slot_batch_callbacks(); the existing ExecSeqScan variants drive
the batch slot unchanged. Forward and backward scans, including a
direction change within a batch, share one path, and the batch slot
deforms like a regular buffer-heap slot so EvalPlanQual and the rest
of the executor are unaffected.
Performance (meson release builds, master vs patched, pg_prewarm'd
table, vacuum-frozen for the all-visible rows; median ms over the
1M..10M row sizes, ranges across two runs):
all-visible not-all-visible
count(*) (no qual) -35% to -43% -21% to -31%
count(*) WHERE pass-all -17% to -23% -14% to -16%
count(*) WHERE pass-none -15% to -20% -13% to -18%
The win is largest where per-tuple scan overhead dominates -- no qual,
and all-visible pages where the visibility check is cheap -- and
proportionally smaller as qual evaluation (unchanged by this series)
is added. Two runs agree to within a couple of points at 5M and 10M;
the 1-2M figures are noisier on my machine, so the larger sizes are
the ones to trust.
Open items:
- Only sequential scan uses the batch interface; the other scan
nodes keep their existing fetch paths. The heap-page-oriented ones
(sample, TID-range, bitmap heap) look convertible along the same
lines; index and index-only scans are less direct and would more
likely connect through the ongoing index-prefetching work. I left
these out to keep the first step small, not because the interface
cannot express them.
- Batched expression evaluation (a batch_next-driven qual opcode)
and any non-HeapTupleData / columnar batch consumption remain
follow-on work, as discussed at pgconf.dev and earlier on this thread.
Where this is going:
This series stops at the scan/TAM boundary. Profiling a selective
count(*) ... WHERE shows why that is the right first cut: batching
removes the per-tuple scan-fetch overhead (heapgettup_pagemode and
friends), which is where the win comes from, and what remains is
per-tuple deform and per-tuple expression evaluation, each about a
quarter of the cycles, with the predicate operator itself a couple of
percent. Batching only the scan does not touch those, and a throwaway
patch I wrote that batched the qual loop moved almost nothing, so the
remaining cost is in the per-tuple executor work, not the loop around
it.
Some of that is improvable in the scalar path with no batching or
columnar representation at all (a denser per-attribute slot layout,
and avoiding the per-tuple indirect deform call where the slot type is
fixed); those help the row-at-a-time executor generally and overlap
the seqscan inefficiencies Andres has catalogued, and I am pursuing
them separately. Beyond that, letting expression evaluation or a
parent node consume a batch as columns rather than a tuple at a time
is the larger direction, but it turns on how batch column data should
be represented, which I would not want to settle yet. What this
series tries to get right for all of it is that the batch lives in the
slot and batch_next is the row-compatible way to walk it, so later
work can reach the batch without a new cross-node container and
anything not converted keeps working unchanged.
--
Thanks, Amit Langote
From 5aa95474f6808ff2167b7613f77d712adc05ca1f Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Fri, 3 Jul 2026 12:35:04 +0900
Subject: [PATCH v8 3/3] Implement batched sequential scans for heap
Implement the batch scan interface for heapam and use it from the
sequential scan node, so a heap SeqScan fetches the visible tuples of a
page per AM call and serves them to the executor one tuple at a time.
Heap
----
heap_getnextbatch() advances to the next block via heap_fetch_next_buffer()
and runs heap_prepare_pagescan() to populate rs_vistuples[], then exposes
the page's tuples through the scan slot with ExecStoreBatchBufferHeapTuples().
BatchBufferHeapTupleTableSlot (TTSOpsBatchBufferHeapTuple) extends
BufferHeapTupleTableSlot with a pointer into rs_vistuples[], a tuple count,
and a cursor. It holds its own pin on the page buffer so the tuples stay
valid for the slot's lifetime. It deforms exactly as TTSOpsBufferHeapTuple
does and shares getsomeattrs, getsysattr and the copy callbacks with it;
only clear, materialize and batch_next differ. materialize copies the
current tuple and retains the buffer pin, since the remaining batch entries
still reference the page.
batch_next moves to the tuple at cindex + dir, the same resume rule
heapgettup_pagemode() uses with rs_cindex, giving one path for forward and
backward iteration and for a direction change within a batch.
ExecStoreBatchBufferHeapTuples positions the cursor before the first tuple
of the requested direction.
TTS_IS_BUFFERTUPLE() recognizes the batch slot; pgstat_count_heap_getnext()
takes a tuple count so a batch is accounted in one call.
Sequential scan
---------------
A heap SeqScan uses the batch slot as its scan slot: ExecInitSeqScan
obtains it from table_slot_batch_callbacks(). SeqNext returns the next
tuple of the current batch with slot_batch_next() and fetches the next
batch with table_scan_getnextbatch() once the current one is drained,
passing es_direction so a single path covers both scan directions. The
ExecSeqScan variants drive the batch slot as they do any scan slot and are
unchanged.
Required callbacks
------------------
scan_getnextbatch and batch_slot_callbacks are now required of any table
AM: the sequential scan node uses them with no fallback, so an AM that
omits them cannot be sequentially scanned. GetTableAmRoutine() asserts
their presence alongside the other required callbacks.
Expression evaluation
---------------------
A scan slot's ops are recorded as a fixed deform target (fetch.kind) when
scan expressions are compiled, and a node may evaluate the same expression
against more than one slot type: an EvalPlanQual recheck reuses the scan's
expressions against the EPQ slot, a TTSOpsBufferHeapTuple. The batch slot
and TTSOpsBufferHeapTuple share their deform code, so
CheckOpSlotCompatibility() treats them, and TTSOpsHeapTuple, as equivalent
fixed targets.
---
src/backend/access/heap/heapam.c | 73 ++++++++-
src/backend/access/heap/heapam_handler.c | 9 +-
src/backend/access/table/tableamapi.c | 2 +
src/backend/executor/execExprInterp.c | 23 +++
src/backend/executor/execTuples.c | 198 ++++++++++++++++++++++-
src/backend/executor/nodeSeqscan.c | 24 ++-
src/include/access/heapam.h | 2 +
src/include/executor/tuptable.h | 37 ++++-
src/include/pgstat.h | 4 +-
9 files changed, 355 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2b85e6805b8..d8fb8d998c7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1464,7 +1464,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
return scan->rs_ctup_p;
}
@@ -1492,12 +1492,79 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(scan->rs_ctup_p, slot, scan->rs_cbuf);
return true;
}
+/*
+ * heap_getnextbatch
+ *
+ * Modeled on heapgettup_pagemode() but without the per-tuple
+ * iteration loop. Advances to the next page with visible tuples
+ * via heap_fetch_next_buffer(), calls heap_prepare_pagescan() to
+ * populate rs_vistuples[], then hands the whole array to the slot.
+ *
+ * The caller must fully consume the previous batch (via batch_next)
+ * before calling again.
+ *
+ * Returns true if a non-empty batch was produced, false when the
+ * scan is exhausted.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan,
+ ScanDirection direction,
+ TupleTableSlot *slot)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+
+ /*
+ * Each call returns a full page worth of tuples, so there is no
+ * resume-mid-page path like heapgettup_pagemode's rs_inited +
+ * goto continue_page. We always advance to the next block.
+ */
+ while (true)
+ {
+ heap_fetch_next_buffer(scan, direction);
+
+ /* did we run out of blocks to scan? */
+ if (!BufferIsValid(scan->rs_cbuf))
+ break;
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+ /* prune the page and determine visible tuple offsets */
+ heap_prepare_pagescan(sscan);
+
+ if (scan->rs_ntuples == 0)
+ continue; /* empty page, try next */
+
+ /*
+ * Hand the whole page to the slot.
+ */
+ ExecStoreBatchBufferHeapTuples(scan->rs_vistuples,
+ scan->rs_ntuples,
+ scan->rs_cbuf,
+ direction,
+ slot);
+
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, scan->rs_ntuples);
+ scan->rs_inited = true;
+ return true;
+ }
+
+ /* end of scan */
+ if (BufferIsValid(scan->rs_cbuf))
+ ReleaseBuffer(scan->rs_cbuf);
+ scan->rs_cbuf = InvalidBuffer;
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+
+ return false;
+}
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
@@ -1639,7 +1706,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* if we get here it means we have a new current scan tuple, so point to
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(scan->rs_ctup_p, slot, scan->rs_cbuf);
return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d4c2c67f564..9a6e5e4707a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -79,6 +79,11 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+static const TupleTableSlotOps *
+heap_batch_slot_callbacks(Relation rel)
+{
+ return &TTSOpsBatchBufferHeapTuple;
+}
/* ------------------------------------------------------------------------
* Callbacks for non-modifying operations on individual tuples for heap AM
@@ -2293,7 +2298,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
/* Count successfully-fetched tuples as heap fetches */
- pgstat_count_heap_getnext(scan->rs_rd);
+ pgstat_count_heap_getnext(scan->rs_rd, 1);
return true;
}
@@ -2654,11 +2659,13 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_slot_callbacks = heap_batch_slot_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_getnextbatch = heap_getnextbatch,
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 5450a27faeb..9ef6b5cca65 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -45,6 +45,8 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->scan_end != NULL);
Assert(routine->scan_rescan != NULL);
Assert(routine->scan_getnextslot != NULL);
+ Assert(routine->scan_getnextbatch != NULL);
+ Assert(routine->batch_slot_callbacks != NULL);
Assert(routine->parallelscan_estimate != NULL);
Assert(routine->parallelscan_initialize != NULL);
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0634af964a9..c61c0872f35 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2458,6 +2458,29 @@ CheckOpSlotCompatibility(ExprEvalStep *op, TupleTableSlot *slot)
op->d.fetch.kind == &TTSOpsBufferHeapTuple)
return;
+ /*
+ * The batch buffer-heap slot shares getsomeattrs/getsysattr with
+ * TTSOpsBufferHeapTuple, so it deforms identically to it, and buffer-heap
+ * and heap are already interchangeable above. Extend that same
+ * equivalence to the batch slot, in either direction.
+ */
+
+ /*
+ * Compiled for the batch slot, run against a buffer-heap/heap slot. This
+ * is the case that arises today: a scan expression compiled for the batch
+ * slot is re-run under EPQ against the plain buffer-heap EPQ slot.
+ */
+ if (op->d.fetch.kind == &TTSOpsBatchBufferHeapTuple &&
+ (slot->tts_ops == &TTSOpsBufferHeapTuple ||
+ slot->tts_ops == &TTSOpsHeapTuple))
+ return;
+
+ /* the reverse: compiled for buffer-heap/heap, run against the batch slot */
+ if (slot->tts_ops == &TTSOpsBatchBufferHeapTuple &&
+ (op->d.fetch.kind == &TTSOpsBufferHeapTuple ||
+ op->d.fetch.kind == &TTSOpsHeapTuple))
+ return;
+
/*
* At the moment we consider it OK if a virtual slot is used instead of a
* specific type of slot, as a virtual slot never needs to be deformed.
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 7f4ebf95432..d565ed017d6 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -992,6 +992,133 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
}
}
+/*
+ * clear -- release buffer pin and reset batch state.
+ */
+static void
+tts_batch_buffer_heap_clear(TupleTableSlot *slot)
+{
+ BatchBufferHeapTupleTableSlot *bslot =
+ (BatchBufferHeapTupleTableSlot *) slot;
+ BufferHeapTupleTableSlot *bhslot = &bslot->base;
+
+
+ if (TTS_SHOULDFREE(slot))
+ {
+ /* batch slots may still have a valid buffer here */
+ heap_freetuple(bhslot->base.tuple);
+ slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+ }
+
+ if (BufferIsValid(bhslot->buffer))
+ {
+ ReleaseBuffer(bhslot->buffer);
+ bhslot->buffer = InvalidBuffer;
+ }
+ bhslot->base.tuple = NULL;
+ bhslot->base.off = 0;
+
+ bslot->tuples = NULL;
+ bslot->ntuples = 0;
+ bslot->cindex = -1;
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags |= TTS_FLAG_EMPTY;
+ ItemPointerSetInvalid(&slot->tts_tid);
+}
+
+static void
+tts_batch_buffer_heap_materialize(TupleTableSlot *slot)
+{
+ BatchBufferHeapTupleTableSlot *bslot =
+ (BatchBufferHeapTupleTableSlot *) slot;
+ BufferHeapTupleTableSlot *bhslot = &bslot->base;
+ HeapTupleTableSlot *hslot = &bhslot->base;
+ MemoryContext oldContext;
+
+ Assert(!TTS_EMPTY(slot));
+
+ /* already materialized */
+ if (TTS_SHOULDFREE(slot))
+ return;
+
+ oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
+
+ hslot->off = 0;
+ slot->tts_nvalid = 0;
+
+ if (!hslot->tuple)
+ {
+ hslot->tuple = heap_form_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+ }
+ else
+ {
+ hslot->tuple = heap_copytuple(hslot->tuple);
+
+ /*
+ * Unlike the non-batch case, do NOT release the buffer pin.
+ * The remaining tuples in the batch still have t_data pointing
+ * into the pinned page. The pin is released when the slot
+ * is cleared after the batch is fully consumed.
+ */
+ }
+
+ slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+ MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * tts_batch_buffer_heap_next -- advance to the next tuple in the batch.
+ *
+ * Points the underlying BufferHeapTupleTableSlot at tuples[cursor]
+ * and bumps the cursor. Returns false when the batch is exhausted.
+ */
+static bool
+tts_batch_buffer_heap_next(TupleTableSlot *slot, ScanDirection direction)
+{
+ BatchBufferHeapTupleTableSlot *bslot =
+ (BatchBufferHeapTupleTableSlot *) slot;
+ HeapTupleTableSlot *hslot = &bslot->base.base;
+ int dir = (int) direction; /* +1 forward, -1 backward */
+ int next;
+
+ Assert(direction != NoMovementScanDirection);
+
+ /*
+ * Resume relative to the last-returned tuple, mirroring
+ * heapgettup_pagemode()'s rs_cindex + dir logic. This keeps the
+ * iteration correct across a scan-direction change within a batch:
+ * after returning index i, a forward step yields i+1 and a backward
+ * step yields i-1, regardless of how the batch was first entered.
+ */
+ next = bslot->cindex + dir;
+
+ if (next < 0 || next >= bslot->ntuples)
+ return false;
+
+ /*
+ * If the slot was materialized (e.g. by a parent node), free the
+ * palloc'd copy before overwriting the pointer.
+ */
+ if (TTS_SHOULDFREE(slot))
+ {
+ heap_freetuple(hslot->tuple);
+ slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+ }
+
+ hslot->tuple = &bslot->tuples[next];
+ hslot->off = 0;
+ bslot->cindex = next;
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ slot->tts_tid = hslot->tuple->t_self;
+
+ return true;
+}
+
/*
* slot_deform_heap_tuple
* Given a TupleTableSlot, extract data from the slot's physical tuple
@@ -1281,7 +1408,8 @@ const TupleTableSlotOps TTSOpsVirtual = {
.get_heap_tuple = NULL,
.get_minimal_tuple = NULL,
.copy_heap_tuple = tts_virtual_copy_heap_tuple,
- .copy_minimal_tuple = tts_virtual_copy_minimal_tuple
+ .copy_minimal_tuple = tts_virtual_copy_minimal_tuple,
+ .batch_next = NULL
};
const TupleTableSlotOps TTSOpsHeapTuple = {
@@ -1299,7 +1427,8 @@ const TupleTableSlotOps TTSOpsHeapTuple = {
/* A heap tuple table slot can not "own" a minimal tuple. */
.get_minimal_tuple = NULL,
.copy_heap_tuple = tts_heap_copy_heap_tuple,
- .copy_minimal_tuple = tts_heap_copy_minimal_tuple
+ .copy_minimal_tuple = tts_heap_copy_minimal_tuple,
+ .batch_next = NULL
};
const TupleTableSlotOps TTSOpsMinimalTuple = {
@@ -1317,7 +1446,8 @@ const TupleTableSlotOps TTSOpsMinimalTuple = {
.get_heap_tuple = NULL,
.get_minimal_tuple = tts_minimal_get_minimal_tuple,
.copy_heap_tuple = tts_minimal_copy_heap_tuple,
- .copy_minimal_tuple = tts_minimal_copy_minimal_tuple
+ .copy_minimal_tuple = tts_minimal_copy_minimal_tuple,
+ .batch_next = NULL
};
const TupleTableSlotOps TTSOpsBufferHeapTuple = {
@@ -1335,9 +1465,29 @@ const TupleTableSlotOps TTSOpsBufferHeapTuple = {
/* A buffer heap tuple table slot can not "own" a minimal tuple. */
.get_minimal_tuple = NULL,
.copy_heap_tuple = tts_buffer_heap_copy_heap_tuple,
- .copy_minimal_tuple = tts_buffer_heap_copy_minimal_tuple
+ .copy_minimal_tuple = tts_buffer_heap_copy_minimal_tuple,
+ .batch_next = NULL
};
+/*
+ * Everything except clear(), materialize(), batch_next() shared with
+ * BufferHeapTuple.
+ */
+const TupleTableSlotOps TTSOpsBatchBufferHeapTuple = {
+ .base_slot_size = sizeof(BatchBufferHeapTupleTableSlot),
+ .init = tts_buffer_heap_init,
+ .release = tts_buffer_heap_release,
+ .clear = tts_batch_buffer_heap_clear,
+ .getsomeattrs = tts_buffer_heap_getsomeattrs,
+ .getsysattr = tts_buffer_heap_getsysattr,
+ .materialize = tts_batch_buffer_heap_materialize,
+ .copyslot = tts_buffer_heap_copyslot,
+ .get_heap_tuple = tts_buffer_heap_get_heap_tuple,
+ .get_minimal_tuple = NULL,
+ .copy_heap_tuple = tts_buffer_heap_copy_heap_tuple,
+ .copy_minimal_tuple = tts_buffer_heap_copy_minimal_tuple,
+ .batch_next = tts_batch_buffer_heap_next
+};
/* ----------------------------------------------------------------
* tuple table create/delete functions
@@ -1708,6 +1858,46 @@ ExecStorePinnedBufferHeapTuple(HeapTuple tuple,
return slot;
}
+/*
+ * ExecStoreBatchBufferHeapTuples -- load a batch into a
+ * BatchBufferHeapTupleTableSlot.
+ *
+ * tuples[] must remain valid (buffer pinned) for the slot's lifetime.
+ * No copy -- the slot takes a pointer and a count.
+ */
+void
+ExecStoreBatchBufferHeapTuples(HeapTupleData *tuples,
+ int ntuples,
+ Buffer buffer,
+ ScanDirection direction,
+ TupleTableSlot *slot)
+{
+ BatchBufferHeapTupleTableSlot *bslot =
+ (BatchBufferHeapTupleTableSlot *) slot;
+ BufferHeapTupleTableSlot *bhslot = &bslot->base;
+
+ Assert(ntuples > 0);
+
+ bslot->tuples = tuples;
+ bslot->ntuples = ntuples;
+
+ /*
+ * Position the cursor just before the first tuple to return, so that
+ * the first batch_next (which advances by the scan direction) lands on
+ * the correct edge: index 0 for a forward scan, ntuples-1 for backward.
+ */
+ if (ScanDirectionIsBackward(direction))
+ bslot->cindex = ntuples; /* ntuples + (-1) = ntuples - 1 */
+ else
+ bslot->cindex = -1; /* -1 + 1 = 0 */
+
+ if (BufferIsValid(bhslot->buffer))
+ ReleaseBuffer(bhslot->buffer);
+
+ bhslot->buffer = buffer;
+ IncrBufferRefCount(buffer);
+}
+
/*
* Store a minimal tuple into TTSOpsMinimalTuple type slot.
*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 5bcb0a861d7..943944ededa 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -85,11 +85,22 @@ SeqNext(SeqScanState *node)
}
/*
- * get the next tuple from the table
+ * Serve the next tuple from the current batch held in the scan slot.
+ * When the batch is exhausted, ask the AM for the next one. The AM
+ * builds the batch honoring scan direction and sets the slot's batch
+ * cursor accordingly, so this single path serves forward and backward
+ * scans alike.
*/
- if (table_scan_getnextslot(scandesc, direction, slot))
+ if (slot_batch_next(slot, direction))
return slot;
- return NULL;
+
+ ExecClearTuple(slot);
+
+ if (!table_scan_getnextbatch(scandesc, direction, slot))
+ return slot;
+
+ slot_batch_next(slot, direction);
+ return slot;
}
/*
@@ -220,6 +231,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ Relation rel;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -245,15 +257,15 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
/*
* open the scan relation
*/
- scanstate->ss.ss_currentRelation =
+ scanstate->ss.ss_currentRelation = rel =
ExecOpenScanRelation(estate,
node->scan.scanrelid,
eflags);
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
- RelationGetDescr(scanstate->ss.ss_currentRelation),
- table_slot_callbacks(scanstate->ss.ss_currentRelation),
+ RelationGetDescr(rel),
+ table_slot_batch_callbacks(rel),
TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS);
/*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index c17076455bd..c3340f8e29a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -362,6 +362,8 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+extern bool heap_getnextbatch(TableScanDesc sscan,
+ ScanDirection direction, TupleTableSlot *slot);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 890115314b0..524a3b78772 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -260,11 +260,13 @@ extern PGDLLIMPORT const TupleTableSlotOps TTSOpsVirtual;
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsHeapTuple;
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsMinimalTuple;
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsBufferHeapTuple;
+extern PGDLLIMPORT const TupleTableSlotOps TTSOpsBatchBufferHeapTuple;
#define TTS_IS_VIRTUAL(slot) ((slot)->tts_ops == &TTSOpsVirtual)
#define TTS_IS_HEAPTUPLE(slot) ((slot)->tts_ops == &TTSOpsHeapTuple)
#define TTS_IS_MINIMALTUPLE(slot) ((slot)->tts_ops == &TTSOpsMinimalTuple)
-#define TTS_IS_BUFFERTUPLE(slot) ((slot)->tts_ops == &TTSOpsBufferHeapTuple)
+#define TTS_IS_BUFFERTUPLE(slot) ((slot)->tts_ops == &TTSOpsBufferHeapTuple || \
+ (slot)->tts_ops == &TTSOpsBatchBufferHeapTuple)
/*
@@ -309,6 +311,34 @@ typedef struct BufferHeapTupleTableSlot
Buffer buffer; /* tuple's buffer, or InvalidBuffer */
} BufferHeapTupleTableSlot;
+/*
+ * BatchBufferHeapTupleTableSlot -- a batch of heap tuples residing in a
+ * pinned buffer page.
+ *
+ * Extends BufferHeapTupleTableSlot to hold a pointer to an array of
+ * pre-built HeapTupleData entries living in HeapScanDesc.rs_vistuples[].
+ * The tuple headers' t_data pointers reference the pinned buffer page
+ * and remain valid for the lifetime of the pin. No data is copied.
+ *
+ * When consumed tuple-at-a-time by the existing executor machinery,
+ * the batch_next slot ops callback points the underlying
+ * BufferHeapTupleTableSlot at successive tuples in the array.
+ * Standard slot_getsomeattrs() then deforms the current tuple
+ * through the normal heap deform path.
+ *
+ * Other table AMs that support batching would define their own batch
+ * slot type with AM-specific internals and TupleTableSlotOps.
+ */
+typedef struct BatchBufferHeapTupleTableSlot
+{
+ BufferHeapTupleTableSlot base;
+
+ HeapTupleData *tuples; /* points into HeapScanDesc->rs_vistuples[] */
+ int ntuples; /* number of tuples in this batch */
+ int cindex; /* index of last-returned tuple, or -1 if
+ * none returned from this batch yet */
+} BatchBufferHeapTupleTableSlot;
+
typedef struct MinimalTupleTableSlot
{
pg_node_attr(abstract)
@@ -360,6 +390,11 @@ extern TupleTableSlot *ExecStoreBufferHeapTuple(HeapTuple tuple,
extern TupleTableSlot *ExecStorePinnedBufferHeapTuple(HeapTuple tuple,
TupleTableSlot *slot,
Buffer buffer);
+extern void ExecStoreBatchBufferHeapTuples(HeapTupleData *tuples,
+ int ntuples,
+ Buffer buffer,
+ ScanDirection direction,
+ TupleTableSlot *slot);
extern TupleTableSlot *ExecStoreMinimalTuple(MinimalTuple mtup,
TupleTableSlot *slot,
bool shouldFree);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 58a44857f13..bc3f56a6724 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -732,10 +732,10 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.numscans++; \
} while (0)
-#define pgstat_count_heap_getnext(rel) \
+#define pgstat_count_heap_getnext(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
- (rel)->pgstat_info->counts.tuples_returned++; \
+ (rel)->pgstat_info->counts.tuples_returned += (n); \
} while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
--
2.47.3
From 7f998a084205ff08e1e95a432a5258515c0ca693 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Jun 2026 15:40:35 +0000
Subject: [PATCH v8 1/3] heapam: store full HeapTupleData in rs_vistuples[] for
pagemode scans
page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the
OffsetNumber of each survivor into rs_vistuples[]. heapgettup_pagemode()
then re-derived those same values on every call by reading the saved
OffsetNumber, calling PageGetItemId() and PageGetItem() to locate the
tuple on the page, and populating rs_ctup field by field.
Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
and block are already in scope, so no additional page reads are needed.
A new rs_ctup_p field on HeapScanDesc points directly into
rs_vistuples[] for pagemode scans (or into rs_ctup for the non-pagemode
path). Callers read through rs_ctup_p and check rs_ctup_p == NULL for
scan exhaustion instead of rs_ctup.t_data == NULL.
The same simplification applies to heapam_scan_bitmap_next_tuple() and
BitmapHeapScanNextBlock(); SampleHeapTupleVisible() extracts the offset
from rs_vistuples[].t_self.ip_posid.
Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line-pointer
traversal.
Suggested-by: Andres Freund <[email protected]>
---
src/backend/access/heap/heapam.c | 126 ++++++++++----------
src/backend/access/heap/heapam_handler.c | 22 +---
src/backend/access/heap/heapam_visibility.c | 19 ++-
src/include/access/heapam.h | 7 +-
4 files changed, 79 insertions(+), 95 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index abfd8e8970a..2b85e6805b8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -462,6 +462,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
+ scan->rs_ctup_p = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
scan->rs_cbuf = InvalidBuffer;
scan->rs_cblock = InvalidBlockNumber;
@@ -526,7 +527,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
BlockNumber block, int lines,
bool all_visible, bool check_serializable)
{
- Oid relid = RelationGetRelid(scan->rs_base.rs_rd);
int ntup = 0;
int nvis = 0;
BatchMVCCState batchmvcc;
@@ -534,43 +534,27 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
/* page at a time should have been disabled otherwise */
Assert(IsMVCCSnapshot(snapshot));
- /* first find all tuples on the page */
+ /*
+ * Collect all normal tuples on the page into rs_vistuples[].
+ * Every entry gets a fully populated HeapTupleData header with
+ * t_data pointing into the pinned page. t_tableOid was set
+ * once at beginscan time.
+ */
for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
{
ItemId lpp = PageGetItemId(page, lineoff);
- HeapTuple tup;
+ HeapTuple tup = &scan->rs_vistuples[ntup];
if (unlikely(!ItemIdIsNormal(lpp)))
continue;
- /*
- * If the page is not all-visible or we need to check serializability,
- * maintain enough state to be able to refind the tuple efficiently,
- * without again first needing to fetch the item and then via that the
- * tuple.
- */
- if (!all_visible || check_serializable)
- {
- tup = &batchmvcc.tuples[ntup];
+ tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ tup->t_len = ItemIdGetLength(lpp);
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+ ItemPointerSet(&(tup->t_self), block, lineoff);
- tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tup->t_len = ItemIdGetLength(lpp);
- tup->t_tableOid = relid;
- ItemPointerSet(&(tup->t_self), block, lineoff);
- }
-
- /*
- * If the page is all visible, these fields otherwise won't be
- * populated in loop below.
- */
- if (all_visible)
- {
- if (check_serializable)
- {
- batchmvcc.visible[ntup] = true;
- }
- scan->rs_vistuples[ntup] = lineoff;
- }
+ if (all_visible && check_serializable)
+ batchmvcc.visible[ntup] = true;
ntup++;
}
@@ -600,11 +584,30 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
{
HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
scan->rs_base.rs_rd,
- &batchmvcc.tuples[i],
+ &scan->rs_vistuples[i],
buffer, snapshot);
}
}
+ /*
+ * Compact rs_vistuples[] to contain only visible survivors in
+ * [0..nvis-1]. This is done here rather than inside
+ * HeapTupleSatisfiesMVCCBatch() because the serializable conflict
+ * check above needs visible[i] and rs_vistuples[i] to correspond
+ * before compaction. Callers iterate rs_vistuples[] directly
+ * without rechecking visibility.
+ */
+ if (!all_visible)
+ {
+ int dst = 0;
+ for (int i = 0; i < ntup; i++)
+ {
+ if (batchmvcc.visible[i])
+ scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+ }
+ Assert(dst == nvis);
+ }
+
return nvis;
}
@@ -1035,6 +1038,7 @@ continue_page:
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
scan->rs_coffset = lineoff;
+ scan->rs_ctup_p = &scan->rs_ctup;
return;
}
@@ -1052,7 +1056,7 @@ continue_page:
scan->rs_cbuf = InvalidBuffer;
scan->rs_cblock = InvalidBlockNumber;
scan->rs_prefetch_block = InvalidBlockNumber;
- tuple->t_data = NULL;
+ scan->rs_ctup_p = NULL;
scan->rs_inited = false;
}
@@ -1075,15 +1079,13 @@ heapgettup_pagemode(HeapScanDesc scan,
int nkeys,
ScanKey key)
{
- HeapTuple tuple = &(scan->rs_ctup);
- Page page;
uint32 lineindex;
uint32 linesleft;
if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
- page = BufferGetPage(scan->rs_cbuf);
+ Assert(BufferIsValid(scan->rs_cbuf));
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
@@ -1111,37 +1113,32 @@ heapgettup_pagemode(HeapScanDesc scan,
/* prune the page and determine visible tuple offsets */
heap_prepare_pagescan((TableScanDesc) scan);
- page = BufferGetPage(scan->rs_cbuf);
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
- /* block is the same for all tuples, set it once outside the loop */
- ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
/* lineindex now references the next or previous visible tid */
continue_page:
for (; linesleft > 0; linesleft--, lineindex += dir)
{
- ItemId lpp;
- OffsetNumber lineoff;
-
- Assert(lineindex < scan->rs_ntuples);
- lineoff = scan->rs_vistuples[lineindex];
- lpp = PageGetItemId(page, lineoff);
- Assert(ItemIdIsNormal(lpp));
-
- tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tuple->t_len = ItemIdGetLength(lpp);
- ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
-
/* skip any tuples that don't match the scan key */
- if (key != NULL &&
- !HeapKeyTest(tuple, RelationGetDescr(scan->rs_base.rs_rd),
- nkeys, key))
+ if (key != NULL)
+ {
+ /*
+ * Headers were pre-built by page_collect_tuples() into
+ * rs_vistuples[]. Copy the entry; t_data still points into
+ * the pinned page, which is safe for the lifetime of the
+ * current page scan.
+ */
+ HeapTuple tuple = &scan->rs_vistuples[lineindex];
+
+ if (!HeapKeyTest(tuple, RelationGetDescr(scan->rs_base.rs_rd),
+ nkeys, key))
continue;
+ }
scan->rs_cindex = lineindex;
+ scan->rs_ctup_p = &scan->rs_vistuples[lineindex];
return;
}
}
@@ -1152,7 +1149,7 @@ continue_page:
scan->rs_cbuf = InvalidBuffer;
scan->rs_cblock = InvalidBlockNumber;
scan->rs_prefetch_block = InvalidBlockNumber;
- tuple->t_data = NULL;
+ scan->rs_ctup_p = NULL;
scan->rs_inited = false;
}
@@ -1248,6 +1245,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
/* we only need to set this up once */
scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+ for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+ scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
/*
* Allocate memory to keep track of page allocation for parallel workers
@@ -1457,7 +1456,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
heapgettup(scan, direction,
scan->rs_base.rs_nkeys, scan->rs_base.rs_key);
- if (scan->rs_ctup.t_data == NULL)
+ if (scan->rs_ctup_p == NULL)
return NULL;
/*
@@ -1467,7 +1466,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
pgstat_count_heap_getnext(scan->rs_base.rs_rd);
- return &scan->rs_ctup;
+ return scan->rs_ctup_p;
}
bool
@@ -1482,7 +1481,7 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
else
heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
- if (scan->rs_ctup.t_data == NULL)
+ if (scan->rs_ctup_p == NULL)
{
ExecClearTuple(slot);
return false;
@@ -1495,8 +1494,7 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
pgstat_count_heap_getnext(scan->rs_base.rs_rd);
- ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
- scan->rs_cbuf);
+ ExecStoreBufferHeapTuple(scan->rs_ctup_p, slot, scan->rs_cbuf);
return true;
}
@@ -1589,7 +1587,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
else
heapgettup(scan, direction, sscan->rs_nkeys, sscan->rs_key);
- if (scan->rs_ctup.t_data == NULL)
+ if (scan->rs_ctup_p == NULL)
{
ExecClearTuple(slot);
return false;
@@ -1601,7 +1599,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* we're scanning for. Here we must filter out any tuples from these
* pages that are outside of that range.
*/
- if (ItemPointerCompare(&scan->rs_ctup.t_self, mintid) < 0)
+ if (ItemPointerCompare(&scan->rs_ctup_p->t_self, mintid) < 0)
{
ExecClearTuple(slot);
@@ -1620,7 +1618,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* Likewise for the final page, we must filter out TIDs greater than
* maxtid.
*/
- if (ItemPointerCompare(&scan->rs_ctup.t_self, maxtid) > 0)
+ if (ItemPointerCompare(&scan->rs_ctup_p->t_self, maxtid) > 0)
{
ExecClearTuple(slot);
@@ -1643,7 +1641,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
*/
pgstat_count_heap_getnext(scan->rs_base.rs_rd);
- ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
+ ExecStoreBufferHeapTuple(scan->rs_ctup_p, slot, scan->rs_cbuf);
return true;
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2268cc277bc..d4c2c67f564 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2099,9 +2099,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
{
BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
HeapScanDesc hscan = (HeapScanDesc) bscan;
- OffsetNumber targoffset;
- Page page;
- ItemId lp;
/*
* Out of range? If so, nothing more to look at on this page
@@ -2115,16 +2112,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
if (!BitmapHeapScanNextBlock(scan, recheck, lossy_pages, exact_pages))
return false;
}
-
- targoffset = hscan->rs_vistuples[hscan->rs_cindex];
- page = BufferGetPage(hscan->rs_cbuf);
- lp = PageGetItemId(page, targoffset);
- Assert(ItemIdIsNormal(lp));
-
- hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
- hscan->rs_ctup.t_len = ItemIdGetLength(lp);
- hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
- ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+ hscan->rs_ctup_p = &hscan->rs_vistuples[hscan->rs_cindex];
pgstat_count_heap_fetch(scan->rs_rd);
@@ -2132,7 +2120,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
* Set up the result slot to point to this tuple. Note that the slot
* acquires a pin on the buffer.
*/
- ExecStoreBufferHeapTuple(&hscan->rs_ctup,
+ ExecStoreBufferHeapTuple(hscan->rs_ctup_p,
slot,
hscan->rs_cbuf);
@@ -2479,7 +2467,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
while (start < end)
{
uint32 mid = start + (end - start) / 2;
- OffsetNumber curoffset = hscan->rs_vistuples[mid];
+ OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
if (tupoffset == curoffset)
return true;
@@ -2599,7 +2587,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
ItemPointerSet(&tid, block, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
&heapTuple, NULL, true))
- hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ hscan->rs_vistuples[ntup++] = heapTuple;
}
}
else
@@ -2628,7 +2616,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
if (valid)
{
- hscan->rs_vistuples[ntup++] = offnum;
+ hscan->rs_vistuples[ntup++] = loctup;
PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
HeapTupleHeaderGetXmin(loctup.t_data));
}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 361b76e5065..235493f0e99 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1674,13 +1674,13 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
* Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
* efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
*
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[]. The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
*
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
* setting hint bits more efficiently (see the one BufferFinishSetHintBits()
* call below).
*
@@ -1690,7 +1690,7 @@ int
HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense)
+ HeapTupleData *tuples)
{
int nvis = 0;
SetHintBitsState state = SHB_INITIAL;
@@ -1700,16 +1700,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
for (int i = 0; i < ntups; i++)
{
bool valid;
- HeapTuple tup = &batchmvcc->tuples[i];
+ HeapTuple tup = &tuples[i];
valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
batchmvcc->visible[i] = valid;
if (likely(valid))
- {
- vistuples_dense[nvis] = tup->t_self.ip_posid;
nvis++;
- }
}
if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5176478c295..c17076455bd 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -102,7 +102,9 @@ typedef struct HeapScanDescData
/* these fields only used in page-at-a-time mode and for bitmap scans */
uint32 rs_cindex; /* current tuple's index in vistuples */
uint32 rs_ntuples; /* number of visible tuples on page */
- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ HeapTupleData rs_vistuples[MaxHeapTuplesPerPage]; /* tuples */
+ HeapTuple rs_ctup_p; /* points to current tuple in rs_vistuples[]
+ * or &rs_ctup depending on scan mode */
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
@@ -498,14 +500,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
*/
typedef struct BatchMVCCState
{
- HeapTupleData tuples[MaxHeapTuplesPerPage];
bool visible[MaxHeapTuplesPerPage];
} BatchMVCCState;
extern int HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense);
+ HeapTupleData *tuples);
/*
* To avoid leaking too much knowledge about reorderbuffer implementation
--
2.47.3
From 25a2dc6de70f7492c4f2c6fec6ee686a5e780b5a Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Fri, 3 Jul 2026 12:12:41 +0900
Subject: [PATCH v8 2/3] Add table AM and slot interface for batched scans
Add an interface by which a table AM can deliver tuples to the executor a
batch at a time while the executor continues to consume them one tuple at
a time through the existing slot interface, amortizing the per-tuple cost
of the AM call.
TableAmRoutine gains:
const TupleTableSlotOps *(*batch_slot_callbacks)(Relation rel);
bool (*scan_getnextbatch)(TableScanDesc scan, ScanDirection direction,
TupleTableSlot *slot);
batch_slot_callbacks returns the slot ops for the AM's batch slot type.
scan_getnextbatch fetches the AM's next batch into that slot and resets
its cursor to the start; what constitutes a batch is defined by the AM.
The caller must fully consume the current batch, via the slot's
batch_next (below), before calling scan_getnextbatch again.
table_scan_getnextbatch() and table_slot_batch_callbacks() are the inline
wrappers.
TupleTableSlotOps gains:
bool (*batch_next)(TupleTableSlot *slot, ScanDirection direction);
Advance the slot to the next tuple of the batch it currently holds, in
the given scan direction. After it returns true the slot behaves as an
ordinary single-tuple slot: getsomeattrs, getsysattr, materialize and so
on operate on the current tuple. Returns false when the batch is
exhausted. slot_batch_next() is the inline wrapper.
---
src/include/access/tableam.h | 49 +++++++++++++++++++++++++++++++++
src/include/executor/tuptable.h | 23 ++++++++++++++++
2 files changed, 72 insertions(+)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f2c36696bca..50cc312c822 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -334,6 +334,12 @@ typedef struct TableAmRoutine
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ /*
+ * Return the TupleTableSlotOps for the batch slot type used by
+ * this AM. NULL if the AM does not support batching.
+ */
+ const TupleTableSlotOps *(*batch_slot_callbacks) (Relation rel);
+
/* ------------------------------------------------------------------------
* Table scan callbacks.
@@ -384,6 +390,25 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /*
+ * Fetch the next batch of tuples from the scan into the given slot.
+ *
+ * What constitutes a batch is defined by the AM. For heap, it is
+ * all visible tuples on one heap page. The AM populates the slot's
+ * batch state and resets the cursor to the start. The slot's
+ * batch_next callback is then used to iterate through individual
+ * tuples.
+ *
+ * Returns true if a non-empty batch was produced, false when the
+ * scan is exhausted.
+ *
+ * The caller must fully consume the previous batch before calling
+ * again.
+ */
+ bool (*scan_getnextbatch) (TableScanDesc scan,
+ ScanDirection direction,
+ TupleTableSlot *slot);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1104,6 +1129,30 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_getnextbatch
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan,
+ ScanDirection direction,
+ TupleTableSlot *slot)
+{
+ slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+ return sscan->rs_rd->rd_tableam->scan_getnextbatch(sscan,
+ direction,
+ slot);
+}
+
+/*
+ * table_slot_batch_callbacks
+ */
+static inline const TupleTableSlotOps *
+table_slot_batch_callbacks(Relation rel)
+{
+ return rel->rd_tableam->batch_slot_callbacks(rel);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 3db6c9c9bd0..890115314b0 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -15,6 +15,7 @@
#define TUPTABLE_H
#include "access/htup.h"
+#include "access/sdir.h"
#include "access/sysattr.h"
#include "access/tupdesc.h"
#include "storage/buf.h"
@@ -239,6 +240,16 @@ struct TupleTableSlotOps
* with the minimal tuple without the need for an additional allocation.
*/
MinimalTuple (*copy_minimal_tuple) (TupleTableSlot *slot, Size extra);
+
+ /*
+ * Advance to the next tuple in a batch. Returns true if a tuple
+ * is available, false when the batch is exhausted. After returning
+ * true, the slot behaves as a regular single-tuple slot for
+ * getsomeattrs, getsysattr, etc.
+ *
+ * NULL for non-batch slot types.
+ */
+ bool (*batch_next)(TupleTableSlot *slot, ScanDirection direction);
};
/*
@@ -553,6 +564,18 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
return dstslot;
}
+/*
+ * slot_batch_next
+ */
+static inline bool
+slot_batch_next(TupleTableSlot *slot, ScanDirection direction)
+{
+ Assert(slot->tts_ops->batch_next != NULL);
+
+ return slot->tts_ops->batch_next(slot, direction);
+}
+
+
#endif /* FRONTEND */
#endif /* TUPTABLE_H */
--
2.47.3