Hi Junwang, On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <[email protected]> wrote: > On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <[email protected]> wrote: > > I re-ran the benchmarks (same test as yours, different machine): > > > > create table pk (a numeric primary key); > > create table fk (a bigint references pk); > > insert into pk select generate_series(1, 2000000); > > insert into fk select generate_series(1, 2000000, 2); > > > > master: 2444 ms (median of 3 runs) > > 0001: 1382 ms (43% faster) > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) > > I can get similar improvement on my old mac intel chip: > > master: 12963.993 ms > 0001: 6641.692 ms, 48.8% faster > 0001+0002: 5771.703 ms, 55.5% faster > > > > Also, with int PK / int FK (1M rows): > > > > create table pk (a int primary key); > > create table fk (a int references pk); > > insert into pk select generate_series(1, 1000000); > > insert into fk select generate_series(1, 1000000); > > > > master: 1000 ms > > 0001: 520 ms (48% faster) > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone) > > master: 11134.583 ms > 0001: 5240.298 ms, 52.9% faster > 0001+0002: 4554.215 ms, 59.1% faster
Thanks for testing, good to see similar numbers. I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.
> I don't have any additional comments on the patch except one minor nit,
> maybe merge the following two if conditions into one, not a strong opinion
> though.
>
> if (use_cache)
> {
> /*
> * The snapshot was registered once when the cache entry was created.
> * We just patch curcid to reflect the new command counter.
> * SnapshotSetCommandId() only patches process-global statics, not
> * registered copies, so we do it directly.
> *
> * The xmin/xmax/xip fields don't need refreshing: within a single
> * statement batch, only curcid changes between rows.
> */
> Assert(fpentry && fpentry->snapshot != NULL);
> snapshot = fpentry->snapshot;
> snapshot->curcid = GetCurrentCommandId(false);
> }
> else
> snapshot = RegisterSnapshot(GetLatestSnapshot());
>
> if (use_cache)
> {
> pk_rel = fpentry->pk_rel;
> idx_rel = fpentry->idx_rel;
> scandesc = fpentry->scandesc;
> slot = fpentry->slot;
> }
> else
> {
> pk_rel = table_open(riinfo->pk_relid, RowShareLock);
> idx_rel = index_open(riinfo->conindid, AccessShareLock);
> scandesc = index_beginscan(pk_rel, idx_rel,
> snapshot, NULL,
> riinfo->nkeys, 0);
> slot = table_slot_create(pk_rel, NULL);
> }
Good idea, done.
While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.
Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush. The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.
> > David Rowley mentioned off-list that it might be worth batching
> > multiple FK values into a single index probe, leveraging the
> > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
> > to buffer FK values across trigger invocations in the per-constraint
> > cache (0002 already has the right structure for this), build a
> > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
> > pages in one sorted traversal instead of one tree descent per row. The
> > locking and recheck would still be per-tuple, but the index traversal
> > cost drops significantly. Single-column FKs are the obvious starting
> > point. That seems worth exploring but can be done as a separate patch
> > on top of this.
>
> I will take a look at this in the following weeks.
I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it. So attached set is now:
0001 - Core fast path (your 0001+0002 reworked, as before)
0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)
0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end
0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents. Multi-column FKs fall back to a per-row loop.
0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates. The payoff
comes in 0004.
Numbers (same machine as before, median of 3 runs):
numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms (2.1x)
int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms (3.1x)
The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.
Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:
master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)
So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.
Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.
Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta
struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.
On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <[email protected]> wrote:
>
> Hi Amit,
>
> On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <[email protected]> wrote:
> >
> > Hi Junwang,
> >
> > On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <[email protected]> wrote:
> > > As Amit has already stated, we are approaching a hybrid "fast-path +
> > > fallback"
> > > design.
> > >
> > > 0001 adds a fast path optimization for foreign key constraint checks
> > > that bypasses the SPI executor, the fast path applies when the referenced
> > > table is not partitioned, and the constraint does not involve temporal
> > > semantics.
> > >
> > > With the following test:
> > >
> > > create table pk (a numeric primary key);
> > > create table fk (a bigint references pk);
> > > insert into pk select generate_series(1, 2000000);
> > >
> > > head:
> > >
> > > [local] zhjwpku@postgres:5432-90419=# insert into fk select
> > > generate_series(1, 2000000, 2);
> > > INSERT 0 1000000
> > > Time: 13516.177 ms (00:13.516)
> > >
> > > [local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
> > > UPDATE 1000000
> > > Time: 15057.638 ms (00:15.058)
> > >
> > > patched:
> > >
> > > [local] zhjwpku@postgres:5432-98673=# insert into fk select
> > > generate_series(1, 2000000, 2);
> > > INSERT 0 1000000
> > > Time: 8248.777 ms (00:08.249)
> > >
> > > [local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
> > > UPDATE 1000000
> > > Time: 10117.002 ms (00:10.117)
> > >
> > > 0002 cache fast-path metadata used by the index probe, at the current
> > > time only comparison operator hash entries, operator function OIDs
> > > and strategy numbers and subtypes for index scans. But this cache
> > > doesn't buy any performance improvement.
> > >
> > > Caching additional metadata should improve performance for foreign key
> > > checks.
> > >
> > > Amit suggested introducing a mechanism for ri_triggers.c to register a
> > > cleanup callback in the EState, which AfterTriggerEndQuery() could then
> > > invoke to release per-statement cached metadata (such as the
> > > IndexScanDesc).
> > > However, I haven't been able to implement this mechanism yet.
> >
> > Thanks for working on this. I've taken your patches as a starting
> > point and reworked the series into two patches (attached): 1st is your
> > 0001+0002 as the core patch that adds a gated fast-path alternative to
> > SPI and 2nd where I added per-statement resource caching. Doing the
> > latter turned out to be not so hard thanks to the structure you chose
> > to build the core fast path. Good call on adding the RLS and ACL test
> > cases, btw.
> >
> > So, 0001 is a functionally complete fast path: concurrency handling,
> > REPEATABLE READ crosscheck, cross-type operators, security context,
> > and metadata caching. 0002 implements the per-statement resource
> > caching we discussed, though instead of sharing the EState between
> > trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
> > mechanism that fires at the end of each trigger-firing cycle
> > (per-statement for immediate constraints, or until COMMIT for deferred
> > ones). It layers resource caching on top so that the PK relation,
> > index, scan descriptor, and snapshot stay open across all FK trigger
> > invocations within a single trigger-firing cycle rather than being
> > opened and closed per row.
> >
> > Note that phe previous 0002 (metadata caching) is folded into 0001,
> > and most of the new fast-path logic added in 0001 now lives in
> > ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
> > RI_FKey_check diff is just the gating call and SPI fallback.
> >
> > I re-ran the benchmarks (same test as yours, different machine):
> >
> > create table pk (a numeric primary key);
> > create table fk (a bigint references pk);
> > insert into pk select generate_series(1, 2000000);
> > insert into fk select generate_series(1, 2000000, 2);
> >
> > master: 2444 ms (median of 3 runs)
> > 0001: 1382 ms (43% faster)
> > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone)
>
> I can get similar improvement on my old mac intel chip:
>
> master: 12963.993 ms
> 0001: 6641.692 ms, 48.8% faster
> 0001+0002: 5771.703 ms, 55.5% faster
>
> >
> > Also, with int PK / int FK (1M rows):
> >
> > create table pk (a int primary key);
> > create table fk (a int references pk);
> > insert into pk select generate_series(1, 1000000);
> > insert into fk select generate_series(1, 1000000);
> >
> > master: 1000 ms
> > 0001: 520 ms (48% faster)
> > 0001+0002: 432 ms (57% faster, 17% over 0001 alone)
>
> master: 11134.583 ms
> 0001: 5240.298 ms, 52.9% faster
> 0001+0002: 4554.215 ms, 59.1% faster
>
> >
> > The incremental gain from 0002 comes from eliminating per-row relation
> > open/close, scan begin/end, slot alloc/free, and replacing per-row
> > GetSnapshotData() with only curcid adjustment on the registered
> > snapshot copy in the cache.
> >
> > The two current limitations are partitioned referenced tables and
> > temporal foreign keys. Partitioned PKs are relatively uncommon in
> > practice, so the non-partitioned case should cover most FK workloads,
> > so I'm not sure it's worth the added complexity to support them.
> > Temporal FKs are inherently multi-row, so they're a poor fit for a
> > single-probe fast path.
> >
> > David Rowley mentioned off-list that it might be worth batching
> > multiple FK values into a single index probe, leveraging the
> > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
> > to buffer FK values across trigger invocations in the per-constraint
> > cache (0002 already has the right structure for this), build a
> > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
> > pages in one sorted traversal instead of one tree descent per row. The
> > locking and recheck would still be per-tuple, but the index traversal
> > cost drops significantly. Single-column FKs are the obvious starting
> > point. That seems worth exploring but can be done as a separate patch
> > on top of this.
>
> I will take a look at this in the following weeks.
>
> >
> > I think the series is in reasonable shape but would appreciate extra
> > eyeballs, especially on the concurrency handling in ri_LockPKTuple()
> > in 0001 and the snapshot lifecycle in 0002. Or anything else that
> > catches one's eye.
> >
> > --
> > Thanks, Amit Langote
>
> I don't have any additional comments on the patch except one minor nit,
> maybe merge the following two if conditions into one, not a strong opinion
> though.
>
> if (use_cache)
> {
> /*
> * The snapshot was registered once when the cache entry was created.
> * We just patch curcid to reflect the new command counter.
> * SnapshotSetCommandId() only patches process-global statics, not
> * registered copies, so we do it directly.
> *
> * The xmin/xmax/xip fields don't need refreshing: within a single
> * statement batch, only curcid changes between rows.
> */
> Assert(fpentry && fpentry->snapshot != NULL);
> snapshot = fpentry->snapshot;
> snapshot->curcid = GetCurrentCommandId(false);
> }
> else
> snapshot = RegisterSnapshot(GetLatestSnapshot());
>
> if (use_cache)
> {
> pk_rel = fpentry->pk_rel;
> idx_rel = fpentry->idx_rel;
> scandesc = fpentry->scandesc;
> slot = fpentry->slot;
> }
> else
> {
> pk_rel = table_open(riinfo->pk_relid, RowShareLock);
> idx_rel = index_open(riinfo->conindid, AccessShareLock);
> scandesc = index_beginscan(pk_rel, idx_rel,
> snapshot, NULL,
> riinfo->nkeys, 0);
> slot = table_slot_create(pk_rel, NULL);
> }
>
> --
> Regards
> Junwang Zhao
--
Thanks, Amit Langote
0001-Add-fast-path-for-foreign-key-constraint-checks.patch
Description: Binary data
0004-Use-SK_SEARCHARRAY-for-batched-fast-path-FK-probes.patch
Description: Binary data
0003-Buffer-FK-rows-for-batched-fast-path-probing.patch
Description: Binary data
0002-Cache-per-batch-resources-for-fast-path-foreign-key-.patch
Description: Binary data
