Fix logging for invalid recovery timeline
Hackers, I noticed while debugging a user issue that the checkpoint logged in "Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X." is being pulled from the value stored in the control file rather than the value read from backup_label (in the case where there is a backup_label). This only affects logging since the timeline check is done against the checkpoint/TLI as read from backup_label. This patch updates the checkpoint and TLI to (what I believe are) the correct values for logging. I think this should also be back-patched. Regards, -DavidFrom f14e30b18cde216131bd3e069ee8ecd5da3301b0 Mon Sep 17 00:00:00 2001 From: David Steele Date: Fri, 20 Dec 2024 15:13:59 + Subject: Fix logging for invalid recovery timeline. If the requested recovery timeline is not reachable the logged checkpoint and TLI should to be the values read from backup_label when backup_label is present. --- src/backend/access/transam/xlogrecovery.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index c6994b78282..99afd01bf38 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -844,13 +844,13 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr, * tliSwitchPoint will throw an error if the checkpoint's timeline is * not in expectedTLEs at all. */ - switchpoint = tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, NULL); + switchpoint = tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL); ereport(FATAL, (errmsg("requested timeline %u is not a child of this server's history", recoveryTargetTLI), errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.", - LSN_FORMAT_ARGS(ControlFile->checkPoint), - ControlFile->checkPointCopy.ThisTimeLineID, + LSN_FORMAT_ARGS(CheckPointLoc), + CheckPointTLI, LSN_FORMAT_ARGS(switchpoint; } -- 2.34.1
Re: FileFallocate misbehaving on XFS
Hi, On 2024-12-19 17:47:13 +1100, Michael Harris wrote: > I finally managed to get the patched version installed in a production > database where the error is occurring very regularly. Thanks! > Here is a sample of the output: > > 2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate > failing with ENOSPC: free space for filesystem containing > "pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks: > 2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files: > 1073741376, f_ffree: 1069933796 That's ~700 GB of free space... It'd be interesting to see filefrag -v for that segment. > This is a different system to those I previously provided logs from. > It is also running RHEL8 with a similar configuration to the other > system. Given it's a RHEL system, have you raised this as an issue with RH? They probably have somebody with actual XFS hacking experience on staff. RH's kernels are *heavily* patched, so it's possible the issue is actually RH specific. > I have so far not installed the bpftrace that Jakub suggested before - > as I say this is a production machine and I am wary of triggering a > kernel panic or worse (even though it seems like the risk for that > would be low?). While a kernel stack trace would no doubt be helpful > to the XFS developers, from a postgres point of view, would that be > likely to help us decide what to do about this? Well, I'm personally wary of installing workarounds for a problem I don't understand and can't reproduce, which might be specific to old filesystems and/or heavily patched kernels. This clearly is an FS bug. That said, if we learn that somehow this is a fundamental XFS issue that can be triggered on every XFS filesystem, with current kernels, it becomes more reasonable to implement a workaround in PG. Another thing I've been wondering about is if we could reduce the frequency of hitting problems by rounding up the number of blocks we extend by to powers of two. That would probably reduce fragmentation, and the extra space would be quickly used in workloads where we extend by a bunch of blocks at once, anyway. Greetings, Andres Freund
Document How Commit Handles Aborted Transactions
Hi, The commit reference page lacks an "Outputs" section even though it is capable of outputting both "COMMIT" and "ROLLBACK". The attached adds this section, describes when each applies, and then incorporates the same into the main description for commit as well as the transaction section of the tutorial - which presently seems to be the main discussion area for the topic (the Concurrency Control chapter lacks a section for this introductory material). This was noted as being needed by Tom Lane back into 2006 but it never happened. https://www.postgresql.org/message-id/28798.1142608067%40sss.pgh.pa.us It came up again when I was answering a question on Slack regarding "commit and chain" wondering whether the "and chain" could be made conditional (i.e., could the new transaction start aborted) on whether commit outputted "commit" or "rollback". Its left implied that this behavior of "rollback" is standard-conforming. Please feel free to suggest/add language to the Compatibility section if this is not the case. David J. From f9c40e72c8e62ae7b364a82d26a0f73995ef5082 Mon Sep 17 00:00:00 2001 From: "David G. Johnston" Date: Fri, 20 Dec 2024 08:40:47 -0700 Subject: [PATCH] doc: Commit performs rollback of aborted transactions The Commit command handles an aborted transaction in the same manner as the Rollback command. This needs to be documented. Add it to both the reference page as well mentioning the behavior in the official material for introducting transactions - the tutorial. In passing, make the description of the Commit reference page self-contained by mentioning the 'and chain' behavior. --- doc/src/sgml/advanced.sgml | 18 +++--- doc/src/sgml/ref/commit.sgml | 33 + 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/advanced.sgml b/doc/src/sgml/advanced.sgml index 755c9f1485..b5cae2335c 100644 --- a/doc/src/sgml/advanced.sgml +++ b/doc/src/sgml/advanced.sgml @@ -149,7 +149,8 @@ DETAIL: Key (city)=(Berkeley) is not present in table "cities". systems. The essential point of a transaction is that it bundles multiple steps into a single, all-or-nothing operation. The intermediate states between the steps are not visible to other concurrent transactions, -and if some failure occurs that prevents the transaction from completing, +and if some failure +occurs that prevents the transaction from completing, then none of the steps affect the database at all. @@ -218,7 +219,8 @@ UPDATE branches SET balance = balance + 100.00 In PostgreSQL, a transaction is set up by surrounding the SQL commands of the transaction with -BEGIN and COMMIT commands. So our banking + and + commands. So our banking transaction would actually look like: @@ -233,7 +235,7 @@ COMMIT; If, partway through the transaction, we decide we do not want to commit (perhaps we just noticed that Alice's balance went negative), -we can issue the command ROLLBACK instead of +we can issue the command instead of COMMIT, and all our updates so far will be canceled. @@ -256,6 +258,16 @@ COMMIT; + +When a failure does occur during a transaction it is not ended but instead +goes into an aborted state. While in this state all commands except + and are ignored and, +importantly, both those commands behave identically - they roll-back and close +the current transaction and return the session to a state where new commands can +be issued. They will also automatically begin a new transaction if executed +with a AND CHAIN parameter. + + It's possible to control the statements in a transaction in a more granular fashion through the use of savepoints. Savepoints diff --git a/doc/src/sgml/ref/commit.sgml b/doc/src/sgml/ref/commit.sgml index 7e2dcac5a3..e4ef573fe5 100644 --- a/doc/src/sgml/ref/commit.sgml +++ b/doc/src/sgml/ref/commit.sgml @@ -33,6 +33,19 @@ COMMIT [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ] changes made by the transaction become visible to others and are guaranteed to be durable if a crash occurs. + + If no changes have been made - because the transaction is in an + aborted state - the effect of the commit will look like a rollback, + including the command tag output. + + + In either situation, if the AND CHAIN parameter is + specified a new, identically configured, transaction is started. + + + For more information regarding transactions see + . + @@ -67,6 +80,25 @@ COMMIT [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ] + + Outputs + + + On successful completion on a non-aborted transaction, a COMMIT + command returns a command tag of the form + +COMMIT + + + + However, if the transaction being affected is aborted, a COMMIT + command returns a command tag of the form + +ROLLBACK + + + + Notes @@ -107,6 +139,7 @@ CO
Re: Sort functions with specialized comparators
> On 16 Dec 2024, at 14:02, John Naylor wrote: > > Sorry, I forgot this part earlier. Yes, let's have the private function. PFA v6. I was poking around intarray and trying not to bash everything. It seems to me that overall code readability should be seriously reworked. Even if no one is going to invent anything new around this code. Looks like overall project code standards matured far above, but this particular extension is forgotten. There is a lot of useless checks and optimizations for n < 2, copying data to new allocation when it's not necessary, small inconsistencies etc. I don't think it's a matter of this particular thread though. Best regards, Andrey Borodin. v6-0001-Use-specialized-sort-facilities.patch Description: Binary data v6-0003-prefer-non-resizing-to-constructing-empty-array.patch Description: Binary data v6-0002-Use-PREPAREARR-where-possible.patch Description: Binary data
Re: FileFallocate misbehaving on XFS
On Thu, Dec 19, 2024 at 7:49 AM Michael Harris wrote: > Hello, > > I finally managed to get the patched version installed in a production > database where the error is occurring very regularly. > > Here is a sample of the output: > > 2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate > failing with ENOSPC: free space for filesystem containing > "pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks: > 2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files: > 1073741376, f_ffree: 1069933796 [..] > I have attached a file containing all the errors I collected. The > error is happening pretty regularly - over 400 times in a ~6 hour > period. The number of blocks being extended varies from ~9 to ~15, and > the statfs result shows plenty of available space & inodes at the > time. The errors do seem to come in bursts. > I couldn't resist: you seem to have entered the quantum realm of free disk space AKA Schrodinger's free space: you both have the space and dont have it... ;) No one else has responded, so I'll try. My take is that we got very limited number of reports (2-3) of this stuff happening and it always seem to be >90% space used, yet the adoption of PG16 is rising, so we may or may not see more errors of those kind, but on another side of things: it's frequency is so rare that it's really wild we don't see more reports like this one. Lots of OS upgrades in the wild are performed by building new standbys (maybe that lowers the fs fragmentation), rather than in-place OS upgrades. To me it sounds like a new bug in XFS that is rare. You can probably live with #undef HAVE_POSIX_FALLOCATE as a way to survive, another would be probably to try to run xfs_fsr to defragment the fs. Longer-term: other than collecting the eBPF data to start digging from where it is really triggered, I don't see a way forward. It would be suboptimal to just abandon fallocate() optimizations from commit 31966b151e6ab7a6284deab6e8fe5faddaf2ae4c, just because of very unusual combinations of factors (XFS bug). Well we could be having some kludge like pseudo-code: if(posix_falloc() == ENOSPC && statfs().free_space_pct >= 1) fallback_to_pwrites(), but it is ugly. Another is GUC (or even two -- how much to extend or to use or not the posix_fallocate()), but people do not like more GUCs... > I have so far not installed the bpftrace that Jakub suggested before - > as I say this is a production machine and I am wary of triggering a > kernel panic or worse (even though it seems like the risk for that > would be low?). While a kernel stack trace would no doubt be helpful > to the XFS developers, from a postgres point of view, would that be > likely to help us decide what to do about this?[..] Well you could try having reproduction outside of production, or even clone the storage (but not using backup/restore), but literally clone the XFS LUNs on the storage itself and connect those separate VM to have a safe testbed (or even use dd(1) of some smaller XFS fs exhibiting such behaviour to some other place) As for eBPF/bpftrace: it is safe (it's sandboxed anyway), lots of customers are using it, but as always YMMV. There's also xfs_fsr that might help overcome. You can also experiment if -o allocsize helps or just even try -o allocsize=0 (but that might have some negative effects on performance probably) -J.
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
On 12/7/2024 12:42 AM, Devulapalli, Raghuveer wrote: [0] https://cirrus-ci.com/task/6023394207989760 [1] https://cirrus-ci.com/task/5460444254568448 [2] https://cirrus-ci.com/task/6586344161411072 I was able to fix [0] and [1], but I can't think of why [2] fails. When I tried to reproduce this locally, I get a different unrelated error. Any idea why I am seeing this? LINK : fatal error LNK1181: cannot open input file 'C:\Program Files\Git\nologo' Commands: meson setup build && cd build && meson compile Hello! I'm Matthew Sterrett and I'm a coworker of Raghuveer; he asked me to look into the Windows build failures related to pg_comp_crc32c. It seems that the only thing that was required to fix that is to mark pg_comp_crc32c as PGDLLIMPORT, so I added a patch that does just that. I'm new to working with mailing lists, so please tell me if I messed anything up! Matthew Sterrett From 74d085d44d41af8ffb01f7bf2377ac487c7d4cc1 Mon Sep 17 00:00:00 2001 From: Paul Amonson Date: Mon, 6 May 2024 08:34:17 -0700 Subject: [PATCH v10 1/4] Add a Postgres SQL function for crc32c benchmarking. Add a drive_crc32c() function to use for benchmarking crc32c computation. The function takes 2 arguments: (1) count: num of times CRC32C is computed in a loop. (2) num: #bytes in the buffer to calculate crc over. Signed-off-by: Paul Amonson Signed-off-by: Raghuveer Devulapalli --- src/test/modules/meson.build | 1 + src/test/modules/test_crc32c/Makefile | 20 src/test/modules/test_crc32c/meson.build | 22 + .../modules/test_crc32c/test_crc32c--1.0.sql | 1 + src/test/modules/test_crc32c/test_crc32c.c| 47 +++ .../modules/test_crc32c/test_crc32c.control | 4 ++ 6 files changed, 95 insertions(+) create mode 100644 src/test/modules/test_crc32c/Makefile create mode 100644 src/test/modules/test_crc32c/meson.build create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql create mode 100644 src/test/modules/test_crc32c/test_crc32c.c create mode 100644 src/test/modules/test_crc32c/test_crc32c.control diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index c829b61953..68d8904dd0 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -15,6 +15,7 @@ subdir('ssl_passphrase_callback') subdir('test_bloomfilter') subdir('test_copy_callbacks') subdir('test_custom_rmgrs') +subdir('test_crc32c') subdir('test_ddl_deparse') subdir('test_dsa') subdir('test_dsm_registry') diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile new file mode 100644 index 00..5b747c6184 --- /dev/null +++ b/src/test/modules/test_crc32c/Makefile @@ -0,0 +1,20 @@ +MODULE_big = test_crc32c +OBJS = test_crc32c.o +PGFILEDESC = "test" +EXTENSION = test_crc32c +DATA = test_crc32c--1.0.sql + +first: all + +# test_crc32c.o: CFLAGS+=-g + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/test_crc32c +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/test_crc32c/meson.build b/src/test/modules/test_crc32c/meson.build new file mode 100644 index 00..7021a6d6cf --- /dev/null +++ b/src/test/modules/test_crc32c/meson.build @@ -0,0 +1,22 @@ +# Copyright (c) 2022-2024, PostgreSQL Global Development Group + +test_crc32c_sources = files( + 'test_crc32c.c', +) + +if host_system == 'windows' + test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [ +'--NAME', 'test_crc32c', +'--FILEDESC', 'test_crc32c - test code for crc32c library',]) +endif + +test_crc32c = shared_module('test_crc32c', + test_crc32c_sources, + kwargs: pg_test_mod_args, +) +test_install_libs += test_crc32c + +test_install_data += files( + 'test_crc32c.control', + 'test_crc32c--1.0.sql', +) diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql new file mode 100644 index 00..32f8f0fb2e --- /dev/null +++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql @@ -0,0 +1 @@ +CREATE FUNCTION drive_crc32c (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C; diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c new file mode 100644 index 00..b350caf5ce --- /dev/null +++ b/src/test/modules/test_crc32c/test_crc32c.c @@ -0,0 +1,47 @@ +/* select drive_crc32c(100, 1024); */ + +#include "postgres.h" +#include "fmgr.h" +#include "port/pg_crc32c.h" +#include "common/pg_prng.h" + +PG_MODULE_MAGIC; + +/* + * drive_crc32c(count: int, num: int) returns bigint + * + * count is the nuimber of loops to perform + * + * num is the number byte in the buffer to calculate + * crc32c over. + */ +PG_FUNCTION_INFO_V1(drive_crc32c); +Datum +drive_crc32c(PG_FUNCTION_ARGS) +{ + int64
Re: Can rs_cindex be < 0 for bitmap heap scans?
Em qui., 19 de dez. de 2024 às 19:50, Melanie Plageman < melanieplage...@gmail.com> escreveu: > On Wed, Dec 18, 2024 at 9:50 PM Richard Guo > wrote: > > > > On Thu, Dec 19, 2024 at 8:18 AM Melanie Plageman > > wrote: > > > I pushed the straightforward option for now so that it's fixed. > > > > I think this binary search code now has a risk of underflow. If 'mid' > > is calculated as zero, the second 'if' branch will cause 'end' to > > underflow. > > Thanks, Richard! > > > Maybe we need to do something like below. > > > > --- a/src/backend/access/heap/heapam_handler.c > > +++ b/src/backend/access/heap/heapam_handler.c > > @@ -2600,7 +2600,11 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer > buffer, > > if (tupoffset == curoffset) > > return true; > > else if (tupoffset < curoffset) > > + { > > + if (mid == 0) > > + return false; > > end = mid - 1; > > + } > > else > > start = mid + 1; > > } > > > > Alternatively, we can revert 'start' and 'end' to signed int as they > > were before. > > What about this instead: > > diff --git a/src/backend/access/heap/heapam_handler.c > b/src/backend/access/heap/heapam_handler.c > index 2da4e4da13e..fb90fd869c2 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -2574,11 +2574,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer > buffer, > > if (scan->rs_flags & SO_ALLOW_PAGEMODE) > { > - uint32 start, > - end; > - > - if (hscan->rs_ntuples == 0) > - return false; > + uint32 start = 0, > + end = hscan->rs_ntuples; > > /* > * In pageatatime mode, heap_prepare_pagescan() already did > visibility > @@ -2589,10 +2586,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer > buffer, > * in increasing order, but it's not clear that there would be > enough > * gain to justify the restriction. > */ > - start = 0; > - end = hscan->rs_ntuples - 1; > > - while (start <= end) > + while (start < end) > { > uint32 mid = (start + end) / 2; > OffsetNumber curoffset = hscan->rs_vistuples[mid]; > @@ -2600,7 +2595,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer > buffer, > if (tupoffset == curoffset) > return true; > else if (tupoffset < curoffset) > - end = mid - 1; > + end = mid; > else > start = mid + 1; > } > > Or to make it easier to read, here: > > uint32start = 0, > end = hscan->rs_ntuples; > > while (start < end) > { > uint32mid = (start + end) / 2; > OffsetNumber curoffset = hscan->rs_vistuples[mid]; > > if (tupoffset == curoffset) > return true; > else if (tupoffset < curoffset) > end = mid; > else > start = mid + 1; > } > > I think this gets rid of any subtraction and is still the same. > Look goods to me. I think that you propose, can get rid of the early test too. Note the way we can avoid an overflow in the mid calculation. diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 9f17baea5d..bd1335276a 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2577,9 +2577,6 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer, uint32 start, end; - if (hscan->rs_ntuples == 0) - return false; - /* * In pageatatime mode, heap_prepare_pagescan() already did visibility * checks, so just look at the info it left in rs_vistuples[]. @@ -2590,17 +2587,17 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer, * gain to justify the restriction. */ start = 0; - end = hscan->rs_ntuples - 1; + end = hscan->rs_ntuples; - while (start <= end) + while (start < end) { - uint32 mid = (start + end) / 2; + uint32 mid = (start + end) >> 1; OffsetNumber curoffset = hscan->rs_vistuples[mid]; if (tupoffset == curoffset) return true; else if (tupoffset < curoffset) - end = mid - 1; + end = mid; else start = mid + 1; } best regards, Ranier Vilela
Re: A few patches to clarify snapshot management
On 16/12/2024 23:56, Nathan Bossart wrote: On Mon, Dec 16, 2024 at 12:06:33PM +0200, Heikki Linnakangas wrote: While working on the CSN snapshot patch, I got sidetracked looking closer into the snapshot tracking in snapmgr.c. Attached are a few patches to clarify some things. I haven't yet looked closely at what you are proposing, but big +1 from me for the general idea. I recently found myself wishing for a lot more commentary about this stuff [0]. [0] https://postgr.es/m/Z0dB1ld2iPcS6nC9%40nathan While playing around some more with this, I noticed that this code in GetTransactionSnapshot() is never reached, and AFAICS has always been dead code: Snapshot GetTransactionSnapshot(void) { /* * Return historic snapshot if doing logical decoding. We'll never need a * non-historic transaction snapshot in this (sub-)transaction, so there's * no need to be careful to set one up for later calls to * GetTransactionSnapshot(). */ if (HistoricSnapshotActive()) { Assert(!FirstSnapshotSet); return HistoricSnapshot; } when you think about it, that's good, because it doesn't really make sense to call GetTransactionSnapshot() during logical decoding. We jump through hoops to make the historic catalog decoding possible with historic snapshots, tracking subtransactions that modify catalogs and WAL-logging command ids, but they're not suitable for general purpose queries. So I think we should turn that into an error, per attached patch. Another observation is that we only ever use regular MVCC snapshots as active snapshots. I added a "Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);" to PushActiveSnapshotWithLevel() and all regression tests passed. That's also good, because we assumed that much in a few places anyway: there are a couple of calls that amount to "XidInMVCCSnapshot(..., GetActiveSnapshot()"), in find_inheritance_children_extended() and RelationGetPartitionDesc(). We could add comments and that assertion to make that assumption explicit. And that thought takes me deeper down the rabbit hole: /* * Struct representing all kind of possible snapshots. * * There are several different kinds of snapshots: * * Normal MVCC snapshots * * MVCC snapshots taken during recovery (in Hot-Standby mode) * * Historic MVCC snapshots used during logical decoding * * snapshots passed to HeapTupleSatisfiesDirty() * * snapshots passed to HeapTupleSatisfiesNonVacuumable() * * snapshots used for SatisfiesAny, Toast, Self where no members are * accessed. * * TODO: It's probably a good idea to split this struct using a NodeTag * similar to how parser and executor nodes are handled, with one type for * each different kind of snapshot to avoid overloading the meaning of * individual fields. */ typedef struct SnapshotData I'm thinking of implementing that TODO, splitting SnapshotData into separate structs like MVCCSnapshotData, SnapshotDirtyData, etc. It seems to me most places can assume that you're dealing with MVCC snapshots, and if we had separate types for them, could be using MVCCSnapshot instead of the generic Snapshot. Only the table and index AM functions need to deal with non-MVCC snapshots. -- Heikki Linnakangas Neon (https://neon.tech) From ec248c69cb42a0747ecc6a63ac4e4682cce2ee93 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 20 Dec 2024 18:37:44 +0200 Subject: [PATCH 1/1] Don't allow GetTransactionSnapshot() in logical decoding A historic snapshot should only be used for catalog access, not general queries. We never call GetTransactionSnapshot() during logical decoding, which is good because it wouldn't be very sensible, so the code to deal with that was unreachable and untested. Turn it into an error, to avoid doing that in the future either. --- src/backend/utils/time/snapmgr.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index e60360338d5..3717869f736 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -212,16 +212,12 @@ Snapshot GetTransactionSnapshot(void) { /* - * Return historic snapshot if doing logical decoding. We'll never need a - * non-historic transaction snapshot in this (sub-)transaction, so there's - * no need to be careful to set one up for later calls to - * GetTransactionSnapshot(). + * This should not be called while doing logical decoding. Historic + * snapshots are only usable for catalog access, not for general-purpose + * queries. */ if (HistoricSnapshotActive()) - { - Assert(!FirstSnapshotSet); - return HistoricSnapshot; - } + elog(ERROR, "cannot take query snapshot during logical decoding"); /* First call in transaction? */ if (!FirstSnapshotSet) -- 2.39.5
Re: AIO v2.0
On Fri, 20 Dec 2024 at 01:54, Andres Freund wrote: > Arguably the configuration *did* tell us, by having a higher hard limit... > > But opting into a higher rlimit, while obviously adhering to the hard limit > and perhaps some other config knob, seems fine? Yes, totally fine. That's exactly the reasoning why the hard limit is so much larger than the soft limit by default on systems with systemd: https://0pointer.net/blog/file-descriptor-limits.html
Re: AIO v2.0
Hi, On 2024-12-20 18:27:13 +0100, Jelte Fennema-Nio wrote: > On Fri, 20 Dec 2024 at 01:54, Andres Freund wrote: > > Arguably the configuration *did* tell us, by having a higher hard limit... > > > > But opting into a higher rlimit, while obviously adhering to the hard limit > > and perhaps some other config knob, seems fine? > > Yes, totally fine. That's exactly the reasoning why the hard limit is > so much larger than the soft limit by default on systems with systemd: > > https://0pointer.net/blog/file-descriptor-limits.html Good link. This isn't just relevant for using io_uring: There obviously are several people working on threaded postgres. Even if we didn't duplicate fd.c file descriptors between threads (we probably will, at least initially), the client connection FDs alone will mean that we have a lot more FDs open. Due to the select() issue the soft limit won't be increased beyond 1024, requiring everyone to add a 'ulimit -n $somehighnumber' before starting postgres on linux doesn't help anyone. Greetings, Andres Freund
Re: per backend I/O statistics
Hi, On Fri, Dec 20, 2024 at 08:00:00AM +0200, Alexander Lakhin wrote: > Hello Michael, > > 19.12.2024 06:21, Michael Paquier wrote: > > Fixed that, bumped the two version counters, and done. > > Could you, please, look at recent failures produced by grassquit (which > has fsync = on in it's config), on an added test case? For instance, [1]: > --- > /home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/stats.out > 2024-12-19 04:44:08.779311933 + > +++ > /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out > 2024-12-19 16:37:41.351784840 + > @@ -1333,7 +1333,7 @@ > AND :my_io_sum_shared_after_fsyncs= 0); > ?column? > -- > - t > + f > (1 row) > > The complete query is: > SELECT current_setting('fsync') = 'off' > OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs > AND :my_io_sum_shared_after_fsyncs= 0); > > And the corresponding query in 027_stream_regress_primary.log is: > 2024-12-19 16:37:39.907 UTC [4027467][client backend][15/1980:0] LOG: > statement: SELECT current_setting('fsync') = 'off' > OR (1 = 1 > AND 1= 0); > > (I can reproduce this locally with an asan-enabled build too.) > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-12-19%2016%3A28%3A58 Thanks for the report! I was not able able to reproduce (even with asan-enabled) but I think the test is wrong. Indeed the backend could fsync too during the test (see register_dirty_segment() and the case where the request queue is full). I think the test should look like the attached instead, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From a83def4ca95c154e824378452cb15f100887a56c Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 20 Dec 2024 08:59:53 + Subject: [PATCH v1] Fix per-backend IO stats regression test The tests added in 9aea73fc61 did not take into account that the backend can fsync in some circumstances. --- src/test/regress/expected/stats.out | 3 +-- src/test/regress/sql/stats.sql | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) 50.0% src/test/regress/expected/ 50.0% src/test/regress/sql/ diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index 150b6dcf74..d18fc7c081 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -1329,8 +1329,7 @@ SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes; (1 row) SELECT current_setting('fsync') = 'off' - OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs - AND :my_io_sum_shared_after_fsyncs= 0); + OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs; ?column? -- t diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql index 1e7d0ff665..fc0d2fde31 100644 --- a/src/test/regress/sql/stats.sql +++ b/src/test/regress/sql/stats.sql @@ -648,8 +648,7 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs WHERE object = 'relation' \gset my_io_sum_shared_after_ SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes; SELECT current_setting('fsync') = 'off' - OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs - AND :my_io_sum_shared_after_fsyncs= 0); + OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs; -- Change the tablespace so that the table is rewritten directly, then SELECT -- from it to cause it to be read back into shared buffers. -- 2.34.1
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Dec 16, 2024 at 4:10 PM Nisha Moond wrote: > > Here is the v56 patch set with the above comments incorporated. > Review comments: === 1. + { + {"idle_replication_slot_timeout", PGC_SIGHUP, REPLICATION_SENDING, + gettext_noop("Sets the duration a replication slot can remain idle before " + "it is invalidated."), + NULL, + GUC_UNIT_MS + }, + &idle_replication_slot_timeout_ms, I think users are going to keep idele_slot timeout at least in hours. So, millisecond seems the wrong choice to me. I suggest to keep the units in minutes. I understand that writing a test would be challenging as spending a minute or more on one test is not advisable. But I don't see any test testing the other GUCs that are in minutes (wal_summary_keep_time and log_rotation_age). The default value should be one day. 2. + /* + * An error is raised if error_if_invalid is true and the slot is found to + * be invalid. + */ + if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE) + { + StringInfoData err_detail; + + initStringInfo(&err_detail); + + switch (s->data.invalidated) + { + case RS_INVAL_WAL_REMOVED: + appendStringInfo(&err_detail, _("This slot has been invalidated because the required WAL has been removed.")); + break; + + case RS_INVAL_HORIZON: + appendStringInfo(&err_detail, _("This slot has been invalidated because the required rows have been removed.")); + break; + + case RS_INVAL_WAL_LEVEL: + /* translator: %s is a GUC variable name */ + appendStringInfo(&err_detail, _("This slot has been invalidated because \"%s\" is insufficient for slot."), + "wal_level"); + break; + + case RS_INVAL_NONE: + pg_unreachable(); + } + + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("can no longer get changes from replication slot \"%s\"", +NameStr(s->data.name)), + errdetail_internal("%s", err_detail.data)); + } + This should be moved to a separate function. 3. +static inline bool +IsSlotIdleTimeoutPossible(ReplicationSlot *s) Would it be better to name this function as CanInvalidateIdleSlot()? The current name doesn't seem to match with similar other functionalities. -- With Regards, Amit Kapila.
Re: per backend I/O statistics
Hi, On Thu, Dec 19, 2024 at 06:12:04AM +, Bertrand Drouvot wrote: > On Thu, Dec 19, 2024 at 01:21:54PM +0900, Michael Paquier wrote: > > bumped the two version counters, and done. > >The existing structure could be expanded in the > >future to add more information about other statistics related to > >backends, depending on requirements or ideas. BTW, now that the per backend I/O statistics is done, I'll start working on per backend wal statistics. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Parametrization minimum password lenght
On Thu, Dec 19, 2024 at 09:57:31AM -0600, Nathan Bossart wrote: > On Thu, Dec 19, 2024 at 09:36:17AM -0600, Nathan Bossart wrote: > > On Thu, Dec 19, 2024 at 07:25:30AM +, Bertrand Drouvot wrote: > >> - errmsg("password is too short"))); > >> + errmsg("password is too short: %d (< > >> %d)", pwdlen, min_password_length))); > > > > I think we should explicitly point to the parameter and note its current > > value. > > Like so. LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: per backend I/O statistics
On Fri, Dec 20, 2024 at 09:09:00AM +, Bertrand Drouvot wrote: > Thanks for the report! I was not able able to reproduce (even with > asan-enabled) > but I think the test is wrong. Indeed the backend could fsync too during the > test > (see register_dirty_segment() and the case where the request queue is full). > > I think the test should look like the attached instead, thoughts? Hmm. I cannot reproduce that here either. FWIW, I use these settings by default in my local builds, similarly to the CI which did not complain: ASAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0" UBSAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2" CFLAGS+="-fsanitize=undefined " LDFLAGS+="-fsanitize=undefined " grassquit uses something a bit different, which don't allow me to see a problem with 027 even with fsync enabled: ASAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0" CFLAGS+="-fsanitize=address -fno-sanitize-recover=all " LDFLAGS+="-fsanitize=address -fno-sanitize-recover=all " Anyway, I was arriving at the same conclusion as you, because this is just telling us that there could be "some" sync activity. As rewritten, this would just check that the after-the-fact counter is never lower than the before-the-fact counter, which is still better than removing the query. I was thinking about doing the latter and remove the query, but I'm also OK with what you have here, keeping the query with a relaxed check. I'll go adjust that in a bit.. -- Michael signature.asc Description: PGP signature
Re: Test to dump and restore objects left behind by regression
On Wed, Dec 18, 2024 at 7:39 PM Daniel Gustafsson wrote: > > > On 18 Dec 2024, at 12:28, Ashutosh Bapat > > wrote: > > In general I think it's fine to have such an expensive test gated behind a > PG_TEST_EXTRA flag, and since it's only run on demand we might as well run it > for all formats while at it. If this ran just once per week in the buildfarm > it would still allow us to catch things in time at fairly low overall cost. > > > I have rebased my patches on the current HEAD. The test now passes and > > does not show any new diff or bug. > > A few comments on this version of the patch: > > + regression run. Not enabled by default because it is time consuming. > Since this test consumes both time and to some degree diskspace (the > dumpfiles) > I wonder if this should be "time and resource consuming". Done. > > > + if ( $ENV{PG_TEST_EXTRA} > + && $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/) > Should this also test that $oldnode and $newnode have matching pg_version to > keep this from running in a cross-version upgrade test? While it can be > argued > that running this in a cross-version upgrade is breaking it and getting to > keep > both pieces, it's also not ideal to run a resource intensive test we know will > fail. (It can't be done at this exact callsite, just picked to illustrate.) > You already wrote it in parenthesis. At the exact callsite $oldnode and $newnode can not be of different versions. In fact newnode is yet to be created at this point. But $oldnode has the same version as the server run from the code. In a cross-version upgrade this test will not be executed. I am confused as to what this comment is about. > > -sub filter_dump > +sub filter_dump_for_upgrade > What is the reason for the rename? filter_dump() is perhaps generic but it's > also local to the upgrade test so it's also not too unclear. > In one of the earlier versions of the patch, there was filter_dump_for_regress or some such function which was used to filter the dump from the regression database. Name was changed to differentiate between the two functions. But the new function is now named as adjust_regress_dumpfile() so this name change is not required anymore. Reverting it. I have left the comment change since the test file now has tests for both upgrade and dump/restore. > > + my $format_spec = substr($format, 0, 1); > This doesn't seem great for readability, how about storing the formats and > specfiers in an array of Perl hashes which can be iterated over with > descriptive names, like $format{'name'} and $format{'spec'}? > Instead of an array of hashes, I used a single hash with format description as key and format spec as value. Hope that's acceptable. > > + || die "opening $dump_adjusted "; > Please include the errno in the message using ": $!" appended to the error > message, it could help in debugging. > I didn't see this being used with other open calls in the file. For that matter we are not using $! with open() in many test files. But it seems useful. Done > +compare the results of dump and retore tests > s/retore/restore/ > Thanks for pointing out. Fixed. > > + else > + { > + note('first dump file: ' . $dump1); > + note('second dump file: ' . $dump2); > + } > + > This doesn't seem particularly helpful, if the tests don't fail then printing > the names won't bring any needed information. What we could do here is to add > an is() test in compare_dump()s to ensure the filenames differ to catch any > programmer error in passing in the same file twice. Good suggestion. Done. 0001 - same as 0001 from previous version 0002 - addresses above comments -- Best Wishes, Ashutosh Bapat From 5ab6dd99438dbb1a77151f5faa0a4104aec5ce74 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Thu, 27 Jun 2024 10:03:53 +0530 Subject: [PATCH 1/2] Test pg_dump/restore of regression objects 002_pg_upgrade.pl tests pg_upgrade of the regression database left behind by regression run. Modify it to test dump and restore of the regression database as well. Regression database created by regression run contains almost all the database objects supported by PostgreSQL in various states. Hence the new testcase covers dump and restore scenarios not covered by individual dump/restore cases. Many regression tests mention tht they leave objects behind for dump/restore testing. But till now 002_pg_upgrade only tested dump/restore through pg_upgrade which is different from dump/restore through pg_dump. Adding the new testcase closes that gap. Testing dump and restore of regression database makes this test run longer for a relatively smaller benefit. Hence run it only when explicitly requested by user by specifying "regress_dump_test" in PG_TEST_EXTRA. Note For the reviewer: The new test has uncovered two bugs so far in one year. 1. Introduced by 14e87ffa5c54. Fixed in fd41ba93e4630921a72ed5127cd0d552a8f3f8fc. 2. Introduced by 0413a556990ba628a3de8a0b58be020fd9a14ed0. Reverted in 7456
Re: Make tuple deformation faster
On Thu, 5 Dec 2024 at 13:09, Andres Freund wrote: > On 2024-12-05 11:52:01 +1300, David Rowley wrote: > > On Thu, 5 Dec 2024 at 03:51, Andres Freund wrote: > > > Possibly stupid idea: Could we instead store the attributes *before* the > > > main > > > TupleDescData, with increasing "distance" for increased attnos? That way > > > we > > > wouldn't need to calculate any variable offsets. Of course the price > > > would be > > > to have some slightly more complicated invocation of pfree(), but that's > > > comparatively rare. > > > > Are you thinking this to make the address calculation cheaper? or so > > that the hacky code that truncates the TupleDesc would work without > > crashing still? > > Primarily to make the address calculation cheaper. I mostly got that working, but there were quite several adjustments needed to fix various things. For example, there are a few places that are pfreeing a TupleDesc without going through FreeTupleDesc(); namely index_truncate_tuple() and InsertOneTuple(). There was also a bit more complexity around the TupleDescs stored in DSA memory as the base address of those is the start of the FormData_pg_attribute array, so some offsetting is needed to get the TupleDesc address. None of those changes are complex themselves, but it was quite painful to find all those places and I started to worry that there might end up being a bit more fallout from that method and I basically chickened out. I agree the address calculation is cheaper, but accessing FormData_pg_attribute isn't as performance-critical anymore as CompactAttribute is doing the performance-critical work now. I've now pushed the 0001 patch and the 0002 patch as one commit. I'm going to give the buildfarm a bit of time then push the commit to remove pg_attribute.attcacheoff. I'm planning on letting that settle overnight then if all is well push the attalignby changes. David
Re: Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment
On Sat, 21 Dec 2024 at 01:05, Thomas Munro wrote: > > On Sat, Dec 21, 2024 at 11:41 AM Matthias van de Meent > wrote: > > The unlinking of forks in the FileTag infrastructure has been broken > > since b0a55e43 in PG16, > > while a segment number other than 0 has never > > been unlinked (at least not since the introduction of the system with > > 3eb77eba in PG12) > > Right, and that predates the FileTag refactoring, it's just that the > earlier coding didn't explicitly mention the segment number so it > looks slightly different. In fact it was hard-coded to unlink > relpathperm(entry->rnode, MAIN_FORKNUM) before that work, so both fork > and segment number were always fixed, it's just that the FileTag > mechanism was made slightly more general, really for the SYNC stuff, > not so much for the UNLINK stuff which uses the same tags. I see. > > However, extensions may still make use of this and > > incorrectly assume that only the requested file of the requested fork > > 's segment will be unlinked, when it actually unlinks data from the > > main fork. > > It seems unlikely to be useful for any purpose other than tombstones. > And it seems like if someone is already using it, they would have been > in touch to say that it doesn't work. Or perhaps you tried to use it > and noticed this flaw, or know of someone who would like to use it? > Or more likely I guess you're working on smgr extension support. I noticed it when I was browsing NBuffers-sized allocations, which got me looking into the FileTag infrastructure, which got me trying to figure out what FileTag.segno is used for that would require it to be a uint64 in addition to the RelFileNode, which got me looking through this code. So, not exactly for SMGR extension support here, but my experience in that did make it easier for me to figure out that the code doesn't behave as I'd expected it to. > > The attached fixes that for PG16+. PG13-15 will take a little bit more > > effort due to code changes in PG16; though it'd probably still require > > a relatively minor change. > > The patch does not seem unreasonable and I'd like to help tidy this > up, but ... hmm, could we also consider going the other way? > register_unlink_segment(), mdunlinkfiletag() and the macro that > populates md.c's FileTag are internal to md.c, and we don't expect > external code to be pushing md.c SYNC_UNLINK_REQUEST requests into the > request queue (who would do that and what could the motivation > possibly be?) Doesn't feel like a supported usage to me... So my > question is: what bad thing would happen if we just renamed > register_unlink_segment() to register_unlink_tombstone() without > fork/seg arguments, to make it clear that it's not really a general > purpose unreliable segment unlink mechanism that we want anyone to > build more stuff on top of? I just noticed I misinterpreted the conditions in mdunlinkfork, so that I thought it allowed a user to pass their own forknum into register_unlink_segment (as that is called with the user-provided forknum). Instead, that branch is only taken when forknum == MAIN_FORKNUM, so I think you might be right that going in the other direction is more desirable. In that case, something along the lines of the attached would then be better - it removes the fork and segno from register_unlink_segment's arguments (renamed to register_unlink_tombstone), and Asserts() that mdunlinkfiletag only receives a FileTag that contains expected values. Kind regards, Matthias van de Meent. v2-0001-MD-smgr-Clarify-FileTag-based-unlinking.patch Description: Binary data
Re: a litter question about mdunlinkfiletag function
On Tue, 15 Oct 2024 at 04:50, px shi wrote: >> >> You will find other places where relpathperm() is called without having >> a FileTag structure available, e.g. ReorderBufferProcessTXN(). > > > I apologize for the confusion. What I meant to say is that in the > mdunlinkfiletag() function, the forknum is currently hardcoded as > MAIN_FORKNUM when calling relpathperm(). While mdunlinkfiletag currently only > handles MAIN_FORKNUM, wouldn’t it be more flexible to retrieve the forknum > from the ftag structure instead? I just noticed this mail thread as I was searching the archives for other mentions of `mdunlinkfiletag` when doing some more digging on uses of that function, to back my own bug report of what looks like the same issue. See [0]. As was explained to me by Thomas, the reason why MAIN_FORKNUM is hardcoded here (and why ftag.segno is also ignored) is that this code is only ever reached for FileTag values with forknum=MAIN_FORKNUM (and segno is also always 0) with the code in Postgres' repository. The patch proposed in [0] is supposed to make that more clear to developers. I suspect the code will be further updated to include the correct fork number and segment number when there is a need to unlink non-MAIN_FORKNUM or non-segno=0 files in mdunlinkfiletag. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/CAEze2WiWt%2B9%2BOnqW1g9rKz0gqxymmt%3Doe6pKAEDrutdfpDMpTw%40mail.gmail.com
Re: Possible integer overflow in bringetbitmap()
On Tue, Dec 10, 2024 at 12:33:08PM +0900, Michael Paquier wrote: > Sure, you could do (a) and (b) together. It also seems to me that it > is just simpler to make totalpages a int64 to map automatically with > the result expected by the caller of bringetbitmap(), and we know that > based on MaxBlockNumber we'll never run out of bits. That should be simple enough. Are you planning to send a proposal of patch? -- Michael signature.asc Description: PGP signature
Re: Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment
On Sat, Dec 21, 2024 at 11:41 AM Matthias van de Meent wrote: > The unlinking of forks in the FileTag infrastructure has been broken > since b0a55e43 in PG16, Well spotted. -p = relpathperm(ftag->rlocator, MAIN_FORKNUM); +p = _mdfd_segpath_rflb(rlfb, ftag->forknum, ftag->segno); As you say, a harmless thinko as far as core is concerned, as we only ever use it for the "tombstone" file preventing relfilenode recycling. The tombstone is the bare relfilenode (main segment 0), since every other segment is unlinked on commit. > while a segment number other than 0 has never > been unlinked (at least not since the introduction of the system with > 3eb77eba in PG12) Right, and that predates the FileTag refactoring, it's just that the earlier coding didn't explicitly mention the segment number so it looks slightly different. In fact it was hard-coded to unlink relpathperm(entry->rnode, MAIN_FORKNUM) before that work, so both fork and segment number were always fixed, it's just that the FileTag mechanism was made slightly more general, really for the SYNC stuff, not so much for the UNLINK stuff which uses the same tags. > However, extensions may still make use of this and > incorrectly assume that only the requested file of the requested fork > 's segment will be unlinked, when it actually unlinks data from the > main fork. It seems unlikely to be useful for any purpose other than tombstones. And it seems like if someone is already using it, they would have been in touch to say that it doesn't work. Or perhaps you tried to use it and noticed this flaw, or know of someone who would like to use it? Or more likely I guess you're working on smgr extension support. It is not a reliable mechanism (pull the power after the checkpoint record is written and before it processes that list and you've leaked a file), and it's dealing with an edge case we should close in a better way, and then get rid of it. > The attached fixes that for PG16+. PG13-15 will take a little bit more > effort due to code changes in PG16; though it'd probably still require > a relatively minor change. The patch does not seem unreasonable and I'd like to help tidy this up, but ... hmm, could we also consider going the other way? register_unlink_segment(), mdunlinkfiletag() and the macro that populates md.c's FileTag are internal to md.c, and we don't expect external code to be pushing md.c SYNC_UNLINK_REQUEST requests into the request queue (who would do that and what could the motivation possibly be?) Doesn't feel like a supported usage to me... So my question is: what bad thing would happen if we just renamed register_unlink_segment() to register_unlink_tombstone() without fork/seg arguments, to make it clear that it's not really a general purpose unreliable segment unlink mechanism that we want anyone to build more stuff on top of?
Add XMLNamespaces to XMLElement
Hi, I'd like to propose the implementation of the XMLNamespaces option for XMLElement. XMLNAMESPACES(nsuri AS nsprefix) XMLNAMESPACES(DEFAULT default-nsuri) XMLNAMESPACES(NO DEFAULT) * nsprefix: Namespace's prefix. * nsuri: Namespace's URI. * DEFAULT default-nsuri: Specifies the DEFAULT namespace to use within the scope of a namespace declaration. * NO DEFAULT: Specifies that NO DEFAULT namespace is to be used within the scope of a namespace declaration. This basically works pretty much like XMLAttributes, but with a few more restrictions (see SQL/XML:2023, 11.2 ): * XML namespace declaration shall contain at most one DEFAULT namespace declaration item. * No namespace prefix shall be equivalent to xml or xmlns. * No namespace URI shall be identical to http://www.w3.org/2000/xmlns/ or to http://www.w3.org/XML/1998/namespace. * The value of a namespace URI contained in an regular namespace declaration item (no DEFAULT) shall not be a zero-length string. Examples: SELECT xmlelement(NAME "foo", xmlnamespaces('http://x.y' AS bar)); xmlelement --- http://x.y"/> SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http://x.y')); xmlelement --- http://x.y"/> SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT)); xmlelement - In transformXmlExpr() it seemed convenient to use the same parameters to store the prefixes and URIs as in XMLAttributes (arg_names and named_args), but I am still not so sure it is the right approach. Is there perhaps a better way? Any thoughts? Feedback welcome! Best, Jim From 8dee3772be0d89b3d49eff17344ff53440e5f566 Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Fri, 20 Dec 2024 14:50:51 +0100 Subject: [PATCH v1] Add XMLNamespaces option to XMLElement This patch adds the scoped option XMLNamespaces to XMLElement, as described in ISO/IEC 9075-14:2023, 11.2 XML lexically scoped options: xmlnamespaces(uri AS prefix, ...) xmlnamespaces(DEFAULT uri, ...) xmlnamespaces(NO DEFAULT, ...) * prefix: Namespace's prefix. * uri:Namespace's URI. * DEFAULT prefix: Specifies the DEFAULT namespace to use within the scope of a namespace declaration. * NO DEFAULT: Specifies that NO DEFAULT namespace is to be used within the scope of a namespace declaration. == Examples == SELECT xmlelement(NAME "foo", xmlnamespaces('http:/x.y' AS bar)); xmlelement -- SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http:/x.y')); xmlelement -- SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT)); xmlelement - Tests and documentation were updated accordingly. --- doc/src/sgml/func.sgml | 57 +++- src/backend/parser/gram.y | 73 +++--- src/backend/parser/parse_expr.c | 73 ++ src/backend/utils/adt/xml.c | 32 - src/include/nodes/primnodes.h | 4 +- src/include/utils/xml.h | 6 + src/test/regress/expected/xml.out | 205 src/test/regress/expected/xml_1.out | 151 src/test/regress/expected/xml_2.out | 205 src/test/regress/sql/xml.sql| 100 ++ 10 files changed, 884 insertions(+), 22 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 47370e581a..b33663bc09 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -14482,7 +14482,10 @@ SELECT xmlconcat('', ' -xmlelement ( NAME name , XMLATTRIBUTES ( attvalue AS attname , ... ) , content , ... ) xml +xmlelement ( NAME name +, XMLATTRIBUTES ( attvalue AS attname , ... ) +, XMLNAMESPACES ( {regular-nsuri AS nsprefix | DEFAULT default-nsuri | NO DEFAULT} , ... ) +, content , ... ) xml @@ -14495,7 +14498,39 @@ SELECT xmlconcat('', 'PostgreSQL data type. The argument(s) within XMLATTRIBUTES generate attributes of the XML element; the content value(s) are - concatenated to form its content. + concatenated to form its content. The arguments within XMLNAMESPACES + constuct namespace declarations from values provided in nsuri + and nsprefix, which correspond to the URI of a namespace and + its prefix, respectively. The option DEFAULT can be used to set the + default namespace declaration (without a prefix) to the URI provided in default-nsuri. + The option NO DEFAULT states that a namespace scope has no default namespace. A valid + XMLNAMESPACES item must fulfill the following conditions: + + + + + Only a single DEFAULT declaration item within the same scope. + + + + + No two nsuri can be equal within the same scope. + + + + + No nsprefix can be equal to xml or xmlns, + and no nsuri can be equal to http://www.w3.org/2000/xmlns/ +
Re: Statistics Import and Export
On Thu, 2024-12-19 at 21:23 -0800, Jeff Davis wrote: > > 0001-0005 - changes to pg_dump/pg_upgrade > > Attached is a version 36j... The testing can use some work here. I noticed that if I take out the stats entirely, the tests still pass, because pg_upgrade still gets the same before/after result. Also, we need some testing of the output and ordering of pg_dump. Granted, in most cases problems would result in errors during the reload. But we have those tests for other kinds of objects, so we should have the tests for stats, too. I like the description "STATISTICS DATA" because it differentiates from the extended stats definitions. It might be worth differentiating between "RELATION STATISTICS DATA" and "ATTRIBUTE STATISTICS DATA" but I'm not sure if there's value in that. But how did you determine what to use for the .tag and prefix? In the output, it uses the form: Name: STATISTICS DATA ; Type: STATISTICS DATA; ... Should that be: Name: ; Type: STATISTICS DATA; ... Or: Data for Name: ...; Name: ...; Type: STATISTICS DATA; ... Or: Statistics for Name: ...; Name: ...; Type: STATISTICS DATA; ... Regards, Jeff Davis
Re: Discussion on a LISTEN-ALL syntax
Trey Boudreau writes: > so I'd like to propose a 'LISTEN *' equivalent to 'UNLISTEN *'. Seems reasonable in the abstract, and given the UNLISTEN * precedent it's hard to quibble with that syntax choice. I think what actually needs discussing are the semantics, specifically how this'd interact with other LISTEN/UNLISTEN actions. Explain what you think should be the behavior after: LISTEN foo; LISTEN *; UNLISTEN *; -- are we still listening on foo? LISTEN *; LISTEN foo; UNLISTEN *; -- how about now? LISTEN *; UNLISTEN foo; -- how about now? LISTEN *; LISTEN foo; UNLISTEN foo; -- does that make a difference? I don't have any strong preferences about this, but we ought to have a clear idea of the behavior we want before we start coding. regards, tom lane
Re: [PoC] Federated Authn/z with OAUTHBEARER
> On 20 Dec 2024, at 02:00, Jacob Champion > wrote: Thanks for the new version, I was doing a v39 review but I'll roll that over into a v40 review now. As I was reading I was trying to identify parts can be broken out and committed ahead of time. This not only to trim down size, but mostly to shape the final commit into a coherent single commit that brings a single functionality utilizing existing APIs. Basically I think we should keep generic functionality out of the final commit and keep that focused on OAuth and the required APIs and infra. The async auth support seemed like a candidate to go in before the rest. While there won't be any consumers of it, it's also not limited to OAuth. What do you think about slicing that off and get in ahead of time? I took a small stab at separating out the generic bits (it includes the PG_MAX_AUTH_TOKEN_LENGTH move as well which is unrelated, but could also be committed ahead of time) along with some small tweaks on it. -- Daniel Gustafsson diff --git a/src/include/libpq/auth.h b/src/include/libpq/auth.h index 227b41daf6..01bc3c7250 100644 --- a/src/include/libpq/auth.h +++ b/src/include/libpq/auth.h @@ -16,6 +16,22 @@ #include "libpq/libpq-be.h" +/* + * Maximum accepted size of GSS and SSPI authentication tokens. + * We also use this as a limit on ordinary password packet lengths. + * + * Kerberos tickets are usually quite small, but the TGTs issued by Windows + * domain controllers include an authorization field known as the Privilege + * Attribute Certificate (PAC), which contains the user's Windows permissions + * (group memberships etc.). The PAC is copied into all tickets obtained on + * the basis of this TGT (even those issued by Unix realms which the Windows + * realm trusts), and can be several kB in size. The maximum token size + * accepted by Windows systems is determined by the MaxAuthToken Windows + * registry setting. Microsoft recommends that it is not set higher than + * 65535 bytes, so that seems like a reasonable limit for us as well. + */ +#define PG_MAX_AUTH_TOKEN_LENGTH 65535 + extern PGDLLIMPORT char *pg_krb_server_keyfile; extern PGDLLIMPORT bool pg_krb_caseins_users; extern PGDLLIMPORT bool pg_gss_accept_delegation; diff --git a/src/interfaces/libpq/fe-auth-sasl.h b/src/interfaces/libpq/fe-auth-sasl.h index 258bfd0564..b47011d077 100644 --- a/src/interfaces/libpq/fe-auth-sasl.h +++ b/src/interfaces/libpq/fe-auth-sasl.h @@ -30,6 +30,7 @@ typedef enum SASL_COMPLETE = 0, SASL_FAILED, SASL_CONTINUE, + SASL_ASYNC, } SASLStatus; /* @@ -77,6 +78,8 @@ typedef struct pg_fe_sasl_mech * * state: The opaque mechanism state returned by init() * +* final: true if the server has sent a final exchange outcome +* * input: The challenge data sent by the server, or NULL when * generating a client-first initial response (that is, when * the server expects the client to send a message to start @@ -101,12 +104,17 @@ typedef struct pg_fe_sasl_mech * * SASL_CONTINUE: The output buffer is filled with a client response. * Additional server challenge is expected +* SASL_ASYNC: Some asynchronous processing external to the +* connection needs to be done before a response can be +* generated. The mechanism is responsible for setting up +* conn->async_auth appropriately before returning. * SASL_COMPLETE: The SASL exchange has completed successfully. * SASL_FAILED:The exchange has failed and the connection should be * dropped. * */ - SASLStatus (*exchange) (void *state, char *input, int inputlen, + SASLStatus (*exchange) (void *state, bool final, +char *input, int inputlen, char **output, int *outputlen); /* diff --git a/src/interfaces/libpq/fe-auth-scram.c b/src/interfaces/libpq/fe-auth-scram.c index 0bb820e0d9..da168eb2f5 100644 --- a/src/interfaces/libpq/fe-auth-scram.c +++ b/src/interfaces/libpq/fe-auth-scram.c @@ -24,7 +24,8 @@ /* The exported SCRAM callback mechanism. */ static void *scram_init(PGconn *conn, const char *password, const char *sasl_mechanism); -static SASLStatus scram_exchange(void *opaq, char *input, int inputlen, +static SASLStatus scram_exchange(void *opaq, bool final, +char *input, int inputlen,
Re: Discussion on a LISTEN-ALL syntax
On Friday, December 20, 2024, Tom Lane wrote: > Trey Boudreau writes: > > * "Listen to all but X" seems like a reasonable desire. > This I concur with, and would add: let me name my channels accounting.payables, accounting.receivables, sales.leads; and let me listen or ignore all accounting/sales channel names. But staying within the existing “deny default, permissive grants only” design to meet this specific goal seems like a reasonable incremental step to accept. Let others wanting to work on a more expansive capability change brings those patches forth. As for exposing this to the user, this allow-all “channel” would be presented as any other normal channel. The reader would need to know about the special meaning of whatever label we end up using. IOW, the wildcard is the label and no attempt to tie real in-use channel names to it should or even could be attempted. David J.
Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment
Hi, I noticed that the MD smgr internals misbehave when unlink requests for specific forks or specific segments are sent through SyncOps, as it currently always unlinks segment 0 of the main fork, even if only a different fork and/or segment was requested. While probably not extremely critical, it seems bad to not unlink the right segment while in recovery, so here's a patch that unlinks the exact requested files. The unlinking of forks in the FileTag infrastructure has been broken since b0a55e43 in PG16, while a segment number other than 0 has never been unlinked (at least not since the introduction of the system with 3eb77eba in PG12). However, extensions may still make use of this and incorrectly assume that only the requested file of the requested fork 's segment will be unlinked, when it actually unlinks data from the main fork. The attached fixes that for PG16+. PG13-15 will take a little bit more effort due to code changes in PG16; though it'd probably still require a relatively minor change. Kind regards, Matthias van de Meent. Neon (https://neon.tech) v1-0001-MD-smgr-Unlink-the-requested-file-segment-not-mai.patch Description: Binary data
Re: Discussion on a LISTEN-ALL syntax
"David G. Johnston" writes: > On Friday, December 20, 2024, Tom Lane wrote: >> * "Listen to all but X" seems like a reasonable desire. > This I concur with, and would add: let me name my channels > accounting.payables, accounting.receivables, sales.leads; and let me listen > or ignore all accounting/sales channel names. Hmm. That reminds me that there was recently a proposal to allow LISTEN/UNLISTEN with pattern arguments. (It wasn't anything you'd expect like regex patterns or LIKE patterns, but some off-the-wall syntax, which I doubt we'd accept in that form. But clearly there's some desire for that out there.) While I don't say we need to implement that as part of this, it'd be a good idea to anticipate that that will happen. And that kind of blows a hole in my idea, because mine was predicated on the assumption that you could unambiguously match UNLISTENs against LISTENs. A patterned UNLISTEN might revoke a superset or subset of previous LISTENs, and I'm not sure you could readily tell which. I think we can still hold to the idea that LISTEN * or UNLISTEN * cancels all previous requests, but it's feeling like we might have to accumulate subsequent requests without trying to make contradictory ones cancel out. Is it okay if the behavior is explicitly dependent on the order of those requests, more or less "last match wins"? If not, how do we avoid that? > As for exposing this to the user, this allow-all “channel” would be > presented as any other normal channel. The reader would need to know about > the special meaning of whatever label we end up using. IOW, the wildcard is > the label and no attempt to tie real in-use channel names to it should or > even could be attempted. Don't think that quite flies. We might have to regurgitate the state explicitly: LISTEN * UNLISTEN foo.* LISTEN foo.bar.* showing that we're listening to channels foo.bar.*, but not other channels beginning "foo", and also to all channels not beginning "foo". regards, tom lane
Re: Discussion on a LISTEN-ALL syntax
> On 20 Dec 2024, at 23:07, Tom Lane wrote: > ..it makes "LISTEN *" act the same as though you had somehow explicitly listed > every possible channel. When thinking about it while reading this thread, this is what I came up with as well. Since the current workings of LISTEN is so well established I can't see how we could make this anything but a natural extension of the current. -- Daniel Gustafsson
Re: pure parsers and reentrant scanners
On 20.12.24 02:07, Tom Lane wrote: I noticed that lapwing is bleating about ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -fPIC -fvisibility=hidden -I. -I. -I../../src/include -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o cubescan.o cubescan.c cubescan.c:1689:5: warning: no previous prototype for 'cube_yyget_column' [-Wmissing-prototypes] cubescan.c:1765:6: warning: no previous prototype for 'cube_yyset_column' [-Wmissing-prototypes] and likewise in segscan.c. lapwing is using flex 2.5.35, so probably this is the same bug worked around in parser/scan.l: Ok, we can fix that, but maybe this is also a good moment to think about whether that is useful. I could not reproduce the issue with flex 2.5.39. I could find no download of flex 2.5.35. The github site only offers back to 2.5.39, the sourceforce site back to 2.5.36. lapwing says it's Debian 7.0, which went out of support in 2016 and out of super-duper-extended support in 2020. It also doesn't have a supported OpenSSL version anymore, and IIRC, it has a weird old compiler that occasionally gives bogus warnings. I think it's time to stop supporting this.
Re: downgrade some aclchk.c errors to internal
On 20.12.24 12:47, Peter Eisentraut wrote: In aclchk.c, there are a few error messages that use ereport() but it seems like they should be internal error messages. Moreover, they are using get_object_class_descr(), which is only meant for internal errors. (For example, it does not have translation support.) I dug through this and it seems like these errors are indeed not or no longer user- facing, so we can downgrade them to internal. See commit messages in the attached patches for further explanations. There was a mistake in the second patch, I had missed some other callers that I have fixed up here. Amended patch set attached. From b9a2945b367bc32f1a1dda421a72e176fecacdda Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 20 Dec 2024 10:10:19 +0100 Subject: [PATCH v2 1/2] Downgrade errors in object_ownercheck() to internal The "does not exist" errors in object_ownership() were written as ereport(), suggesting that they are user-facing. But no code path except one can reach this function without first checking that the object exists. If this were actually a user-facing error message, then there would be some problems: get_object_class_descr() is meant to be for internal errors only and does not support translation. The one case that can reach this without first checking the object existence is from be_lo_unlink(). (This makes some sense since large objects are referred to by their OID directly.) In this one case, we can add a line of code to check the object existence explicitly, consistent with other LO code. For the rest, downgrade the error messages to elog()s. The new message wordings are the same as in DropObjectById(). --- src/backend/catalog/aclchk.c | 10 -- src/backend/libpq/be-fsstubs.c | 5 + 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index aaf96a965e4..840122dca44 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -4082,9 +4082,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid) tuple = SearchSysCache1(cacheid, ObjectIdGetDatum(objectid)); if (!HeapTupleIsValid(tuple)) - ereport(ERROR, - (errcode(ERRCODE_UNDEFINED_OBJECT), -errmsg("%s with OID %u does not exist", get_object_class_descr(classid), objectid))); + elog(ERROR, "cache lookup failed for %s %u", +get_object_class_descr(classid), objectid); ownerId = DatumGetObjectId(SysCacheGetAttrNotNull(cacheid, tuple, @@ -4113,9 +4112,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid) tuple = systable_getnext(scan); if (!HeapTupleIsValid(tuple)) - ereport(ERROR, - (errcode(ERRCODE_UNDEFINED_OBJECT), -errmsg("%s with OID %u does not exist", get_object_class_descr(classid), objectid))); + elog(ERROR, "could not find tuple for %s %u", +get_object_class_descr(classid), objectid); ownerId = DatumGetObjectId(heap_getattr(tuple, get_object_attnum_owner(classid), diff --git a/src/backend/libpq/be-fsstubs.c b/src/backend/libpq/be-fsstubs.c index 27d317dfdc0..66bec869cbc 100644 --- a/src/backend/libpq/be-fsstubs.c +++ b/src/backend/libpq/be-fsstubs.c @@ -317,6 +317,11 @@ be_lo_unlink(PG_FUNCTION_ARGS) PreventCommandIfReadOnly("lo_unlink()"); + if (!LargeObjectExists(lobjId)) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_OBJECT), +errmsg("large object %u does not exist", lobjId))); + /* * Must be owner of the large object. It would be cleaner to check this * in inv_drop(), but we want to throw the error before not after closing base-commit: 8ac0021b6f10928a46b7f3d1b25bc21c0ac7f8c5 -- 2.47.1 From b68de2c26fac1d11c52dcfa391992454b65d7f52 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 20 Dec 2024 14:45:28 +0100 Subject: [PATCH v2 2/2] Downgrade error in object_aclmask_ext() to internal The "does not exist" error in object_aclmask_ext() was written as ereport(), suggesting that it is user-facing. This is problematic: get_object_class_descr() is meant to be for internal errors only and does not support translation. For the has_xxx_privilege functions, the error has not been user-facing since commit 403ac226ddd. The remaining users are pg_database_size() and pg_tablespace_size(). The call stack here is pretty deep and this dependency
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Thu, Dec 19, 2024 at 9:36 PM Richard Guo wrote: > > On Fri, Dec 20, 2024 at 7:50 AM Melanie Plageman > wrote: > Looks correct to me. > > BTW, I kind of doubt that the overflow risk fixed in 28328ec87 is a > real issue in real-world scenarios. Is it realistically possible to > have more than INT_MAX tuples fit on one heap page? > > If it is, then the statement below could also be prone to overflow. > > uint32 mid = (start + end) / 2; > > Maybe you want to change it to: > > uint32 mid = start + (end - start) / 2; I've done this and pushed. Thanks to you and Ranier for your help in identifying and fixing this! - Melanie
Additional comments around need_escapes in pg_parse_json()
I recently had occasion to use the pg_parse_json() function for creating input functions for pg_ndistinct and pg_dependencies. While it is obvious now that I should have been parsing with need_escapes=true, it wasn't obvious from the outset, and so I'm proposing adding a few comments to help the next person come to that conclusion sooner than I did. From 84a8990622a4b83729600d9db43e8678502548f5 Mon Sep 17 00:00:00 2001 From: Corey Huinker Date: Fri, 20 Dec 2024 05:08:50 -0500 Subject: [PATCH v1] Explain impact of need_escapes in JSON parsing. Adding some additional comments to help the reader discover the purpose and effect of the need_escapes parameter. Nearly every new use of pg_parse_json and pg_parse_json_incremental will need need_escapes=true. --- src/include/common/jsonapi.h | 11 +++ src/common/jsonapi.c | 7 ++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/src/include/common/jsonapi.h b/src/include/common/jsonapi.h index 167615a557..fa19005822 100644 --- a/src/include/common/jsonapi.h +++ b/src/include/common/jsonapi.h @@ -118,6 +118,11 @@ typedef struct JsonLexContext struct jsonapi_StrValType *errormsg; } JsonLexContext; +/* + * Function types for custom json parsing actions. + * + * fname and token will always be NULL if the context has need_escapes=false. + */ typedef JsonParseErrorType (*json_struct_action) (void *state); typedef JsonParseErrorType (*json_ofield_action) (void *state, char *fname, bool isnull); typedef JsonParseErrorType (*json_aelem_action) (void *state, bool isnull); @@ -197,6 +202,12 @@ extern JsonParseErrorType json_count_array_elements(JsonLexContext *lex, * struct is allocated. * * If need_escapes is true, ->strval stores the unescaped lexemes. + * + * Setting need_escapes to true is necessary if the operation needs + * to reference field names or scalar values. This is true of any + * operation beyond purely checking the json-validaity of the source + * document. + * * Unescaping is expensive, so only request it when necessary. * * If need_escapes is true or lex was given as NULL, then the caller is diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c index 0e2a82ad7a..307182e478 100644 --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -1297,7 +1297,10 @@ parse_scalar(JsonLexContext *lex, const JsonSemAction *sem) return result; } - /* invoke the callback, which may take ownership of val */ + /* + * invoke the callback, which may take ownership of val. + * val always NULL when need_escapes=false + */ result = (*sfunc) (sem->semstate, val, tok); if (lex->flags & JSONLEX_CTX_OWNS_TOKENS) @@ -1349,6 +1352,7 @@ parse_object_field(JsonLexContext *lex, const JsonSemAction *sem) if (ostart != NULL) { + /* fname always NULL when need_escapes=false */ result = (*ostart) (sem->semstate, fname, isnull); if (result != JSON_SUCCESS) goto ofield_cleanup; @@ -1370,6 +1374,7 @@ parse_object_field(JsonLexContext *lex, const JsonSemAction *sem) if (oend != NULL) { + /* fname always NULL when need_escapes=false */ result = (*oend) (sem->semstate, fname, isnull); if (result != JSON_SUCCESS) goto ofield_cleanup; -- 2.47.1
Some ExecSeqScan optimizations
Hi, I’ve been looking into possible optimizations for ExecSeqScan() and chatting with colleagues about cases where it shows up prominently in analytics-style query plans. For example, in queries like SELECT agg(somecol) FROM big_table WHERE , ExecScan() often dominates the profile. Digging into it, I found two potential sources of overhead: 1. Run-time checks for PlanState.qual and PlanState.ps_ProjInfo nullness: these checks are done repeatedly, which seems unnecessary if we know the values at plan init time. 2. Overhead from ExecScanFetch() when EvalPlanQual() isn’t relevant: Andres pointed out that ExecScanFetch() introduces unnecessary overhead even in the common case where EvalPlanQual() isn’t applicable. To address (1), I tried assigning specialized functions to PlanState.ExecProcNode in ExecInitSeqScan() based on whether qual or projInfo are NULL. Inspired by David Rowley’s suggestion to look at ExecHashJoinImpl(), I wrote variants like ExecSeqScanNoQual() (for qual == NULL) and ExecSeqScanNoProj() (for projInfo == NULL). These call a local version of ExecScan() that lives in nodeSeqScan.c, marked always-inline. This local copy takes qual and projInfo as arguments, letting compilers inline and optimize unnecessary branches away. For (2), the local ExecScan() copy avoids the generic ExecScanFetch() logic, simplifying things further when EvalPlanQual() doesn’t apply. That has the additional benefit of allowing SeqNext() to be called directly instead of via an indirect function pointer. This reduces the overhead of indirect calls and enables better compiler optimizations like inlining. Junwang Zhao helped with creating a benchmark to test the patch, the results of which can be accessed in the spreadsheet at [1]. The results show that the patch makes the latency of queries of shape `SELECT agg(somecol or *) FROM big_table WHERE ` generally faster with up to 5% improvement in some cases. Would love to hear thoughts. -- Thanks, Amit Langote [1] https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing v1-0001-Introduce-optimized-ExecSeqScan-variants-for-tail.patch Description: Binary data
downgrade some aclchk.c errors to internal
In aclchk.c, there are a few error messages that use ereport() but it seems like they should be internal error messages. Moreover, they are using get_object_class_descr(), which is only meant for internal errors. (For example, it does not have translation support.) I dug through this and it seems like these errors are indeed not or no longer user-facing, so we can downgrade them to internal. See commit messages in the attached patches for further explanations.From b9a2945b367bc32f1a1dda421a72e176fecacdda Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 20 Dec 2024 10:10:19 +0100 Subject: [PATCH 1/2] Downgrade errors in object_ownercheck() to internal The "does not exist" errors in object_ownership() were written as ereport(), suggesting that they are user-facing. But no code path except one can reach this function without first checking that the object exists. If this were actually a user-facing error message, then there would be some problems: get_object_class_descr() is meant to be for internal errors only and does not support translation. The one case that can reach this without first checking the object existence is from be_lo_unlink(). (This makes some sense since large objects are referred to by their OID directly.) In this one case, we can add a line of code to check the object existence explicitly, consistent with other LO code. For the rest, downgrade the error messages to elog()s. The new message wordings are the same as in DropObjectById(). --- src/backend/catalog/aclchk.c | 10 -- src/backend/libpq/be-fsstubs.c | 5 + 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index aaf96a965e4..840122dca44 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -4082,9 +4082,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid) tuple = SearchSysCache1(cacheid, ObjectIdGetDatum(objectid)); if (!HeapTupleIsValid(tuple)) - ereport(ERROR, - (errcode(ERRCODE_UNDEFINED_OBJECT), -errmsg("%s with OID %u does not exist", get_object_class_descr(classid), objectid))); + elog(ERROR, "cache lookup failed for %s %u", +get_object_class_descr(classid), objectid); ownerId = DatumGetObjectId(SysCacheGetAttrNotNull(cacheid, tuple, @@ -4113,9 +4112,8 @@ object_ownercheck(Oid classid, Oid objectid, Oid roleid) tuple = systable_getnext(scan); if (!HeapTupleIsValid(tuple)) - ereport(ERROR, - (errcode(ERRCODE_UNDEFINED_OBJECT), -errmsg("%s with OID %u does not exist", get_object_class_descr(classid), objectid))); + elog(ERROR, "could not find tuple for %s %u", +get_object_class_descr(classid), objectid); ownerId = DatumGetObjectId(heap_getattr(tuple, get_object_attnum_owner(classid), diff --git a/src/backend/libpq/be-fsstubs.c b/src/backend/libpq/be-fsstubs.c index 27d317dfdc0..66bec869cbc 100644 --- a/src/backend/libpq/be-fsstubs.c +++ b/src/backend/libpq/be-fsstubs.c @@ -317,6 +317,11 @@ be_lo_unlink(PG_FUNCTION_ARGS) PreventCommandIfReadOnly("lo_unlink()"); + if (!LargeObjectExists(lobjId)) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_OBJECT), +errmsg("large object %u does not exist", lobjId))); + /* * Must be owner of the large object. It would be cleaner to check this * in inv_drop(), but we want to throw the error before not after closing base-commit: 8ac0021b6f10928a46b7f3d1b25bc21c0ac7f8c5 -- 2.47.1 From 320a62cd669cc7a43cab950ed60f956984cd65b4 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 20 Dec 2024 10:48:20 +0100 Subject: [PATCH 2/2] Downgrade error in object_aclmask_ext() to internal The "does not exist" error in object_aclmask_ext() was written as ereport(), suggesting that it is user-facing. But this has not been true since commit 403ac226ddd. So we can downgrade this to a normal "cache lookup failed" internal error. If this were actually a user-facing error message, then there would be some problems: get_object_class_descr() is meant to be for internal errors only and does not support translation. --- src/backend/catalog/aclchk.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index 840122dca44..25b15027447 100644 ---
Re: Fix crash when non-creator being an iteration on shared radix tree
On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada wrote: > > On Wed, Dec 18, 2024 at 10:32 PM John Naylor wrote: > > 2. The iter_context is separate because the creator's new context > > could be a bump context which doesn't support pfree. But above we > > assume we can pfree in the caller's context. Also, IIUC we only > > allocate small iter objects, and it'd be unusual to need more than one > > at a time per backend, so it's a bit strange to have an entire context > > for that. Since we use a standard pattern of "begin; while(iter); > > end;", it seems unlikely that someone will cause a leak because of a > > coding mistake in iteration. v3-0001 allocates the iter data in the caller's context. It's a small patch, but still a core behavior change so proposed for master-only. I believe your v1 is still the least invasive fix for PG17. > > If these tiny admin structs were always, not sometimes, in the callers > > current context, I think it would be easier to reason about because > > then the creator's passed context would be used only for local memory, > > specifically only for leaves and the inner node child contexts. 0002 does this. > Fair points. Given that we need only one iterator at a time per > backend, it would be simpler if the caller passes the pointer to an > iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example, > TidStoreBeginIterate() would be like: > > if (TidStoreIsShared(ts)) > shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared); > else >local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared); Hard for me to tell if it'd be simpler. > > 3. I was never a fan of trying to second-guess the creator's new > > context and instead use slab for fixed-sized leaf allocations. If the > > creator passes a bump context, we say "no, no, no, use slab -- it's > > good for ya". Let's assume the caller knows what they're doing. > > That's a valid argument but how can a user use the slab context for > leaf allocations? It's trivial after 0001-02: 0003 removes makes one test use slab as the passed context (only 32-bit systems would actually use it currently). Also, with a bit more work we could allow a NULL context for when the caller has purposely arranged to use pointer-sized values. Did you see any of Heikki's CSN patches? There is a radix tree used as a cache in a context where the tree could be created and destroyed frequently. Something about the memory blocks seems to have tickled some bad case in the glibc allocator, and one less context might be good insurance: https://www.postgresql.org/message-id/718d1788-b058-40e6-bc37-8f15612b5646%40iki.fi -- John Naylor Amazon Web Services From cd5f4e63dd7a49a0884249e55bf2b78293faa4b9 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 20 Dec 2024 12:01:50 +0700 Subject: [PATCH v3 1/3] Use caller's current memory context for radix tree iteration --- src/include/lib/radixtree.h | 15 +++ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index 1301f3fee4..8fe2302652 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -719,7 +719,6 @@ struct RT_RADIX_TREE /* leaf_context is used only for single-value leaves */ MemoryContextData *leaf_context; #endif - MemoryContextData *iter_context; }; /* @@ -1836,14 +1835,6 @@ RT_CREATE(MemoryContext ctx) tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); tree->context = ctx; - /* - * Separate context for iteration in case the tree context doesn't support - * pfree - */ - tree->iter_context = AllocSetContextCreate(ctx, - RT_STR(RT_PREFIX) "_radix_tree iter context", - ALLOCSET_SMALL_SIZES); - #ifdef RT_SHMEM tree->dsa = dsa; dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL)); @@ -2075,7 +2066,8 @@ RT_FREE(RT_RADIX_TREE * tree) /* ITERATION */ /* - * Create and return the iterator for the given radix tree. + * Create and return the iterator for the given radix tree, + * in the caller's current memory context. * * Taking a lock in shared mode during the iteration is the caller's * responsibility. @@ -2086,8 +2078,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE * tree) RT_ITER*iter; RT_CHILD_PTR root; - iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context, - sizeof(RT_ITER)); + iter = (RT_ITER *) palloc0(sizeof(RT_ITER)); iter->tree = tree; Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root)); -- 2.47.1 From 79d22b266f0df433dcb9147d0d64db53c681f6c7 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 20 Dec 2024 14:48:24 +0700 Subject: [PATCH v3 3/3] Always use the caller-provided context for radix tree leaves Previously, RT_CREATE would create an additional slab context if the values were fixed-length and larger than pointer size. Commit X arranged so that the caller-provided context is only used for leaves and nothing else, so if we ove
Re: Make tuple deformation faster
On Fri, 20 Dec 2024 at 23:04, David Rowley wrote: > I've now pushed the 0001 patch and the 0002 patch as one commit. I'm > going to give the buildfarm a bit of time then push the commit to > remove pg_attribute.attcacheoff. I'm planning on letting that settle > overnight then if all is well push the attalignby changes. The attcacheoff removal is now pushed. I've attached the two remaining patches. David v8-0001-Optimize-alignment-calculations-in-tuple-form-def.patch Description: Binary data v8-0002-Speedup-tuple-deformation-with-additional-functio.patch Description: Binary data
Re: pure parsers and reentrant scanners
Peter Eisentraut writes: > On 20.12.24 02:07, Tom Lane wrote: >> I noticed that lapwing is bleating about >> cubescan.c:1689:5: warning: no previous prototype for 'cube_yyget_column' >> [-Wmissing-prototypes] >> cubescan.c:1765:6: warning: no previous prototype for 'cube_yyset_column' >> [-Wmissing-prototypes] >> and likewise in segscan.c. lapwing is using flex 2.5.35, so probably >> this is the same bug worked around in parser/scan.l: > Ok, we can fix that, but maybe this is also a good moment to think about > whether that is useful. I could not reproduce the issue with flex > 2.5.39. I could find no download of flex 2.5.35. The github site only > offers back to 2.5.39, the sourceforce site back to 2.5.36. lapwing > says it's Debian 7.0, which went out of support in 2016 and out of > super-duper-extended support in 2020. It also doesn't have a supported > OpenSSL version anymore, and IIRC, it has a weird old compiler that > occasionally gives bogus warnings. I think it's time to stop supporting > this. OK, that's fair. I do see lapwing called out a lot in the commit log, though it's not clear how much of that is about 32-bitness and how much about old tools. It's surely still valuable to have i386 machines in the buildfarm, but I agree that supporting unobtainable tool versions is a bit much. Could we get that animal updated to some newer OS version? Presumably, we should also rip out the existing yyget_column and yyset_column kluges in src/backend/parser/scan.l: extern int core_yyget_column(yyscan_t yyscanner); src/bin/psql/psqlscanslash.l: extern int slash_yyget_column(yyscan_t yyscanner); src/bin/pgbench/exprscan.l: extern int expr_yyget_column(yyscan_t yyscanner); src/fe_utils/psqlscan.l: extern intpsql_yyget_column(yyscan_t yyscanner); regards, tom lane
Re: pure parsers and reentrant scanners
On Fri, Dec 20, 2024 at 11:23 PM Tom Lane wrote: > > Could we get that animal updated to > some newer OS version? There is already adder animal that is running debian sid on i386. The only remaining interest in lapwing is to have older versions of everything, so if that's useless I can just trash that vm.
Re: pure parsers and reentrant scanners
Julien Rouhaud writes: > On Fri, Dec 20, 2024 at 11:23 PM Tom Lane wrote: >> Could we get that animal updated to >> some newer OS version? > There is already adder animal that is running debian sid on i386. The > only remaining interest in lapwing is to have older versions of > everything, so if that's useless I can just trash that vm. Hmm, sid is the opposite extreme no? Maybe switching lapwing to whatever is currently the oldest supported Debian release would be a good answer. regards, tom lane
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote: > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > "sv-SE-x-icu") FROM generate_series(1, 100) i); > > master: ~540 ms > Patched: ~460 ms > glibc: ~410 ms It looks like you are opening and closing the UCaseMap object each time. Why not save it in pg_locale_t? That should speed it up even more and hopefully beat libc. Also, to support older ICU versions consistently, we need to fix up the locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out that logic? Regards, Jeff Davis
Re: Memory leak in WAL sender with pgoutput (v10~)
On Thu, Dec 19, 2024 at 6:31 PM Michael Paquier wrote: > > On Thu, Dec 19, 2024 at 09:27:04AM -0800, Masahiko Sawada wrote: > > On Wed, Dec 18, 2024 at 6:26 PM Michael Paquier wrote: > >> v2 does not have these weaknesses by design. > > > > I agree that v2 is better than v3 in terms of that. > > Okay. In terms of the backbranches, would you prefer that I handle > this patch myself as I have done the HEAD part? This would need a > second, closer, review but I could do that at the beginning of next > week. Thanks. Please proceed with this fix as you've already fixed the HEAD part. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Discussion on a LISTEN-ALL syntax
Trey Boudreau writes: > My first pass at the documentation looks like this: > > The special wildcard * cancels all listener > registrations for the current session and replaces them with a > virtual registration that matches all channels. Further > LISTEN and UNLISTEN class="parameter">channel commands will > be ignored until the session sees the UNLISTEN * > command. > Hmph. After thinking about it a bit I have a different idea (and I see David has yet a third one). So maybe this is more contentious than it seems. But at any rate, I have two fundamental thoughts: * "Listen to all but X" seems like a reasonable desire. * The existing implementation already has the principle that you can't listen to a channel more than once; that is, LISTEN foo; LISTEN foo; -- this is a no-op, not a duplicate subscription Therefore I propose: * "LISTEN *" wipes away all previous listen state, and sets up a state where you're listening to all channels (within your database). * "UNLISTEN *" wipes away all previous listen state, and sets up a state where you're listening to no channels (which is the same as it does now). * "LISTEN foo" adds "foo" to what you are listening to, with no effect if you already were listening to foo (whether it was a virtual or explicit listen). * "UNLISTEN foo" removes "foo" from what you are listening to, with no effect if you already weren't listening to foo. This is just about the same as the current behavior, and it makes "LISTEN *" act the same as though you had somehow explicitly listed every possible channel. Which I think is a lot cleaner than conceptualizing it as an independent gating behavior, as well as more useful because it'll permit "all but" behavior. The implementation of this could be something like struct { boolall;/* true if listening to all */ List *plus; /* channels explicitly listened */ List *minus; /* channels explicitly unlistened */ } ListenChannels; with the proviso that "plus" must be empty if "all" is true, while "minus" must be empty if "all" is false. The two lists are always empty right after LISTEN * or UNLISTEN *, but could be manipulated by subsequent channel-specific LISTEN/UNLISTEN. (Since only one list would be in use at a time, you could alternatively combine "plus" and "minus" into a single list of exceptions to the all/none state. I suspect that would be confusingly error-prone to code; but perhaps it would turn out elegantly.) One other thing that needs to be thought about in any case is what the pg_listening_channels() function ought to return in these newly-possible states. regards, tom lane
Re: Discussion on a LISTEN-ALL syntax
> On Dec 20, 2024, at 2:58 PM, Tom Lane wrote: > Seems reasonable in the abstract, and given the UNLISTEN * precedent > it's hard to quibble with that syntax choice. I think what actually > needs discussing are the semantics, specifically how this'd interact > with other LISTEN/UNLISTEN actions. My first pass at the documentation looks like this: The special wildcard * cancels all listener registrations for the current session and replaces them with a virtual registration that matches all channels. Further LISTEN and UNLISTEN channel commands will be ignored until the session sees the UNLISTEN * command. > Explain what you think should > be the behavior after: > > LISTEN foo; > LISTEN *; > UNLISTEN *; > -- are we still listening on foo? > No, as the ‘LISTEN *’ wipes existing registrations. > LISTEN *; > LISTEN foo; > UNLISTEN *; > -- how about now? Not listening on ‘foo’ or anything else. > LISTEN *; > UNLISTEN foo; > -- how about now? ‘UNLISTEN foo’ ignored. > LISTEN *; > LISTEN foo; > UNLISTEN foo; > -- does that make a difference? ‘LISTEN foo’ and ‘UNLISTEN foo’ ignored, leaving only the wildcard. > I don't have any strong preferences about this, but we ought to > have a clear idea of the behavior we want before we start coding. These semantics made sense to me, but I have limited experience and a very specific use case in mind. Changing the behavior of ‘UNLISTEN *’ feels extremely impolite, and if we leave that alone I don’t see using the ‘LISTEN *’ syntax with behavior that leaves other LISTENs in place. We could have a different set of keywords, like LISTEN_ALL/UNLISTEN_ALL that doesn’t interfere with the existing behavior. -- Trey
Re: Discussion on a LISTEN-ALL syntax
On Fri, Dec 20, 2024 at 1:58 PM Tom Lane wrote: > Trey Boudreau writes: > > so I'd like to propose a 'LISTEN *' equivalent to 'UNLISTEN *'. > > Seems reasonable in the abstract, and given the UNLISTEN * precedent > it's hard to quibble with that syntax choice. I think what actually > needs discussing are the semantics, specifically how this'd interact > with other LISTEN/UNLISTEN actions. Explain what you think should > be the behavior after: > > Answers premised on the framing explained below: LISTEN foo; > LISTEN *; > UNLISTEN *; > -- are we still listening on foo? > Yes; the channels are orthogonal and thus order doesn't matter. > LISTEN *; > LISTEN foo; > UNLISTEN *; > -- how about now? > Yes > LISTEN *; > UNLISTEN foo; > -- how about now? > The unlisten was a no-op since listen foo was not issued; * receives everything, always. > LISTEN *; > LISTEN foo; > UNLISTEN foo; > -- does that make a difference? > If any notify foo happened in between listen foo and unlisten foo the session would receive the notify message twice - once implicitly via * and once explicitly via foo. Alternatively, the server could see that "foo" is subscribed too for PID listener, send the message and then skip over looking for a * subscription for PID listener. Basically document that we won't send duplicates if both listen * and listen foo are present. > I don't have any strong preferences about this, but we ought to > have a clear idea of the behavior we want before we start coding. > > I'm inclined to make this clearly distinct from the semantics of listen/notify. Both in command form, what is affected, and the message. Something like: MONITOR NOTIFICATION QUEUE; UNMONITOR NOTIFICATION QUEUE; Asynchronous notification "foo" [with payload ...] sent by server process with PID nnn. If you also LISTEN foo you would also receive: Asynchronous notification "foo" [with payload ...] received from server process with PID nnn. Unlisten undoes Listen Unmonitor undoes Monitor Upon session disconnect both Unlisten * and Unmonitor are executed. If we must shoehorn this into the existing syntax and messages I'd still want to say that * is simply a special channel name that the system recognizes and sends all notify messages to. There is no way to limit which messages get sent to you via unlisten and if you also listen to the channel foo explicitly you end up receiving multiple messages. (Alternatively, send it just to foo and have the server not look for a * listen for that specific session.) Adding a "do not send" listing (or option) to the implementation doesn't seem beneficial enough to deal with, and would be the only way: Listen *; Unlisten foo; would be capable of not having foo messages sent to the * subscribing client. In short, a "deny (do not send) all" base posture and then permit-only policies built on top of it. Listen * is the permit-all policy. David J.
Discussion on a LISTEN-ALL syntax
Howdy all, NOTE: Grey-beard coder, pgsql newbie. All info/tips/suggestions welcome! I have a use-case where I’d like to LISTEN for all NOTIFY channels. Right now I simply issue a LISTEN for every channel name of interest, but in production the channels will number in the low thousands. The current implementation uses a linked list, and a linear probe through the list of desired channels which will always return true becomes quite expensive at this scale. I have a work-around available by creating the “ALL” channel and making the payload include the actual channel name, but this has a few of drawbacks: * it does not play nice with clients that actually want a small subset of channels; * it requires code modification at every NOTIFY; * it requires extra code on the client side. The work-around subjects the developer (me :-) to significant risk of foot-gun disease, so I'd like to propose a 'LISTEN *' equivalent to 'UNLISTEN *'. The implementation in src/backend/commands/async.c seems straightforward enough, but it feels prudent to select a syntax that doesn't make some kind of actual pattern matching syntactically ugly in the future. Choosing 'LISTEN *' has a nice symmetry with 'UNLISTEN *', but I don't have enough SQL chops to know if it cause problems. If anyone has a better work-around, please speak up! If not, and we can come to some resolution on a future-resistant syntax, I'd happily start working up a patch set. Thanks, -- Trey Boudreau
Re: Fix crash when non-creator being an iteration on shared radix tree
On Fri, Dec 20, 2024 at 2:27 AM John Naylor wrote: > > On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada wrote: > > > > On Wed, Dec 18, 2024 at 10:32 PM John Naylor > > wrote: > > > > 2. The iter_context is separate because the creator's new context > > > could be a bump context which doesn't support pfree. But above we > > > assume we can pfree in the caller's context. Also, IIUC we only > > > allocate small iter objects, and it'd be unusual to need more than one > > > at a time per backend, so it's a bit strange to have an entire context > > > for that. Since we use a standard pattern of "begin; while(iter); > > > end;", it seems unlikely that someone will cause a leak because of a > > > coding mistake in iteration. > > v3-0001 allocates the iter data in the caller's context. It's a small > patch, but still a core behavior change so proposed for master-only. I > believe your v1 is still the least invasive fix for PG17. I agree to use v1 for v17. > > > > If these tiny admin structs were always, not sometimes, in the callers > > > current context, I think it would be easier to reason about because > > > then the creator's passed context would be used only for local memory, > > > specifically only for leaves and the inner node child contexts. > > 0002 does this. > > > Fair points. Given that we need only one iterator at a time per > > backend, it would be simpler if the caller passes the pointer to an > > iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example, > > TidStoreBeginIterate() would be like: > > > > if (TidStoreIsShared(ts)) > > shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared); > > else > >local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared); > > Hard for me to tell if it'd be simpler. > > > > 3. I was never a fan of trying to second-guess the creator's new > > > context and instead use slab for fixed-sized leaf allocations. If the > > > creator passes a bump context, we say "no, no, no, use slab -- it's > > > good for ya". Let's assume the caller knows what they're doing. > > > > That's a valid argument but how can a user use the slab context for > > leaf allocations? > > It's trivial after 0001-02: 0003 removes makes one test use slab as > the passed context (only 32-bit systems would actually use it > currently). These changes make sense to me. Here are a few comments: RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for local memory. Do we want to declare only in the !RT_SHMEM case? --- /* * Similar to TidStoreCreateLocal() but create a shared TidStore on a * DSA area. The TID storage will live in the DSA area, and the memory * context rt_context will have only meta data of the radix tree. * * The returned object is allocated in backend-local memory. */ We need to update the comment about rt_context too since we no longer use rt_context for shared tidstore. > > Also, with a bit more work we could allow a NULL context for when the > caller has purposely arranged to use pointer-sized values. Did you see > any of Heikki's CSN patches? There is a radix tree used as a cache in > a context where the tree could be created and destroyed frequently. > Something about the memory blocks seems to have tickled some bad case > in the glibc allocator, and one less context might be good insurance: > > https://www.postgresql.org/message-id/718d1788-b058-40e6-bc37-8f15612b5646%40iki.fi Will check these patches. Thanks! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: allow changing autovacuum_max_workers without restarting
I think I've been saying I would commit this since August, but now I am planning to do so first thing in the new year. In v11 of the patch, I moved the initial startup WARNING to autovac_init() to avoid repeatedly logging when the launcher restarts (e.g., for emergency vacuums when the autovacuum parameter is disabled). Otherwise, I just made a couple of cosmetic alterations and added a commit message. -- nathan >From dad43c4138c2cc0712006565bb4984c2a211e8fe Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 20 Dec 2024 13:33:26 -0600 Subject: [PATCH v11 1/1] Allow changing autovacuum_max_workers without restarting. This commit introduces a new parameter named autovacuum_worker_slots that controls how many autovacuum worker slots to reserve during server startup. Modifying this new parameter's value does require a server restart, but it should typically be set to the upper bound of what you might realistically need to set autovacuum_max_workers. With that new parameter in place, autovacuum_max_workers can now be changed with a SIGHUP (e.g., pg_ctl reload). If autovacuum_max_workers is set higher than autovacuum_worker_slots, a WARNING is emitted, and the server will only start up to autovacuum_worker_slots workers at a given time. If autovacuum_max_workers is set to a value less than the number of currently-running autovacuum workers, the existing workers will continue running, but no new workers will be started until the number of running autovacuum workers drops below autovacuum_max_workers. Reviewed-by: Sami Imseih, Justin Pryzby, Robert Haas, Andres Freund, Yogesh Sharma Discussion: https://postgr.es/m/20240410212344.GA1824549%40nathanxps13 --- doc/src/sgml/config.sgml | 28 ++- doc/src/sgml/runtime.sgml | 4 +- src/backend/access/transam/xlog.c | 2 +- src/backend/postmaster/autovacuum.c | 82 +++ src/backend/postmaster/pmchild.c | 4 +- src/backend/storage/lmgr/proc.c | 6 +- src/backend/utils/init/postinit.c | 6 +- src/backend/utils/misc/guc_tables.c | 11 ++- src/backend/utils/misc/postgresql.conf.sample | 3 +- src/include/postmaster/autovacuum.h | 1 + src/test/perl/PostgreSQL/Test/Cluster.pm | 1 + 11 files changed, 117 insertions(+), 31 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index fbdd6ce5740..740ff5d5044 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8630,6 +8630,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; + + autovacuum_worker_slots (integer) + + autovacuum_worker_slots configuration parameter + + + + +Specifies the number of backend slots to reserve for autovacuum worker +processes. The default is 16. This parameter can only be set at server +start. + + +When changing this value, consider also adjusting +. + + + + autovacuum_max_workers (integer) @@ -8640,7 +8659,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; Specifies the maximum number of autovacuum processes (other than the autovacuum launcher) that may be running at any one time. The default -is three. This parameter can only be set at server start. +is three. This parameter can only be set in the +postgresql.conf file or on the server command line. + + +Note that a setting for this value which is higher than + will have no effect, +since autovacuum workers are taken from the pool of slots established +by that setting. diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 94135e9d5ee..537e356315d 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -839,7 +839,7 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such When using System V semaphores, PostgreSQL uses one semaphore per allowed connection (), allowed autovacuum worker process -(), allowed WAL sender process +(), allowed WAL sender process (), allowed background process (), etc., in sets of 16. The runtime-computed parameter @@ -892,7 +892,7 @@ $ postgres -D $PGDATA -C num_os_semaphores When using POSIX semaphores, the number of semaphores needed is the same as for System V, that is one semaphore per allowed connection (), allowed autovacuum worker process -(), allowed WAL sender process +(), allowed WAL sender process (), allowed background process (), etc. On the platforms where this option is preferred, there is no specific diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6f58412bcab..706f2127def 100644 --- a/src/backend/acces
Re: Discussion on a LISTEN-ALL syntax
On Fri, Dec 20, 2024 at 2:42 PM Trey Boudreau wrote: > > > On Dec 20, 2024, at 2:58 PM, Tom Lane wrote: > > Seems reasonable in the abstract, and given the UNLISTEN * precedent > > it's hard to quibble with that syntax choice. I think what actually > > needs discussing are the semantics, specifically how this'd interact > > with other LISTEN/UNLISTEN actions. > > My first pass at the documentation looks like this: > > > The special wildcard * cancels all listener > registrations for the current session and replaces them with a > virtual registration that matches all channels. Further > LISTEN and UNLISTEN class="parameter">channel commands will > be ignored until the session sees the UNLISTEN * > command. > > I just sent my thoughts here as well. The choice to "cancel all listener registrations" seems unintuitive and unnecessary - so long as we either document or handle deduplication internally. As I noted in my email, * is a permit-all policy in a "deny by default" system. Such a system is allowed to have other more targeted "allow" policies existing at the same time. If the permit-all policy gets removed then those individual allow policies immediately become useful again. If you want to remove those targeted allowed policies execute Unlisten * before executing Listen *. I dislike the non-symmetric meaning of * in the command sequence above but it likely is better than inventing a whole new syntax. David J.
Re: Discussion on a LISTEN-ALL syntax
On Fri, Dec 20, 2024 at 2:42 PM Trey Boudreau wrote: > We could have a different set of keywords, like LISTEN_ALL/UNLISTEN_ALL > that doesn’t interfere with the existing behavior. > > I think we will need something along these lines. We've given * a meaning in UNLISTEN * that doesn't match what this proposal wants to accomplish. I suggested using monitor/unmonitor but I suppose any unquoted symbol or keyword that is invalid as a channel name would work within the Listen/Unlisten syntax. Otherwise I mis-spoke in my previous design since regardless of whether Listen * unregisters existing channels or not Unlisten * will remove everything and leave the session back at nothing. In which case you might as well just remove the redundant channel listeners. David J.
wrong comments in ClassifyUtilityCommandAsReadOnly
hi. /* * Determine the degree to which a utility command is read only. * * Note the definitions of the relevant flags in src/include/utility/tcop.h. */ static int ClassifyUtilityCommandAsReadOnly(Node *parsetree) Is the comment wrong? it should be " * Note the definitions of the relevant flags in src/include/tcop/utility.h."
Re: Discussion on a LISTEN-ALL syntax
On 20/12/2024 23:45, Tom Lane wrote: Don't think that quite flies. We might have to regurgitate the state explicitly: LISTEN * UNLISTEN foo.* LISTEN foo.bar.* showing that we're listening to channels foo.bar.*, but not other channels beginning "foo", and also to all channels not beginning "foo". Could I perhaps propose a sort of wildmat[1] syntax? The above sequence could be expressed simply as: LISTEN *,!foo.*,foo.bar.* I would like this in psql's backslash commands, too. [1] https://en.wikipedia.org/wiki/Wildmat -- Vik Fearing
Re: Fix logging for invalid recovery timeline
> On 20 Dec 2024, at 20:37, David Steele wrote: > > "Latest checkpoint is at %X/%X on timeline %u, but in the history of the > requested timeline, the server forked off from that timeline at %X/%X." I think errdetai here is very hard to follow. I seem to understand what is going on after reading errmsg, but errdetai makes me uncertain. If we call "tliSwitchPoint(CheckPointTLI, expectedTLEs, NULL);" don't we risk to have again ereport(ERROR, (errmsg("requested timeline %u is not in this server's history", tli))); ? Best regards, Andrey Borodin.
Re: Discussion on a LISTEN-ALL syntax
On 21/12/2024 05:23, Tom Lane wrote: Vik Fearing writes: Could I perhaps propose a sort of wildmat[1] syntax? The above sequence could be expressed simply as: LISTEN *,!foo.*,foo.bar.* That doesn't absolve you from having to say what happens if the user then issues another "LISTEN zed" or "UNLISTEN foo.bar.baz" command. We can't break the existing behavior that "LISTEN foo" followed by "LISTEN bar" results in listening to both channels. So on the whole this seems like it just adds complexity without removing any. I'm inclined to limit things to one pattern per LISTEN/UNLISTEN command, with more complex behaviors reached by issuing a sequence of commands. Fair enough. -- Vik Fearing
Re: Discussion on a LISTEN-ALL syntax
Vik Fearing writes: > Could I perhaps propose a sort of wildmat[1] syntax? > The above sequence could be expressed simply as: > LISTEN *,!foo.*,foo.bar.* That doesn't absolve you from having to say what happens if the user then issues another "LISTEN zed" or "UNLISTEN foo.bar.baz" command. We can't break the existing behavior that "LISTEN foo" followed by "LISTEN bar" results in listening to both channels. So on the whole this seems like it just adds complexity without removing any. I'm inclined to limit things to one pattern per LISTEN/UNLISTEN command, with more complex behaviors reached by issuing a sequence of commands. regards, tom lane
Re: Fix crash when non-creator being an iteration on shared radix tree
On Sat, Dec 21, 2024 at 2:17 AM Masahiko Sawada wrote: > > On Fri, Dec 20, 2024 at 2:27 AM John Naylor wrote: > > v3-0001 allocates the iter data in the caller's context. It's a small > > patch, but still a core behavior change so proposed for master-only. I > > believe your v1 is still the least invasive fix for PG17. > > I agree to use v1 for v17. Okay, did you want to commit that separately, or together with my 0001 on master? Either way, I've put a bit more effort into the commit message in v4 and adjusted the comment slightly. > > It's trivial after 0001-02: 0003 removes makes one test use slab as > > the passed context (only 32-bit systems would actually use it > > currently). > > These changes make sense to me. Here are a few comments: > > RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for > local memory. Do we want to declare only in the !RT_SHMEM case? That's already the case, if I understand your statement correctly. > --- > /* > * Similar to TidStoreCreateLocal() but create a shared TidStore on a > * DSA area. The TID storage will live in the DSA area, and the memory > * context rt_context will have only meta data of the radix tree. > * > * The returned object is allocated in backend-local memory. > */ > > We need to update the comment about rt_context too since we no longer > use rt_context for shared tidstore. Fixed. BTW, it seems TidStore.context is unused? -- John Naylor Amazon Web Services From 679cf5c305b3f0affb5c4a679cf18eb0674e8691 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 20 Dec 2024 14:48:24 +0700 Subject: [PATCH v4 3/3] Always use the caller-provided context for radix tree leaves Previously, it may not have worked for a caller to pass a slab context, since it would have been also used for other things which may have incompatible size. In an attempt to helpfully avoid wasting space due to aset's power-of-two rounding, RT_CREATE would create an additional slab context if the values were fixed-length and larger than pointer size. The problem was, we have since added the bump context type, and the generation context was a possibility as well, so silently overriding the caller's choice is not friendly. Commit X arranged so that the caller-provided context is only used for leaves and nothing else, so it's safe for the caller to use slab here if they wish. As demonstration, use slab in one of the radix tree regression tests. Reviewed by Masahiko Sawada --- src/include/lib/radixtree.h | 14 -- src/test/modules/test_radixtree/test_radixtree.c | 5 +++-- 2 files changed, 3 insertions(+), 16 deletions(-) diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index 97aff227c5..fc7474e0c7 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -1849,21 +1849,7 @@ RT_CREATE(MemoryContext ctx) size_class.allocsize); } - /* By default we use the passed context for leaves. */ tree->leaf_context = ctx; - -#ifndef RT_VARLEN_VALUE_SIZE - - /* - * For leaves storing fixed-length values, we use a slab context to avoid - * the possibility of space wastage by power-of-2 rounding up. - */ - if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) - tree->leaf_context = SlabContextCreate(ctx, - RT_STR(RT_PREFIX) "_radix_tree leaf context", - RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), - sizeof(RT_VALUE_TYPE)); -#endif /* !RT_VARLEN_VALUE_SIZE */ #endif /* RT_SHMEM */ /* add root node now so that RT_SET can assume it exists */ diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c index 8074b83695..5c54962edf 100644 --- a/src/test/modules/test_radixtree/test_radixtree.c +++ b/src/test/modules/test_radixtree/test_radixtree.c @@ -313,9 +313,10 @@ test_random(void) #else MemoryContext radixtree_ctx; - radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, + radixtree_ctx = SlabContextCreate(CurrentMemoryContext, "test_radix_tree", - ALLOCSET_SMALL_SIZES); + SLAB_DEFAULT_BLOCK_SIZE, + sizeof(TestValueType)); radixtree = rt_create(radixtree_ctx); #endif -- 2.47.1 From 46056af82a7516250b5cdb817579aeeb3952b8de Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sat, 21 Dec 2024 10:55:31 +0700 Subject: [PATCH v4 1/3] Get rid of radix tree's separate iterator context Instead, just use the caller's memory context. This relieves backends from having to create this context when they attach to a tree in shared memory, and delete when they detach. The iterator state is freed when ending iteration, and even if a developer failed to follow existing examples of iteration, the state struct is small enough to pose low risk. --- src/include/lib/radixtree.h | 15 +++ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index 1301f3fe
Re: Add XMLNamespaces to XMLElement
Hi so 21. 12. 2024 v 0:51 odesílatel Jim Jones napsal: > Hi, > > I'd like to propose the implementation of the XMLNamespaces option for > XMLElement. > > XMLNAMESPACES(nsuri AS nsprefix) > XMLNAMESPACES(DEFAULT default-nsuri) > XMLNAMESPACES(NO DEFAULT) > > * nsprefix: Namespace's prefix. > * nsuri: Namespace's URI. > * DEFAULT default-nsuri: Specifies the DEFAULT namespace to use within > the scope of a namespace declaration. > * NO DEFAULT:Specifies that NO DEFAULT namespace is to be > used within the scope of a namespace declaration. > > This basically works pretty much like XMLAttributes, but with a few more > restrictions (see SQL/XML:2023, 11.2 ): > > * XML namespace declaration shall contain at most one DEFAULT namespace > declaration item. > * No namespace prefix shall be equivalent to xml or xmlns. > * No namespace URI shall be identical to http://www.w3.org/2000/xmlns/ > or to http://www.w3.org/XML/1998/namespace. > * The value of a namespace URI contained in an regular namespace > declaration item (no DEFAULT) shall not be a zero-length string. > > Examples: > > SELECT xmlelement(NAME "foo", xmlnamespaces('http://x.y' AS bar)); > xmlelement > --- > http://x.y"/> > > SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http://x.y')); > xmlelement > --- > http://x.y"/> > > SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT)); >xmlelement > - > > > In transformXmlExpr() it seemed convenient to use the same parameters to > store the prefixes and URIs as in XMLAttributes (arg_names and > named_args), but I am still not so sure it is the right approach. Is > there perhaps a better way? > > Any thoughts? Feedback welcome! > +1 Pavel > > Best, Jim >