Re: Fixing memory leaks in postgres_fdw
Etsuro Fujita writes: > On Sat, May 24, 2025 at 10:10 AM Tom Lane wrote: >> I thought of fixing this by using a memory context reset callback >> to ensure that the PGresult is cleaned up when the executor's context >> goes away, and that seems to work nicely (see 0001 attached). >> However, I feel like this is just a POC, because now that we have that >> concept we might be able to use it elsewhere in postgres_fdw to >> eliminate most or even all of its reliance on PG_TRY. That should be >> faster as well as much less bug-prone. But I'm putting it up at this >> stage for comments, in case anyone thinks this is not the direction to >> head in. > I think that that is a good idea; +1 for removing the reliance not > only in DirectModify but in other places. I think that that would be > also useful if extending batch INSERT to cases with RETURNING data in > postgres_fdw. Here is an attempt at making a bulletproof fix by having all backend users of libpq go through a wrapper layer that provides the memory context callback. Perhaps this is more code churn than we want to accept; I'm not sure. I thought about avoiding most of the niggling code changes by adding #define PGresult BEPGresult #define PQclear BEPQclear #define PQresultStatus BEPQresultStatus and so forth at the bottom of the new header file, but I'm afraid that would create a lot of confusion. There is a lot yet to do towards getting rid of no-longer-needed PG_TRYs and other complication, but I decided to stop here pending comments on the notational decisions I made. One point that people might find particularly dubious is that I put the new stuff into a new header file libpq-be-fe.h, rather than adding it to libpq-be-fe-helpers.h which would seem more obvious. The reason for that is the code layout in postgres_fdw. postgres_fdw.h needs to include libpq-fe.h to get the PGresult typedef, and with these changes it instead needs to get BEPGresult. But only connection.c currently includes libpq-be-fe-helpers.h, and I didn't like the idea of making all of postgres_fdw's .c files include that. Maybe that's not worth worrying about though. The 0002 patch is the same as before. regards, tom lane From 2fc53191595be88592fcd83b8b78c5849b623049 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 25 May 2025 15:16:23 -0400 Subject: [PATCH v2 1/2] Fix memory leakage in postgres_fdw's DirectModify code path. postgres_fdw tries to use PG_TRY blocks to ensure that it will eventually free the PGresult created by the remote modify command. However, it's fundamentally impossible for this scheme to work reliably when there's RETURNING data, because the query could fail in between invocations of postgres_fdw's DirectModify methods. There is at least one instance of exactly this situation in the regression tests, and the ensuing session-lifespan leak is visible under Valgrind. We can improve matters by using a memory context reset callback attached to the ExecutorState context. That ensures that the PGresult will be freed when the ExecutorState context is torn down, even if control never reaches postgresEndDirectModify. Also, by switching to this approach, we can eliminate some PG_TRY overhead since it's no longer necessary to be so cautious about errors. This patch adds infrastructure that wraps each PGresult in a "BEPGresult" that provides the reset callback. Code using this abstraction is inherently memory-safe (as long as it attaches the reset callbacks to an appropriate memory context). The amount of notational churn is slightly annoying, but I think that it's probably worth it to forestall future leaks. So far I created the infrastructure and made relevant code use it, but I have not done very much towards making the caller simplifications that should now be possible. I did remove a few PQclear calls in error-exit paths that are now no longer necessary. But we should be able to remove a lot of uses of PG_TRY and "volatile" variable markings, as well as simplify error handling, now that it's not necessary to be so paranoid about freeing PGresults before throwing an error. This state of the patch is good for reviewing the notational choices I made, though. Author: Tom Lane Discussion: https://postgr.es/m/2976982.1748049...@sss.pgh.pa.us --- contrib/dblink/dblink.c | 159 +++--- contrib/postgres_fdw/connection.c | 54 ++--- contrib/postgres_fdw/postgres_fdw.c | 186 contrib/postgres_fdw/postgres_fdw.h | 10 +- .../libpqwalreceiver/libpqwalreceiver.c | 145 + src/backend/utils/mmgr/mcxt.c | 39 +++- src/include/libpq/libpq-be-fe-helpers.h | 68 +++--- src/include/libpq/libpq-be-fe.h | 203 ++ src/include/utils/palloc.h| 2 + src/tools/pgindent/typedefs.list | 1 + 10 files changed, 531 insertions(+), 336 deletions(-) create mode 10
Re: MERGE issues around inheritance
On Sun, 25 May 2025 at 13:41, Dean Rasheed wrote: > > On Sun, 25 May 2025 at 13:06, Tender Wang wrote: > > > > For a partitioned table, we must pass rootResultRelInfo to ExecInsert(). I > > added the check before calling ExecInsert() > > If it is a partitioned table, we continue to pass rootResultRelInfo. > > Otherwise, we pass resultRelInfo. > > Please see the attached diff file. The patch passed all regression test > > cases. > > > > No, I don't think that's the right fix. I'm looking at it now, and I > think I have a fix, but it's more complicated than that. I'll post an > update later. > The reason that MERGE must pass rootResultRelInfo to ExecInsert() for a plain-inheritance table dates back to 387f9ed0a08. As that commit demonstrates, it is possible for the parent to be excluded from the plan, and so all of the entries in the resultRelInfo array may be for different relations than rootResultRelInfo. So there are indeed two related bugs here: 1. The projection built by ExecInitMerge() in the INSERT case may use a tuple slot that's not compatible with the target table. 2. ExecInitModifyTable() does not initialize the WCO lists or RETURNING list for rootResultRelInfo, so those never get executed. As it happens, it is possible to construct cases where (1) causes a crash, even without WCO's or a RETURNING list (see the first test case in the attached patch), so this bug goes all the way back to v15, where MERGE was introduced. So I think we need to do something like the attached. There is perhaps scope to reduce the code duplication between this and ExecInitPartitionInfo(), but that'd likely make the patch bigger, so I think it's best to leave that for now. Regards, Dean From 74db4187272171853b69d4d8c7155a778846f299 Mon Sep 17 00:00:00 2001 From: Dean Rasheed Date: Sun, 25 May 2025 20:00:48 +0100 Subject: [PATCH v1] Fix MERGE into a plain inheritance parent table. When a MERGE's target table is the parent of an inheritance tree, any INSERT actions insert into the parent table using ModifyTableState's rootResultRelInfo. However, there are at least two bugs in the way this is initialized: 1. ExecInitMerge() incorrectly uses a ResultRelInfo entry from the ModifyTableState's resultRelInfo array to build the insert projection, which may not be compatible with the rootResultRelInfo. 2. ExecInitModifyTable() does not fully initialize rootResultRelInfo (ri_WithCheckOptions, ri_WithCheckOptionExprs, ri_returningList, and ri_projectReturning are not initialized). This can lead to crashes, or a failure to check WCO's or to evaluate the RETURNING list for MERGE INSERT actions. Fix both these bugs in ExecInitMerge(), noting that it is only necessary to fully initialize rootResultRelInfo for a MERGE with INSERT actions, when the target table is an inheritance parent. Backpatch to v15, where MERGE was introduced. Reported-by: Andres Freund Discussion: https://postgr.es/m/4rlmjfniiyffp6b3kv4pfy4jw3pciy6mq72rdgnedsnbsx7qe5@j5hlpiwdguvc Backpatch-through: 15 --- src/backend/executor/nodeModifyTable.c | 123 - src/test/regress/expected/merge.out| 70 ++ src/test/regress/sql/merge.sql | 49 ++ 3 files changed, 239 insertions(+), 3 deletions(-) diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 2bc89bf84dc..871c421fd5b 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -64,6 +64,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/optimizer.h" #include "rewrite/rewriteHandler.h" +#include "rewrite/rewriteManip.h" #include "storage/lmgr.h" #include "utils/builtins.h" #include "utils/datum.h" @@ -3735,6 +3736,7 @@ ExecInitMerge(ModifyTableState *mtstate, EState *estate) switch (action->commandType) { case CMD_INSERT: + /* INSERT actions always use rootRelInfo */ ExecCheckPlanOutput(rootRelInfo->ri_RelationDesc, action->targetList); @@ -3774,9 +3776,23 @@ ExecInitMerge(ModifyTableState *mtstate, EState *estate) } else { - /* not partitioned? use the stock relation and slot */ - tgtslot = resultRelInfo->ri_newTupleSlot; - tgtdesc = RelationGetDescr(resultRelInfo->ri_RelationDesc); + /* + * If the MERGE targets an inherited table, we insert + * into the root table, so we must initialize its + * "new" tuple slot, if not already done, and use its + * relation descriptor for the projection. + * + * For non-inherited tables, rootRelInfo and + * resultRelInfo are the same, and the "new" tuple + * slot will already have been initialized. + */ + if (rootRelInfo->ri_newTupleSlot == NULL) + rootRelInfo->ri_newTupleSlot = +table_slot_create(rootRelInfo->ri_RelationDesc, + &estate->es_tupleTable); + + tgtslot = rootRelInfo->ri_newTupleSlot; + tgtdesc = RelationGetDescr(rootRelInfo->ri_Relatio
Re: MERGE issues around inheritance
Dean Rasheed 于2025年5月26日周一 04:10写道: > On Sun, 25 May 2025 at 13:41, Dean Rasheed > wrote: > > > > On Sun, 25 May 2025 at 13:06, Tender Wang wrote: > > > > > > For a partitioned table, we must pass rootResultRelInfo to > ExecInsert(). I added the check before calling ExecInsert() > > > If it is a partitioned table, we continue to pass rootResultRelInfo. > Otherwise, we pass resultRelInfo. > > > Please see the attached diff file. The patch passed all regression > test cases. > > > > > > > No, I don't think that's the right fix. I'm looking at it now, and I > > think I have a fix, but it's more complicated than that. I'll post an > > update later. > > > > The reason that MERGE must pass rootResultRelInfo to ExecInsert() for > a plain-inheritance table dates back to 387f9ed0a08. As that commit > demonstrates, it is possible for the parent to be excluded from the > plan, and so all of the entries in the resultRelInfo array may be for > different relations than rootResultRelInfo. > Thanks for the information. Passing resultRelInfo to ExecInsert() is wrong. > So there are indeed two related bugs here: > > 1. The projection built by ExecInitMerge() in the INSERT case may use > a tuple slot that's not compatible with the target table. > > 2. ExecInitModifyTable() does not initialize the WCO lists or > RETURNING list for rootResultRelInfo, so those never get executed. > > As it happens, it is possible to construct cases where (1) causes a > crash, even without WCO's or a RETURNING list (see the first test case > in the attached patch), so this bug goes all the way back to v15, > where MERGE was introduced. > > So I think we need to do something like the attached. > I tested the attached patch, and there's no problem anymore. LGTM. > There is perhaps scope to reduce the code duplication between this and > ExecInitPartitionInfo(), but that'd likely make the patch bigger, so I > think it's best to leave that for now. > Agreed. -- Thanks, Tender Wang
Custom GUCs and typos
Hi, while going through this thread [0] i have some different views and questions on this problem so creating a separate thread,this is just an attempt, i might be brutally wrong here. 1) On top of OP's patch I added support to warn if the prefix of custom GUC is invalid,for valid questions such as "How do you know that's a bogus prefix? It could perfectly well be a fully valid setting for an extension that the installation doesn't choose to preload.",we can get the info of such extensions using extension_file_exists() which tells that its installed but not preloaded thanks to Greg for proposing this,which tells this could be a potential valid extension ,so if its not it in reserved_class_prefix and also not in prefix/share/extension dir then the prefix is for sure invalid,so i warn and remove the GUC from hashtable. 2) for preloaded extensions if the suffix/parameter is a typo or if we want to create a new custom GUC for ex:"pg_stat_statements.count_something = 1", currently we throwing a warning as WARNING: invalid configuration parameter name "pg_stat_statements.count_something", removing it "pg_stat_statements" is now a reserved prefix. I guess this means you cant create a new custom GUC manually ,you have to use "DefineCustomXXXVariable" routines,but this kind of warning is not there for non-preloaded extensions ,so i added the same for non-preloaded ones ,once again checking the extension_file_exists() if the extension was built/installed this could make it a potential valid extension so it warns as if i try add this "plpgsql.ironman = 3000" WARNING: invalid configuration parameter name "plpgsql.ironman", removing it DETAIL: "plpgsql.ironman" has a valid prefix. 3) I also added all the above support for the SET command also which previously was not there. 4) TODO: add support for ALTER SYSTEM SET ,if this patch makes sense. thoughts? [0] https://www.postgresql.org/message-id/flat/196f3f8e12f.87666a6c16169.9160742057750715009%40zohocorp.com -- Thanks, Srinath Reddy Sadipiralla EDB: https://www.enterprisedb.com/ v1-0001-Reject-and-warn-invalid-custom-GUCs.patch Description: Binary data
Re: POC: Parallel processing of indexes in autovacuum
Hi, On Fri, May 23, 2025 at 6:12 AM Masahiko Sawada wrote: > > On Thu, May 22, 2025 at 12:44 AM Daniil Davydov <3daniss...@gmail.com> wrote: > > > > On Wed, May 21, 2025 at 5:30 AM Masahiko Sawada > > wrote: > > > > > > I find that the name "autovacuum_reserved_workers_num" is generic. It > > > would be better to have a more specific name for parallel vacuum such > > > as autovacuum_max_parallel_workers. This parameter is related to > > > neither autovacuum_worker_slots nor autovacuum_max_workers, which > > > seems fine to me. Also, max_parallel_maintenance_workers doesn't > > > affect this parameter. > > > > This was my headache when I created names for variables. Autovacuum > > initially implies parallelism, because we have several parallel a/v > > workers. > > I'm not sure if it's parallelism. We can have multiple autovacuum > workers simultaneously working on different tables, which seems not > parallelism to me. Hm, I didn't thought about the 'parallelism' definition in this way. But I see your point - the next v4 patch will contain the naming that you suggest. > > > So I think that parameter like > > `autovacuum_max_parallel_workers` will confuse somebody. > > If we want to have a more specific name, I would prefer > > `max_parallel_index_autovacuum_workers`. > > It's better not to use 'index' as we're trying to extend parallel > vacuum to heap scanning/vacuuming as well[1]. OK, I'll fix it. > > > + /* > > > +* If we are running autovacuum - decide whether we need to process > > > indexes > > > +* of table with given oid in parallel. > > > +*/ > > > + if (AmAutoVacuumWorkerProcess() && > > > + params->index_cleanup != VACOPTVALUE_DISABLED && > > > + RelationAllowsParallelIdxAutovac(rel)) > > > > > > I think that this should be done in autovacuum code. > > > > We need params->index cleanup variable to decide whether we need to > > use parallel index a/v. In autovacuum.c we have this code : > > *** > > /* > > * index_cleanup and truncate are unspecified at first in autovacuum. > > * They will be filled in with usable values using their reloptions > > * (or reloption defaults) later. > > */ > > tab->at_params.index_cleanup = VACOPTVALUE_UNSPECIFIED; > > tab->at_params.truncate = VACOPTVALUE_UNSPECIFIED; > > *** > > This variable is filled in inside the `vacuum_rel` function, so I > > think we should keep the above logic in vacuum.c. > > I guess that we can specify the parallel degree even if index_cleanup > is still UNSPECIFIED. vacuum_rel() would then decide whether to use > index vacuuming and vacuumlazy.c would decide whether to use parallel > vacuum based on the specified parallel degree and index_cleanup value. > > > > > > +#define AV_PARALLEL_DEADTUP_THRESHOLD 1024 > > > > > > These fixed values really useful in common cases? I think we already > > > have an optimization where we skip vacuum indexes if the table has > > > fewer dead tuples (see BYPASS_THRESHOLD_PAGES). > > > > When we allocate dead items (and optionally init parallel autocuum) we > > don't have sane value for `vacrel->lpdead_item_pages` (which should be > > compared with BYPASS_THRESHOLD_PAGES). > > The only criterion that we can focus on is the number of dead tuples > > indicated in the PgStat_StatTabEntry. > > My point is that this criterion might not be useful. We have the > bypass optimization for index vacuuming and having many dead tuples > doesn't necessarily mean index vacuuming taking a long time. For > example, even if the table has a few dead tuples, index vacuuming > could take a very long time and parallel index vacuuming would help > the situation, if the table is very large and has many indexes. That sounds reasonable. I'll fix it. > > But autovacuum (as I think) should work as stable as possible and > > `unnoticed` by other processes. Thus, we must : > > 1) Compute resources (such as the number of parallel workers for a > > single table's indexes vacuuming) as efficiently as possible. > > 2) Provide a guarantee that as many tables as possible (among > > requested) will be processed in parallel. > > > > (1) can be achieved by calculating the parameters on the fly. > > NUM_INDEXES_PER_PARALLEL_WORKER is a rough mock. I can provide more > > accurate value in the near future. > > I think it requires more things than the number of indexes on the > table to achieve (1). Suppose that there is a very large table that > gets updates heavily and has a few indexes. If users want to avoid the > table from being bloated, it would be a reasonable idea to use > parallel vacuum during autovacuum and it would not be a good idea to > disallow using parallel vacuum solely because it doesn't have more > than 30 indexes. On the other hand, if the table had got many updates > but not so now, users might want to use resources for autovacuums on > other tables. We might need to consider autovacuum frequencies per > table, the statistics of the previous autovacuum, or system loads etc. > So I thi
Re: Non-reproducible AIO failure
On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote: > Thomas Munro writes: > > Can you get a core and print *ioh in the debugger? > > So far, I've failed to get anything useful out of core files > from this failure. The trace goes back no further than > > (lldb) bt > * thread #1 > * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8 > > That's quite odd in itself: while I don't find the debugging > environment on macOS to be the greatest, it's not normally > this unhelpful. (And Alexander reported the same off-list.). It's interesting that the elog.c backtrace stuff is able to analyse the stack and it looks normal AFAICS. Could that be interfering with the stack in the core?! I doubt it ... I kinda wonder if the debugger might be confused about libsystem sometimes since it has ceased to be a regular Mach-O file on disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue? So far we have: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 20931 0 postgres0x000105299c84 ExceptionalCondition + 108 1 postgres0x0001051159ac WaitReadBuffers + 616 2 postgres0x0001053611ec read_stream_next_buffer.cold.1 + 184 3 postgres0x000105111630 read_stream_next_buffer + 300 4 postgres0x000104e0b994 heap_fetch_next_buffer + 136 5 postgres0x000104e018f4 heapgettup_pagemode + 204 Hmm, looking around that code and wracking my brain for things that might happen on one OS but not others, I wonder about partial I/Os. Perhaps combined with some overlapping requests. TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 0 postgres0x000104f078f4 ExceptionalCondition + 236 1 postgres0x000104c0ebd4 pgaio_io_before_start + 260 2 postgres0x000104c0ea94 pgaio_io_start_readv + 36 3 postgres0x000104c2d4e8 FileStartReadV + 252 4 postgres0x000104c807c8 mdstartreadv + 668 5 postgres0x000104c83db0 smgrstartreadv + 116 But this one seems like a more basic confusion... wild writes somewhere? Hmm, we need to see what's in that struct. If we can't get a debugger to break there or a core file to be analysable, maybe we should try logging as much info as possible at those points to learn a bit more? I would be digging like that myself but I haven't seen this failure on my little M4 MacBook Air yet (Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with corporate security-ware that intercepts at least file system stuff and slows it down and I can't even convince it to dump core files right now. Could you guys please share your exact repro steps?
Re: Fixing memory leaks in postgres_fdw
I wrote: > Here is an attempt at making a bulletproof fix by having all backend > users of libpq go through a wrapper layer that provides the memory > context callback. Perhaps this is more code churn than we want to > accept; I'm not sure. I thought about avoiding most of the niggling > code changes by adding > #define PGresult BEPGresult > #define PQclear BEPQclear > #define PQresultStatus BEPQresultStatus > and so forth at the bottom of the new header file, but I'm afraid > that would create a lot of confusion. I tried that, and it leads to such a less-messy patch that I think we should probably do it this way, confusion or no. One argument that can be made in favor is that we don't really want random notational differences between frontend and backend code that's doing the same thing. Also, I'd been struggling with the assumption that we should palloc the wrapper object before calling PQgetResult; there doesn't seem to be any nice way to make that transparent to callers. I realized that we can make it simple as long as we're willing to assume that allocating with MCXT_ALLOC_NO_OOM can't throw an error. But we assume that in other usages too. Hence, v3 attached. The 0002 patch is still the same as before. regards, tom lane From be4b888d0d2936cb80e63092d14a3133fab590da Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 25 May 2025 19:49:33 -0400 Subject: [PATCH v3 1/2] Fix memory leakage in postgres_fdw's DirectModify code path. postgres_fdw tries to use PG_TRY blocks to ensure that it will eventually free the PGresult created by the remote modify command. However, it's fundamentally impossible for this scheme to work reliably when there's RETURNING data, because the query could fail in between invocations of postgres_fdw's DirectModify methods. There is at least one instance of exactly this situation in the regression tests, and the ensuing session-lifespan leak is visible under Valgrind. We can improve matters by using a memory context reset callback attached to the ExecutorState context. That ensures that the PGresult will be freed when the ExecutorState context is torn down, even if control never reaches postgresEndDirectModify. Also, by switching to this approach, we can eliminate some PG_TRY overhead since it's no longer necessary to be so cautious about errors. This patch adds infrastructure that wraps each PGresult in a "libpqsrv_PGresult" that provides the reset callback. Code using this abstraction is inherently memory-safe (so long as it attaches the reset callbacks to an appropriate memory context). Furthermore, we add some macros that automatically redirect calls of the libpq functions concerned with PGresults to use this infrastructure, so that almost no source-code changes are needed to wheel this infrastructure into place. So far I created the infrastructure and made relevant code use it, but I have not done very much towards making the caller simplifications that should now be possible. I did remove a few PQclear calls in error-exit paths that are now no longer necessary. But we should be able to remove a lot of uses of PG_TRY and "volatile" variable markings, as well as simplify error handling, now that it's not necessary to be so paranoid about freeing PGresults before throwing an error. I did fix libpqsrv_get_result_last() that way, just as proof of concept. Author: Tom Lane Discussion: https://postgr.es/m/2976982.1748049...@sss.pgh.pa.us --- contrib/postgres_fdw/postgres_fdw.c | 19 +- contrib/postgres_fdw/postgres_fdw.h | 2 +- .../libpqwalreceiver/libpqwalreceiver.c | 31 +-- src/backend/utils/mmgr/mcxt.c | 39 ++- src/include/libpq/libpq-be-fe-helpers.h | 59 ++-- src/include/libpq/libpq-be-fe.h | 259 ++ src/include/utils/palloc.h| 2 + src/tools/pgindent/typedefs.list | 1 + 8 files changed, 327 insertions(+), 85 deletions(-) create mode 100644 src/include/libpq/libpq-be-fe.h diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 331f3fc088d..3cc27c5a1e2 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -3779,9 +3779,6 @@ create_cursor(ForeignScanState *node) /* * Get the result, and check for success. - * - * We don't use a PG_TRY block here, so be careful not to throw error - * without releasing the PGresult. */ res = pgfdw_get_result(conn); if (PQresultStatus(res) != PGRES_COMMAND_OK) @@ -4178,9 +4175,6 @@ execute_foreign_modify(EState *estate, /* * Get the result, and check for success. - * - * We don't use a PG_TRY block here, so be careful not to throw error - * without releasing the PGresult. */ res = pgfdw_get_result(fmstate->conn); if (PQresultStatus(res) != @@ -4248,9 +4242,6 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* * Get the result, and check for success. - * -
Re: Non-reproducible AIO failure
Thomas Munro writes: > Could you guys please share your exact repro steps? I've just been running 027_stream_regress.pl over and over. It's not a recommendable answer though because the failure probability is tiny, under 1%. It sounded like Alexander had a better way. regards, tom lane
Re: Add CHECK_FOR_INTERRUPTS in polling loop code path in XactLockTableWait
On 2025/05/24 5:41, Kevin K Biju wrote: Hi, While creating a logical replication slot, we wait for older transactions to complete to reach a "consistent point", which can take a while on busy databases. If we're creating a slot on a primary instance, it's pretty clear that we're waiting on a transaction: postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=804; pid | wait_event_type | wait_event | state | query -+-+---++ 804 | Lock | transactionid | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput'); (1 row) postgres=# SELECT locktype,transactionid,pid,mode FROM pg_locks WHERE pid=804 AND granted='f'; locktype | transactionid | pid | mode ---+---+-+--- transactionid | 6077 | 804 | ShareLock However, creating a slot on a hot standby behaves very differently when blocked on a transaction. Querying pg_stat_activity doesn't give us any indication on what the issue is: postgres=# SELECT pg_is_in_recovery(); pg_is_in_recovery --- t (1 row) postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=5074; pid | wait_event_type | wait_event | state | query --+-+++ 5074 | | | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput'); And more importantly, a backend in this state cannot be terminated via the usual methods and sometimes requires a server restart. postgres=# SELECT pg_cancel_backend(5074), pg_terminate_backend(5074); pg_cancel_backend | pg_terminate_backend ---+-- t | t (1 row) postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=5074; pid | wait_event_type | wait_event | state | query --+-+++ 5074 | | | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput'); I've taken a look at the code that "awaits" on transactions and the function of interest is lmgr.c/XactLockTableWait. On a primary, it is able to acquire a ShareLock on the xid and the lock manager does a good job on making this wait efficient as well as visible externally. However, on a standby, it cannot wait on the xid since it is not running the transaction. However, it knows the transaction is still running from KnownAssignedXids, and then ends up on this codepath: * * Some uses of this function don't involve tuple visibility -- such * as when building snapshots for logical decoding. It is possible to * see a transaction in ProcArray before it registers itself in the * locktable. The topmost transaction in that case is the same xid, * so we try again after a short sleep. (Don't sleep the first time * through, to avoid slowing down the normal case.) */ if (!first) pg_usleep(1000L); The attached comment suggests that this sleep was only meant to be hit a few times while we wait for the lock to appear so we can wait on it. However, in the hot standby case, this ends up becoming a polling loop since the lock will never appear. There is no interrupt processing in this loop either and so the only way out is to wait for the transaction on the primary to complete. I agree with your analysis. It makes sense to me. Attached is a patch that adds CHECK_FOR_INTERRUPTS before the sleep in XactLockTableWait and ConditionalXactLockTableWait. This allows backends waiting on a transaction during slot creation on a standby to be interrupted. +1 It would also be nice if there was a way for users to tell what we're waiting on (maybe a different wait event?) but I'd like input on that. Just an idea: how about calling pgstat_report_wait_start() and _end() around pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does? Since this is more of an improvement than a bug fix, I think it would be better to make it as a separate patch from the CFI addition. Also, would waiting only 1ms per loop cycle be too aggressive? If so, maybe we could increase the sleep time gradually during recovery (but not beyond 1 second), again similar to WaitExceedsMaxStandbyDelay(). Regards, -- Fujii Masao NTT DATA Japan Corporation
Re: Fix slot synchronization with two_phase decoding enabled
to On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada wrote: > > > Here are review comments for v14 patch: > Thank you for the review. > I think we need to include a basic test case where we simply create a > subscription with two_phase=true and then enable the failover via > ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed > two_phase_at. Probably if we use this case in 003_logical_slots.pl, we > can avoid creating regress_sub2? > A test has been added in 040_standby_failover_slots_sync.pl to enable failover via ALTER SUBSCRIPTION. Regarding regress_sub2 in 003_logical_slots.pl : If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it leads to pg_upgrade failure, as it attempts to create a slot on the new_node(upgraded version) with both two_phase and failover enabled, which is an unsupported combination. > --- > +# Advance the slot's restart_lsn to allow enabling the failover option > +# on a two_phase-enabled subscription using ALTER SUBSCRIPTION. > +$publisher->safe_psql( > +'postgres', qq( > +BEGIN; > +SELECT txid_current(); > +SELECT pg_log_standby_snapshot(); > +COMMIT; > +BEGIN; > +SELECT txid_current(); > +SELECT pg_log_standby_snapshot(); > +COMMIT; > +)); > + > +# Alter subscription to enable failover > +$subscriber1->psql( > +'postgres', > +"ALTER SUBSCRIPTION regress_mysub2 DISABLE; > + ALTER SUBSCRIPTION regress_mysub2 SET (failover = true);"); > > I think we need to wait for the subscription to consume these changes > before disabling the subscription. If the publisher doesn't consume > these WAL records for some reason, the subsequent "ALTER SUBSCRIPTION > ... SET (failover = true)" will fail. > Done. > --- > @@ -25,6 +25,8 @@ $publisher->init( > $publisher->append_conf('postgresql.conf', 'autovacuum = off'); > $publisher->start; > > +$publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int)"); > + > $publisher->safe_psql('postgres', > "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"); > > @@ -42,6 +44,8 @@ my $slot_creation_time_on_primary = $publisher->safe_psql( > SELECT current_timestamp; > ]); > > +$subscriber1->safe_psql('postgres', "CREATE TABLE tab_full (a int)"); > + > # Create a subscription that enables failover. > $subscriber1->safe_psql('postgres', > "CREATE SUBSCRIPTION regress_mysub1 CONNECTION > '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = > lsub1_slot, copy_data = false, failover = true, enabled = false);" > @@ -98,6 +102,114 @@ my ($result, $stdout, $stderr) = > $subscriber1->psql('postgres', > ok( $stderr =~ /ERROR: cannot set failover for enabled subscription/, > "altering failover is not allowed for enabled subscription"); > > I think it's better to create the tables before the new test starts > with a comment explaining why we need to create the table here. > Done. ~~~ Please find the updated patch v15, addressing above comments. -- Thanks, Nisha v15-0001-PG17-Approach-3-Fix-slot-synchronization-for-two.patch Description: Binary data
Re: on_error table, saving error info to a table
> > On Tue, Dec 17, 2024 at 12:31 PM Kirill Reshke > wrote: > > > > On Mon, 16 Dec 2024 at 16:50, Nishant Sharma > > wrote: > > > Also, I think Andrew's suggestion can resolve the concern me and Krill > > > had on forcing users to create tables with correct column names and > > > numbers. Also, will make error table checking simpler. No need for the > > > above kind of checks. > > > > +1 on that. > > > > Syntax: COPY (on_error table, table error_saving_tbl); > seems not ideal. > > but auto-create on_error table if this table does not exist, seems way > more harder. > > Since we can not use SPI interface here, maybe we can use DefineRelation > also, to auto-create a table, what if table already exists, then > our operation would be stuck for not COPY related reason. > also auto-create means we need to come up with a magic table name for > all COPY (on_error table) > operations, which seems not ideal IMO. > > i realized we should error out case like: > COPY err_tbl FROM STDIN WITH (DELIMITER ',', on_error table, table > err_tbl); > > also by changing copy_generic_opt_arg, now we can > COPY err_tbl FROM STDIN WITH (DELIMITER ',', on_error table, table x); > previously, we can only do > COPY err_tbl FROM STDIN WITH (DELIMITER ',', on_error 'table', table x); > I am not sure if you understood Andrew's suggestion. As per my understanding he did not suggest auto-creating the error table, he suggested using TYPED TABLES for the error saving table. For example: postgres=# CREATE TYPE error_saving_table_type AS (userid oid, copy_tbl oid, filename text, lineno bigint, line text, colname text, raw_field_value text, err_message text, err_detail text, errorcode text); CREATE TYPE We can have something similar like above in some initdb script, which will help in making the above type kind of derived or standard error saving table type in PG. And then, user can use above TYPE to create error saving table like below: postgres=# CREATE TABLE error_saving_table OF error_saving_table_type; CREATE TABLE After this, user can use error_saving_table with the COPY command like below: COPY t_copy_tbl(a,b) FROM STDIN WITH (DELIMITER ',', on_error table, table error_saving_table); Here's manual example of insert in that table: postgres=# INSERT INTO error_saving_table VALUES (1234, 4321, 'abcd', 12, 'This is was getting copied', 'xyz', 'pqr', 'testing type error table', 'inserting into typed table error saving table', 'test error code'); INSERT 0 1 postgres=# SELECT * from error_saving_table; userid | copy_tbl | filename | lineno |line| colname | raw_field_value | err_message| err_detail |errorcode +--+--+++-+- +--+--- +- 1234 | 4321 | abcd | 12 | This is was getting copied | xyz | pqr | testing type error table | insert ing into typed table error saving table | test error code (1 row) With the above we don't need to check all the 12 column's count, their data types etc. in the patch. *"Then all you would need to check is the reloftype to* *make **sure it's the right type," *as quoted by Andrew. This will make patch simpler and also will remove burden on users to create error saving tables with correct columns. As its TYPE will be already available by default for the users to create error saving tables. Regards, Nishant.
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Sat, May 24, 2025 at 6:59 PM Alexander Korotkov wrote: > > Hi, Amit! > > Thank you for your attention to this patchset! > > On Sat, May 24, 2025 at 2:15 PM Amit Kapila wrote: > > On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov > > wrote: > > > > > > I spend more time on this. The next revision is attached. It > > > contains revised comments and other cosmetic changes. I'm going to > > > backpatch 0001 to all supported branches, > > > > > > > Is my understanding correct that we need 0001 because > > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after > > changing the slot's restart_lsn? > > Yes. Also, even if it would save slot to the disk, there is still > race condition that concurrent checkpoint could use updated value from > the shared memory to clean old WAL segments, and then crash happens > before we managed to write the slot to the disk. > How can that happen, if we first write the updated value to disk and then update the shared memory as we do in LogicalConfirmReceivedLocation? BTW, won't there be a similar problem with physical slot's xmin computation as well? In PhysicalReplicationSlotNewXmin(), after updating the slot's xmin computation, we mark it as dirty and update shared memory values. Now, say after checkpointer writes these xmin values to disk, walsender receives another feedback message and updates the slot's xmin values. Now using these updated shared memory values, vacuum removes the rows, however, a restart will show the older xmin values in the slot, which mean vacuum would have removed the required rows before restart. As per my understanding, neither the xmin nor the LSN problem exists for logical slots. I am pointing this out to indicate we may need to think of a different solution for physical slots, if these are problems only for physical slots. -- With Regards, Amit Kapila.
Re: Automatically sizing the IO worker pool
HI > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote: > It's hard to know how to set io_workers=3. If it's too small, > io_method=worker's small submission queue overflows and it silently > falls back to synchronous IO. If it's too high, it generates a lot of > pointless wakeups and scheduling overhead, which might be considered > an independent problem or not, but having the right size pool > certainly mitigates it. Here's a patch to replace that GUC with: > > io_min_workers=1 > io_max_workers=8 > io_worker_idle_timeout=60s > io_worker_launch_interval=500ms > > It grows the pool when a backlog is detected (better ideas for this > logic welcome), and lets idle workers time out. I also like idea ,can we set a io_workers= 3 io_max_workers= cpu/4 io_workers_oversubscribe = 3 (range 1-8) io_workers * io_workers_oversubscribe <=io_max_workers On Sun, May 25, 2025 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote: > > It's hard to know how to set io_workers=3. If it's too small, > > io_method=worker's small submission queue overflows and it silently > > falls back to synchronous IO. If it's too high, it generates a lot of > > pointless wakeups and scheduling overhead, which might be considered > > an independent problem or not, but having the right size pool > > certainly mitigates it. Here's a patch to replace that GUC with: > > > > io_min_workers=1 > > io_max_workers=8 > > io_worker_idle_timeout=60s > > io_worker_launch_interval=500ms > > > > It grows the pool when a backlog is detected (better ideas for this > > logic welcome), and lets idle workers time out. > > I like the idea. In fact, I've been pondering about something like a > "smart" configuration for quite some time, and convinced that a similar > approach needs to be applied to many performance-related GUCs. > > Idle timeout and launch interval serving as a measure of sensitivity > makes sense to me, growing the pool when a backlog (queue_depth > > nworkers, so even a slightest backlog?) is detected seems to be somewhat > arbitrary. From what I understand the pool growing velocity is constant > and do not depend on the worker demand (i.e. queue_depth)? It may sounds > fancy, but I've got an impression it should be possible to apply what's > called a "low-pass filter" in the control theory (sort of a transfer > function with an exponential decay) to smooth out the demand and adjust > the worker pool based on that. > > As a side note, it might be far fetched, but there are instruments in > queueing theory to figure out how much workers are needed to guarantee a > certain low queueing probability, but for that one needs to have an > average arrival rate (in our case, average number of IO operations > dispatched to workers) and an average service rate (average number of IO > operations performed by workers). > > >
Re: Hash table scans outside transactions
On 5/25/25 13:36, Dilip Kumar wrote: > On Wed, May 21, 2025 at 3:02 AM Aidar Imamov wrote: >> >> Hi! >> I tried to resend this thread directly to myself, but for some reason it >> ended up in the whole hackers list.. >> >> I thought I'd chime in on this topic since it hasn't really been >> discussed >> anywhere else (or maybe I missed it). >> I've attached two patches: the first one is a little test extension to >> demonstrate the problem (just add "hash_test" to >> "shared_preload_libraries"), >> and the second is a possible solution. Basically, the idea is that we >> don't >> reset the scan counter if we find scans that started outside of the >> current >> transaction at the end. I have to admit, though, that I can't >> immediately >> say what other implications this might have or what else we need to >> watch >> out for if we try this. >> Maybe any thoughts on that? > > I haven't reviewed the complete patch or tested it, but I don't see > any issues with it. > I may be wrong, but I'd guess the restriction is there for a reason. Maybe it's not needed anymore, or maybe it's needed only in some cases, or something like that. So the most important thing for a patch removing the restriction it is to explain why it's there and why it's safe to relax it. The extension demonstrating that the restriction exists doesn't really help with that, it shows why we have it. Not hat it's safe to remove it. I'm not a dynahash expert, but isn't the first sentence from dynahash.c relevant: * dynahash.c supports both local-to-a-backend hash tables and hash * tables in shared memory. For shared hash tables, it is the caller's * responsibility to provide appropriate access interlocking. The * simplest convention is that a single LWLock protects the whole hash * table. Searches (HASH_FIND or hash_seq_search) need only shared * lock, but any update requires exclusive lock. In other words, let's say you have a hash table, protected by LWLock. That is, you're holding the lock in shared mode during seqscan. And then we drop the locks for some reason (say, transaction end). The table can be modified - won't that break the seqscan? FWIW I think with the use case from the beginning of this thread: 1. Add/update/remove entries in hash table 2. Scan the existing entries and perform one transaction per entry 3. Close scan Why not to simply build a linked list after step (1)? regards -- Tomas Vondra
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 9 May 2025 17:57:35 -0700, Masahiko Sawada wrote: >> Proposed approaches to register custom COPY formats: >> a. Create a function that has the same name of custom COPY >>format >> b. Call a register function from _PG_init() >> >> FYI: I proposed c. approach that uses a. but it always >> requires schema name for format name in other e-mail. > > With approach (c), do you mean that we require users to change all > FORMAT option values like from 'text' to 'pg_catalog.text' after the > upgrade? Or are we exempt the built-in formats? The latter. 'text' must be accepted because existing pg_dump results use 'text'. If we reject 'text', it's a big incompatibility. (We can't dump on old PostgreSQL and restore to new PostgreSQL.) >> Users can register the same format name: >> a. Yes >>* Users can distinct the same format name by schema name >>* If format name doesn't have schema name, the used >> format depends on search_path >> * Pros: >>* Using schema for it is consistent with other >> PostgreSQL mechanisms >>* Custom format never conflict with built-in >> format. For example, an extension register "xml" and >> PostgreSQL adds "xml" later, they are never >> conflicted because PostgreSQL's "xml" is registered >> to pg_catalog. >> * Cons: Different format may be used with the same >>input. For example, "jsonlines" may choose >>"jsonlines" implemented by extension X or implemented >>by extension Y when search_path is different. >> b. No >>* Users can use "${schema}.${name}" for format name >> that mimics PostgreSQL's builtin schema (but it's just >> a string) >> >> >> Built-in formats (text/csv/binary) should be able to >> overwritten by extensions: >> a. (The current patch is no but David's answer is) Yes >>* Pros: Users can use drop-in replacement faster >> implementation without changing input >>* Cons: Users may overwrite them accidentally. >> It may break pg_dump result. >> (This is called as "backward incompatibility.") >> b. No > > The summary matches my understanding. I think the second point is > important. If we go with a tablesample-like API, I agree with David's > point that all FORMAT values including the built-in formats should > depend on the search_path value. While it provides a similar user > experience to other database objects, there is a possibility that a > COPY with built-in format could work differently on v19 than v18 or > earlier depending on the search_path value. Thanks for sharing additional points. David said that the additional point case is a responsibility or DBA not PostgreSQL, right? As I already said, I don't have a strong opinion on which approach is better. My opinion for the (important) second point is no. I feel that the pros of a. isn't realistic. If users want to improve text/csv/binary performance (or something), they should improve PostgreSQL itself instead of replacing it as an extension. (Or they should create another custom copy format such as "faster_text" not "text".) So I'm OK with the approach b. >> Are there any missing or wrong items? > > I think the approach (b) provides more flexibility than (a) in terms > of API design as with (a) we need to do everything based on one > handler function and callbacks. Thanks for sharing this missing point. I have a concern that the flexibility may introduce needless complexity. If it's not a real concern, I'm OK with the approach b. >> If we can summarize >> the current discussion here correctly, others will be able >> to chime in this discussion. (At least I can do it.) > > +1 Are there any more people who are interested in custom COPY FORMAT implementation design? If no more people, let's decide it by us. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 9 May 2025 21:29:23 -0700, Masahiko Sawada wrote: >> > So the idea is that the backend process sets the format ID somewhere >> > in st_progress_param, and then the progress view calls a SQL function, >> > say pg_stat_get_copy_format_name(), with the format ID that returns >> > the corresponding format name. >> >> Does it work when we use session_preload_libraries or the >> LOAD command? If we have 2 sessions and both of them load >> "jsonlines" COPY FORMAT extensions, what will be happened? >> >> For example: >> >> 1. Session 1: Register "jsonlines" >> 2. Session 2: Register "jsonlines" >> (Should global format ID <-> format name mapping >> be updated?) >> 3. Session 2: Close this session. >> Unregister "jsonlines". >> (Can we unregister COPY FORMAT extension?) >> (Should global format ID <-> format name mapping >> be updated?) >> 4. Session 1: Close this session. >> Unregister "jsonlines". >> (Can we unregister COPY FORMAT extension?) >> (Should global format ID <-> format name mapping >> be updated?) > > I imagine that only for progress reporting purposes, I think session 1 > and 2 can have different format IDs for the same 'jsonlines' if they > load it by LOAD command. They can advertise the format IDs on the > shmem and we can also provide a SQL function for the progress view > that can get the format name by the format ID. > > Considering the possibility that we might want to use the format ID > also in the cumulative statistics, we might want to strictly provide > the unique format ID for each custom format as the format IDs are > serialized to the pgstat file. One possible way to implement it is > that we manage the custom format IDs in a wiki page like we do for > custom cumulative statistics and custom RMGR[1][2]. That is, a custom > format extension registers the format name along with the format ID > that is pre-registered in the wiki page or the format ID (e.g. 128) > indicating under development. If either the format name or format ID > conflict with an already registered custom format extension, the > registration function raises an error. And we preallocate enough > format IDs for built-in formats. > > As for unregistration, I think that even if we provide an > unregisteration API, it ultimately depends on whether or not custom > format extensions call it in _PG_fini(). Thanks for sharing your idea. With the former ID issuing approach, it seems that we need a global format ID <-> name mapping and a per session registered format name list. The custom COPY FORMAT register function rejects the same format name, right? If we support both of shared_preload_libraries and session_preload_libraries/LOAD, we have different life time custom formats. It may introduce a complexity with the ID issuing approach. With the latter static ID approach, how to implement a function that converts format ID to format name? PostgreSQL itself doesn't know ID <-> name mapping in the Wiki page. It seems that custom COPY FORMAT implementation needs to register its name to PostgreSQL by itself. Thanks, -- kou
Re: Non-reproducible AIO failure
Thomas Munro writes: > On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote: >> So far, I've failed to get anything useful out of core files >> from this failure. The trace goes back no further than >> (lldb) bt >> * thread #1 >> * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8 > (And Alexander reported the same off-list.). It's interesting that the > elog.c backtrace stuff is able to analyse the stack and it looks > normal AFAICS. Could that be interfering with the stack in the core?! No, but something is. Just to make sure it wasn't totally broken, I added a sure-to-fail Assert in a random place (I chose pg_backend_pid), and I get both a trace in the postmaster log and a perfectly usable core file: TRAP: failed Assert("MyProcPid == 0"), File: "pgstatfuncs.c", Line: 692, PID: 59063 0 postgres0x0001031f1fa4 ExceptionalCondition + 108 1 postgres0x0001031672b4 pg_stat_get_backend_pid + 0 2 postgres0x000102e9e598 ExecInterpExpr + 5524 3 postgres0x000102edb100 ExecResult + 368 4 postgres0x000102ea6418 standard_ExecutorRun + 316 (lldb) bt * thread #1 * frame #0: 0x0001836b5388 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x0001836ee88c libsystem_pthread.dylib`pthread_kill + 296 frame #2: 0x0001835f7c60 libsystem_c.dylib`abort + 124 frame #3: 0x00010491dfac postgres`ExceptionalCondition(conditionName=, fileName=, lineNumber=692) at assert.c:66:2 [opt] frame #4: 0x00010489329c postgres`pg_backend_pid(fcinfo=) at pgstatfuncs.c:692:2 [opt] frame #5: 0x0001045ca598 postgres`ExecInterpExpr(state=0x00013780d190, econtext=0x00013780ce38, isnull=) at execExprInterp.c:0 [opt] frame #6: 0x000104607100 postgres`ExecResult [inlined] ExecEvalExprNoReturn(state=, econtext=0x00013780ce38) at executor.h:417:13 [opt] frame #7: 0x0001046070f4 postgres`ExecResult [inlined] ExecEvalExprNoReturnSwitchContext(state=, econtext=0x00013780ce38) at executor.h:458:2 [opt] The fact that I can trace through this Assert failure but not the AIO one strongly suggests some system-level problem in the latter. There is something rotten in the state of Denmark. For completeness, this is with Sequoia 15.5 (latest macOS) on an M4 Pro MacBook. > but I haven't seen this failure on my little M4 MacBook Air yet > (Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with > corporate security-ware that intercepts at least file system stuff and > slows it down and I can't even convince it to dump core files right > now. As far as that goes, if you have SIP turned on (which I'm sure a corporate laptop would), there are extra steps needed to get a core to happen. See [1]; that page is old, but the recipe still works for me. regards, tom lane [1] https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html
CREATE OR REPLACE FUNCTION now validates it's dependents
hi. The attached patch allows CREATE OR REPLACE FUNCTION to correctly update domain constraints when the function they depend on is changed. so this thread[1] mentioned the problem can be resolved. for example: create function sqlcheck(int) returns bool as 'select $1 > 0' language sql; create domain checkedint as int check(sqlcheck(value)); select 0::checkedint; -- fail ERROR: value for domain checkedint violates check constraint "checkedint_check" create or replace function sqlcheck(int) returns bool as 'select $1 <= 0' language sql; select 1::checkedint; -- fail? the last query won't fail on the master. with the patch it will fail. I also make CREATE OR REPLACE FUNCTION validate each domain value when the domain constraint conditions is associated the function we are gonname changes Of course, this will make CREATE OR REPLACE FUNCTION take way longer time compared to the current. Similar to domain constraints, attached patch also apply to table check constraints too. Is this what we want to do? [1]: https://postgr.es/m/12539.1544107316%40sss.pgh.pa.us From d3356a2485143dc81e5b4d4e0311ffeaec56153c Mon Sep 17 00:00:00 2001 From: jian he Date: Mon, 26 May 2025 11:19:00 +0800 Subject: [PATCH v1 1/1] Ensure CREATE OR REPLACE FUNCTION validates its dependents Currently, CREATE OR REPLACE FUNCTION doesn't validate if the new function definition satisfies its dependents. This patch changes that: it will now validate the new function definition against its dependents, raising an error and failing the command if it's not satisfied. demo: create function sqlcheck(int) returns bool as 'select $1 > 0' language sql; create domain checkedint as int check(sqlcheck(value)); select 1::checkedint; -- ok create or replace function sqlcheck(int) returns bool as 'select $1 <= 0' language sql; -- Currently succeeds on master, but would fail with this patch. select 1::checkedint; context: https://postgr.es/m/12539.1544107316%40sss.pgh.pa.us discussion: https://postgr.es/m/ --- doc/src/sgml/config.sgml | 20 ++ doc/src/sgml/ref/create_function.sgml | 9 + src/backend/catalog/pg_proc.c | 202 ++ src/backend/commands/typecmds.c | 3 +- src/backend/utils/cache/typcache.c| 3 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/pg_backup_archiver.c | 3 + src/include/catalog/pg_proc.h | 1 + src/include/commands/typecmds.h | 1 + src/include/utils/guc.h | 1 + src/include/utils/typcache.h | 2 + .../regress/expected/create_function_sql.out | 48 - src/test/regress/expected/domain.out | 43 src/test/regress/sql/create_function_sql.sql | 38 src/test/regress/sql/domain.sql | 30 +++ 16 files changed, 410 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ca2a567b2b1..d81d08962f7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9895,6 +9895,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; + + check_function_dependencies (boolean) + + check_function_dependencies configuration parameter + + + + +This parameter is normally on. When set to off, it +disables validation of objects that depentent object on the routine during . Disabling validation avoids side +effects of the validation process, in particular valildating existing invalidated constraint. +Set this parameter +to off before loading functions on behalf of other +users; pg_dump does so automatically. +This parameter should be false if check_function_bodies is set to false. + + + + default_transaction_isolation (enum) diff --git a/doc/src/sgml/ref/create_function.sgml b/doc/src/sgml/ref/create_function.sgml index 0d240484cd3..4c938bb391c 100644 --- a/doc/src/sgml/ref/create_function.sgml +++ b/doc/src/sgml/ref/create_function.sgml @@ -76,6 +76,15 @@ CREATE [ OR REPLACE ] FUNCTION OUT parameters except by dropping the function.) + + If any existing domains or CHECK constraints rely on the + current function, and check_function_bodies is set to + true, then using CREATE OR REPLACE FUNCTION will + effectively verify whether the new function definition satisfies with those + domains or CHECK constraints. If not, an error will be + raised. + + When CREATE OR REPLACE FUNCTION is used to replace an existing function, the ownership and permissions of the function diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c index 5fdcf24d5f8..e2caa6bde21 100644 --- a/src/backend/catalog/pg_proc.c +++ b/src/backend/catalog/pg_proc.c @@ -15,17 +15,20 @@ #include "postgres.h"
Re: CREATE OR REPLACE FUNCTION now validates it's dependents
jian he writes: > The attached patch allows CREATE OR REPLACE FUNCTION to correctly update > domain > constraints when the function they depend on is changed. Why is this a problem that needs solving? Our fundamental expectation is that domain constraints are immutable (see the Notes section under CREATE DOMAIN [1]). If the user changes a function used in a constraint in a way that changes its behavior, that's their fault not our problem. I don't think we should add a bunch of code (that we have to maintain) and a bunch of overhead for every CREATE OR REPLACE FUNCTION in order to slap people on the wrist if they get this wrong. regards, tom lane [1] https://www.postgresql.org/docs/current/sql-createdomain.html
JIT works only partially with meson build?
Hi, While building PostgreSQL 17 on Windows, I noticed bitcode files (.bc) are not generated with meson. Does this means that "inlining" of JIT doesn't work when PostgreSQL is build with meson? If so, is it better to describe that in the meson build and JIT sections in the documentation so that users and packagers would except JIT doesn't work in the same way when built with meson and autoconf/make? Regards, Yugo Nagata -- Yugo Nagata
Re: Custom GUCs and typos
On Sun, May 25, 2025 at 7:52 PM Srinath Reddy Sadipiralla < srinath2...@gmail.com> wrote: > 1) On top of OP's patch I added support to warn if the prefix of custom > GUC is invalid,for valid questions such as "How do you know that's a bogus > prefix? It could perfectly well be a fully valid setting for an extension > that the installation doesn't choose to preload.",we can get the info of > such extensions using extension_file_exists() which tells that its > installed but not preloaded thanks to Greg for proposing this,which tells > this could be a potential valid extension ,so if its not it in > reserved_class_prefix and also not in prefix/share/extension dir then the > prefix is for sure invalid,so i warn and remove the GUC from hashtable. > People can and do create GUCs without any intention or need to write C code or place files onto the server to make use of them. https://www.postgresql.org/docs/current/runtime-config-custom.html "PostgreSQL will accept a setting for any two-part parameter name." We cannot change this. So, please include discussion of their existence in any description of changes you wish to make in this area. David J.
Fix a minor typo in jit/README
Hi, The attached partch is fix the following typo. -additionally an index is over these is stored to +additionally an index over these is stored to Regards, Yugo Nagata -- Yugo Nagata diff --git a/src/backend/jit/README b/src/backend/jit/README index 5427bdf2153..a40950dfb03 100644 --- a/src/backend/jit/README +++ b/src/backend/jit/README @@ -205,7 +205,7 @@ The ability to do so allows us to get the LLVM IR for all operators bitcode files get installed into the server's $pkglibdir/bitcode/postgres/ Using existing LLVM functionality (for parallel LTO compilation), -additionally an index is over these is stored to +additionally an index over these is stored to $pkglibdir/bitcode/postgres.index.bc Similarly extensions can install code into
Re: Replication slot is not able to sync up
On Sat, May 24, 2025 at 10:37 AM Amit Kapila wrote: > If the > > problem is that we're not able to create a slot on the standby at an > > old enough LSN or XID position to permit its use with the > > corresponding slot on the master, it should be reported that way. > > > > That is the case, and we should improve the LOG message. > Agree that log messages need improvement. Please find the patch attached for the same. I also intend to update the docs in this area for users to understand this feature better, and will work on that soon. thanks Shveta v1-0001-Improve-log-messages-in-slotsync.patch Description: Binary data
Re: SQL:2011 application time
On 17.09.24 11:45, Peter Eisentraut wrote: On 05.09.24 14:09, Peter Eisentraut wrote: On 07.08.24 22:54, Paul Jungwirth wrote: Here are some fixes based on outstanding feedback (some old some new). I have studied your patches v39-0001 through v39-0004, which correspond to what had been reverted plus the new empty range check plus various minor fixes. This looks good to me now, so I propose to go ahead with that. Btw., in your 0003 you point out that this prevents using the WITHOUT OVERLAPS functionality for non-range types. But I think this could be accomplished by adding an "is empty" callback as a support function or something like that. I'm not suggesting to do that here, but it might be worth leaving a comment about that possibility. I have committed these, as explained here. Here we added a gist support function that we internally refer to by the symbol GIST_STRATNUM_PROC. This translated from "well-known" strategy numbers to opfamily-specific strategy numbers. However, we later changed this to fit into index-AM-level compare type mapping, so this function actually now maps from compare type to opfamily-specific strategy numbers. So I'm wondering if this name is still good. Moreover, the index AM level also supports the opposite, a function to map from strategy number to compare type. This is currently not supported in gist, but one might wonder what this function is supposed to be called when it is added. So I went through and updated the naming of the gist-level functionality to be more in line with the index-AM-level functionality; see attached patch. I think this makes sense because these are essentially the same thing on different levels. What do you think? (This would be for PG18.) From 5cada8bb5032e7b2ee6e218dc38bc6ff35096a23 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 26 May 2025 07:48:51 +0200 Subject: [PATCH] WIP: Rename gist stratnum support function --- contrib/btree_gist/btree_gist--1.7--1.8.sql | 54 ++-- contrib/btree_gist/btree_gist.c | 4 +- contrib/btree_gist/expected/stratnum.out | 18 +++ contrib/btree_gist/sql/stratnum.sql | 6 +-- doc/src/sgml/gist.sgml | 25 ++--- doc/src/sgml/xindex.sgml | 2 +- src/backend/access/gist/gistutil.c | 14 ++--- src/backend/access/gist/gistvalidate.c | 6 +-- src/include/access/gist.h| 2 +- src/include/catalog/pg_amproc.dat| 12 ++--- src/include/catalog/pg_proc.dat | 4 +- src/test/regress/expected/misc_functions.out | 18 +++ src/test/regress/sql/misc_functions.sql | 6 +-- 13 files changed, 91 insertions(+), 80 deletions(-) diff --git a/contrib/btree_gist/btree_gist--1.7--1.8.sql b/contrib/btree_gist/btree_gist--1.7--1.8.sql index 4ff9c43a8eb..8f79365a461 100644 --- a/contrib/btree_gist/btree_gist--1.7--1.8.sql +++ b/contrib/btree_gist/btree_gist--1.7--1.8.sql @@ -3,85 +3,85 @@ -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.8'" to load this file. \quit -CREATE FUNCTION gist_stratnum_btree(int) +CREATE FUNCTION gist_translate_cmptype_btree(int) RETURNS smallint AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE PARALLEL SAFE STRICT; ALTER OPERATOR FAMILY gist_oid_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_int2_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_int4_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_int8_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_float4_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_float8_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_timestamp_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ; + FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ; ALTER OPERATOR FAMILY gist_time_ops USING gist ADD - FUNCTION 12 ("any", "any") gist_stratnum_btree (in
Re: Conflict detection for update_deleted in logical replication
On Sat, May 24, 2025 at 4:46 PM Amit Kapila wrote: > > This sounds reasonable to me. Let us see what others think. > I was looking into the for getting the transaction status from publisher, what I would assume this patch should be doing is request the publisher status first time, and if some transactions are still in commit, then we need to wait for them to get completed. But in the current design its possible that while we are waiting for in-commit transactions to get committed the old running transaction might come in commit phase and then we wait for them again, is my understanding not correct? Maybe this is very corner case that there are thousands of old running transaction and everytime we request the status we find some transactions is in commit phase and the process keep running for long time until all the old running transaction eventually get committed. I am thinking can't we make it more deterministic such that when we get the status first time if we find some transactions that are in commit phase then we should just wait for those transaction to get committed? One idea is to get the list of xids in commit phase and next time when we get the list we can just compare and in next status if we don't get any xids in commit phase which were in commit phase during previous status then we are done. But not sure is this worth the complexity? Mabe not but shall we add some comment explaining the case and also explaining why this corner case is not harmful? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Hash table scans outside transactions
Aidar Imamov wrote 2025-05-21 00:32: Ashutosh Bapat wrote 2023-03-28 15:58: Bumping it to attract some attention. On Tue, Mar 21, 2023 at 12:51 PM Ashutosh Bapat wrote: Hi, Hash table scans (seq_scan_table/level) are cleaned up at the end of a transaction in AtEOXact_HashTables(). If a hash seq scan continues beyond transaction end it will meet "ERROR: no hash_seq_search scan for hash table" in deregister_seq_scan(). That seems like a limiting the hash table usage. Our use case is 1. Add/update/remove entries in hash table 2. Scan the existing entries and perform one transaction per entry 3. Close scan repeat above steps in an infinite loop. Note that we do not add/modify/delete entries in step 2. We can't use linked lists since the entries need to be updated or deleted using hash keys. Because the hash seq scan is cleaned up at the end of the transaction, we encounter error in the 3rd step. I don't see that the actual hash table scan depends upon the seq_scan_table/level[] which is cleaned up at the end of the transaction. I have following questions 1. Is there a way to avoid cleaning up seq_scan_table/level() when the transaction ends? 2. Is there a way that we can use hash table implementation in PostgreSQL code for our purpose? -- Best Wishes, Ashutosh Bapat Hi! I tried to resend this thread directly to myself, but for some reason it ended up in the whole hackers list.. I thought I'd chime in on this topic since it hasn't really been discussed anywhere else (or maybe I missed it). I've attached two patches: the first one is a little test extension to demonstrate the problem (just add "hash_test" to "shared_preload_libraries"), and the second is a possible solution. Basically, the idea is that we don't reset the scan counter if we find scans that started outside of the current transaction at the end. I have to admit, though, that I can't immediately say what other implications this might have or what else we need to watch out for if we try this. Maybe any thoughts on that? regards, Aidar Imamov Just bumping it again regards, Aidar Imamov
Re: Understanding when VM record needs snapshot conflict horizon
On Sat, May 24, 2025 at 2:21 AM Melanie Plageman wrote: > > On Fri, May 23, 2025 at 12:04 PM Andres Freund wrote: > 3) if you are updating the VM and you are not modifying the heap page > at all, then you don't need to include a snapshot conflict horizon in > the record because you can safely assume that a record with the > visibility cutoff xid for that heap page as the snapshot conflict > horizon has already been emitted. And any existing snapshots that > would conflict with it would have conflicted with the previous record. > > I think 3 can only happen if something goes wrong with the VM -- like > it is lost somehow. > > What I am wondering is if it is worth omitting the snapshot conflict > horizon in the third case. > Currently, you would emit an xl_heap_visible record with > InvalidTransactionId as the conflict horizon in this case. But you > aren't saving any space and it doesn't seem like you are saving any > queries from being canceled by not doing this. It simply makes the > logic for what to put in the WAL record more complicated. > IMHO, if we include snapshot conflict horizon in cases where it is not necessary, don't you think it will impact performance on standby? because now it has to loop through the procarray on standby to check whether there is any conflict before applying this WAL. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: MERGE issues around inheritance
Álvaro Herrera 于2025年5月24日周六 17:11写道: > On 2025-May-21, Andres Freund wrote: > > > Hi, > > > > In [1] I added some verification to projection building, to check if the > > tupleslot passed to ExecBuildProjectionInfo() is compatible with the > target > > list. One query in merge.sql [2] got flagged. > > > > Trying to debug that issue, I found another problem. This leaves us with: > > > > 1) w/ inheritance INSERTed rows are not returned by RETURNING. This > seems to > >just generally not work with MERGE > > Hmm, curious. One thing to observe is that the original source tuple is > in the child table, but the tuple inserted by MERGE ends up in the > parent table. I'm guessing that the code gets confused as to the > relation that the tuple in the returned slot comes from, and that > somehow makes it somehow not "see" the tuple to process for RETURNING? > I dunno. CC'ing Dean, who is more familiar with this code than I am. > In ExecMergeNotMatched(), we passed the mtstate->rootResultRelInfo to ExecInsert(). In this case, the ri_projectReturning of mtstate->rootResultRelInfo is NULL, in ExecInsert(), the "if (resultRelInfo->ri_projectReturning)" branch will not run, so inheritance INSERTed rows are not returned by RETURNING. The mtstate->rootResultRelInfo assigned in ExecInitModifyTable() is only here: if (node->rootRelation > 0) { Assert(bms_is_member(node->rootRelation, estate->es_unpruned_relids)); mtstate->rootResultRelInfo = makeNode(ResultRelInfo); ExecInitResultRelation(estate, mtstate->rootResultRelInfo, node->rootRelation); } The ri_projectReturning is not assigned. I try to pass resultRelInfo to ExecInsert, the inherited INSERTed rows are returned by RETURNING. But some test cases in regression failed. -- Thanks, Tender Wang
Re: Hash table scans outside transactions
On Wed, May 21, 2025 at 3:02 AM Aidar Imamov wrote: > > Hi! > I tried to resend this thread directly to myself, but for some reason it > ended up in the whole hackers list.. > > I thought I'd chime in on this topic since it hasn't really been > discussed > anywhere else (or maybe I missed it). > I've attached two patches: the first one is a little test extension to > demonstrate the problem (just add "hash_test" to > "shared_preload_libraries"), > and the second is a possible solution. Basically, the idea is that we > don't > reset the scan counter if we find scans that started outside of the > current > transaction at the end. I have to admit, though, that I can't > immediately > say what other implications this might have or what else we need to > watch > out for if we try this. > Maybe any thoughts on that? I haven't reviewed the complete patch or tested it, but I don't see any issues with it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: MERGE issues around inheritance
Tender Wang 于2025年5月25日周日 19:17写道: > > > Álvaro Herrera 于2025年5月24日周六 17:11写道: > >> On 2025-May-21, Andres Freund wrote: >> >> > Hi, >> > >> > In [1] I added some verification to projection building, to check if the >> > tupleslot passed to ExecBuildProjectionInfo() is compatible with the >> target >> > list. One query in merge.sql [2] got flagged. >> > >> > Trying to debug that issue, I found another problem. This leaves us >> with: >> > >> > 1) w/ inheritance INSERTed rows are not returned by RETURNING. This >> seems to >> >just generally not work with MERGE >> >> Hmm, curious. One thing to observe is that the original source tuple is >> in the child table, but the tuple inserted by MERGE ends up in the >> parent table. I'm guessing that the code gets confused as to the >> relation that the tuple in the returned slot comes from, and that >> somehow makes it somehow not "see" the tuple to process for RETURNING? >> I dunno. CC'ing Dean, who is more familiar with this code than I am. >> > > In ExecMergeNotMatched(), we passed the mtstate->rootResultRelInfo to > ExecInsert(). > In this case, the ri_projectReturning of mtstate->rootResultRelInfo is > NULL, in ExecInsert(), > the "if (resultRelInfo->ri_projectReturning)" branch will not run, so > inheritance INSERTed rows are not returned by RETURNING. > > The mtstate->rootResultRelInfo assigned in ExecInitModifyTable() is only > here: > if (node->rootRelation > 0) > { > Assert(bms_is_member(node->rootRelation, estate->es_unpruned_relids)); > mtstate->rootResultRelInfo = makeNode(ResultRelInfo); > ExecInitResultRelation(estate, mtstate->rootResultRelInfo, > node->rootRelation); > } > The ri_projectReturning is not assigned. > > I try to pass resultRelInfo to ExecInsert, the inherited INSERTed rows are > returned by RETURNING. > But some test cases in regression failed. > For a partitioned table, we must pass rootResultRelInfo to ExecInsert(). I added the check before calling ExecInsert() If it is a partitioned table, we continue to pass rootResultRelInfo. Otherwise, we pass resultRelInfo. Please see the attached diff file. The patch passed all regression test cases. -- Thanks, Tender Wang diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 2bc89bf84dc..443d4523789 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -3612,6 +3612,7 @@ ExecMergeNotMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo, MergeActionState *action = (MergeActionState *) lfirst(l); CmdType commandType = action->mas_action->commandType; TupleTableSlot *newslot; + char relkind; /* * Test condition, if any. @@ -3635,9 +3636,15 @@ ExecMergeNotMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo, */ newslot = ExecProject(action->mas_proj); mtstate->mt_merge_action = action; + relkind = mtstate->rootResultRelInfo->ri_RelationDesc->rd_rel->relkind; - rslot = ExecInsert(context, mtstate->rootResultRelInfo, + if (relkind == RELKIND_PARTITIONED_TABLE) + rslot = ExecInsert(context, mtstate->rootResultRelInfo, newslot, canSetTag, NULL, NULL); + else + rslot = ExecInsert(context, resultRelInfo, + newslot, canSetTag, NULL, NULL); + mtstate->mt_merge_inserted += 1; break; case CMD_NOTHING:
Re: MERGE issues around inheritance
On Sun, 25 May 2025 at 13:06, Tender Wang wrote: > > For a partitioned table, we must pass rootResultRelInfo to ExecInsert(). I > added the check before calling ExecInsert() > If it is a partitioned table, we continue to pass rootResultRelInfo. > Otherwise, we pass resultRelInfo. > Please see the attached diff file. The patch passed all regression test cases. > No, I don't think that's the right fix. I'm looking at it now, and I think I have a fix, but it's more complicated than that. I'll post an update later. Regards, Dean