Re: Fixing memory leaks in postgres_fdw

2025-05-25 Thread Tom Lane
Etsuro Fujita  writes:
> On Sat, May 24, 2025 at 10:10 AM Tom Lane  wrote:
>> I thought of fixing this by using a memory context reset callback
>> to ensure that the PGresult is cleaned up when the executor's context
>> goes away, and that seems to work nicely (see 0001 attached).
>> However, I feel like this is just a POC, because now that we have that
>> concept we might be able to use it elsewhere in postgres_fdw to
>> eliminate most or even all of its reliance on PG_TRY.  That should be
>> faster as well as much less bug-prone.  But I'm putting it up at this
>> stage for comments, in case anyone thinks this is not the direction to
>> head in.

> I think that that is a good idea; +1 for removing the reliance not
> only in DirectModify but in other places.  I think that that would be
> also useful if extending batch INSERT to cases with RETURNING data in
> postgres_fdw.

Here is an attempt at making a bulletproof fix by having all backend
users of libpq go through a wrapper layer that provides the memory
context callback.  Perhaps this is more code churn than we want to
accept; I'm not sure.  I thought about avoiding most of the niggling
code changes by adding

#define PGresult BEPGresult
#define PQclear BEPQclear
#define PQresultStatus BEPQresultStatus

and so forth at the bottom of the new header file, but I'm afraid
that would create a lot of confusion.

There is a lot yet to do towards getting rid of no-longer-needed
PG_TRYs and other complication, but I decided to stop here pending
comments on the notational decisions I made.

One point that people might find particularly dubious is that
I put the new stuff into a new header file libpq-be-fe.h, rather
than adding it to libpq-be-fe-helpers.h which would seem more
obvious.  The reason for that is the code layout in postgres_fdw.
postgres_fdw.h needs to include libpq-fe.h to get the PGresult
typedef, and with these changes it instead needs to get BEPGresult.
But only connection.c currently includes libpq-be-fe-helpers.h,
and I didn't like the idea of making all of postgres_fdw's .c
files include that.  Maybe that's not worth worrying about though.

The 0002 patch is the same as before.

regards, tom lane

From 2fc53191595be88592fcd83b8b78c5849b623049 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 25 May 2025 15:16:23 -0400
Subject: [PATCH v2 1/2] Fix memory leakage in postgres_fdw's DirectModify code
 path.

postgres_fdw tries to use PG_TRY blocks to ensure that it will
eventually free the PGresult created by the remote modify command.
However, it's fundamentally impossible for this scheme to work
reliably when there's RETURNING data, because the query could fail
in between invocations of postgres_fdw's DirectModify methods.
There is at least one instance of exactly this situation in the
regression tests, and the ensuing session-lifespan leak is visible
under Valgrind.

We can improve matters by using a memory context reset callback
attached to the ExecutorState context.  That ensures that the
PGresult will be freed when the ExecutorState context is torn
down, even if control never reaches postgresEndDirectModify.
Also, by switching to this approach, we can eliminate some
PG_TRY overhead since it's no longer necessary to be so
cautious about errors.

This patch adds infrastructure that wraps each PGresult in a
"BEPGresult" that provides the reset callback.  Code using this
abstraction is inherently memory-safe (as long as it attaches
the reset callbacks to an appropriate memory context).  The
amount of notational churn is slightly annoying, but I think
that it's probably worth it to forestall future leaks.

So far I created the infrastructure and made relevant code
use it, but I have not done very much towards making the
caller simplifications that should now be possible.  I did
remove a few PQclear calls in error-exit paths that are now
no longer necessary.  But we should be able to remove a lot
of uses of PG_TRY and "volatile" variable markings, as well
as simplify error handling, now that it's not necessary to be
so paranoid about freeing PGresults before throwing an error.
This state of the patch is good for reviewing the notational
choices I made, though.

Author: Tom Lane 
Discussion: https://postgr.es/m/2976982.1748049...@sss.pgh.pa.us
---
 contrib/dblink/dblink.c   | 159 +++---
 contrib/postgres_fdw/connection.c |  54 ++---
 contrib/postgres_fdw/postgres_fdw.c   | 186 
 contrib/postgres_fdw/postgres_fdw.h   |  10 +-
 .../libpqwalreceiver/libpqwalreceiver.c   | 145 +
 src/backend/utils/mmgr/mcxt.c |  39 +++-
 src/include/libpq/libpq-be-fe-helpers.h   |  68 +++---
 src/include/libpq/libpq-be-fe.h   | 203 ++
 src/include/utils/palloc.h|   2 +
 src/tools/pgindent/typedefs.list  |   1 +
 10 files changed, 531 insertions(+), 336 deletions(-)
 create mode 10

Re: MERGE issues around inheritance

2025-05-25 Thread Dean Rasheed
On Sun, 25 May 2025 at 13:41, Dean Rasheed  wrote:
>
> On Sun, 25 May 2025 at 13:06, Tender Wang  wrote:
> >
> > For a partitioned table, we must pass rootResultRelInfo to ExecInsert(). I 
> > added the check before calling ExecInsert()
> > If it is a partitioned table, we continue to pass rootResultRelInfo. 
> > Otherwise, we pass resultRelInfo.
> > Please see the attached diff file. The patch passed all regression test 
> > cases.
> >
>
> No, I don't think that's the right fix. I'm looking at it now, and I
> think I have a fix, but it's more complicated than that. I'll post an
> update later.
>

The reason that MERGE must pass rootResultRelInfo to ExecInsert() for
a plain-inheritance table dates back to 387f9ed0a08. As that commit
demonstrates, it is possible for the parent to be excluded from the
plan, and so all of the entries in the resultRelInfo array may be for
different relations than rootResultRelInfo.

So there are indeed two related bugs here:

1. The projection built by ExecInitMerge() in the INSERT case may use
a tuple slot that's not compatible with the target table.

2. ExecInitModifyTable() does not initialize the WCO lists or
RETURNING list for rootResultRelInfo, so those never get executed.

As it happens, it is possible to construct cases where (1) causes a
crash, even without WCO's or a RETURNING list (see the first test case
in the attached patch), so this bug goes all the way back to v15,
where MERGE was introduced.

So I think we need to do something like the attached.

There is perhaps scope to reduce the code duplication between this and
ExecInitPartitionInfo(), but that'd likely make the patch bigger, so I
think it's best to leave that for now.

Regards,
Dean
From 74db4187272171853b69d4d8c7155a778846f299 Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Sun, 25 May 2025 20:00:48 +0100
Subject: [PATCH v1] Fix MERGE into a plain inheritance parent table.

When a MERGE's target table is the parent of an inheritance tree, any
INSERT actions insert into the parent table using ModifyTableState's
rootResultRelInfo. However, there are at least two bugs in the way
this is initialized:

1. ExecInitMerge() incorrectly uses a ResultRelInfo entry from the
ModifyTableState's resultRelInfo array to build the insert projection,
which may not be compatible with the rootResultRelInfo.

2. ExecInitModifyTable() does not fully initialize rootResultRelInfo
(ri_WithCheckOptions, ri_WithCheckOptionExprs, ri_returningList, and
ri_projectReturning are not initialized).

This can lead to crashes, or a failure to check WCO's or to evaluate
the RETURNING list for MERGE INSERT actions.

Fix both these bugs in ExecInitMerge(), noting that it is only
necessary to fully initialize rootResultRelInfo for a MERGE with
INSERT actions, when the target table is an inheritance parent.

Backpatch to v15, where MERGE was introduced.

Reported-by: Andres Freund 
Discussion: https://postgr.es/m/4rlmjfniiyffp6b3kv4pfy4jw3pciy6mq72rdgnedsnbsx7qe5@j5hlpiwdguvc
Backpatch-through: 15
---
 src/backend/executor/nodeModifyTable.c | 123 -
 src/test/regress/expected/merge.out|  70 ++
 src/test/regress/sql/merge.sql |  49 ++
 3 files changed, 239 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 2bc89bf84dc..871c421fd5b 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -64,6 +64,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
 #include "rewrite/rewriteHandler.h"
+#include "rewrite/rewriteManip.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
@@ -3735,6 +3736,7 @@ ExecInitMerge(ModifyTableState *mtstate, EState *estate)
 			switch (action->commandType)
 			{
 case CMD_INSERT:
+	/* INSERT actions always use rootRelInfo */
 	ExecCheckPlanOutput(rootRelInfo->ri_RelationDesc,
 		action->targetList);
 
@@ -3774,9 +3776,23 @@ ExecInitMerge(ModifyTableState *mtstate, EState *estate)
 	}
 	else
 	{
-		/* not partitioned? use the stock relation and slot */
-		tgtslot = resultRelInfo->ri_newTupleSlot;
-		tgtdesc = RelationGetDescr(resultRelInfo->ri_RelationDesc);
+		/*
+		 * If the MERGE targets an inherited table, we insert
+		 * into the root table, so we must initialize its
+		 * "new" tuple slot, if not already done, and use its
+		 * relation descriptor for the projection.
+		 *
+		 * For non-inherited tables, rootRelInfo and
+		 * resultRelInfo are the same, and the "new" tuple
+		 * slot will already have been initialized.
+		 */
+		if (rootRelInfo->ri_newTupleSlot == NULL)
+			rootRelInfo->ri_newTupleSlot =
+table_slot_create(rootRelInfo->ri_RelationDesc,
+  &estate->es_tupleTable);
+
+		tgtslot = rootRelInfo->ri_newTupleSlot;
+		tgtdesc = RelationGetDescr(rootRelInfo->ri_Relatio

Re: MERGE issues around inheritance

2025-05-25 Thread Tender Wang
Dean Rasheed  于2025年5月26日周一 04:10写道:

> On Sun, 25 May 2025 at 13:41, Dean Rasheed 
> wrote:
> >
> > On Sun, 25 May 2025 at 13:06, Tender Wang  wrote:
> > >
> > > For a partitioned table, we must pass rootResultRelInfo to
> ExecInsert(). I added the check before calling ExecInsert()
> > > If it is a partitioned table, we continue to pass rootResultRelInfo.
> Otherwise, we pass resultRelInfo.
> > > Please see the attached diff file. The patch passed all regression
> test cases.
> > >
> >
> > No, I don't think that's the right fix. I'm looking at it now, and I
> > think I have a fix, but it's more complicated than that. I'll post an
> > update later.
> >
>
> The reason that MERGE must pass rootResultRelInfo to ExecInsert() for
> a plain-inheritance table dates back to 387f9ed0a08. As that commit
> demonstrates, it is possible for the parent to be excluded from the
> plan, and so all of the entries in the resultRelInfo array may be for
> different relations than rootResultRelInfo.
>

Thanks for the information.  Passing resultRelInfo to ExecInsert() is wrong.


> So there are indeed two related bugs here:
>
> 1. The projection built by ExecInitMerge() in the INSERT case may use
> a tuple slot that's not compatible with the target table.
>
> 2. ExecInitModifyTable() does not initialize the WCO lists or
> RETURNING list for rootResultRelInfo, so those never get executed.
>
> As it happens, it is possible to construct cases where (1) causes a
> crash, even without WCO's or a RETURNING list (see the first test case
> in the attached patch), so this bug goes all the way back to v15,
> where MERGE was introduced.
>
> So I think we need to do something like the attached.
>

I tested the attached patch, and there's no problem anymore.  LGTM.


> There is perhaps scope to reduce the code duplication between this and
> ExecInitPartitionInfo(), but that'd likely make the patch bigger, so I
> think it's best to leave that for now.
>

Agreed.

-- 
Thanks,
Tender Wang


Custom GUCs and typos

2025-05-25 Thread Srinath Reddy Sadipiralla
Hi,
while going through this thread [0] i have some different views and
questions on this problem so creating a separate thread,this is just an
attempt, i might be brutally wrong here.

1) On top of OP's patch I added support to warn if the prefix of custom GUC
is invalid,for valid questions such as "How do you know that's a bogus
prefix?  It could perfectly well be a fully valid setting for an extension
that the installation doesn't choose to preload.",we can get the info of
such extensions using extension_file_exists() which tells that its
installed but not preloaded thanks to Greg for proposing this,which tells
this could be a potential valid extension ,so if its not it in
reserved_class_prefix and also not in prefix/share/extension dir then the
prefix is for sure invalid,so i warn and remove the GUC from hashtable.

2) for preloaded extensions if the suffix/parameter is a typo or if we want
to create a new custom GUC for ex:"pg_stat_statements.count_something = 1",
currently we throwing a warning as

WARNING:  invalid configuration parameter name
"pg_stat_statements.count_something", removing it
"pg_stat_statements" is now a reserved prefix.

I guess this means you cant create a new custom GUC manually ,you have to
use "DefineCustomXXXVariable" routines,but this kind of warning is not
there for non-preloaded extensions ,so i added the same for non-preloaded
ones ,once again checking the extension_file_exists() if the extension was
built/installed this could make it a potential valid extension so it warns
as

if i try add this "plpgsql.ironman = 3000"

WARNING:  invalid configuration parameter name "plpgsql.ironman", removing
it
DETAIL:  "plpgsql.ironman" has a valid prefix.

3) I also added all the above support for the SET command also which
previously was not there.
4) TODO: add support for ALTER SYSTEM SET ,if this patch makes sense.

thoughts?


[0]
https://www.postgresql.org/message-id/flat/196f3f8e12f.87666a6c16169.9160742057750715009%40zohocorp.com
-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/


v1-0001-Reject-and-warn-invalid-custom-GUCs.patch
Description: Binary data


Re: POC: Parallel processing of indexes in autovacuum

2025-05-25 Thread Daniil Davydov
Hi,

On Fri, May 23, 2025 at 6:12 AM Masahiko Sawada  wrote:
>
> On Thu, May 22, 2025 at 12:44 AM Daniil Davydov <3daniss...@gmail.com> wrote:
> >
> > On Wed, May 21, 2025 at 5:30 AM Masahiko Sawada  
> > wrote:
> > >
> > > I find that the name "autovacuum_reserved_workers_num" is generic. It
> > > would be better to have a more specific name for parallel vacuum such
> > > as autovacuum_max_parallel_workers. This parameter is related to
> > > neither autovacuum_worker_slots nor autovacuum_max_workers, which
> > > seems fine to me. Also, max_parallel_maintenance_workers doesn't
> > > affect this parameter.
> >
> > This was my headache when I created names for variables. Autovacuum
> > initially implies parallelism, because we have several parallel a/v
> > workers.
>
> I'm not sure if it's parallelism. We can have multiple autovacuum
> workers simultaneously working on different tables, which seems not
> parallelism to me.

Hm, I didn't thought about the 'parallelism' definition in this way.
But I see your point - the next v4 patch will contain the naming that
you suggest.

>
> > So I think that parameter like
> > `autovacuum_max_parallel_workers` will confuse somebody.
> > If we want to have a more specific name, I would prefer
> > `max_parallel_index_autovacuum_workers`.
>
> It's better not to use 'index' as we're trying to extend parallel
> vacuum to heap scanning/vacuuming as well[1].

OK, I'll fix it.

> > > +   /*
> > > +* If we are running autovacuum - decide whether we need to process 
> > > indexes
> > > +* of table with given oid in parallel.
> > > +*/
> > > +   if (AmAutoVacuumWorkerProcess() &&
> > > +   params->index_cleanup != VACOPTVALUE_DISABLED &&
> > > +   RelationAllowsParallelIdxAutovac(rel))
> > >
> > > I think that this should be done in autovacuum code.
> >
> > We need params->index cleanup variable to decide whether we need to
> > use parallel index a/v. In autovacuum.c we have this code :
> > ***
> > /*
> >  * index_cleanup and truncate are unspecified at first in autovacuum.
> >  * They will be filled in with usable values using their reloptions
> >  * (or reloption defaults) later.
> >  */
> > tab->at_params.index_cleanup = VACOPTVALUE_UNSPECIFIED;
> > tab->at_params.truncate = VACOPTVALUE_UNSPECIFIED;
> > ***
> > This variable is filled in inside the `vacuum_rel` function, so I
> > think we should keep the above logic in vacuum.c.
>
> I guess that we can specify the parallel degree even if index_cleanup
> is still UNSPECIFIED. vacuum_rel() would then decide whether to use
> index vacuuming and vacuumlazy.c would decide whether to use parallel
> vacuum based on the specified parallel degree and index_cleanup value.
>
> >
> > > +#define AV_PARALLEL_DEADTUP_THRESHOLD  1024
> > >
> > > These fixed values really useful in common cases? I think we already
> > > have an optimization where we skip vacuum indexes if the table has
> > > fewer dead tuples (see BYPASS_THRESHOLD_PAGES).
> >
> > When we allocate dead items (and optionally init parallel autocuum) we
> > don't have sane value for `vacrel->lpdead_item_pages` (which should be
> > compared with BYPASS_THRESHOLD_PAGES).
> > The only criterion that we can focus on is the number of dead tuples
> > indicated in the PgStat_StatTabEntry.
>
> My point is that this criterion might not be useful. We have the
> bypass optimization for index vacuuming and having many dead tuples
> doesn't necessarily mean index vacuuming taking a long time. For
> example, even if the table has a few dead tuples, index vacuuming
> could take a very long time and parallel index vacuuming would help
> the situation, if the table is very large and has many indexes.

That sounds reasonable. I'll fix it.

> > But autovacuum (as I think) should work as stable as possible and
> > `unnoticed` by other processes. Thus, we must :
> > 1) Compute resources (such as the number of parallel workers for a
> > single table's indexes vacuuming) as efficiently as possible.
> > 2) Provide a guarantee that as many tables as possible (among
> > requested) will be processed in parallel.
> >
> > (1) can be achieved by calculating the parameters on the fly.
> > NUM_INDEXES_PER_PARALLEL_WORKER is a rough mock. I can provide more
> > accurate value in the near future.
>
> I think it requires more things than the number of indexes on the
> table to achieve (1). Suppose that there is a very large table that
> gets updates heavily and has a few indexes. If users want to avoid the
> table from being bloated, it would be a reasonable idea to use
> parallel vacuum during autovacuum and it would not be a good idea to
> disallow using parallel vacuum solely because it doesn't have more
> than 30 indexes. On the other hand, if the table had got many updates
> but not so now, users might want to use resources for autovacuums on
> other tables. We might need to consider autovacuum frequencies per
> table, the statistics of the previous autovacuum, or system loads etc.
> So I thi

Re: Non-reproducible AIO failure

2025-05-25 Thread Thomas Munro
On Sun, May 25, 2025 at 3:22 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Can you get a core and print *ioh in the debugger?
>
> So far, I've failed to get anything useful out of core files
> from this failure.  The trace goes back no further than
>
> (lldb) bt
> * thread #1
>   * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8
>
> That's quite odd in itself: while I don't find the debugging
> environment on macOS to be the greatest, it's not normally
> this unhelpful.

(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS.  Could that be interfering with the stack in the core?!
 I doubt it ... I kinda wonder if the debugger might be confused about
libsystem sometimes since it has ceased to be a regular Mach-O file on
disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue?

So far we have:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),
File: "bufmgr.c", Line: 1605, PID: 20931
0   postgres0x000105299c84
ExceptionalCondition + 108
1   postgres0x0001051159ac WaitReadBuffers + 616
2   postgres0x0001053611ec
read_stream_next_buffer.cold.1 + 184
3   postgres0x000105111630
read_stream_next_buffer + 300
4   postgres0x000104e0b994
heap_fetch_next_buffer + 136
5   postgres0x000104e018f4
heapgettup_pagemode + 204

Hmm, looking around that code and wracking my brain for things that
might happen on one OS but not others, I wonder about partial I/Os.
Perhaps combined with some overlapping requests.

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
Line: 161, PID: 32355
0   postgres0x000104f078f4
ExceptionalCondition + 236
1   postgres0x000104c0ebd4
pgaio_io_before_start + 260
2   postgres0x000104c0ea94
pgaio_io_start_readv + 36
3   postgres0x000104c2d4e8 FileStartReadV + 252
4   postgres0x000104c807c8 mdstartreadv + 668
5   postgres0x000104c83db0 smgrstartreadv + 116

But this one seems like a more basic confusion...  wild writes
somewhere?  Hmm, we need to see what's in that struct.

If we can't get a debugger to break there or a core file to be
analysable, maybe we should try logging as much info as possible at
those points to learn a bit more?  I would be digging like that myself
but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3).  It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now.  Could you guys please share your exact repro steps?




Re: Fixing memory leaks in postgres_fdw

2025-05-25 Thread Tom Lane
I wrote:
> Here is an attempt at making a bulletproof fix by having all backend
> users of libpq go through a wrapper layer that provides the memory
> context callback.  Perhaps this is more code churn than we want to
> accept; I'm not sure.  I thought about avoiding most of the niggling
> code changes by adding
> #define PGresult BEPGresult
> #define PQclear BEPQclear
> #define PQresultStatus BEPQresultStatus
> and so forth at the bottom of the new header file, but I'm afraid
> that would create a lot of confusion.

I tried that, and it leads to such a less-messy patch that I think
we should probably do it this way, confusion or no.  One argument
that can be made in favor is that we don't really want random
notational differences between frontend and backend code that's
doing the same thing.

Also, I'd been struggling with the assumption that we should
palloc the wrapper object before calling PQgetResult; there
doesn't seem to be any nice way to make that transparent to
callers.  I realized that we can make it simple as long as
we're willing to assume that allocating with MCXT_ALLOC_NO_OOM
can't throw an error.  But we assume that in other usages too.

Hence, v3 attached.  The 0002 patch is still the same as before.

regards, tom lane

From be4b888d0d2936cb80e63092d14a3133fab590da Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 25 May 2025 19:49:33 -0400
Subject: [PATCH v3 1/2] Fix memory leakage in postgres_fdw's DirectModify code
 path.

postgres_fdw tries to use PG_TRY blocks to ensure that it will
eventually free the PGresult created by the remote modify command.
However, it's fundamentally impossible for this scheme to work
reliably when there's RETURNING data, because the query could fail
in between invocations of postgres_fdw's DirectModify methods.
There is at least one instance of exactly this situation in the
regression tests, and the ensuing session-lifespan leak is visible
under Valgrind.

We can improve matters by using a memory context reset callback
attached to the ExecutorState context.  That ensures that the
PGresult will be freed when the ExecutorState context is torn
down, even if control never reaches postgresEndDirectModify.
Also, by switching to this approach, we can eliminate some
PG_TRY overhead since it's no longer necessary to be so
cautious about errors.

This patch adds infrastructure that wraps each PGresult in a
"libpqsrv_PGresult" that provides the reset callback.  Code using
this abstraction is inherently memory-safe (so long as it attaches
the reset callbacks to an appropriate memory context).  Furthermore,
we add some macros that automatically redirect calls of the libpq
functions concerned with PGresults to use this infrastructure,
so that almost no source-code changes are needed to wheel this
infrastructure into place.

So far I created the infrastructure and made relevant code
use it, but I have not done very much towards making the
caller simplifications that should now be possible.  I did
remove a few PQclear calls in error-exit paths that are now
no longer necessary.  But we should be able to remove a lot
of uses of PG_TRY and "volatile" variable markings, as well
as simplify error handling, now that it's not necessary to be
so paranoid about freeing PGresults before throwing an error.
I did fix libpqsrv_get_result_last() that way, just as proof
of concept.

Author: Tom Lane 
Discussion: https://postgr.es/m/2976982.1748049...@sss.pgh.pa.us
---
 contrib/postgres_fdw/postgres_fdw.c   |  19 +-
 contrib/postgres_fdw/postgres_fdw.h   |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c   |  31 +--
 src/backend/utils/mmgr/mcxt.c |  39 ++-
 src/include/libpq/libpq-be-fe-helpers.h   |  59 ++--
 src/include/libpq/libpq-be-fe.h   | 259 ++
 src/include/utils/palloc.h|   2 +
 src/tools/pgindent/typedefs.list  |   1 +
 8 files changed, 327 insertions(+), 85 deletions(-)
 create mode 100644 src/include/libpq/libpq-be-fe.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 331f3fc088d..3cc27c5a1e2 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3779,9 +3779,6 @@ create_cursor(ForeignScanState *node)
 
 	/*
 	 * Get the result, and check for success.
-	 *
-	 * We don't use a PG_TRY block here, so be careful not to throw error
-	 * without releasing the PGresult.
 	 */
 	res = pgfdw_get_result(conn);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
@@ -4178,9 +4175,6 @@ execute_foreign_modify(EState *estate,
 
 	/*
 	 * Get the result, and check for success.
-	 *
-	 * We don't use a PG_TRY block here, so be careful not to throw error
-	 * without releasing the PGresult.
 	 */
 	res = pgfdw_get_result(fmstate->conn);
 	if (PQresultStatus(res) !=
@@ -4248,9 +4242,6 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/*
 	 * Get the result, and check for success.
-	 *
-	 

Re: Non-reproducible AIO failure

2025-05-25 Thread Tom Lane
Thomas Munro  writes:
> Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%.  It sounded like Alexander
had a better way.

regards, tom lane




Re: Add CHECK_FOR_INTERRUPTS in polling loop code path in XactLockTableWait

2025-05-25 Thread Fujii Masao




On 2025/05/24 5:41, Kevin K Biju wrote:

Hi,

While creating a logical replication slot, we wait for older transactions to complete to 
reach a "consistent point", which can take a while on busy databases. If we're 
creating a slot on a primary instance, it's pretty clear that we're waiting on a 
transaction:


postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM 
pg_stat_activity WHERE pid=804;
  pid | wait_event_type |  wait_event   | state  |                             
query
-+-+---++
  804 | Lock            | transactionid | active | SELECT 
pg_create_logical_replication_slot('wow1', 'pgoutput');
(1 row)
postgres=# SELECT locktype,transactionid,pid,mode FROM pg_locks WHERE pid=804 
AND granted='f';
    locktype    | transactionid | pid |   mode
---+---+-+---
  transactionid |          6077 | 804 | ShareLock

However, creating a slot on a hot standby behaves very differently when blocked 
on a transaction. Querying pg_stat_activity doesn't give us any indication on 
what the issue is:


postgres=# SELECT pg_is_in_recovery();
  pg_is_in_recovery
---
  t
(1 row)
postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM 
pg_stat_activity WHERE pid=5074;
  pid  | wait_event_type | wait_event | state  |                             
query
--+-+++
  5074 |                 |            | active | SELECT 
pg_create_logical_replication_slot('wow1', 'pgoutput');

And more importantly, a backend in this state cannot be terminated via the 
usual methods and sometimes requires a server restart.


postgres=# SELECT pg_cancel_backend(5074), pg_terminate_backend(5074);
  pg_cancel_backend | pg_terminate_backend
---+--
  t                 | t
(1 row)
postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM 
pg_stat_activity WHERE pid=5074;
  pid  | wait_event_type | wait_event | state  |                             
query
--+-+++
  5074 |                 |            | active | SELECT 
pg_create_logical_replication_slot('wow1', 'pgoutput');

I've taken a look at the code that "awaits" on transactions and the function of 
interest is lmgr.c/XactLockTableWait. On a primary, it is able to acquire a ShareLock on 
the xid and the lock manager does a good job on making this wait efficient as well as 
visible externally. However, on a standby, it cannot wait on the xid since it is not 
running the transaction. However, it knows the transaction is still running from 
KnownAssignedXids, and then ends up on this codepath:

*
* Some uses of this function don't involve tuple visibility -- such
* as when building snapshots for logical decoding.  It is possible to
* see a transaction in ProcArray before it registers itself in the
* locktable.  The topmost transaction in that case is the same xid,
* so we try again after a short sleep.  (Don't sleep the first time
* through, to avoid slowing down the normal case.)
*/
if (!first)
pg_usleep(1000L);

The attached comment suggests that this sleep was only meant to be hit a few 
times while we wait for the lock to appear so we can wait on it. However, in 
the hot standby case, this ends up becoming a polling loop since the lock will 
never appear. There is no interrupt processing in this loop either and so the 
only way out is to wait for the transaction on the primary to complete.


I agree with your analysis. It makes sense to me.



Attached is a patch that adds CHECK_FOR_INTERRUPTS before the sleep in 
XactLockTableWait and ConditionalXactLockTableWait. This allows backends 
waiting on a transaction during slot creation on a standby to be interrupted.


+1


It would also be nice if there was a way for users to tell what we're waiting 
on (maybe a different wait event?) but I'd like input on that.


Just an idea: how about calling pgstat_report_wait_start() and _end() around
pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does?
Since this is more of an improvement than a bug fix, I think
it would be better to make it as a separate patch from the CFI addition.


Also, would waiting only 1ms per loop cycle be too aggressive? If so, maybe
we could increase the sleep time gradually during recovery (but not beyond
1 second), again similar to WaitExceedsMaxStandbyDelay().

Regards,

--
Fujii Masao
NTT DATA Japan Corporation





Re: Fix slot synchronization with two_phase decoding enabled

2025-05-25 Thread Nisha Moond
 to
On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada  wrote:
>
>
> Here are review comments for v14 patch:
>

Thank you for the review.

> I think we need to include a basic test case where we simply create a
> subscription with two_phase=true and then enable the failover via
> ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed
> two_phase_at. Probably if we use this case in 003_logical_slots.pl, we
> can avoid creating regress_sub2?
>

A test has been added in 040_standby_failover_slots_sync.pl to enable
failover via ALTER SUBSCRIPTION.
Regarding regress_sub2 in 003_logical_slots.pl :
If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it
leads to pg_upgrade failure, as it attempts to create a slot on the
new_node(upgraded version) with both two_phase and failover enabled,
which is an unsupported combination.

> ---
> +# Advance the slot's restart_lsn to allow enabling the failover option
> +# on a two_phase-enabled subscription using ALTER SUBSCRIPTION.
> +$publisher->safe_psql(
> +'postgres', qq(
> +BEGIN;
> +SELECT txid_current();
> +SELECT pg_log_standby_snapshot();
> +COMMIT;
> +BEGIN;
> +SELECT txid_current();
> +SELECT pg_log_standby_snapshot();
> +COMMIT;
> +));
> +
> +# Alter subscription to enable failover
> +$subscriber1->psql(
> +'postgres',
> +"ALTER SUBSCRIPTION regress_mysub2 DISABLE;
> + ALTER SUBSCRIPTION regress_mysub2 SET (failover = true);");
>
> I think we need to wait for the subscription to consume these changes
> before disabling the subscription. If the publisher doesn't consume
> these WAL records for some reason, the subsequent "ALTER SUBSCRIPTION
> ... SET (failover = true)" will fail.
>

Done.

> ---
> @@ -25,6 +25,8 @@ $publisher->init(
>  $publisher->append_conf('postgresql.conf', 'autovacuum = off');
>  $publisher->start;
>
> +$publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int)");
> +
>  $publisher->safe_psql('postgres',
> "CREATE PUBLICATION regress_mypub FOR ALL TABLES;");
>
> @@ -42,6 +44,8 @@ my $slot_creation_time_on_primary = $publisher->safe_psql(
>  SELECT current_timestamp;
>  ]);
>
> +$subscriber1->safe_psql('postgres', "CREATE TABLE tab_full (a int)");
> +
>  # Create a subscription that enables failover.
>  $subscriber1->safe_psql('postgres',
> "CREATE SUBSCRIPTION regress_mysub1 CONNECTION
> '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name =
> lsub1_slot, copy_data = false, failover = true, enabled = false);"
> @@ -98,6 +102,114 @@ my ($result, $stdout, $stderr) =
> $subscriber1->psql('postgres',
>  ok( $stderr =~ /ERROR:  cannot set failover for enabled subscription/,
> "altering failover is not allowed for enabled subscription");
>
> I think it's better to create the tables before the new test starts
> with a comment explaining why we need to create the table here.
>

Done.
~~~

Please find the updated patch v15, addressing above comments.

--
Thanks,
Nisha


v15-0001-PG17-Approach-3-Fix-slot-synchronization-for-two.patch
Description: Binary data


Re: on_error table, saving error info to a table

2025-05-25 Thread Nishant Sharma
>
> On Tue, Dec 17, 2024 at 12:31 PM Kirill Reshke 
> wrote:
> >
> > On Mon, 16 Dec 2024 at 16:50, Nishant Sharma
> >  wrote:
> > > Also, I think Andrew's suggestion can resolve the concern me and Krill
> > > had on forcing users to create tables with correct column names and
> > > numbers. Also, will make error table checking simpler. No need for the
> > > above kind of checks.
> >
> > +1 on that.
> >
>
> Syntax: COPY (on_error table, table error_saving_tbl);
> seems not ideal.
>
> but auto-create on_error table if this table does not exist, seems way
> more harder.
>
> Since we can not use SPI interface here, maybe we can use DefineRelation
> also, to auto-create a table, what if table already exists, then
> our operation would be stuck for not COPY related reason.
> also auto-create means we need to come up with a magic table name for
> all COPY (on_error table)
> operations, which seems not ideal IMO.
>
> i realized we should error out case like:
> COPY err_tbl FROM STDIN WITH (DELIMITER ',', on_error table, table
> err_tbl);
>
> also by changing copy_generic_opt_arg, now we can
> COPY err_tbl FROM STDIN WITH (DELIMITER ',', on_error table, table  x);
> previously, we can only do
> COPY err_tbl FROM STDIN WITH (DELIMITER ',', on_error 'table', table  x);
>

I am not sure if you understood Andrew's suggestion.
As per my understanding he did not suggest auto-creating the error table,
he suggested using TYPED TABLES for the error saving table. For example:

postgres=# CREATE TYPE error_saving_table_type AS (userid oid, copy_tbl oid,
filename text, lineno  bigint, line text, colname text, raw_field_value
text,
err_message text, err_detail text, errorcode text);
CREATE TYPE

We can have something similar like above in some initdb script, which will
help
in making the above type kind of derived or standard error saving table
type in
PG.

And then, user can use above TYPE to create error saving table like below:
postgres=# CREATE TABLE error_saving_table OF error_saving_table_type;
CREATE TABLE

After this, user can use error_saving_table with the COPY command like
below:
COPY t_copy_tbl(a,b) FROM STDIN WITH (DELIMITER ',', on_error table,
table error_saving_table);

Here's manual example of insert in that table:
postgres=# INSERT INTO error_saving_table VALUES (1234, 4321, 'abcd', 12,
'This is was getting copied', 'xyz', 'pqr', 'testing type error table',
'inserting into typed table error saving table', 'test error code');
INSERT 0 1
postgres=# SELECT * from error_saving_table;
 userid | copy_tbl | filename | lineno |line|
colname |
 raw_field_value |   err_message|
   err_detail   |errorcode
+--+--+++-+-
+--+---
+-
   1234 | 4321 | abcd | 12 | This is was getting copied | xyz
  | pqr |
 testing type error table | insert
ing into typed table error saving table | test error code
(1 row)


With the above we don't need to check all the 12 column's count, their data
types etc. in the patch. *"Then all you would need to check is the
reloftype to*
*make **sure it's the right type," *as quoted by Andrew.

This will make patch simpler and also will remove burden on users to create
error saving tables with correct columns. As its TYPE will be already
available
by default for the users to create error saving tables.


Regards,
Nishant.


Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2025-05-25 Thread Amit Kapila
On Sat, May 24, 2025 at 6:59 PM Alexander Korotkov  wrote:
>
> Hi, Amit!
>
> Thank you for your attention to this patchset!
>
> On Sat, May 24, 2025 at 2:15 PM Amit Kapila  wrote:
> > On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov  
> > wrote:
> > >
> > > I spend more time on this.  The next revision is attached.  It
> > > contains revised comments and other cosmetic changes.  I'm going to
> > > backpatch 0001 to all supported branches,
> > >
> >
> > Is my understanding correct that we need 0001 because
> > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> > changing the slot's restart_lsn?
>
> Yes.  Also, even if it would save slot to the disk, there is still
> race condition that concurrent checkpoint could use updated value from
> the shared memory to clean old WAL segments, and then crash happens
> before we managed to write the slot to the disk.
>

How can that happen, if we first write the updated value to disk and
then update the shared memory as we do in
LogicalConfirmReceivedLocation?

BTW, won't there be a similar problem with physical slot's xmin
computation as well? In PhysicalReplicationSlotNewXmin(), after
updating the slot's xmin computation, we mark it as dirty and update
shared memory values. Now, say after checkpointer writes these xmin
values to disk, walsender receives another feedback message and
updates the slot's xmin values. Now using these updated shared memory
values, vacuum removes the rows, however, a restart will show the
older xmin values in the slot, which mean vacuum would have removed
the required rows before restart.

As per my understanding, neither the xmin nor the LSN problem exists
for logical slots. I am pointing this out to indicate we may need to
think of a different solution for physical slots, if these are
problems only for physical slots.

-- 
With Regards,
Amit Kapila.




Re: Automatically sizing the IO worker pool

2025-05-25 Thread wenhui qiu
HI
> On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> It's hard to know how to set io_workers=3.  If it's too small,
> io_method=worker's small submission queue overflows and it silently
> falls back to synchronous IO.  If it's too high, it generates a lot of
> pointless wakeups and scheduling overhead, which might be considered
> an independent problem or not, but having the right size pool
> certainly mitigates it.  Here's a patch to replace that GUC with:
>
>   io_min_workers=1
>   io_max_workers=8
>   io_worker_idle_timeout=60s
>   io_worker_launch_interval=500ms
>
> It grows the pool when a backlog is detected (better ideas for this
> logic welcome), and lets idle workers time out.
I also like idea ,can we set a
io_workers= 3
io_max_workers= cpu/4
io_workers_oversubscribe = 3 (range 1-8)
io_workers * io_workers_oversubscribe <=io_max_workers

On Sun, May 25, 2025 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> > It's hard to know how to set io_workers=3.  If it's too small,
> > io_method=worker's small submission queue overflows and it silently
> > falls back to synchronous IO.  If it's too high, it generates a lot of
> > pointless wakeups and scheduling overhead, which might be considered
> > an independent problem or not, but having the right size pool
> > certainly mitigates it.  Here's a patch to replace that GUC with:
> >
> >   io_min_workers=1
> >   io_max_workers=8
> >   io_worker_idle_timeout=60s
> >   io_worker_launch_interval=500ms
> >
> > It grows the pool when a backlog is detected (better ideas for this
> > logic welcome), and lets idle workers time out.
>
> I like the idea. In fact, I've been pondering about something like a
> "smart" configuration for quite some time, and convinced that a similar
> approach needs to be applied to many performance-related GUCs.
>
> Idle timeout and launch interval serving as a measure of sensitivity
> makes sense to me, growing the pool when a backlog (queue_depth >
> nworkers, so even a slightest backlog?) is detected seems to be somewhat
> arbitrary. From what I understand the pool growing velocity is constant
> and do not depend on the worker demand (i.e. queue_depth)? It may sounds
> fancy, but I've got an impression it should be possible to apply what's
> called a "low-pass filter" in the control theory (sort of a transfer
> function with an exponential decay) to smooth out the demand and adjust
> the worker pool based on that.
>
> As a side note, it might be far fetched, but there are instruments in
> queueing theory to figure out how much workers are needed to guarantee a
> certain low queueing probability, but for that one needs to have an
> average arrival rate (in our case, average number of IO operations
> dispatched to workers) and an average service rate (average number of IO
> operations performed by workers).
>
>
>


Re: Hash table scans outside transactions

2025-05-25 Thread Tomas Vondra
On 5/25/25 13:36, Dilip Kumar wrote:
> On Wed, May 21, 2025 at 3:02 AM Aidar Imamov  wrote:
>>
>> Hi!
>> I tried to resend this thread directly to myself, but for some reason it
>> ended up in the whole hackers list..
>>
>> I thought I'd chime in on this topic since it hasn't really been
>> discussed
>> anywhere else (or maybe I missed it).
>> I've attached two patches: the first one is a little test extension to
>> demonstrate the problem (just add "hash_test" to
>> "shared_preload_libraries"),
>> and the second is a possible solution. Basically, the idea is that we
>> don't
>> reset the scan counter if we find scans that started outside of the
>> current
>> transaction at the end. I have to admit, though, that I can't
>> immediately
>> say what other implications this might have or what else we need to
>> watch
>> out for if we try this.
>> Maybe any thoughts on that?
> 
> I haven't reviewed the complete patch or tested it, but I don't see
> any issues with it.
> 

I may be wrong, but I'd guess the restriction is there for a reason.
Maybe it's not needed anymore, or maybe it's needed only in some cases,
or something like that.

So the most important thing for a patch removing the restriction it is
to explain why it's there and why it's safe to relax it. The extension
demonstrating that the restriction exists doesn't really help with that,
it shows why we have it. Not hat it's safe to remove it.

I'm not a dynahash expert, but isn't the first sentence from dynahash.c
relevant:

 * dynahash.c supports both local-to-a-backend hash tables and hash
 * tables in shared memory.  For shared hash tables, it is the caller's
 * responsibility to provide appropriate access interlocking.  The
 * simplest convention is that a single LWLock protects the whole hash
 * table.  Searches (HASH_FIND or hash_seq_search) need only shared
 * lock, but any update requires exclusive lock.

In other words, let's say you have a hash table, protected by LWLock.
That is, you're holding the lock in shared mode during seqscan. And then
we drop the locks for some reason (say, transaction end). The table can
be modified - won't that break the seqscan?

FWIW I think with the use case from the beginning of this thread:

1. Add/update/remove entries in hash table
2. Scan the existing entries and perform one transaction per entry
3. Close scan

Why not to simply build a linked list after step (1)?


regards

-- 
Tomas Vondra





Re: Make COPY format extendable: Extract COPY TO format implementations

2025-05-25 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 9 May 2025 17:57:35 -0700,
  Masahiko Sawada  wrote:

>> Proposed approaches to register custom COPY formats:
>> a. Create a function that has the same name of custom COPY
>>format
>> b. Call a register function from _PG_init()
>>
>> FYI: I proposed c. approach that uses a. but it always
>> requires schema name for format name in other e-mail.
> 
> With approach (c), do you mean that we require users to change all
> FORMAT option values like from 'text' to 'pg_catalog.text' after the
> upgrade? Or are we exempt the built-in formats?

The latter. 'text' must be accepted because existing pg_dump
results use 'text'. If we reject 'text', it's a big
incompatibility. (We can't dump on old PostgreSQL and
restore to new PostgreSQL.)


>> Users can register the same format name:
>> a. Yes
>>* Users can distinct the same format name by schema name
>>* If format name doesn't have schema name, the used
>>  format depends on search_path
>>  * Pros:
>>* Using schema for it is consistent with other
>>  PostgreSQL mechanisms
>>* Custom format never conflict with built-in
>>  format. For example, an extension register "xml" and
>>  PostgreSQL adds "xml" later, they are never
>>  conflicted because PostgreSQL's "xml" is registered
>>  to pg_catalog.
>>  * Cons: Different format may be used with the same
>>input. For example, "jsonlines" may choose
>>"jsonlines" implemented by extension X or implemented
>>by extension Y when search_path is different.
>> b. No
>>* Users can use "${schema}.${name}" for format name
>>  that mimics PostgreSQL's builtin schema (but it's just
>>  a string)
>>
>>
>> Built-in formats (text/csv/binary) should be able to
>> overwritten by extensions:
>> a. (The current patch is no but David's answer is) Yes
>>* Pros: Users can use drop-in replacement faster
>>  implementation without changing input
>>* Cons: Users may overwrite them accidentally.
>>  It may break pg_dump result.
>>  (This is called as "backward incompatibility.")
>> b. No
> 
> The summary matches my understanding. I think the second point is
> important. If we go with a tablesample-like API, I agree with David's
> point that all FORMAT values including the built-in formats should
> depend on the search_path value. While it provides a similar user
> experience to other database objects, there is a possibility that a
> COPY with built-in format could work differently on v19 than v18 or
> earlier depending on the search_path value.

Thanks for sharing additional points.

David said that the additional point case is a
responsibility or DBA not PostgreSQL, right?


As I already said, I don't have a strong opinion on which
approach is better. My opinion for the (important) second
point is no. I feel that the pros of a. isn't realistic. If
users want to improve text/csv/binary performance (or
something), they should improve PostgreSQL itself instead of
replacing it as an extension. (Or they should create another
custom copy format such as "faster_text" not "text".)


So I'm OK with the approach b.

>> Are there any missing or wrong items?
> 
> I think the approach (b) provides more flexibility than (a) in terms
> of API design as with (a) we need to do everything based on one
> handler function and callbacks.

Thanks for sharing this missing point.

I have a concern that the flexibility may introduce needless
complexity. If it's not a real concern, I'm OK with the
approach b.


>> If we can summarize
>> the current discussion here correctly, others will be able
>> to chime in this discussion. (At least I can do it.)
> 
> +1

Are there any more people who are interested in custom COPY
FORMAT implementation design? If no more people, let's
decide it by us.


Thanks,
-- 
kou




Re: Make COPY format extendable: Extract COPY TO format implementations

2025-05-25 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 9 May 2025 21:29:23 -0700,
  Masahiko Sawada  wrote:

>> > So the idea is that the backend process sets the format ID somewhere
>> > in st_progress_param, and then the progress view calls a SQL function,
>> > say pg_stat_get_copy_format_name(), with the format ID that returns
>> > the corresponding format name.
>>
>> Does it work when we use session_preload_libraries or the
>> LOAD command? If we have 2 sessions and both of them load
>> "jsonlines" COPY FORMAT extensions, what will be happened?
>>
>> For example:
>>
>> 1. Session 1: Register "jsonlines"
>> 2. Session 2: Register "jsonlines"
>>   (Should global format ID <-> format name mapping
>>   be updated?)
>> 3. Session 2: Close this session.
>>   Unregister "jsonlines".
>>   (Can we unregister COPY FORMAT extension?)
>>   (Should global format ID <-> format name mapping
>>   be updated?)
>> 4. Session 1: Close this session.
>>   Unregister "jsonlines".
>>   (Can we unregister COPY FORMAT extension?)
>>   (Should global format ID <-> format name mapping
>>   be updated?)
> 
> I imagine that only for progress reporting purposes, I think session 1
> and 2 can have different format IDs for the same 'jsonlines' if they
> load it by LOAD command. They can advertise the format IDs on the
> shmem and we can also provide a SQL function for the progress view
> that can get the format name by the format ID.
> 
> Considering the possibility that we might want to use the format ID
> also in the cumulative statistics, we might want to strictly provide
> the unique format ID for each custom format as the format IDs are
> serialized to the pgstat file. One possible way to implement it is
> that we manage the custom format IDs in a wiki page like we do for
> custom cumulative statistics and custom RMGR[1][2]. That is, a custom
> format extension registers the format name along with the format ID
> that is pre-registered in the wiki page or the format ID (e.g. 128)
> indicating under development. If either the format name or format ID
> conflict with an already registered custom format extension, the
> registration function raises an error. And we preallocate enough
> format IDs for built-in formats.
> 
> As for unregistration, I think that  even if we provide an
> unregisteration API, it ultimately depends on whether or not custom
> format extensions call it in _PG_fini().

Thanks for sharing your idea.

With the former ID issuing approach, it seems that we need a
global format ID <-> name mapping and a per session
registered format name list. The custom COPY FORMAT register
function rejects the same format name, right? If we support
both of shared_preload_libraries and
session_preload_libraries/LOAD, we have different life time
custom formats. It may introduce a complexity with the ID
issuing approach.

With the latter static ID approach, how to implement a
function that converts format ID to format name? PostgreSQL
itself doesn't know ID <-> name mapping in the Wiki page. It
seems that custom COPY FORMAT implementation needs to
register its name to PostgreSQL by itself.


Thanks,
-- 
kou




Re: Non-reproducible AIO failure

2025-05-25 Thread Tom Lane
Thomas Munro  writes:
> On Sun, May 25, 2025 at 3:22 PM Tom Lane  wrote:
>> So far, I've failed to get anything useful out of core files
>> from this failure.  The trace goes back no further than
>> (lldb) bt
>> * thread #1
>> * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8

> (And Alexander reported the same off-list.). It's interesting that the
> elog.c backtrace stuff is able to analyse the stack and it looks
> normal AFAICS.  Could that be interfering with the stack in the core?!

No, but something is.  Just to make sure it wasn't totally broken,
I added a sure-to-fail Assert in a random place (I chose
pg_backend_pid), and I get both a trace in the postmaster log and a
perfectly usable core file:

TRAP: failed Assert("MyProcPid == 0"), File: "pgstatfuncs.c", Line: 692, PID: 
59063
0   postgres0x0001031f1fa4 ExceptionalCondition 
+ 108
1   postgres0x0001031672b4 
pg_stat_get_backend_pid + 0
2   postgres0x000102e9e598 ExecInterpExpr + 5524
3   postgres0x000102edb100 ExecResult + 368
4   postgres0x000102ea6418 standard_ExecutorRun 
+ 316

(lldb) bt
* thread #1
  * frame #0: 0x0001836b5388 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x0001836ee88c libsystem_pthread.dylib`pthread_kill + 296
frame #2: 0x0001835f7c60 libsystem_c.dylib`abort + 124
frame #3: 0x00010491dfac 
postgres`ExceptionalCondition(conditionName=, 
fileName=, lineNumber=692) at assert.c:66:2 [opt]
frame #4: 0x00010489329c postgres`pg_backend_pid(fcinfo=) 
at pgstatfuncs.c:692:2 [opt]
frame #5: 0x0001045ca598 
postgres`ExecInterpExpr(state=0x00013780d190, econtext=0x00013780ce38, 
isnull=) at execExprInterp.c:0 [opt]
frame #6: 0x000104607100 postgres`ExecResult [inlined] 
ExecEvalExprNoReturn(state=, econtext=0x00013780ce38) at 
executor.h:417:13 [opt]
frame #7: 0x0001046070f4 postgres`ExecResult [inlined] 
ExecEvalExprNoReturnSwitchContext(state=, 
econtext=0x00013780ce38) at executor.h:458:2 [opt]

The fact that I can trace through this Assert failure but not the
AIO one strongly suggests some system-level problem in the latter.
There is something rotten in the state of Denmark.

For completeness, this is with Sequoia 15.5 (latest macOS) on
an M4 Pro MacBook.

> but I haven't seen this failure on my little M4 MacBook Air yet
> (Sequoia 15.5, Apple clang-1700.0.13.3).  It is infected with
> corporate security-ware that intercepts at least file system stuff and
> slows it down and I can't even convince it to dump core files right
> now.

As far as that goes, if you have SIP turned on (which I'm sure a
corporate laptop would), there are extra steps needed to get a
core to happen.  See [1]; that page is old, but the recipe still
works for me.

regards, tom lane

[1] 
https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html




CREATE OR REPLACE FUNCTION now validates it's dependents

2025-05-25 Thread jian he
hi.

The attached patch allows CREATE OR REPLACE FUNCTION to correctly update domain
constraints when the function they depend on is changed.

so this thread[1] mentioned the problem can be resolved.
for example:

create function sqlcheck(int) returns bool as 'select $1 > 0' language sql;
create domain checkedint as int check(sqlcheck(value));
select 0::checkedint;  -- fail
ERROR:  value for domain checkedint violates check constraint "checkedint_check"
create or replace function sqlcheck(int) returns bool as 'select $1 <=
0' language sql;
select 1::checkedint;  -- fail?

the last query won't fail on the master. with the patch it will fail.

I also make CREATE OR REPLACE FUNCTION validate each domain value when
the domain constraint conditions is associated the function we are
gonname changes
Of course, this will make CREATE OR REPLACE FUNCTION  take way longer time
compared to the current.



Similar to domain constraints, attached patch also apply to table
check constraints too.
Is this what we want to do?

[1]: https://postgr.es/m/12539.1544107316%40sss.pgh.pa.us
From d3356a2485143dc81e5b4d4e0311ffeaec56153c Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 26 May 2025 11:19:00 +0800
Subject: [PATCH v1 1/1] Ensure CREATE OR REPLACE FUNCTION validates its
 dependents

Currently, CREATE OR REPLACE FUNCTION doesn't validate if the new function
definition satisfies its dependents. This patch changes that: it will now
validate the new function definition against its dependents, raising an error
and failing the command if it's not satisfied.

demo:
create function sqlcheck(int) returns bool as 'select $1 > 0' language sql;
create domain checkedint as int check(sqlcheck(value));
select 1::checkedint; -- ok
create or replace function sqlcheck(int) returns bool as 'select $1 <= 0' language sql;

-- Currently succeeds on master, but would fail with this patch.
select 1::checkedint;

context: https://postgr.es/m/12539.1544107316%40sss.pgh.pa.us
discussion: https://postgr.es/m/
---
 doc/src/sgml/config.sgml  |  20 ++
 doc/src/sgml/ref/create_function.sgml |   9 +
 src/backend/catalog/pg_proc.c | 202 ++
 src/backend/commands/typecmds.c   |   3 +-
 src/backend/utils/cache/typcache.c|   3 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/pg_dump/pg_backup_archiver.c  |   3 +
 src/include/catalog/pg_proc.h |   1 +
 src/include/commands/typecmds.h   |   1 +
 src/include/utils/guc.h   |   1 +
 src/include/utils/typcache.h  |   2 +
 .../regress/expected/create_function_sql.out  |  48 -
 src/test/regress/expected/domain.out  |  43 
 src/test/regress/sql/create_function_sql.sql  |  38 
 src/test/regress/sql/domain.sql   |  30 +++
 16 files changed, 410 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ca2a567b2b1..d81d08962f7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9895,6 +9895,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  check_function_dependencies (boolean)
+  
+   check_function_dependencies configuration parameter
+  
+  
+  
+   
+This parameter is normally on. When set to off, it
+disables validation of objects that depentent object on the routine during . Disabling validation avoids side
+effects of the validation process, in particular valildating existing invalidated constraint.
+Set this parameter
+to off before loading functions on behalf of other
+users; pg_dump does so automatically.
+This parameter should be false if check_function_bodies is set to false.
+   
+  
+ 
+
  
   default_transaction_isolation (enum)
   
diff --git a/doc/src/sgml/ref/create_function.sgml b/doc/src/sgml/ref/create_function.sgml
index 0d240484cd3..4c938bb391c 100644
--- a/doc/src/sgml/ref/create_function.sgml
+++ b/doc/src/sgml/ref/create_function.sgml
@@ -76,6 +76,15 @@ CREATE [ OR REPLACE ] FUNCTION
OUT parameters except by dropping the function.)
   
 
+ 
+   If any existing domains or CHECK constraints rely on the
+   current function, and check_function_bodies  is set to
+   true, then using CREATE OR REPLACE FUNCTION will
+   effectively verify whether the new function definition satisfies with those
+   domains or CHECK constraints. If not, an error will be
+   raised.
+  
+
   
When CREATE OR REPLACE FUNCTION is used to replace an
existing function, the ownership and permissions of the function
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 5fdcf24d5f8..e2caa6bde21 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -15,17 +15,20 @@
 #include "postgres.h"
 
 

Re: CREATE OR REPLACE FUNCTION now validates it's dependents

2025-05-25 Thread Tom Lane
jian he  writes:
> The attached patch allows CREATE OR REPLACE FUNCTION to correctly update 
> domain
> constraints when the function they depend on is changed.

Why is this a problem that needs solving?

Our fundamental expectation is that domain constraints are immutable
(see the Notes section under CREATE DOMAIN [1]).  If the user changes
a function used in a constraint in a way that changes its behavior,
that's their fault not our problem.  I don't think we should add a
bunch of code (that we have to maintain) and a bunch of overhead for
every CREATE OR REPLACE FUNCTION in order to slap people on the wrist
if they get this wrong.

regards, tom lane

[1] https://www.postgresql.org/docs/current/sql-createdomain.html




JIT works only partially with meson build?

2025-05-25 Thread Yugo Nagata
Hi,

While building PostgreSQL 17 on Windows, I noticed bitcode files (.bc) are not
generated with meson. Does this means that "inlining" of JIT doesn't
work when PostgreSQL is build with meson?

If so, is it better to describe that in the meson build and JIT sections in the
documentation so that users and packagers would except JIT doesn't work in the 
same
way when built with meson and autoconf/make?

Regards,
Yugo Nagata

-- 
Yugo Nagata 




Re: Custom GUCs and typos

2025-05-25 Thread David G. Johnston
On Sun, May 25, 2025 at 7:52 PM Srinath Reddy Sadipiralla <
srinath2...@gmail.com> wrote:

> 1) On top of OP's patch I added support to warn if the prefix of custom
> GUC is invalid,for valid questions such as "How do you know that's a bogus
> prefix?  It could perfectly well be a fully valid setting for an extension
> that the installation doesn't choose to preload.",we can get the info of
> such extensions using extension_file_exists() which tells that its
> installed but not preloaded thanks to Greg for proposing this,which tells
> this could be a potential valid extension ,so if its not it in
> reserved_class_prefix and also not in prefix/share/extension dir then the
> prefix is for sure invalid,so i warn and remove the GUC from hashtable.
>

People can and do create GUCs without any intention or need to write C code
or place files onto the server to make use of them.

https://www.postgresql.org/docs/current/runtime-config-custom.html

"PostgreSQL will accept a setting for any two-part parameter name."

We cannot change this.  So, please include discussion of their existence in
any description of changes you wish to make in this area.

David J.


Fix a minor typo in jit/README

2025-05-25 Thread Yugo Nagata
Hi,

The attached partch is fix the following typo.

-additionally an index is over these is stored to
+additionally an index over these is stored to

Regards,
Yugo Nagata

-- 
Yugo Nagata 
diff --git a/src/backend/jit/README b/src/backend/jit/README
index 5427bdf2153..a40950dfb03 100644
--- a/src/backend/jit/README
+++ b/src/backend/jit/README
@@ -205,7 +205,7 @@ The ability to do so allows us to get the LLVM IR for all operators
 bitcode files get installed into the server's
   $pkglibdir/bitcode/postgres/
 Using existing LLVM functionality (for parallel LTO compilation),
-additionally an index is over these is stored to
+additionally an index over these is stored to
 $pkglibdir/bitcode/postgres.index.bc
 
 Similarly extensions can install code into


Re: Replication slot is not able to sync up

2025-05-25 Thread shveta malik
On Sat, May 24, 2025 at 10:37 AM Amit Kapila  wrote:
>  If the
> > problem is that we're not able to create a slot on the standby at an
> > old enough LSN or XID position to permit its use with the
> > corresponding slot on the master, it should be reported that way.
> >
>
> That is the case, and we should improve the LOG message.
>

Agree that log messages need improvement. Please find the patch
attached for the same. I also intend to update the docs in this area
for users to understand this feature better, and will work on that
soon.

thanks
Shveta


v1-0001-Improve-log-messages-in-slotsync.patch
Description: Binary data


Re: SQL:2011 application time

2025-05-25 Thread Peter Eisentraut

On 17.09.24 11:45, Peter Eisentraut wrote:

On 05.09.24 14:09, Peter Eisentraut wrote:

On 07.08.24 22:54, Paul Jungwirth wrote:
Here are some fixes based on outstanding feedback (some old some new). 


I have studied your patches v39-0001 through v39-0004, which 
correspond to what had been reverted plus the new empty range check 
plus various minor fixes.  This looks good to me now, so I propose to 
go ahead with that.


Btw., in your 0003 you point out that this prevents using the WITHOUT 
OVERLAPS functionality for non-range types.  But I think this could be 
accomplished by adding an "is empty" callback as a support function or 
something like that.  I'm not suggesting to do that here, but it might 
be worth leaving a comment about that possibility.


I have committed these, as explained here.


Here we added a gist support function that we internally refer to by the 
symbol GIST_STRATNUM_PROC.  This translated from "well-known" strategy 
numbers to opfamily-specific strategy numbers.  However, we later 
changed this to fit into index-AM-level compare type mapping, so this 
function actually now maps from compare type to opfamily-specific 
strategy numbers.  So I'm wondering if this name is still good.


Moreover, the index AM level also supports the opposite, a function to 
map from strategy number to compare type.  This is currently not 
supported in gist, but one might wonder what this function is supposed 
to be called when it is added.


So I went through and updated the naming of the gist-level functionality 
to be more in line with the index-AM-level functionality; see attached 
patch.  I think this makes sense because these are essentially the same 
thing on different levels.  What do you think?  (This would be for PG18.)
From 5cada8bb5032e7b2ee6e218dc38bc6ff35096a23 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 26 May 2025 07:48:51 +0200
Subject: [PATCH] WIP: Rename gist stratnum support function

---
 contrib/btree_gist/btree_gist--1.7--1.8.sql  | 54 ++--
 contrib/btree_gist/btree_gist.c  |  4 +-
 contrib/btree_gist/expected/stratnum.out | 18 +++
 contrib/btree_gist/sql/stratnum.sql  |  6 +--
 doc/src/sgml/gist.sgml   | 25 ++---
 doc/src/sgml/xindex.sgml |  2 +-
 src/backend/access/gist/gistutil.c   | 14 ++---
 src/backend/access/gist/gistvalidate.c   |  6 +--
 src/include/access/gist.h|  2 +-
 src/include/catalog/pg_amproc.dat| 12 ++---
 src/include/catalog/pg_proc.dat  |  4 +-
 src/test/regress/expected/misc_functions.out | 18 +++
 src/test/regress/sql/misc_functions.sql  |  6 +--
 13 files changed, 91 insertions(+), 80 deletions(-)

diff --git a/contrib/btree_gist/btree_gist--1.7--1.8.sql 
b/contrib/btree_gist/btree_gist--1.7--1.8.sql
index 4ff9c43a8eb..8f79365a461 100644
--- a/contrib/btree_gist/btree_gist--1.7--1.8.sql
+++ b/contrib/btree_gist/btree_gist--1.7--1.8.sql
@@ -3,85 +3,85 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.8'" to load this file. \quit
 
-CREATE FUNCTION gist_stratnum_btree(int)
+CREATE FUNCTION gist_translate_cmptype_btree(int)
 RETURNS smallint
 AS 'MODULE_PATHNAME'
 LANGUAGE C IMMUTABLE PARALLEL SAFE STRICT;
 
 ALTER OPERATOR FAMILY gist_oid_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_int2_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_int4_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_int8_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_float4_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_float8_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_timestamp_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (int) ;
+   FUNCTION 12 ("any", "any") gist_translate_cmptype_btree (int) ;
 
 ALTER OPERATOR FAMILY gist_time_ops USING gist ADD
-   FUNCTION 12 ("any", "any") gist_stratnum_btree (in

Re: Conflict detection for update_deleted in logical replication

2025-05-25 Thread Dilip Kumar
On Sat, May 24, 2025 at 4:46 PM Amit Kapila  wrote:
>
> This sounds reasonable to me. Let us see what others think.
>

I was looking into the for getting the transaction status from
publisher, what I would assume this patch should be doing is request
the publisher status first time, and if some transactions are still in
commit, then we need to wait for them to get completed.  But in the
current design its possible that while we are waiting for in-commit
transactions to get committed the old running transaction might come
in commit phase and then we wait for them again, is my understanding
not correct?

Maybe this is very corner case that there are thousands of old running
transaction and everytime we request the status we find some
transactions is in commit phase and the process keep running for long
time until all the old running transaction eventually get committed.

I am thinking can't we make it more deterministic such that when we
get the status first time if we find some transactions that are in
commit phase then we should just wait for those transaction to get
committed?  One idea is to get the list of xids in commit phase and
next time when we get the list we can just compare and in next status
if we don't get any xids in commit phase which were in commit phase
during previous status then we are done.  But not sure is this worth
the complexity?  Mabe not but shall we add some comment explaining the
case and also explaining why this corner case is not harmful?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Hash table scans outside transactions

2025-05-25 Thread Aidar Imamov

Aidar Imamov wrote 2025-05-21 00:32:

Ashutosh Bapat wrote 2023-03-28 15:58:

Bumping it to attract some attention.

On Tue, Mar 21, 2023 at 12:51 PM Ashutosh Bapat
 wrote:


Hi,
Hash table scans (seq_scan_table/level) are cleaned up at the end of 
a

transaction in AtEOXact_HashTables(). If a hash seq scan continues
beyond transaction end it will meet "ERROR: no hash_seq_search scan
for hash table" in deregister_seq_scan(). That seems like a limiting
the hash table usage.

Our use case is
1. Add/update/remove entries in hash table
2. Scan the existing entries and perform one transaction per entry
3. Close scan

repeat above steps in an infinite loop. Note that we do not
add/modify/delete entries in step 2. We can't use linked lists since
the entries need to be updated or deleted using hash keys. Because 
the

hash seq scan is cleaned up at the end of the transaction, we
encounter error in the 3rd step. I don't see that the actual hash
table scan depends upon the seq_scan_table/level[] which is cleaned 
up

at the end of the transaction.

I have following questions
1. Is there a way to avoid cleaning up seq_scan_table/level() when 
the

transaction ends?
2. Is there a way that we can use hash table implementation in
PostgreSQL code for our purpose?


--
Best Wishes,
Ashutosh Bapat


Hi!
I tried to resend this thread directly to myself, but for some reason 
it

ended up in the whole hackers list..

I thought I'd chime in on this topic since it hasn't really been 
discussed

anywhere else (or maybe I missed it).
I've attached two patches: the first one is a little test extension to
demonstrate the problem (just add "hash_test" to 
"shared_preload_libraries"),
and the second is a possible solution. Basically, the idea is that we 
don't
reset the scan counter if we find scans that started outside of the 
current
transaction at the end. I have to admit, though, that I can't 
immediately
say what other implications this might have or what else we need to 
watch

out for if we try this.
Maybe any thoughts on that?

regards,
Aidar Imamov


Just bumping it again

regards,
Aidar Imamov




Re: Understanding when VM record needs snapshot conflict horizon

2025-05-25 Thread Dilip Kumar
On Sat, May 24, 2025 at 2:21 AM Melanie Plageman
 wrote:
>
> On Fri, May 23, 2025 at 12:04 PM Andres Freund  wrote:

> 3) if you are updating the VM and you are not modifying the heap page
> at all, then you don't need to include a snapshot conflict horizon in
> the record because you can safely assume that a record with the
> visibility cutoff xid for that heap page as the snapshot conflict
> horizon has already been emitted. And any existing snapshots that
> would conflict with it would have conflicted with the previous record.
>
> I think 3 can only happen if something goes wrong with the VM -- like
> it is lost somehow.
>
> What I am wondering is if it is worth omitting the snapshot conflict
> horizon in the third case.
> Currently, you would emit an xl_heap_visible record with
> InvalidTransactionId as the conflict horizon in this case. But you
> aren't saving any space and it doesn't seem like you are saving any
> queries from being canceled by not doing this. It simply makes the
> logic for what to put in the WAL record more complicated.
>
IMHO, if we include snapshot conflict horizon in cases where it is not
necessary, don't you think it will impact performance on standby?
because now it has to loop through the procarray on standby to check
whether there is any conflict before applying this WAL.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: MERGE issues around inheritance

2025-05-25 Thread Tender Wang
Álvaro Herrera  于2025年5月24日周六 17:11写道:

> On 2025-May-21, Andres Freund wrote:
>
> > Hi,
> >
> > In [1] I added some verification to projection building, to check if the
> > tupleslot passed to ExecBuildProjectionInfo() is compatible with the
> target
> > list.  One query in merge.sql [2] got flagged.
> >
> > Trying to debug that issue, I found another problem. This leaves us with:
> >
> > 1) w/ inheritance INSERTed rows are not returned by RETURNING. This
> seems to
> >just generally not work with MERGE
>
> Hmm, curious.  One thing to observe is that the original source tuple is
> in the child table, but the tuple inserted by MERGE ends up in the
> parent table.  I'm guessing that the code gets confused as to the
> relation that the tuple in the returned slot comes from, and that
> somehow makes it somehow not "see" the tuple to process for RETURNING?
> I dunno.  CC'ing Dean, who is more familiar with this code than I am.
>

 In ExecMergeNotMatched(), we passed the mtstate->rootResultRelInfo to
ExecInsert().
In this case,  the ri_projectReturning of mtstate->rootResultRelInfo  is
NULL,  in ExecInsert(),
the "if (resultRelInfo->ri_projectReturning)" branch will not run, so
inheritance INSERTed rows are not returned by RETURNING.

The mtstate->rootResultRelInfo  assigned in ExecInitModifyTable()  is only
here:
if (node->rootRelation > 0)
{
 Assert(bms_is_member(node->rootRelation, estate->es_unpruned_relids));
 mtstate->rootResultRelInfo = makeNode(ResultRelInfo);
 ExecInitResultRelation(estate, mtstate->rootResultRelInfo,
 node->rootRelation);
}
The ri_projectReturning is not assigned.

I try to pass resultRelInfo to ExecInsert, the inherited INSERTed rows are
returned by RETURNING.
But some test cases in regression failed.

-- 
Thanks,
Tender Wang


Re: Hash table scans outside transactions

2025-05-25 Thread Dilip Kumar
On Wed, May 21, 2025 at 3:02 AM Aidar Imamov  wrote:
>
> Hi!
> I tried to resend this thread directly to myself, but for some reason it
> ended up in the whole hackers list..
>
> I thought I'd chime in on this topic since it hasn't really been
> discussed
> anywhere else (or maybe I missed it).
> I've attached two patches: the first one is a little test extension to
> demonstrate the problem (just add "hash_test" to
> "shared_preload_libraries"),
> and the second is a possible solution. Basically, the idea is that we
> don't
> reset the scan counter if we find scans that started outside of the
> current
> transaction at the end. I have to admit, though, that I can't
> immediately
> say what other implications this might have or what else we need to
> watch
> out for if we try this.
> Maybe any thoughts on that?

I haven't reviewed the complete patch or tested it, but I don't see
any issues with it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: MERGE issues around inheritance

2025-05-25 Thread Tender Wang
Tender Wang  于2025年5月25日周日 19:17写道:

>
>
> Álvaro Herrera  于2025年5月24日周六 17:11写道:
>
>> On 2025-May-21, Andres Freund wrote:
>>
>> > Hi,
>> >
>> > In [1] I added some verification to projection building, to check if the
>> > tupleslot passed to ExecBuildProjectionInfo() is compatible with the
>> target
>> > list.  One query in merge.sql [2] got flagged.
>> >
>> > Trying to debug that issue, I found another problem. This leaves us
>> with:
>> >
>> > 1) w/ inheritance INSERTed rows are not returned by RETURNING. This
>> seems to
>> >just generally not work with MERGE
>>
>> Hmm, curious.  One thing to observe is that the original source tuple is
>> in the child table, but the tuple inserted by MERGE ends up in the
>> parent table.  I'm guessing that the code gets confused as to the
>> relation that the tuple in the returned slot comes from, and that
>> somehow makes it somehow not "see" the tuple to process for RETURNING?
>> I dunno.  CC'ing Dean, who is more familiar with this code than I am.
>>
>
>  In ExecMergeNotMatched(), we passed the mtstate->rootResultRelInfo to
> ExecInsert().
> In this case,  the ri_projectReturning of mtstate->rootResultRelInfo  is
> NULL,  in ExecInsert(),
> the "if (resultRelInfo->ri_projectReturning)" branch will not run, so
> inheritance INSERTed rows are not returned by RETURNING.
>
> The mtstate->rootResultRelInfo  assigned in ExecInitModifyTable()  is only
> here:
> if (node->rootRelation > 0)
> {
>  Assert(bms_is_member(node->rootRelation, estate->es_unpruned_relids));
>  mtstate->rootResultRelInfo = makeNode(ResultRelInfo);
>  ExecInitResultRelation(estate, mtstate->rootResultRelInfo,
>  node->rootRelation);
> }
> The ri_projectReturning is not assigned.
>
> I try to pass resultRelInfo to ExecInsert, the inherited INSERTed rows are
> returned by RETURNING.
> But some test cases in regression failed.
>

For a partitioned table, we must pass rootResultRelInfo to ExecInsert(). I
added the check before calling ExecInsert()
If it is a partitioned table, we continue to pass rootResultRelInfo.
Otherwise, we pass resultRelInfo.
Please see the attached diff file. The patch passed all regression test
cases.

-- 
Thanks,
Tender Wang
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 2bc89bf84dc..443d4523789 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3612,6 +3612,7 @@ ExecMergeNotMatched(ModifyTableContext *context, 
ResultRelInfo *resultRelInfo,
MergeActionState *action = (MergeActionState *) lfirst(l);
CmdType commandType = action->mas_action->commandType;
TupleTableSlot *newslot;
+   char relkind;
 
/*
 * Test condition, if any.
@@ -3635,9 +3636,15 @@ ExecMergeNotMatched(ModifyTableContext *context, 
ResultRelInfo *resultRelInfo,
 */
newslot = ExecProject(action->mas_proj);
mtstate->mt_merge_action = action;
+   relkind = 
mtstate->rootResultRelInfo->ri_RelationDesc->rd_rel->relkind;
 
-   rslot = ExecInsert(context, 
mtstate->rootResultRelInfo,
+   if (relkind == RELKIND_PARTITIONED_TABLE)
+   rslot = ExecInsert(context, 
mtstate->rootResultRelInfo,
   newslot, 
canSetTag, NULL, NULL);
+   else
+   rslot = ExecInsert(context, 
resultRelInfo,
+  newslot, 
canSetTag, NULL, NULL);
+
mtstate->mt_merge_inserted += 1;
break;
case CMD_NOTHING:


Re: MERGE issues around inheritance

2025-05-25 Thread Dean Rasheed
On Sun, 25 May 2025 at 13:06, Tender Wang  wrote:
>
> For a partitioned table, we must pass rootResultRelInfo to ExecInsert(). I 
> added the check before calling ExecInsert()
> If it is a partitioned table, we continue to pass rootResultRelInfo. 
> Otherwise, we pass resultRelInfo.
> Please see the attached diff file. The patch passed all regression test cases.
>

No, I don't think that's the right fix. I'm looking at it now, and I
think I have a fix, but it's more complicated than that. I'll post an
update later.

Regards,
Dean