date:20250725

Re: Conflict detection for update_deleted in logical replication

2025-07-25 Thread shveta malik

On Thu, Jul 24, 2025 at 9:12 AM shveta malik  wrote:
>
>
> 2)
> +   if (MySubscription->retaindeadtuples &&
> +   FindMostRecentlyDeletedTupleInfo(localrel, remoteslot,
> +
>   &conflicttuple.xmin,
> +
>   &conflicttuple.origin,
> +
>   &conflicttuple.ts) &&
> +   conflicttuple.origin != replorigin_session_origin)
> +   type = CT_UPDATE_DELETED;
> +   else
> +   type = CT_UPDATE_MISSING;
>
> Shall the conflict be detected as update_deleted irrespective of origin?
>

On thinking more here, I think that we may have the possibility of
UPDATE after DELETE from the same origin only when a publication
selectively publishes certain operations.

1)
Consider a publication that only publishes UPDATE and DELETE
operations. On the publisher, we may perform operations like DELETE,
INSERT, and UPDATE. On the subscriber, only DELETE and UPDATE events
are received. In this case, should we treat the incoming UPDATE as
update_deleted or update_missing?

2)
Another topology could be:
pub1 --> pub2 --> sub (origin=any)
pub1 --> sub

- pub1 publishing only DELETEs to pub2 and the same are published to
sub.
- pub1 publishing only UPDATEs to sub.

Now, consider that on pub1, an UPDATE occurs first, followed by a
DELETE. But on the sub, the events are received in reverse order:
DELETE arrives before UPDATE. Since both operations originated from
the same source (pub1), how delayed UPDATE's conflict should be
interpreted? Should it be detected as update_deleted or
update_missing? Logically, since the DELETE is the more recent
operation, it should be the final one and UPDATE should be ignored.
But if we detect it as update_missing, we might incorrectly apply the
UPDATE.
Thoughts?

thanks
Shveta

Pushing down a subquery relation's ppi_clauses, and more ...

2025-07-25 Thread Richard Guo

For a subquery relation, currently we consider pushing down its
restriction clauses to become WHERE or HAVING quals of the subquery
itself.  This transformation can potentially avoid the need to
evaluate all subquery output rows before filtering them, which may
lead to a more efficient execution plan.  As an example, please
consider:

explain (costs off)
select * from t t1 join
  (select t2.a, t2.b, sum(t2.b) from t t2 group by t2.a, t2.b) s
  on s.a = 1;
 QUERY PLAN
-
 Nested Loop
   ->  Seq Scan on t t1
   ->  Materialize
 ->  HashAggregate
   Group Key: t2.b
   ->  Seq Scan on t t2
 Filter: (a = 1)
(7 rows)

If the subquery is LATERAL, its ppi_clauses -- that is, join clauses
available from its required outer rels -- sort of behave like
restriction clauses.  As an example, please consider:

explain (costs off)
select * from t t1 join lateral
  (select t2.a, t2.b, sum(t2.b), t1.a x from t t2 group by t2.a, t2.b) s
  on s.a = t1.a;
 QUERY PLAN
-
 Nested Loop
   ->  Seq Scan on t t1
   ->  Subquery Scan on s
 Filter: (t1.a = s.a)
 ->  HashAggregate
   Group Key: t2.a, t2.b
   ->  Seq Scan on t t2
(7 rows)

Here, I'd like to discuss whether it's worthwhile to also consider
pushing down a subquery relation's ppi_clauses if the subquery is
LATERAL.

First, it's important to note that pushing down ppi_clauses doesn't
always result in a better execution plan.  While doing so can reduce
the amount of data processed in each aggregation invocation within the
subquery, it also means that the aggregation needs to be re-evaluated
for every outer tuple.  If t1 is very small and t2 is large, pushing
down ppi_clauses can be a win.  As t1 gets larger, this gets less
attractive, and eventually it will have a higher cost than the current
plan, where the aggregation is evaluated only once.

Therefore, if we decide to pursue this approach, we would need to
generate two paths: one with the ppi_clauses pushed down, and one
without, and then compare their costs.  A potential concern is that
this might require re-planning the subquery twice, which could
increase planning overhead.

On the other hand, once this optimization is implemented, in addition
to making lateral subqueries more efficient in some cases, it will
enable us to pull up more complicated sublinks.  To be more concrete,
consider query:

select * from foo where foo.a >
(select avg(bar.a) from bar where foo.b = bar.b);

... it can be transformed into something like:

select * from foo join
(select bar.b, avg(bar.a) as avg from bar group by bar.b) sub
on foo.b = sub.b and foo.a > sub.avg;

Currently, the planner does not perform this transformation because
it's not always a win -- if foo is very small and bar is large, the
original form tends to be more efficient.  However, if we can push
down the subquery's join clauses into the subquery, the resulting
parameterized paths will be pretty nearly performance-equivalent to
the origin form.  We can then compare these parameterized paths with
the unparameterized paths that calculate the aggregation only once.
With this capability, it becomes safe to always pull up such sublinks.

In some articles, the PostgreSQL optimizer is noted to lack certain
important transformations.  For example, one recent article from
Microsoft states:

"
Comparison with Volcano/Cascades: In PostgreSQL the separation
of scan/join planning from query flattening and post scan/join phase
prevents the exploitation of important transformations such as the
reordering of GROUP BY and joins, and decorrelation of correlated
subqueries (see Sections 4.4 and 4.5). This results in a reduced search
space of plans and therefore can result in missing out plans with lower
cost.
"

I hope the transformation proposed here can help address part of the
gap related to decorrelation of correlated subqueries.

(BTW, for the "reordering of GROUP BY and joins" transformation, I
have another patch called "Eager Aggregation" that specifically
targets that optimization.)

There are still many details I haven't fully thought through.  Before
investing too much time in this idea, I'd appreciate some feedback to
make sure I'm not overlooking something obvious.

Thanks
Richard

Re: Regression with large XML data input

2025-07-25 Thread Tom Lane

Robert Treat  writes:
> On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier  wrote:
>> If it were discussing things from the perspective where this new code
>> was added after a major version bump of Postgres, I would not argue
>> much about that: breakages happen every year and users adapt their
>> applications to it.  Here, however, we are talking about a change in a
>> stable branch, across a minor version, which should be a bit more
>> flawless from a user perspective?

> While I am pretty sympathetic to the idea that we hang our hats on
> "Postgres doesn't break things in minor version updates", and this
> seems to betray that, one scenario where we would break things is if
> it were the only reasonable option wrt a bug / security fix, which
> this seems potentially close to.

I'll be the first to say that I'm not too pleased with it either.
However, from Jim Jones' result upthread, a "minor update" of libxml2
could also have caused this problem: 2.9.7 and 2.9.14 behave
differently.  So we don't have sole control --- or sole responsibility
--- here.

I'd be more excited about trying to avoid the failure if I were not
afraid that "avoid the failure" really means "re-expose a security
hazard".  Why should we believe that if libxml2 throws a
resource-limit error (for identical inputs) in one code path and not
another, that's anything but a missed error check in the second path?
(Maybe this is the same thing Robert is saying, not quite sure.)

> There are a lot of public data sets that provide xml dumps as a
> generic format for "non-commercial databases", and those can often be
> quite large. I suspect we don't see those use cases a lot because
> historically users have been forced to resort to perl/python/etc
> scripts to convert the data prior to ingesting. Which is to say, I
> think these use cases are more common than we think, and if there were
> ever a stable implementation that supported these large use cases,
> we'll start to see more of it.

Yeah, it's a real shame that we don't have more-reliable
infrastructure for XML.  I'm not volunteering to fix it though...

regards, tom lane

Re: roles management problem after upgrading in PG 17

2025-07-25 Thread Fabrice Chapuis

Thank you for your reply, Robert.
My main goal was to report this observation, because I was surprised to see
that in the dump the grantors were not exported.
Thanks for drawing my attention to the "observes" section; it is indeed
very useful.
For my part, I corrected the problem related to this perfectly justified
restriction by using this query:

SELECT 'GRANT ' || rolname || ' TO group_of_administrators WITH ADMIN
OPTION, INHERIT FALSE, SET TRUE;' as username
FROM pg_roles
where left(rolname,1) in('a','b')
and rolname not in(
'a_xx_administrators'
,'a_xx_standard_users'
,'a_xx_technical_users,
,'a_xx_owners',
...
)

Best regards,

Fabrice

On Thu, Jul 24, 2025 at 7:50 PM Robert Haas  wrote:

> On Thu, Jul 24, 2025 at 7:03 AM Fabrice Chapuis 
> wrote:
> > After upgrading from Postgres version 14 to Postgres version 17, I
> noticed that some "grantor" roles (admin_role) disappear when exporting all
> roles with grants
> >
> > /usr/pgsql-17/bin/pg_dumpall  --globals-only --quote-all-identifiers
> --binary-upgrade --no-sync -f /var/lib/pgsql/roles_bin.sql
> >
> > Consequently, the admin_role could not administer some roles in version
> 17 related to modifications made in version 16 concerning roles
> administration.
> >
> > A message could be displayed when executing pg_upgrade with option
> --check  that the attribute "createrole"  for a user is more restrictive
> since version 16 and anomalies may appear after upgrading in the new
> version.
> >
> > What is your opinion
>
> Well, we do have a section in the release notes that says "Observe the
> following incompatibilities" and I think that's the place that people
> should check when upgrading. In a perfect world, maybe tools like
> pg_dump and pg_upgrade could also warn you about incompatibilities
> that are likely to affect your specific installation, but that would
> be a lot of work to create and maintain and nobody's volunteered to do
> that much work yet. Of course, someone could argue that this
> particular incompatibility is so much worse than any of the others
> that it deserves to be called out specifically even if we don't do
> that in other cases, but I am a bit skeptical of that argument. I
> think it would lead to people being frightened of this particular
> change as compared with anything else, and I'm not sure that's a
> justified reaction. If anything, I think people who are using
> CREATEROLE with a non-SUPERUSER account should probably be very afraid
> of staying on OLD releases, where the behavior is extremely insecure
> and such users can just go take over the superuser account whenever
> they like.
>
> Of course, you or others might see it differently.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

Re: Allow ON CONFLICT DO UPDATE to return EXCLUDED values

2025-07-25 Thread Dean Rasheed

Rebased version attached, following 904f6a593a0.

Regards,
Dean
From a47423cc29abc8dc197fb895b55aace66d9da885 Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Fri, 25 Jul 2025 08:57:03 +0100
Subject: [PATCH v4] Allow EXCLUDED in RETURNING list of INSERT ON CONFLICT DO
 UPDATE.

This allows an INSERT ... ON CONFLICT DO UPDATE command to return the
excluded values, when conflicts occur. For rows that do not conflict,
the excluded values returned are NULL.

This builds on the infrastructure added to support OLD/NEW values in
the RETURNING list, except that in the case of EXCLUDED, there is no
need for the special syntax to change the name of the excluded
pseudo-relation in the RETURNING list. (That was required for OLD/NEW,
because those names were already used in some situations, such as
trigger functions, but there is no such problem with excluded.)

Discussion: https://postgr.es/m/CAEZATCXXu2ohYmn=4yrrqa9ynwd_fdeeotbpgm_5jhqofca...@mail.gmail.com
---
 doc/src/sgml/dml.sgml | 13 +++
 doc/src/sgml/ref/insert.sgml  | 42 ++---
 src/backend/executor/execExpr.c   | 44 +++--
 src/backend/executor/execExprInterp.c | 21 -
 src/backend/executor/execPartition.c  | 14 +++
 src/backend/executor/nodeModifyTable.c| 51 +++---
 src/backend/optimizer/path/allpaths.c |  3 +-
 src/backend/optimizer/plan/setrefs.c  | 77 ++--
 src/backend/optimizer/prep/prepjointree.c | 22 +++--
 src/backend/optimizer/prep/preptlist.c|  9 +-
 src/backend/optimizer/util/clauses.c  |  2 +-
 src/backend/optimizer/util/var.c  | 11 ++-
 src/backend/parser/analyze.c  | 17 +++-
 src/backend/parser/parse_expr.c   |  2 +-
 src/backend/parser/parse_relation.c   |  4 +-
 src/backend/rewrite/rewriteHandler.c  | 42 +
 src/backend/rewrite/rewriteManip.c| 69 --
 src/include/executor/execExpr.h   |  7 +-
 src/include/nodes/execnodes.h |  2 +
 src/include/nodes/primnodes.h | 26 --
 src/include/parser/parse_node.h   |  7 +-
 src/include/rewrite/rewriteManip.h|  4 +-
 src/test/regress/expected/arrays.out  | 26 --
 .../regress/expected/generated_stored.out | 48 ++
 .../regress/expected/generated_virtual.out| 48 ++
 src/test/regress/expected/inherit.out |  9 +-
 src/test/regress/expected/insert_conflict.out | 92 +--
 src/test/regress/expected/returning.out   | 22 +++--
 src/test/regress/expected/rules.out   | 61 
 src/test/regress/expected/triggers.out| 74 ---
 src/test/regress/expected/updatable_views.out | 27 +-
 src/test/regress/sql/arrays.sql   |  9 +-
 src/test/regress/sql/generated_stored.sql |  4 +
 src/test/regress/sql/generated_virtual.sql|  4 +
 src/test/regress/sql/inherit.sql  |  3 +-
 src/test/regress/sql/insert_conflict.sql  | 29 +++---
 src/test/regress/sql/returning.sql|  8 +-
 src/test/regress/sql/rules.sql| 27 ++
 src/test/regress/sql/triggers.sql | 17 ++--
 src/test/regress/sql/updatable_views.sql  | 12 ++-
 src/tools/pgindent/typedefs.list  |  1 +
 41 files changed, 741 insertions(+), 269 deletions(-)

diff --git a/doc/src/sgml/dml.sgml b/doc/src/sgml/dml.sgml
index 458aee788b7..4712a49071b 100644
--- a/doc/src/sgml/dml.sgml
+++ b/doc/src/sgml/dml.sgml
@@ -392,6 +392,19 @@ UPDATE products SET price = price * 1.10
the new values may be non-NULL.
   
 
+  
+   In an INSERT with an ON CONFLICT DO UPDATE
+   clause, it is also possible to return the values excluded, if there is a
+   conflict.  For example:
+
+INSERT INTO products AS p SELECT * FROM new_products
+  ON CONFLICT (product_no) DO UPDATE SET name = excluded.name
+  RETURNING p.product_no, p.name, excluded.price;
+
+   Excluded values will be NULL for rows that do not
+   conflict.
+  
+
   
If there are triggers () on the target table,
the data available to RETURNING is the row as modified by
diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
index 3f139917790..d1c056296a1 100644
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -183,7 +183,7 @@ INSERT INTO table_name [ AS ON CONFLICT DO UPDATE
 targets a table named excluded, since that will otherwise
 be taken as the name of the special table representing the row proposed
-for insertion.
+for insertion (see note below).

   
  
@@ -343,6 +343,35 @@ INSERT INTO table_name [ AS ON CONFLICT DO UPDATE clause, the old
 values may be non-NULL.

+
+   
+If the INSERT has an ON CONFLICT DO
+UPDATE clause, a column name or * may be
+qualified using EXCLUDED to return the values
+excluded from i

Re: Retail DDL

2025-07-25 Thread Hannu Krosing

I have been thinking of this from a little different direction. We
already have all the needed functionality in pg_dump so why not just
have an option to do

CREATE EXTENSION pg_dump;

Which would wrap and expose whatever the current version of pg_dump is doing.

It still would need to resolve the difficult question of naming
things, but otherways it looks like just a certain amount of
mechanical work.

We could even have just one set of functions with a few possible argument types

pg_dump(object oid, options text);
pg_dump(object text, options text);

---
Hannu









On Fri, Jul 25, 2025 at 10:35 AM Álvaro Herrera  wrote:
>
> On 2025-Jul-24, Andrew Dunstan wrote:
>
> > Obviously we already have some functions for things like views and
> > triggers, but most notably we don't have one for tables, something users
> > have long complained about. I have been trying to think of a reasonable
> > interface for a single function, where we would pass in, say, a catalog oid
> > plus an object oid, and maybe some optional extra arguments.
>
> Reproducing a table might need multiple commands.  Do you intend to
> return a single string containing multiple semicolon-separated commands,
> or are you thinking in a RETURNS SETOF where each row contains a single
> command?
>
> What about schema-qualification needed for elements in the commands?  We
> have the option to schema-qualify everything, _or_ to depend on whether
> the schemas are in search_path, _or_ to schema-qualify nothing (which
> gives the user the chance to recreate in any schema by changing
> search_path).
>
>
> > That seems a bit fragile, though. The alternative is that we have a
> > separate function for each object type, e.g. pg_get_{objecttype}_ddl.
> > I'm kinda leaning that way, but I'd like some sort of consensus before
> > any work gets done.
>
> It looks like the discussion is leaning this way too.  I think it's
> a reasonable choice.
>
> --
> Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>
>

Re: More protocol.h replacements this time into walsender.c

2025-07-25 Thread Dave Cramer

Dave Cramer


On Fri, 25 Jul 2025 at 04:11, Álvaro Herrera  wrote:

> On 2025-Jul-24, Dave Cramer wrote:
>
> > On Thu, 24 Jul 2025 at 17:05, Jacob Champion <
> > jacob.champ...@enterprisedb.com> wrote:
> >
> > > On Thu, Jul 24, 2025 at 12:04 PM Dave Cramer 
> wrote:
>
> > > +/* Replication Protocol sent by the primary */
> > > +
> > > +#define PqMsg_XlogData  'w'
> > > +#define PqMsg_PrimaryKeepAlive  'k'
> > > +#define PqMsg_PrimaryStatusUpdate   's'
> > > +
> > > +
> > > +/* Replication Protocol sent by the standby */
> > > +
> > > +#define PqMsg_StandbyStatus 'r'
> > > +#define PqMsg_HotStandbyFeedback'h'
> > > +#define PqMsg_RequestPrimaryStatus  'p'
> > >
> > > Since these are part of the replication subprotocol (i.e. tunneled,
> > > via CopyData) rather than the top-level wire protocol, do they deserve
> > > their own prefix? PqReplMsg_* maybe?
> >
> > I'm going to wait to see if there are any other opinions. Last time I did
> > this there were quite a few opinions before finally settling on the
> naming
>
> Count me in.
>

FYI, the reason I used XLogData is because the term is used multiple times
here https://www.postgresql.org/docs/devel/protocol-replication.html

Dave

Re: More protocol.h replacements this time into walsender.c

2025-07-25 Thread Álvaro Herrera

On 2025-Jul-25, Dave Cramer wrote:

> FYI, the reason I used XLogData is because the term is used multiple times
> here https://www.postgresql.org/docs/devel/protocol-replication.html

Yeah, we could rename it, as in the attached.  It doesn't harm anything.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From de850f70b793f27c0aaa5a1f8589d44450691767 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Fri, 25 Jul 2025 12:12:03 +0200
Subject: [PATCH] rename XLogData to WALData

---
 doc/src/sgml/protocol.sgml |  8 
 src/bin/pg_basebackup/pg_recvlogical.c |  4 ++--
 src/bin/pg_basebackup/receivelog.c | 18 +-
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b115884acb3..eadac60224e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2550,8 +2550,8 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
   
 
   
-   
-XLogData (B)
+   
+WALData (B)
 
  
   
@@ -2599,11 +2599,11 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
 
 
 
- A single WAL record is never split across two XLogData messages.
+ A single WAL record is never split across two WALData messages.
  When a WAL record crosses a WAL page boundary, and is therefore
  already split using continuation records, it can be split at the page
  boundary. In other words, the first main WAL record and its
- continuation records can be sent in different XLogData messages.
+ continuation records can be sent in different WALData messages.
 

   
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 8a5dd24e6c9..0e9d2e23947 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -517,7 +517,7 @@ StreamLogicalLog(void)
 		}
 
 		/*
-		 * Read the header of the XLogData message, enclosed in the CopyData
+		 * Read the header of the WALData message, enclosed in the CopyData
 		 * message. We only need the WAL location field (dataStart), the rest
 		 * of the header is ignored.
 		 */
@@ -605,7 +605,7 @@ StreamLogicalLog(void)
 		/*
 		 * We're doing a client-initiated clean exit and have sent CopyDone to
 		 * the server. Drain any messages, so we don't miss a last-minute
-		 * ErrorResponse. The walsender stops generating XLogData records once
+		 * ErrorResponse. The walsender stops generating WALData records once
 		 * it sees CopyDone, so expect this to finish quickly. After CopyDone,
 		 * it's too late for sendFeedback(), even if this were to take a long
 		 * time. Hence, use synchronous-mode PQgetCopyData().
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index d6b7f117fa3..f2b54d3c501 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -38,8 +38,8 @@ static int	CopyStreamReceive(PGconn *conn, long timeout, pgsocket stop_socket,
 			  char **buffer);
 static bool ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf,
 int len, XLogRecPtr blockpos, TimestampTz *last_status);
-static bool ProcessXLogDataMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
-			   XLogRecPtr *blockpos);
+static bool ProcessWALDataMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
+			  XLogRecPtr *blockpos);
 static PGresult *HandleEndOfCopyStream(PGconn *conn, StreamCtl *stream, char *copybuf,
 	   XLogRecPtr blockpos, XLogRecPtr *stoppos);
 static bool CheckCopyStreamStop(PGconn *conn, StreamCtl *stream, XLogRecPtr blockpos);
@@ -831,7 +831,7 @@ HandleCopyStream(PGconn *conn, StreamCtl *stream,
 			}
 			else if (copybuf[0] == 'w')
 			{
-if (!ProcessXLogDataMsg(conn, stream, copybuf, r, &blockpos))
+if (!ProcessWALDataMsg(conn, stream, copybuf, r, &blockpos))
 	goto error;
 
 /*
@@ -1041,11 +1041,11 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 }
 
 /*
- * Process XLogData message.
+ * Process WALData message.
  */
 static bool
-ProcessXLogDataMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
-   XLogRecPtr *blockpos)
+ProcessWALDataMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
+  XLogRecPtr *blockpos)
 {
 	int			xlogoff;
 	int			bytes_left;
@@ -1054,13 +1054,13 @@ ProcessXLogDataMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 
 	/*
 	 * Once we've decided we don't want to receive any more, just ignore any
-	 * subsequent XLogData messages.
+	 * subsequent WALData messages.
 	 */
 	if (!(still_sending))
 		return true;
 
 	/*
-	 * Read the header of the XLogData message, enclosed in the CopyData
+	 * Read the header of the WALData mess

Re: trivial grammar refactor

2025-07-25 Thread Álvaro Herrera

On 2025-Jul-24, Nathan Bossart wrote:

> On Thu, Jul 24, 2025 at 11:54:10AM +0200, Álvaro Herrera wrote:
> > Yeah, thanks for taking a look.  That duplication is just me being dumb.
> > Here's a version without that.  The only thing that needed to change was
> > changing "CLUSTER opt_verbose" to "CLUSTER VERBOSE" so that the
> > unadorned CLUSTER is handled by "CLUSTER opt_utility_option_list"
> > instead.
> 
> I think we can do something similar for ANALYZE.  But AFAICT you're right
> that we can't use it for VACUUM and EXPLAIN, at least not easily.

Thank you!  Pushed with that change.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Adding basic NUMA awareness

2025-07-25 Thread Jakub Wartak

On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra  wrote:
>
> On 7/4/25 20:12, Tomas Vondra wrote:
> > On 7/4/25 13:05, Jakub Wartak wrote:
> >> ...
> >>
> >> 8. v1-0005 2x + /* if (numa_procs_interleave) */
> >>
> >>Ha! it's a TRAP! I've uncommented it because I wanted to try it out
> >> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
> >>
> >> 2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
> >> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
> >> [..]
> >> 2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
> >> was terminated by signal 11: Segmentation fault
> >> 2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
> >> active server processes
> >> 2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
> >> "restart_after_crash" is off
> >> 2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down
> >>
> >> [New LWP 28755]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library 
> >> "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >> Core was generated by `postgres: io worker '.
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
> >> at ./nptl/sem_waitcommon.c:136
> >> 136 ./nptl/sem_waitcommon.c: No such file or directory.
> >> (gdb) where
> >> #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
> >> at ./nptl/sem_waitcommon.c:136
> >> #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
> >> #2  0x5561918e0cac in PGSemaphoreReset (sema=0x0) at
> >> ../src/backend/port/posix_sema.c:302
> >> #3  0x556191970553 in InitAuxiliaryProcess () at
> >> ../src/backend/storage/lmgr/proc.c:992
> >> #4  0x5561918e51a2 in AuxiliaryProcessMainCommon () at
> >> ../src/backend/postmaster/auxprocess.c:65
> >> #5  0x556191940676 in IoWorkerMain (startup_data= >> out>, startup_data_len=) at
> >> ../src/backend/storage/aio/method_worker.c:393
> >> #6  0x5561918e8163 in postmaster_child_launch
> >> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
> >> startup_data=startup_data@entry=0x0,
> >> startup_data_len=startup_data_len@entry=0,
> >> client_sock=client_sock@entry=0x0) at
> >> ../src/backend/postmaster/launch_backend.c:290
> >> #7  0x5561918ea09a in StartChildProcess
> >> (type=type@entry=B_IO_WORKER) at
> >> ../src/backend/postmaster/postmaster.c:3973
> >> #8  0x5561918ea308 in maybe_adjust_io_workers () at
> >> ../src/backend/postmaster/postmaster.c:4404
> >> [..]
> >> (gdb) print *MyProc->sem
> >> Cannot access memory at address 0x0
> >>
> >
> > Yeah, good catch. I'll look into that next week.
> >
>
> I've been unable to reproduce this issue, but I'm not sure what settings
> you actually used for this instance. Can you give me more details how to
> reproduce this?

Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:

For complete picture, steps:

1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch

BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej

3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.

4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)

5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok

6. restart with numa_procs_interleave=off, which gets me to every bg
worker crashing e.g.:

(gdb) where
#0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at
./nptl/sem_waitcommon.c:136
#1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
#2  0x563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3  0x563e2d774d93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
#4  0x563e2d6e9252 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5  0x563e2d6eb683 in CheckpointerMain (startup_data=, startup_data_len=) at
../src/backend/postmaster/checkpointer.c:190
#6  0x563e2d6ec363 in postmaster_child_launch
(child_type=child_type@entry=B_CHECKPOINTER, child_slot=249,
startup_data=startup_data@entry=

Re: restore_command return code behaviour

2025-07-25 Thread Jean-Christophe Arnu

On fri. 25 jul. 2025,  00:01, Jacob Champion <
jacob.champ...@enterprisedb.com> wrote :

>
> > If you agree, we can suggest a patch for the documentation.
>
> If you've tripped over it, I think we should consider a docs patch.
>

We've spent time on it, guessing it was a return code-driven behaviour,
analysing the code, and finally using debugger to confirm it. If we can
avoid that round trip (and any erroneous analyses stating "it does not
work") for other users, that would be ideal.


>
> To confirm -- you're happy with the behavior as-is?


I'm completely fine with it.
restore_command returning over 128 can be wrapped in a shell script to
catch the given rc *if server shutdown is not the desired behaviour. I
agree that the docs should be as clear as possible about this behaviour
(which may also be an API, that should not be changed).

Here are the places where I think  the details should be added :
- GUC documentation [1]
- Backup and Restore [2]

The other mention of restore_command does not involve (or require) return
code details.
If there are no objections, I'll start writing a patch proposal on Monday.


[1]
https://www.postgresql.org/docs/18/runtime-config-wal.html#GUC-RESTORE-COMMAND
: file doc/src/sgml/config.sgml
[2]
https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY
file: doc/src/sgml/backup.sgml

RE: 024_add_drop_pub.pl might fail due to deadlock

2025-07-25 Thread Hayato Kuroda (Fujitsu)

Dear Ajin,

Thanks for patches! While checking, I recalled that the backpatch policy [1].
We must not modify definitions of opened functions but this does. Can you define
another function like UpdateSubscriptionRelStateEx or something on the back
branches?

Another comment:
```
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char 
state,
Datum   values[Natts_pg_subscription_rel];
boolreplaces[Natts_pg_subscription_rel];
 
-   LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
-   rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+   if (already_locked)
+   rel = table_open(SubscriptionRelRelationId, NoLock);
```

Can we assert that RowExclusiveLock for pg_subscription_rel has already been
acquired, by using CheckRelationOidLockedByMe() family?

Also, I'm now bit confusing for which branches are really affected. Can you
create all patches for branches, attach at the same e-mail and add some summary
what you want to fix?
E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13?
If not needed, why?

[1]: 
https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

2025-07-25 Thread Amit Kapila

On Fri, Jul 25, 2025 at 12:37 PM shveta malik  wrote:
>
> On Thu, Jul 24, 2025 at 9:12 AM shveta malik  wrote:
> >
> >
> > 2)
> > +   if (MySubscription->retaindeadtuples &&
> > +   FindMostRecentlyDeletedTupleInfo(localrel, 
> > remoteslot,
> > +
> >   &conflicttuple.xmin,
> > +
> >   &conflicttuple.origin,
> > +
> >   &conflicttuple.ts) &&
> > +   conflicttuple.origin != replorigin_session_origin)
> > +   type = CT_UPDATE_DELETED;
> > +   else
> > +   type = CT_UPDATE_MISSING;
> >
> > Shall the conflict be detected as update_deleted irrespective of origin?
> >
>
> On thinking more here, I think that we may have the possibility of
> UPDATE after DELETE from the same origin only when a publication
> selectively publishes certain operations.
>
> 1)
> Consider a publication that only publishes UPDATE and DELETE
> operations. On the publisher, we may perform operations like DELETE,
> INSERT, and UPDATE. On the subscriber, only DELETE and UPDATE events
> are received. In this case, should we treat the incoming UPDATE as
> update_deleted or update_missing?
>

If the user is doing subscription only for certain operations like
Update or Delete, she may not be interested in eventual consistency as
some of the data may not be replicated, so a conflict detection
followed by any resolution may not be helpful.

The other point is that if we report update_delete in such cases, it
won't be reliable, sometimes it can be update_missing as vacuum would
have removed the row, OTOH, if we report update_missing, it will
always be the same conflict, and we can document it.

-- 
With Regards,
Amit Kapila.

Re: Logical replication launcher did not automatically restart when got SIGKILL

2025-07-25 Thread Fujii Masao

On Fri, Jul 25, 2025 at 5:25 PM shveta malik  wrote:
>
> On Fri, Jul 25, 2025 at 7:17 AM Fujii Masao  wrote:
> >
> > On Thu, Jul 24, 2025 at 6:46 PM shveta malik  wrote:
> > > Sounds reasonable.
> > > Thinking out loud, when cleaning up after a backend or background
> > > worker crash, process_pm_child_exit() is invoked, which subsequently
> > > calls both CleanupBackend() and HandleChildCrash(). After the cleanup
> > > completes, process_pm_child_exit() calls PostmasterStateMachine() to
> > > move to the next state. As part of that, PostmasterStateMachine()
> > > invokes ResetBackgroundWorkerCrashTimes() (only in crash
> > > scenarios/FatalError), to reset a few things. Since it also resets
> > > rw_worker.bgw_notify_pid, it seems reasonable to reset the rw_pid as
> > > well there.
> >
> > Thanks!
> > Attached is a patch that fixes the issue by resetting rw_pid in
> > ResetBackgroundWorkerCrashTimes().
> >
>
> The patch LGTM.

Thanks to both ChangAo and Shveta for the review!
I've pushed the patch and back-patched it to v18.


> > We should probably add a regression test for this case,
> > but I'd prefer to commit the fix first and work on the test separately.
> > Andrey Rudometov proposed a test patch in thread [1],
> > which we might use as a starting point.
>
> Sounds good.

This proposed patch adds a new regression test file to verify background
worker restarts when restart_after_crash is enabled. However, since
we already have 013_crash_restart.pl for testing that scenario,
I’m thinking it might be sufficient to check that the logical replication
launcher, i.e., the background worker, restarts properly there.

For example, we could add a check like the following to 013_crash_restart.pl.
Thoughts?

-
is($node->safe_psql('postgres',
"SELECT count(*) = 1 FROM pg_stat_activity WHERE backend_type =
'logical replication launcher'"),
't',
'logical replication launcher is running after crash');
-

Regards,

-- 
Fujii Masao

Re: Conflict detection for update_deleted in logical replication

2025-07-25 Thread Nisha Moond

Hi All,

We conducted performance testing of a bi-directional logical
replication setup, focusing on the primary use case of the
update_deleted feature.
To simulate a realistic scenario, we used a high workload with limited
concurrent updates, and well-distributed writes among servers.

Used source
===
pgHead commit 62a17a92833 + v47 patch set

Machine details
===
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

Test-1: Distributed Write Load
==
Highlight:
---
 - In a bi-directional logical replication setup, with
well-distributed write workloads and a thoughtfully tuned
configuration to minimize lag (e.g., through row filters), TPS
regression is minimal or even negligible.
 - Performance can be sustained with significantly fewer apply workers
compared to the number of client connections on the publisher.

Setup:

 - 2 Nodes(node1 and node2) are created(on same machine) of same
configurations -
autovacuum = false
shared_buffers = '30GB'
-- Also, worker and logical replication related parameters were
increased as per requirement (see attached scripts for details).
 - Both nodes have two set of pgbench tables initiated with *scale=300*:
   -- set1: pgbench_pub_accounts, pgbench_pub_tellers,
pgbench_pub_branches, and pgbench_pub_history
   -- set2: pgbench_accounts, pgbench_tellers, pgbench_branches, and
pgbench_history
 - Node1 is publishing all changes for set1 tables and Node2 has
subscribed for the same.
 - Node2 is publishing all changes for set2 tables and Node2 has
subscribed for the same.
Note: In all the tests, subscriptions are created with (origin=NONE)
as it is a bi-directional replication.

Workload Run:
---
 - On node1, pgbench(read-write) with option "-b simple-update" is run
on set1 tables.
 - On node2, pgbench(read-write) with option "-b simple-update" is run
on set2 tables.
 - #clients = 40
 - pgbench run duration = 10 minutes.
 - results were measured for 3 runs of each case.

Test Runs:
- Six tests were done with varying #pub-sub pairs and below is TPS
reduction in both nodes for all the cases:

| Case | # Pub-Sub Pairs | TPS Reduction  |
|  | --- | -- |
| 01   | 30  | 0–1%   |
| 02   | 15  | 6–7%   |
| 03   | 5   | 7–8%   |
| 04   | 3   | 0-1%   |
| 05   | 2   | 14–15% |
| 06   | 1 (no filters)  | 37–40% |

 - With appropriate row filters and distribution of load across apply
workers, the performance impact of update_deleted patch can be
minimized.
 - Just 3 pub-sub pairs are enough to keep TPS close to the baseline
for the given workload.
 - Poor distribution of replication workload (e.g., only 1–2 pub-sub
pairs) leads to higher overhead due to increased apply worker
contention.


Detailed results for all the above cases:

case-01:
-
 - Created 30 pub-sub pairs to distribute the replication load between
30 apply workers on each node.

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   5633.377165   5579.244492   6385.839585   6482.775975
2   5926.328644   5947.035275   6216.045707   6416.113723
3   5522.804663   5542.380108   6541.031535   6190.123097
median   5633.377165   5579.244492   6385.839585   6416.113723
regression  -1%   0%

 - No regression


case-02:
-
 - #pub-sub pairs = 15

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   8207.708475   7584.288026   8854.017934   8204.301497
2   8120.979334   7404.735801   8719.451895   8169.697482
3   7877.859139   7536.762733   8542.896669   8177.853563
median   8120.979334   7536.762733   8719.451895   8177.853563
regression   -7%   -6%

 - There was 6-7% TPS reduction on both nodes, which seems in acceptable range.
~~~

case-03:
-
 - #pub-sub pairs = 5

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   12325.90315   11664.7445   12997.47104   12324.025
2   12060.38753   11370.52775   12728.41287   12127.61208
3   12390.3677   11367.10255   13135.02558   12036.71502
median   12325.90315   11370.52775   12997.47104   12127.61208
regression   -8%   -7%

 - There was 7-8% TPS reduction on both nodes, which seems in acceptable range.
~~~

case-04:
-
 -  #pub-sub pairs = 3

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   13186.22898   12464.42604   13973.8394   13370.45596
2   13038.15817   13014.03906   13866.51966   13866.47395
3   13881.10513   13868.71971   14687.67444   14516.33854
median   13186.22898   13014.03906   13973.8394   13866.47395
regression   -1%   -1%

 - No regression observed


case-05:
-
 -  #pub-sub pairs = 2

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   15936.98792   13563.98476   16734.35292   14527.229

Re: More protocol.h replacements this time into walsender.c

2025-07-25 Thread Álvaro Herrera

On 2025-Jul-24, Dave Cramer wrote:

> On Thu, 24 Jul 2025 at 17:05, Jacob Champion <
> jacob.champ...@enterprisedb.com> wrote:
> 
> > On Thu, Jul 24, 2025 at 12:04 PM Dave Cramer  wrote:

> > +/* Replication Protocol sent by the primary */
> > +
> > +#define PqMsg_XlogData  'w'
> > +#define PqMsg_PrimaryKeepAlive  'k'
> > +#define PqMsg_PrimaryStatusUpdate   's'
> > +
> > +
> > +/* Replication Protocol sent by the standby */
> > +
> > +#define PqMsg_StandbyStatus 'r'
> > +#define PqMsg_HotStandbyFeedback'h'
> > +#define PqMsg_RequestPrimaryStatus  'p'
> >
> > Since these are part of the replication subprotocol (i.e. tunneled,
> > via CopyData) rather than the top-level wire protocol, do they deserve
> > their own prefix? PqReplMsg_* maybe?
>
> I'm going to wait to see if there are any other opinions. Last time I did
> this there were quite a few opinions before finally settling on the naming

Count me in.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The Postgresql hackers have what I call a "NASA space shot" mentality.
 Quite refreshing in a world of "weekend drag racer" developers."
(Scott Marlowe)

Re: 024_add_drop_pub.pl might fail due to deadlock

2025-07-25 Thread vignesh C

On Thu, 24 Jul 2025 at 17:45, Ajin Cherian  wrote:
>
> On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > Dear Ajin,
> > >
> > > Thanks for the patch. Firstly let me confirm my understanding. While 
> > > altering the
> > > subscription, locks are acquired with below ordering:
> > >
> >
> > I forgot to confirm one point. For which branch should be backpatch? 
> > Initially
> > it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail 
> > on PG14.
>
> Yes, here's a patch for PG14 as well, based on REL_14_STABLE.
>

I believe the patch is trying the address the following issues reported:
1) 024_add_drop_pub.pl test failure reported on REL_16_STABLE at [1]
2) Different variation of the above issue on head with the script
attached at [2]
3) Amit reported different variant of it for PG15 with the patch at [3]

I felt these issues are not applicable to the PG13 branch as
Replication origin creation for table sync is not there in the PG13
branch. So the fix is required from master to PG14 branches.

The patch does not apply on the PG16 branch.

In PG15 you have the following code:
+   /* Close table if opened */
+   if (rel)
+   {
+   table_close(rel, NoLock);
+   }

In master branch you have the following code:
+   /* Close table if opened */
+   if (rel)
+   table_close(rel, NoLock);
+
+

We can keep the fix consistent in both cases and additional newlines
not required in the master branch.

[1] - 
https://www.postgresql.org/message-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297%40gmail.com
[2] - 
https://www.postgresql.org/message-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS%2BQ%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1KPa1dJrcd%3DXfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ%40mail.gmail.com

Regards,
Vignesh

Re: Retail DDL

2025-07-25 Thread Hannu Krosing

A related improvement would be to also support

CREATE EXTENSION psql;

To make at least the `\d ...` commands available to any client

And while we are at it, why not also

CREATE EXTENSION pgbench;

To make the fancy random distribution functions (at least) from
pgbench available from inside the database.




On Fri, Jul 25, 2025 at 10:43 AM Hannu Krosing  wrote:
>
> I have been thinking of this from a little different direction. We
> already have all the needed functionality in pg_dump so why not just
> have an option to do
>
> CREATE EXTENSION pg_dump;
>
> Which would wrap and expose whatever the current version of pg_dump is doing.
>
> It still would need to resolve the difficult question of naming
> things, but otherways it looks like just a certain amount of
> mechanical work.
>
> We could even have just one set of functions with a few possible argument 
> types
>
> pg_dump(object oid, options text);
> pg_dump(object text, options text);
>
> ---
> Hannu
>
>
>
>
>
>
>
>
>
> On Fri, Jul 25, 2025 at 10:35 AM Álvaro Herrera  wrote:
> >
> > On 2025-Jul-24, Andrew Dunstan wrote:
> >
> > > Obviously we already have some functions for things like views and
> > > triggers, but most notably we don't have one for tables, something users
> > > have long complained about. I have been trying to think of a reasonable
> > > interface for a single function, where we would pass in, say, a catalog 
> > > oid
> > > plus an object oid, and maybe some optional extra arguments.
> >
> > Reproducing a table might need multiple commands.  Do you intend to
> > return a single string containing multiple semicolon-separated commands,
> > or are you thinking in a RETURNS SETOF where each row contains a single
> > command?
> >
> > What about schema-qualification needed for elements in the commands?  We
> > have the option to schema-qualify everything, _or_ to depend on whether
> > the schemas are in search_path, _or_ to schema-qualify nothing (which
> > gives the user the chance to recreate in any schema by changing
> > search_path).
> >
> >
> > > That seems a bit fragile, though. The alternative is that we have a
> > > separate function for each object type, e.g. pg_get_{objecttype}_ddl.
> > > I'm kinda leaning that way, but I'd like some sort of consensus before
> > > any work gets done.
> >
> > It looks like the discussion is leaning this way too.  I think it's
> > a reasonable choice.
> >
> > --
> > Álvaro HerreraBreisgau, Deutschland  —  
> > https://www.EnterpriseDB.com/
> >
> >

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-07-25 Thread Tatsuo Ishii

Attached are the v17 patches for adding RESPECT/IGNORE NULLS options
defined in the standard to some window functions. FROM FIRST/LAST
options are not considered in the patch (yet).

This time I split the patch into 6
patches for reviewer's convenience. Also each patch has a short commit
message to explain the patch.

0001: parse and analysis
0002: rewriter
0003: planner
0004: executor
0005: documents
0006: tests

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



v17-0001-Modify-parse-analysis-modules-to-accept-RESPECT-.patch
Description: Binary data


v17-0002-Modify-get_windowfunc_expr_helper-to-handle-IGNO.patch
Description: Binary data


v17-0003-Modify-eval_const_expressions_mutator-to-handle-.patch
Description: Binary data


v17-0004-Modify-executor-and-window-functions-to-handle-I.patch
Description: Binary data


v17-0005-Modify-documents-to-add-null-treatment-clause.patch
Description: Binary data


v17-0006-Modify-window-function-regression-tests-to-test-.patch
Description: Binary data

Re: Retail DDL

2025-07-25 Thread Álvaro Herrera

On 2025-Jul-24, Andrew Dunstan wrote:

> Obviously we already have some functions for things like views and
> triggers, but most notably we don't have one for tables, something users
> have long complained about. I have been trying to think of a reasonable
> interface for a single function, where we would pass in, say, a catalog oid
> plus an object oid, and maybe some optional extra arguments.

Reproducing a table might need multiple commands.  Do you intend to
return a single string containing multiple semicolon-separated commands,
or are you thinking in a RETURNS SETOF where each row contains a single
command?

What about schema-qualification needed for elements in the commands?  We
have the option to schema-qualify everything, _or_ to depend on whether
the schemas are in search_path, _or_ to schema-qualify nothing (which
gives the user the chance to recreate in any schema by changing
search_path).

> That seems a bit fragile, though. The alternative is that we have a
> separate function for each object type, e.g. pg_get_{objecttype}_ddl.
> I'm kinda leaning that way, but I'd like some sort of consensus before
> any work gets done.

It looks like the discussion is leaning this way too.  I think it's
a reasonable choice.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

RE: Conflict detection for update_deleted in logical replication

2025-07-25 Thread Zhijie Hou (Fujitsu)

On Thursday, July 24, 2025 11:42 AM shveta malik  wrote:
> 
> On Wed, Jul 23, 2025 at 12:53 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, July 23, 2025 12:08 PM Amit Kapila
>  wrote:
> > >
> > > On Wed, Jul 23, 2025 at 3:51 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > I've reviewed the 0001 patch and it looks good to me.
> > > >
> > >
> > > Thanks, I have pushed the 0001 patch.
> >
> > Thanks for pushing. I have rebased the remaining patches.

Thanks for the comments!

> >
> > I have reordered the patches to prioritize the detection of
> > update_deleted as the initial patch. This can give us more time to
> > consider the new GUC, since the performance-related aspects have been
> documented.
> >
> 
> 2)
> +   if (MySubscription->retaindeadtuples &&
> +   FindMostRecentlyDeletedTupleInfo(localrel,
> + remoteslot,
> +
>   &conflicttuple.xmin,
> +
>   &conflicttuple.origin,
> +
>   &conflicttuple.ts) &&
> +   conflicttuple.origin != replorigin_session_origin)
> +   type = CT_UPDATE_DELETED;
> +   else
> +   type = CT_UPDATE_MISSING;
> 
> Shall the conflict be detected as update_deleted irrespective of origin?

According to the discussion[1], I kept the current behavior.

> 
> 
> 5)
> monitoring.sgml:
> +  
> +   Number of times the tuple to be updated was deleted by another origin
> +   during the application of changes. See  linkend="conflict-update-deleted"/>
> +   for details about this conflict.
> +  
> 
> Here we are using the term 'by another origin', while in the rest of the doc 
> (see
> confl_update_origin_differs, confl_delete_origin_differs) we use the term 'by
> another source'. Shall we keep it the same?
> OTOH, I think using 'origin' is better but the rest of the page  is using 
> source.
> So perhaps changing source to origin everywhere is better. Thoughts?
> This can be changed if needed once we decide on point 2 above.

Yes, origin may be better. But for now, I have changed to 'source' here to be
consistent with the descriptions around it, and we can improve it in a separate
patch if needed.

Other comments have been addressed in the V53 patch set.

[1] 
https://www.postgresql.org/message-id/CAA4eK1L09u_A0HFRydA4xc%3DHpPkCh%2B7h-%2B_WRhKw1Cksp5_5zQ%40mail.gmail.com

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

2025-07-25 Thread Zhijie Hou (Fujitsu)

On Friday, July 25, 2025 2:33 PM Amit Kapila  wrote:
> 
> On Wed, Jul 23, 2025 at 12:53 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Thanks for pushing. I have rebased the remaining patches.
> >
> 
> + * This function performs a full table scan instead of using indexes
> + because
> + * index scans could miss deleted tuples if an index has been
> + re-indexed or
> + * re-created during change applications.
> 
> IIUC, once the tuple is not found during update, the patch does an additional
> scan with SnapshotAny to find the DEAD tuple, so that it can report
> update_deleted conflict, if one is found. The reason in the comments to do
> sequential scan in such cases sound reasonable but I was thinking if we can
> do index scans if the pg_conflict_* slot's xmin is ahead of the RI (or any 
> usable
> index that can be used during scan) index_tuple's xmin? Note, we use a similar
> check with the indcheckxmin parameter in pg_index though the purpose of
> that is different. If this can happen then still in most cases the index scan 
> will
> happen.

Right, I think it makes sense to do with the index scan when the index's xmin is
less than the conflict detection xmin, as that can ensure that all the tuples
deleted before the index creation or re-indexing are irrelevant for conflict
detection.

I have implemented in the V53 patch set and improved the test to verify both
index and seq scan for dead tuples.

The V53-0001 also includes Shveta's comments in [1].

Apart from above issue,
I'd like to clarify why we scan all matching dead tuples in the relation to
find the most recently deleted one in the patch, and I will share an example for
the same.

The main reason is that only the latest deletion information is relevant for
resolving conflicts. If the first tuple retrieved is antiquated while a newer
deleted tuple exists, users may incorrectly resolve the remote change by
applying a last-update-win strategy. Here is an example:

1. In a BI-cluster setup, if both nodes initially contain empty data:

Node A: tbl (empty)
Node B: tbl (empty)

2. Then if user do the following operations on Node A and wait for them to be
replicated to Node B:

INSERT (pk, 1)
DELETE (pk, 1) @9:00
INSERT (pk, 1)

The data on both nodes looks like:

Node A: tbl (pk, 1) - live tuple
   (pk, 1) - dead tuple - @9:00
Node B: tbl (pk, 1) - live tuple
   (pk, 1) - dead tuple - @9:00

3. If a user do DELETE (pk) on Node B @9:02, and do UDPATE (pk, 1)->(pk, 2) on 
Node A
   @9:01.

When applying the UPDATE on Node B, it cannot find the target tuple, so will
search the dead tuples, but there are two dead tuples:

Node B: tbl (pk, 1) - live tuple
   (pk, 1) - dead tuple - @9:00
   (pk, 1) - dead tuple - @9:02

If we only fetch the first tuple in the scan, it could be either a) the tuple
deleted @9:00 which is older than the remote UPDATE, or b) the tuple deleted
@9:02, which is newer than the remote UPDATE is @9:01. User may choose to apply
the UPDATE for case a) which can cause data inconsistency between nodes
(using last-update-win strategy).

Ideally, we should give the resolve the new dead tuple @9:02, so the resolver
can choose to ignore the remote UDPATE, keeping the data consistent.

[1] 
https://www.postgresql.org/message-id/CAJpy0uDiyjDzLU-%3DNGO7PnXB4OLy4%2BRxJiAySdw%3Da%2BYO62JO2g%40mail.gmail.com

Best Regards,
Hou zj

v53-0003-Re-create-the-replication-slot-if-the-conflict-r.patch
Description:  v53-0003-Re-create-the-replication-slot-if-the-conflict-r.patch

v53-0001-Support-the-conflict-detection-for-update_delete.patch
Description:  v53-0001-Support-the-conflict-detection-for-update_delete.patch

v53-0002-Introduce-a-new-GUC-max_conflict_retention_durat.patch
Description:  v53-0002-Introduce-a-new-GUC-max_conflict_retention_durat.patch

Re: Document transition table triggers are not allowed on views/foreign tables

2025-07-25 Thread Ashutosh Bapat

On Fri, Jul 25, 2025 at 3:22 PM Etsuro Fujita  wrote:
>
> Hi Ashutosh,
>
> On Thu, Jul 24, 2025 at 12:06 AM Ashutosh Bapat
>  wrote:
> > On Wed, Jul 23, 2025 at 4:16 PM Etsuro Fujita  
> > wrote:
> > > On Tue, Jul 15, 2025 at 4:55 PM Etsuro Fujita  
> > > wrote:
> > > > Another thing I noticed about transition tables is that while we
> > > > prohibit transition tables on views/foreign tables, there is no
> > > > description about that in the user-facing documentation.  So I would
> > > > like to propose to do $SUBJECT in create_trigger.sgml.  Attached is a
> > > > patch for that.
> >
> > I think the restriction should be specified in a manner similar to how
> > restriction on CONSTRAINT option for foreign tables is specified i.e.
> > in " This option is only allowed for an AFTER trigger that is not a
> > constraint trigger; also, if the trigger is an UPDATE trigger, it must
> > not specify a column_name list.". But that sentence is already a bit
> > complex because of ; also, ... part. How about splitting the sentence
> > into two and mentioning restriction like below?
> >
> > "This option is only allowed for an AFTER trigger on tables other than
> > views or foreign tables. The trigger should not be a constraint
> > trigger. If the trigger is an UPDATE trigger, it must not specify a
> > column_name list when using this option."
>
> Good idea!  This might be nitpicking, but one thing I noticed is this
> part of the first sentence: "an AFTER trigger on tables other than
> views or foreign tables".  Like the CONSTRAINT-restrictions
> description above, how about just saying "an AFTER trigger on a plain
> table (not a foreign table)"?  No need to mention views, so I removed
> that.

I was actually going to suggest that, but I wasn't sure why you wanted
to mention "views" explicitly.

> I also changed to singular because that sounds natural.  My
> first language is not English, though.  Other than that the change
> looks good to me.

+1.

-- 
Best Wishes,
Ashutosh Bapat

Re: Proposal: Limitations of palloc inside checkpointer

2025-07-25 Thread Xuneng Zhou

Hi, Alexander

Thanks for reviewing and feedback!

> The v8 is attached.  It's basically the same as v6, but has revised
> commit message and comments.
>
> The patch for stable branches is also attached: it just limits the
> maximum size of the checkpointer requests queue.

LGTM.

Best,
Xuneng

Re: Memory consumed by paths during partitionwise join planning

2025-07-25 Thread Ashutosh Bapat

On Tue, Jul 22, 2025 at 7:31 PM Andrei Lepikhov  wrote:
>
> On 18/7/2025 13:48, Ashutosh Bapat wrote:
> > On Mon, Jul 7, 2025 at 8:43 PM Andrei Lepikhov  wrote:
> > if (!IsA(new_path, IndexPath))
> > - pfree(new_path);
> > + free_path(new_path, 0, false);
> >
> > Why don't we free the subpaths if they aren't referenced anymore?
> During testing, I discovered that we sometimes encounter unusual cases.
> For example, imagine pathlist:
> {SeqScan, IndexOnlyScan1, IndexScan2}
> Someone may decide that Sort+SeqScan may be cheaper than IndexScan2. Or,
> IncrementalSort+IndexOnlyScan1 is cheaper than IndexScan2 ... And add
> this path to the same pathlist. I am unsure which exact combinations may
> arise in the planner, but without strict rules, someone may forget to
> increment the refcounter.

When sort + seqscan path is created, seqscan's refcount will be
incremented, and if that dominates indexScan2 which in turn is removed
by add_path from pathlist, indexScan2 should be freed if no other path
is referencing it. One of the objectives of this patch is to make it
easy to maintain refcount. That's why the refcount is incremented in a
central place like path creation time or when adding to pathlist.  The
case where it is not incremented should be studied and fixed.

>
> >> To address this complexity, I propose the following solutions:
> >> 1. Localise reference management within the add_path function.
> >> 2. Transition from a 'strict' approach, which removes all unused paths,
> >> to a more straightforward 'pinning' method. In this approach, we would
> >> simply mark a path as 'used' when someone who was added to the upper
> >> path list references it. Removing less memory, we leave the code much
> >> simpler.
> > Yes. This was one of the ideas Tom had proposed earlier in another
> > thread to manage paths better and avoid dangling pointers. May be it's
> > worth starting with that first. Get rid of special handling of index
> > paths and then improve the memory situation. However, even with that,
> > I think we should free more paths than less.
> It seems like one more trade-off: more eager cleaning means more
> resources spent.

Sorry, I wasn't clear. Even when using simple method to keep track of
whether a path has been referenced or not, we should be able to get
rid of special handling of IndexPath in add_path*
/* Reject and recycle the new path */
if (!IsA(new_path, IndexPath))
pfree(new_path);

That way, we will be able to free even the unreferenced index paths,
thus saving more memory than now.

>
> P.S. path_walker makes sense to implement.

+1

-- 
Best Wishes,
Ashutosh Bapat

Re: Retail DDL

2025-07-25 Thread Andrew Dunstan




On 2025-07-25 Fr 4:34 AM, Álvaro Herrera wrote:

On 2025-Jul-24, Andrew Dunstan wrote:


Obviously we already have some functions for things like views and
triggers, but most notably we don't have one for tables, something users
have long complained about. I have been trying to think of a reasonable
interface for a single function, where we would pass in, say, a catalog oid
plus an object oid, and maybe some optional extra arguments.

Reproducing a table might need multiple commands.  Do you intend to
return a single string containing multiple semicolon-separated commands,
or are you thinking in a RETURNS SETOF where each row contains a single
command?



probably SETOF TEXT, but I'm open to persuasion. What would be best for 
using it in some psql meta-commands?




What about schema-qualification needed for elements in the commands?  We
have the option to schema-qualify everything, _or_ to depend on whether
the schemas are in search_path, _or_ to schema-qualify nothing (which
gives the user the chance to recreate in any schema by changing
search_path).



Good question. Maybe that needs to be a function argument, say 
defaulting to depending on the current search path.




That seems a bit fragile, though. The alternative is that we have a
separate function for each object type, e.g. pg_get_{objecttype}_ddl.
I'm kinda leaning that way, but I'd like some sort of consensus before
any work gets done.

It looks like the discussion is leaning this way too.  I think it's
a reasonable choice.



Thanks. The only reason in my head against it was that it would expand 
the number of visible functions, but I think that loses out against the 
straightforwardness of this approach.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Proposal: QUALIFY clause

2025-07-25 Thread Matheus Alcantara

On Mon Jul 21, 2025 at 7:11 PM -03, Vik Fearing wrote:
> That is my preferred grammar, thank you.  I have not looked at the C 
> code by this can be obtained with a syntax transformation. To wit:
>
>
> SELECT a, b, c
> FROM tab
> QUALIFY wf() OVER () = ?
>
>
> can be rewritten as:
>
>
> SELECT a, b, c
> FROM (
>      SELECT a, b, c, wf() OVER () = ? AS qc
>      FROM tab
> ) AS q
> WHERE qc
>
>
> and then let the optimizer take over.  The standard does this kind of 
> thing all over the place; I don't know what the postgres project's 
> position on doing things like this are.
>

With this transformation users will see a Subquery plan node even if
it's not present on the original query, is that expected or it can be
confusing to users?

--
Matheus Alcantara

Re: Add os_page_num to pg_buffercache

2025-07-25 Thread Bertrand Drouvot

Hi,

On Thu, Jul 24, 2025 at 10:30:06PM +0800, Mircea Cadariu wrote:
> I tried v5 and it returns the expected results on my
> laptop, same as before.

Thanks for the review and testing.

> 
> Just two further remarks for your consideration.
> 
> > +  
> > +   number of OS memory page for this buffer
> > +  
> Let's capitalize the first letter here.

It's copy/pasted from pg_buffercache_numa, but I agree that both (the one
in pg_buffercache_numa and the new one) should be capitalized (for consistency
with the other views).

Done in the attached.

> > +-- Check that the functions / views can't be accessed by default. To avoid
> > +-- having to create a dedicated user, use the pg_database_owner 
> > pseudo-role.
> > +SET ROLE pg_database_owner;
> > +SELECT count(*) > 0 FROM pg_buffercache_os_pages;
> > +RESET role;
> > +
> > +-- Check that pg_monitor is allowed to query view / function
> > +SET ROLE pg_monitor;
> > +SELECT count(*) > 0 FROM pg_buffercache_os_pages;
> > +RESET role;
> In the existing pg_buffercache.sql there are sections similar to the above
> (SET ROLE pg_database_owner/pg_monitor ... RESET role), with a couple of
> different SELECT statements within. Should we rather add the above new
> SELECTs there, instead of in the new pg_buffercache_os_pages.sql?

Yeah, that probably makes more sense, done in the attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From f54304ff1967748319d61145c160df8eaa362d24 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 1 Jul 2025 11:38:37 +
Subject: [PATCH v6 1/2] Introduce GET_MAX_BUFFER_ENTRIES and
 get_buffer_page_boundaries

Those new macro and function are extracted from pg_buffercache_numa_pages().

Currently, this is used by pg_buffercache_numa_pages() only but will be
used by a new function in a following commit.
---
 contrib/pg_buffercache/pg_buffercache_pages.c | 50 ++-
 1 file changed, 37 insertions(+), 13 deletions(-)
 100.0% contrib/pg_buffercache/

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ae0291e6e96..8ef13d74186 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -28,6 +28,12 @@
 
 #define NUM_BUFFERCACHE_NUMA_ELEM	3
 
+/*
+ * Get the maximum buffer cache entries needed.
+ */
+#define GET_MAX_BUFFER_ENTRIES(nbuffers, os_page_size) \
+	((nbuffers) * (Max(1, BLCKSZ / (os_page_size)) + 1))
+
 PG_MODULE_MAGIC_EXT(
 	.name = "pg_buffercache",
 	.version = PG_VERSION
@@ -105,6 +111,33 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict_all);
 /* Only need to touch memory once per backend process lifetime */
 static bool firstNumaTouch = true;
 
+/*
+ * Helper function to get buffer page boundaries.
+ *
+ * Given a buffer pointer and OS page size, calculates the start/end
+ * pointers and first page number.
+ */
+static void
+get_buffer_page_boundaries(char *buffptr, Size os_page_size, char *startptr,
+		   char **startptr_buff, char **endptr_buff,
+		   int32 *page_num)
+{
+	char	   *start_ptr;
+	char	   *end_ptr;
+
+	/* start of the first page of this buffer */
+	start_ptr = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
+
+	/* end of the buffer (no need to align to memory page) */
+	end_ptr = buffptr + BLCKSZ;
+
+	Assert(start_ptr < end_ptr);
+
+	/* calculate ID of the first page for this buffer */
+	*page_num = (start_ptr - startptr) / os_page_size;
+	*startptr_buff = start_ptr;
+	*endptr_buff = end_ptr;
+}
 
 Datum
 pg_buffercache_pages(PG_FUNCTION_ARGS)
@@ -318,7 +351,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		void	  **os_page_ptrs;
 		int		   *os_page_status;
 		uint64		os_page_count;
-		int			pages_per_buffer;
 		int			max_entries;
 		char	   *startptr,
    *endptr;
@@ -426,8 +458,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		 * number as we're walking buffers. That way we can do it in one pass,
 		 * without reallocating memory.
 		 */
-		pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1;
-		max_entries = NBuffers * pages_per_buffer;
+		max_entries = GET_MAX_BUFFER_ENTRIES(NBuffers, os_page_size);
 
 		/* Allocate entries for BufferCachePagesRec records. */
 		fctx->record = (BufferCacheNumaRec *)
@@ -473,16 +504,9 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 			bufferid = BufferDescriptorGetBuffer(bufHdr);
 			UnlockBufHdr(bufHdr, buf_state);
 
-			/* start of the first page of this buffer */
-			startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
-
-			/* end of the buffer (no need to align to memory page) */
-			endptr_buff = buffptr + BLCKSZ;
-
-			Assert(startptr_buff < endptr_buff);
-
-			/* calculate ID of the first page for this buffer */
-			page_num = (startptr_buff - startptr) / os_page_size;
+			/* Get page boundaries for this buffer. */
+			get_buffer_page_boundaries(buffptr, os_page_size, startptr,
+	   &startptr_buff,

Re: HASH_FIXED_SIZE flag gets lost when attaching to existing hash table

2025-07-25 Thread Aidar Imamov


On 2025-07-24 20:24, Tom Lane wrote:

Aidar Imamov  writes:

Recently, while working with hash tables in dynahash.c, I noticed
something weird.
When a hash table is already created in shared memory, and the another
process
calls hash_create attempting to attach to it, it seems like the
HASH_FIXED_SIZE
flag gets lost.


Yeah, you are right.  This seems to be an oversight in 7c797e719
which introduced that flag.  It only affects predicate-lock tables
because we don't use HASH_FIXED_SIZE anywhere else, and it'd only
matter in EXEC_BACKEND builds, so it's not that surprising that
nobody noticed.  But we ought to fix it going forward.

I don't really like your solution though.  ISTM the intent of the
code is that if the shared hashtable already exists, we adhere to the
properties it has, we don't rely on the current caller to specify the
exact same values.  So relying on the caller to get HASH_FIXED_SIZE
correct here seems like the wrong thing.  I think we ought to add
an isfixed flag to the shared hashtable header and copy that.
(IOW, isfixed ought to act more like keysize/ssize/sshift, and should
perhaps be grouped with them.)

regards, tom lane


Thank's, I agree with you.
Attaching an edited version of the patch.

regards,
Aidar Imamov
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 1ad155d446e..50069d20fb5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -195,6 +195,8 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	bool		isfixed;		/* if true, don't enlarge */
+
 
 #ifdef HASH_STATISTICS
 
@@ -496,6 +498,12 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 			hashp->ssize = hctl->ssize;
 			hashp->sshift = hctl->sshift;
 
+			/*
+			 * make a local copy of the value to determine if the initial
+			 * table's size was defined as a hard limit
+			 */
+			hashp->isfixed = hctl->isfixed;
+
 			return hashp;
 		}
 	}
@@ -517,6 +525,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 errmsg("out of memory")));
 	}
 
+	hashp->isfixed = false;
 	hashp->frozen = false;
 
 	hdefault(hashp);
@@ -619,7 +628,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	}
 
 	if (flags & HASH_FIXED_SIZE)
-		hashp->isfixed = true;
+		hashp->isfixed = hctl->isfixed = true;
+
 	return hashp;
 }
 
@@ -644,6 +654,8 @@ hdefault(HTAB *hashp)
 	hctl->ssize = DEF_SEGSIZE;
 	hctl->sshift = DEF_SEGSIZE_SHIFT;
 
+	hctl->isfixed = false;		/* can be enlarged */
+
 #ifdef HASH_STATISTICS
 	hctl->accesses = hctl->collisions = 0;
 #endif

Re: Retail DDL

2025-07-25 Thread Tom Lane

=?utf-8?Q?=C3=81lvaro?= Herrera  writes:
> Reproducing a table might need multiple commands.  Do you intend to
> return a single string containing multiple semicolon-separated commands,
> or are you thinking in a RETURNS SETOF where each row contains a single
> command?

In the same vein: would we expect this command to also build the
table's indexes?  What about foreign key constraints, which might
well reference tables that don't exist yet?

Once you start crawling down this rabbit-hole, you soon realize
why pg_dump is as complicated as it is.

regards, tom lane

Re: Retail DDL

2025-07-25 Thread Zhang Mingli

Hi,

On Jul 25, 2025 at 21:35 +0800, Tom Lane , wrote:
> =?utf-8?Q?=C3=81lvaro?= Herrera  writes:
> > Reproducing a table might need multiple commands. Do you intend to
> > return a single string containing multiple semicolon-separated commands,
> > or are you thinking in a RETURNS SETOF where each row contains a single
> > command?
>
> In the same vein: would we expect this command to also build the
> table's indexes? What about foreign key constraints, which might
> well reference tables that don't exist yet?
>
> Once you start crawling down this rabbit-hole, you soon realize
> why pg_dump is as complicated as it is.

First of all, +1 to this suggestion.
I've long believed there should be a standard way to get a table's DDL (like 
MySQL and Oracle have), especially when our DBAs encounter issues in customer
environments or when we need to cross-validate problems across different 
cluster versions.
This would make problem reproduction much more convenient. Currently, we're 
using pg_dump as our workaround.

Regarding the complexity you mentioned - absolutely, it's a real challenge.
MySQL's approach is to include all of a table's indexes in the DDL output. But 
this becomes problematic when dealing with foreign key dependencies between 
tables.

I think we could start with implementing basic table DDL and index generation 
first, as these are the most commonly needed features in practice.
For other objects related to the table, we can clearly document them.

Additionally, I have another suggestion - could we have a quick backslash 
command to display DDL? Something like \d+ t1, or perhaps \dddl? Looking at the 
code,
it seems there aren't many available command slots remaining.

--
Zhang Mingli
HashData

[PATCH] Avoid unnecessary code execution in Instrument.c when TIMING is FALSE

2025-07-25 Thread Hironobu SUZUKI


Hi,

Even when using EXPLAIN ANALYZE with TIMING=FALSE, the functions 
InstrStopNode(), InstrEndLoop(), and InstrAggNode() in Instrument.c 
still execute code related to the "starttime", "counter", "firsttuple", 
"startup",  and "total" fields within the Instrumentation structure.
These operations are unnecessary when timing is disabled, and since 
these functions are called very frequently, I have created a patch to 
address this.


As far as I can tell, this change has no side effects and clarifies the 
intent of each line, but please let me know if you notice any issues.


Best regards,
H.S.diff --git a/src/backend/executor/instrument.c 
b/src/backend/executor/instrument.c
index 56e635f4700..e4274a28525 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -114,7 +114,8 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (!instr->running)
{
instr->running = true;
-   instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+   if (instr->need_timer)
+   instr->firsttuple = 
INSTR_TIME_GET_DOUBLE(instr->counter);
}
else
{
@@ -122,7 +123,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 * In async mode, if the plan node hadn't emitted any tuples 
before,
 * this might be the first tuple
 */
-   if (instr->async_mode && save_tuplecount < 1.0)
+   if (instr->need_timer && instr->async_mode && save_tuplecount < 
1.0)
instr->firsttuple = 
INSTR_TIME_GET_DOUBLE(instr->counter);
}
 }
@@ -149,18 +150,24 @@ InstrEndLoop(Instrumentation *instr)
elog(ERROR, "InstrEndLoop called on running node");
 
/* Accumulate per-cycle statistics into totals */
-   totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
+   if (instr->need_timer)
+   {
+   totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
 
-   instr->startup += instr->firsttuple;
-   instr->total += totaltime;
+   instr->startup += instr->firsttuple;
+   instr->total += totaltime;
+   }
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
 
/* Reset for next cycle (if any) */
instr->running = false;
-   INSTR_TIME_SET_ZERO(instr->starttime);
-   INSTR_TIME_SET_ZERO(instr->counter);
-   instr->firsttuple = 0;
+   if (instr->need_timer)
+   {
+   INSTR_TIME_SET_ZERO(instr->starttime);
+   INSTR_TIME_SET_ZERO(instr->counter);
+   instr->firsttuple = 0;
+   }
instr->tuplecount = 0;
 }
 
@@ -171,16 +178,21 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
if (!dst->running && add->running)
{
dst->running = true;
-   dst->firsttuple = add->firsttuple;
+   if (dst->need_timer && add->need_timer)
+   dst->firsttuple = add->firsttuple;
}
-   else if (dst->running && add->running && dst->firsttuple > 
add->firsttuple)
+   else if (dst->need_timer && add->need_timer && dst->running && 
add->running
+&& dst->firsttuple > add->firsttuple)
dst->firsttuple = add->firsttuple;
 
-   INSTR_TIME_ADD(dst->counter, add->counter);
+   if (dst->need_timer && add->need_timer)
+   {
+   INSTR_TIME_ADD(dst->counter, add->counter);
 
+   dst->startup += add->startup;
+   dst->total += add->total;
+   }
dst->tuplecount += add->tuplecount;
-   dst->startup += add->startup;
-   dst->total += add->total;
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;

Re: track generic and custom plans in pg_stat_statements

2025-07-25 Thread Sami Imseih

> Perhaps CachedPlanType is
> misnamed, though, would it be more suited to name that as a sort of
> "origin" or "source" field concept?  We want to know which which
> source we have retrieved a plan that a PlannedStmt refers to.

Hmm, I’m not sure I see this as an improvement. In my opinion,
CachedPlanType is a clear name that describes its purpose.

--
Sami

Re: Logical replication launcher did not automatically restart when got SIGKILL

2025-07-25 Thread Fujii Masao

On Fri, Jul 25, 2025 at 10:53 PM cca5507  wrote:
>
> > For example, we could add a check like the following to 
> > 013_crash_restart.pl.
> > Thoughts?
> >
> > -
> > is($node->safe_psql('postgres',
> > "SELECT count(*) = 1 FROM pg_stat_activity WHERE backend_type =
> > 'logical replication launcher'"),
> > 't',
> > 'logical replication launcher is running after crash');
> > -

Patch attached.
I confirmed that the test fails when applied to a version before
commit b5d084c5353 (i.e., before the bug was fixed), and it passes
on HEAD with the patch applied.

> Agree, but we need note that the logical replication launcher won't be 
> registered in some case. (see ApplyLauncherRegister())

Since 013_crash_restart.pl doesn't set any parameters that would
prevent the logical replication launcher from starting, I think
we don't need to worry about that. No??

Regards,

-- 
Fujii Masao

v1-0001-Add-regression-test-for-background-worker-restart.patch
Description: Binary data

Re: PoC: adding CustomJoin, separate from CustomScan

2025-07-25 Thread Paul A Jungwirth

On Thu, Jul 24, 2025 at 12:09 PM Robert Haas  wrote:
> I think that the motivating use case for the existing support was
> wanting to offload work to something that looks altogether unlike the
> normal PostgreSQL executor, rather than (as you want to do) implement
> a new type of join within the PostgreSQL executor. The latter might
> merit different infrastructure, although I cringe a little bit at the
> idea of having two ways of implementing custom joins when the total
> number of projects using this infrastructure is probably less than
> five.

I'm interested in using this. It sounds like it would let me implement
temporal outer/semi/anti joins very nicely. I didn't realize I could
do that with CustomScan as well---I'll investigate that---but
CustomJoin does sound like a better fit.

I'm not sure how to let users *ask* for a temporal join though. . . .
How do other CustomScans "get there"? Perhaps I could use a noop
function you include in the ON section that is declarative
pseudo-syntax, e.g. naming the tsrange columns, and a planner hook
that looks for that. I'm also not completely sure that temporal joins
have the same algebraic properties (associative, commutative, etc) as
non-temporal, or when mixed with non-temporal. Perhaps I can find a
research paper about that. Can the planner freely re-order
CustomScans? Anyway, I just wanted to say that there might be six
projects. :-)

Yours,

-- 
Paul  ~{:-)
p...@illuminatedcomputing.com

Re: Use PqMsg_* macros in basebackup_copy.c

2025-07-25 Thread Nathan Bossart

On Fri, Jul 25, 2025 at 11:47:52AM -0300, Fabrízio de Royes Mello wrote:
> Attached patch for $SUBJECT.

Could we move this to the existing thread on the topic [0]?  I see one more
CopyData character in this file, plus some others that probably need their
own characters in protocol.h:

./basebackup_copy.c:146:mysink->msgbuffer[0] = 'd'; /* archive or 
manifest data */
./basebackup_copy.c:173:pq_sendbyte(&buf, 'n'); /* New archive 
*/
./basebackup_copy.c:224:pq_sendbyte(&buf, 'p'); /* 
Progress report */
./basebackup_copy.c:250:pq_sendbyte(&buf, 'p'); /* Progress 
report */
./basebackup_copy.c:265:pq_sendbyte(&buf, 'm'); /* Manifest */

[0] https://postgr.es/m/aIOW5_jI8YaBuBo0%40nathan

-- 
nathan

Re: PoC: adding CustomJoin, separate from CustomScan

2025-07-25 Thread Tomas Vondra

On 7/25/25 17:34, Paul A Jungwirth wrote:
> On Thu, Jul 24, 2025 at 12:09 PM Robert Haas  wrote:
>> I think that the motivating use case for the existing support was
>> wanting to offload work to something that looks altogether unlike the
>> normal PostgreSQL executor, rather than (as you want to do) implement
>> a new type of join within the PostgreSQL executor. The latter might
>> merit different infrastructure, although I cringe a little bit at the
>> idea of having two ways of implementing custom joins when the total
>> number of projects using this infrastructure is probably less than
>> five.
> 
> I'm interested in using this. It sounds like it would let me implement
> temporal outer/semi/anti joins very nicely. I didn't realize I could
> do that with CustomScan as well---I'll investigate that---but
> CustomJoin does sound like a better fit.
> 

I managed to make it work with the CustomScan by manually building the
tuple based on information extracted from target lists while creating
the plan (so prior to setrefs and all that).

I didn't have time to experiment with that more, so I don't know what
limitations that implies. E.g. I'm not sure if this makes evaluating
expressions (like join clauses) harder, or something like that.

The CustomJoin sure was easier / more convenient to use, though.

> I'm not sure how to let users *ask* for a temporal join though. . . .
> How do other CustomScans "get there"? Perhaps I could use a noop
> function you include in the ON section that is declarative
> pseudo-syntax, e.g. naming the tsrange columns, and a planner hook
> that looks for that. I'm also not completely sure that temporal joins
> have the same algebraic properties (associative, commutative, etc) as
> non-temporal, or when mixed with non-temporal. Perhaps I can find a
> research paper about that. Can the planner freely re-order
> CustomScans? Anyway, I just wanted to say that there might be six
> projects. :-)
> 

I don't think CustomScans help with parser/grammar at all. It's just a
planner/executor node, it has no way to interact with parser. Maybe some
sort of "custom operator" would work, not sure. Not sure what exactly
you need in the planner.

For planning, CustomScan joins are simply part of the regular join
search. We generate all the various paths for a joinrel, and then give
the set_join_pathlist_hook hook a chance to add some more. AFAIK it
doesn't affect the join order search, or anything like that. At least
not directly.

regards

-- 
Tomas Vondra

Re: Test instability when pg_dump orders by OID

2025-07-25 Thread Robert Haas

On Thu, Jul 24, 2025 at 10:27 PM Noah Misch  wrote:
> I regret missing those in v1.  I've attached v2, including branch-specific
> patches.  I'll first need to back-patch 350e6b8, which fixed sorting of CREATE
> RULE, to v17 and earlier.  Since 350e6b8 is conflict-free all the way back to
> v13, I'm not attaching it.

Back-patching 350e6b8 to facilitate back-patching this seems OK. I did
a read-through of dobjcomp20-disambiguate-v2.patch and have no further
review comments. I did not read through the back-patched versions on
the assumption that back-porting was straightforward enough that a
separate review was not required.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Regression with large XML data input

2025-07-25 Thread Robert Treat

On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier  wrote:
> On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:
> > On 24.07.25 21:23, Tom Lane wrote:
> >> However, when testing on RHEL8 with libxml2 2.9.7, indeed
> >> I get "Huge input lookup" with our current code but no
> >> failure with f68d6aabb7e2^.
> >>
> >> The way I interpret these results is that in older libxml2 versions,
> >> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
> >> that does exist in newer versions.  So even if we were to do some kind
> >> of reversion, it would only prevent the error in libxml2 versions that
> >> lack that check.  And in those versions we'd probably be exposing
> >> ourselves to resource-exhaustion problems.
>
> Linux distributions may not seem very eager to add this check, though?
> The top of debian GID uses a version of libxml2 where the difference
> shows up, so it means that we have years ahead with the old code.
>
> If it were discussing things from the perspective where this new code
> was added after a major version bump of Postgres, I would not argue
> much about that: breakages happen every year and users adapt their
> applications to it.  Here, however, we are talking about a change in a
> stable branch, across a minor version, which should be a bit more
> flawless from a user perspective?  I may be influenced by the point of
> seeing a customer impacted by that, of course, there is no denying
> that.  The point is that this enforces one behavior that's part of
> 2.13 and onwards.  Versions of PG before f68d6aabb7e2 were still OK
> with that and the new code of Postgres closes the door completely.
> Even if the behavior that Postgres had when linking with libxml2 2.12
> or older was kind of "accidendal" because
> xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
> it was there, and users relied on that.
>
> One possibility that I could see here for stable branches would be to
> make the code a bit smarter depending on LIBXML_VERSION, where we
> could keep the new code for 2.13 onwards, but keep a compatible
> behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().
>

While I am pretty sympathetic to the idea that we hang our hats on
"Postgres doesn't break things in minor version updates", and this
seems to betray that, one scenario where we would break things is if
it were the only reasonable option wrt a bug / security fix, which
this seems potentially close to. If we can come up with a work around
like the above though, it would certainly be nice to give people a
path forward even if it ends up with a breaking major version change.
This at least eliminates the "surprise!" factor.

> >> On the whole I'm thinking more and more that we don't want to
> >> touch this.  Our recommendation for processing multi-megabyte
> >> chunks of XML should be "don't".  Unless we want to find or
> >> write a replacement for libxml2 ... which we have discussed,
> >> but so far nothing's happened.
> >
> > I also believe that addressing this limitation may not be worth the
> > associated risks. Moreover, a 10MB text node is rather large and
> > probably exceeds the needs of most users.
>
> Yeah, still some people use it, so while I am OK to accept this as a
> conclusion at the end and send back to this thread that workarounds
> are required in applications to split the inputs, that was really
> surprising.  (Aka from the point of view of the customer whose
> application suddenly fails after what should have been a "simple"
> minor update.)

There are a lot of public data sets that provide xml dumps as a
generic format for "non-commercial databases", and those can often be
quite large. I suspect we don't see those use cases a lot because
historically users have been forced to resort to perl/python/etc
scripts to convert the data prior to ingesting. Which is to say, I
think these use cases are more common than we think, and if there were
ever a stable implementation that supported these large use cases,
we'll start to see more of it.


Robert Treat
https://xzilla.net

Re: Non-text mode for pg_dumpall

2025-07-25 Thread Noah Misch

On Thu, Jul 24, 2025 at 04:33:15PM -0400, Andrew Dunstan wrote:
> On 2025-07-21 Mo 8:53 PM, Noah Misch wrote:
> > I suspect this is going to end with a structured dump like we use on the
> > pg_dump (per-database) side.  It's not an accident that v17 pg_restore 
> > doesn't
> > lex text files to do its job.  pg_dumpall deals with a more-limited set of
> > statements than pg_dump deals with, but they're not _that much_ more 
> > limited.
> > I won't veto a lexing-based approach if it gets the behaviors right, but I
> > don't have high hopes for it getting the behaviors right and staying that 
> > way.
> 
> I have been talking offline with Mahendra about this. I agree that we would
> be better off with a structured object for globals. But the thing that's
> been striking me all afternoon as I have pondered it is that we should not
> be designing such an animal at this stage of the cycle. Whatever we do we're
> going to be stuck supporting, so I have very reluctantly come to the
> conclusion that it would probably be better to back the feature out and have
> another go for PG 19.

That makes sense to me.  It would be quite a sprint to get this done in time,
and that wouldn't leave much room for additional testing and feedback before
the final release.  I agree with the reluctance and with the conclusion.

Re: track generic and custom plans in pg_stat_statements

2025-07-25 Thread Tom Lane

Sami Imseih  writes:
>> Perhaps CachedPlanType is
>> misnamed, though, would it be more suited to name that as a sort of
>> "origin" or "source" field concept?  We want to know which which
>> source we have retrieved a plan that a PlannedStmt refers to.

> Hmm, I’m not sure I see this as an improvement. In my opinion,
> CachedPlanType is a clear name that describes its purpose.

I think Michael's got a point.  As of HEAD there are seven different
places that are setting this to PLAN_CACHE_NONE; who's to say that
pg_stat_statements or some other extension might not wish to
distinguish some of those sources?  At the very least, user-submitted
versus internally-generated queries might be an interesting
distinction.  I don't have a concrete proposal for a different
categorization than what we've got, but it seems worth considering
while we still have the flexibility to change it easily.

regards, tom lane

Re: IPC/MultixactCreation on the Standby server

2025-07-25 Thread Andrey Borodin

> On 21 Jul 2025, at 19:58, Andrey Borodin  wrote:
> 
> I'm planning to prepare tests and fixes for all supported branches

This is a status update message. I've reproduced problem on REL_13_STABLE and 
verified that proposed fix works there.

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple that uses 
this multi will become unreadable. Any attempt to read it will hang forever.

I've reproduced the problem and now I'm working on scripting this scenario. 
Basically, I modify code to hang forever after assigning multi number 2. Then 
execute in first psql:

create table x as select i,0 v from generate_series(1,10) i;
create unique index on x(i);

\set id 1
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 1
commit;

\set id 2
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 2 -- will hang
commit;

Then in second psql:

create table y as select i,0 v from generate_series(1,10) i;
create unique index on y(i);

\set id 1
begin;
select * from y where i = :id for no key update;
savepoint s1;
update y set v = v+1 where i = :id;
commit;

After this I pkill -9 postgres. Recovered installation cannot execute select * 
from x; because multi 1 cannot be read without recovery of multi 2 which was 
never logged.

Luckily fix is the same: just restore offset of multi 2 when multi 1 is 
recovered.

Best regards, Andrey Borodin.

Re: Logical replication launcher did not automatically restart when got SIGKILL

2025-07-25 Thread shveta malik

On Fri, Jul 25, 2025 at 7:17 AM Fujii Masao  wrote:
>
> On Thu, Jul 24, 2025 at 6:46 PM shveta malik  wrote:
> > Sounds reasonable.
> > Thinking out loud, when cleaning up after a backend or background
> > worker crash, process_pm_child_exit() is invoked, which subsequently
> > calls both CleanupBackend() and HandleChildCrash(). After the cleanup
> > completes, process_pm_child_exit() calls PostmasterStateMachine() to
> > move to the next state. As part of that, PostmasterStateMachine()
> > invokes ResetBackgroundWorkerCrashTimes() (only in crash
> > scenarios/FatalError), to reset a few things. Since it also resets
> > rw_worker.bgw_notify_pid, it seems reasonable to reset the rw_pid as
> > well there.
>
> Thanks!
> Attached is a patch that fixes the issue by resetting rw_pid in
> ResetBackgroundWorkerCrashTimes().
>

The patch LGTM.

> We should probably add a regression test for this case,
> but I'd prefer to commit the fix first and work on the test separately.
> Andrey Rudometov proposed a test patch in thread [1],
> which we might use as a starting point.

Sounds good.

thanks
Shveta

Re: Custom pgstat support performance regression for simple queries

2025-07-25 Thread Bertrand Drouvot

Hi,

On Fri, Jul 25, 2025 at 10:38:59AM +0900, Michael Paquier wrote:
> On Thu, Jul 24, 2025 at 07:38:46AM +, Bertrand Drouvot wrote:
> > What about?
> > 
> > - create a global variable but that should be used only by extensions? Say,
> > pgstat_report_custom_fixed
> > 
> > - for builtin fixed-sized, just rely on have_iostats, have_slrustats and 
> > friends
> > 
> > Then in pgstat_report_stat():
> > 
> > - do the check on have_iostats, have_slrustats and friends and on
> > pgstat_report_custom_fixed
> > 
> > - reset pgstat_report_custom_fixed
> > 
> > That way I think it's less error prone for the builtin stats and more clear
> > for the extensions (as at least "custom" is also part of the global variable
> > name and the check in pgstat_report_stat() would make it clear that we rely 
> > on
> > the custom global variable).
> > 
> > Thoughts?
> 
> FWIW, I see more benefits in the simplicity of a single flag to govern
> both builtin and custom kinds, but I can see that this may come down
> to personal taste.  A single flag is simpler here, and cheap.
> 
> What have_static_pending_cb was originally aiming for is the removal
> of any knowledge specific to stats kinds in the central pgstats APIs,
> which is why the flags specific to IO, SLRU and WAL have been moved
> out into their own files.

Yes, with my idea we'd need to move them back. 

> Letting custom stats manipulate pgstat_report_fixed or invent a new
> pgstat_report_custom_fixed would be logically the same thing, but 
> this comes down to the fact that we still want to decide if a report
> should be triggered if any of the fixed-numbered stats want to let
> pgstat_report_stat() do one round of pending stats flush.

Yes.

> Having a second flag would keep this abstraction level intact.

Not a second, just one global "pgstat_report_custom_fixed" and the specifics
flags for IO, SLRU, something like:

if (dlist_is_empty(&pgStatPending) &&
!have_iostats &&
!have_slrustats &&
!pgstat_report_custom_fixed &&
!pgstat_have_pending_wal())


> Now
> that leads to overcomplicating the API

Not sure.

> for a small gain in
> debuggability to know which part of the subsystem wants the report to 
> happen at early stages of pgstat_report_stat().  If we want to know
> exactly which stats kind wants the flush to happen, it may be better
> to reuse your idea of one single uint32 or uint64 with one bit
> allocated for each stats kind to have this information available in
> pgstat_report_fixed().  However, we already know that with the
> flush_static_cb callback, as dedicated paths can be taken for each
> stats kinds.  Once again, this would require stats kind to set their
> bit to force a round of reports, gaining more information before we
> try to call each flush_static_cb.

Yeah. I'm fine with one single flag as you proposed, I just have the feeling
it's more error prone.

As an example:

@@ -1091,6 +1092,9 @@ XLogInsertRecord(XLogRecData *rdata,
pgWalUsage.wal_bytes += rechdr->xl_tot_len;
pgWalUsage.wal_records++;
pgWalUsage.wal_fpi += num_fpi;
+
+   /* Required for the flush of pending stats WAL data */
+   pgstat_report_fixed = true;
}

return EndPos;
@@ -2108,6 +2112,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, 
bool opportunistic)
LWLockRelease(WALWriteLock);
pgWalUsage.wal_buffers_full++;

TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
+
+   /*
+* Required for the flush of pending 
stats WAL data, per
+* update of pgWalUsage.
+*/
+   pgstat_report_fixed = true;
}

This was not needed before fc415edf8ca8, and I think it was better to use
pgstat_have_pending_wal() to not have to set a flag to every places 
pgWalUsage.XX
changes.

Having said that, I think the single global flag patch is pretty straightforward
and LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Document transition table triggers are not allowed on views/foreign tables

2025-07-25 Thread Etsuro Fujita

Hi Ashutosh,

On Thu, Jul 24, 2025 at 12:06 AM Ashutosh Bapat
 wrote:
> On Wed, Jul 23, 2025 at 4:16 PM Etsuro Fujita  wrote:
> > On Tue, Jul 15, 2025 at 4:55 PM Etsuro Fujita  
> > wrote:
> > > Another thing I noticed about transition tables is that while we
> > > prohibit transition tables on views/foreign tables, there is no
> > > description about that in the user-facing documentation.  So I would
> > > like to propose to do $SUBJECT in create_trigger.sgml.  Attached is a
> > > patch for that.
>
> I think the restriction should be specified in a manner similar to how
> restriction on CONSTRAINT option for foreign tables is specified i.e.
> in " This option is only allowed for an AFTER trigger that is not a
> constraint trigger; also, if the trigger is an UPDATE trigger, it must
> not specify a column_name list.". But that sentence is already a bit
> complex because of ; also, ... part. How about splitting the sentence
> into two and mentioning restriction like below?
>
> "This option is only allowed for an AFTER trigger on tables other than
> views or foreign tables. The trigger should not be a constraint
> trigger. If the trigger is an UPDATE trigger, it must not specify a
> column_name list when using this option."

Good idea!  This might be nitpicking, but one thing I noticed is this
part of the first sentence: "an AFTER trigger on tables other than
views or foreign tables".  Like the CONSTRAINT-restrictions
description above, how about just saying "an AFTER trigger on a plain
table (not a foreign table)"?  No need to mention views, so I removed
that.  I also changed to singular because that sounds natural.  My
first language is not English, though.  Other than that the change
looks good to me.

Thanks for the review!

Best regards,
Etsuro Fujita

Re: More protocol.h replacements this time into walsender.c

2025-07-25 Thread Dave Cramer

On Fri, 25 Jul 2025 at 06:23, Álvaro Herrera  wrote:

> On 2025-Jul-25, Dave Cramer wrote:
>
> > FYI, the reason I used XLogData is because the term is used multiple
> times
> > here https://www.postgresql.org/docs/devel/protocol-replication.html
>
> Yeah, we could rename it, as in the attached.  It doesn't harm anything.
>

Consistency is good. If your patch were applied, then it would be
consistent to use WALData

Dave

Re: Enhance pg_createsubscriber to create required standby.

2025-07-25 Thread Amit Kapila

On Wed, Jun 18, 2025 at 10:43 AM David G. Johnston
 wrote:
>
> On Tue, Jun 17, 2025 at 9:22 PM Amit Kapila  wrote:
>>
>>
>> As shown in Vignesh's email [1] (point 4), there could be multiple
>> additional parameters required for the first option suggested by you,
>> which will make it longer. Additionally, there are some other benefits
>> of having the second option (pg_createsubscriber --create-standby),
>> like better cleanup of contents during failures and better progress
>> reporting. Are you still against adding such an option?
>>
>> [1]: 
>> https://www.postgresql.org/message-id/CALDaNm1biZBMOzFMfHYzqrAeosJSD5YRG%3D82-pp6%2BJhALsfe6w%40mail.gmail.com
>>
>
> None of those benefits convince me that "let's write a shell script in C and 
> put it under an annual feature release policy" is the way to go here.
>

I don't think it is equivalent to writing a simple script as you are
imagining, for example the future enhancements in this tool could make
such scripts require constant updates. Also, not sure providing Ctrl+C
or progress reporting would be any easier with scripting.

I feel this is a basic requirement to create a subscriber-node from
scratch and for that if we refer users for external scripts, it would
be inconvenient for users.

-- 
With Regards,
Amit Kapila.

Re: Adding basic NUMA awareness

2025-07-25 Thread Tomas Vondra

On 7/25/25 12:27, Jakub Wartak wrote:
> On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra  wrote:
>>
>> On 7/4/25 20:12, Tomas Vondra wrote:
>>> On 7/4/25 13:05, Jakub Wartak wrote:
 ...

 8. v1-0005 2x + /* if (numa_procs_interleave) */

Ha! it's a TRAP! I've uncommented it because I wanted to try it out
 without it (just by setting GUC off) , but "MyProc->sema" is NULL :

 2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
 [..]
 2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
 was terminated by signal 11: Segmentation fault
 2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
 active server processes
 2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
 "restart_after_crash" is off
 2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down

 [New LWP 28755]
 [Thread debugging using libthread_db enabled]
 Using host libthread_db library 
 "/lib/x86_64-linux-gnu/libthread_db.so.1".
 Core was generated by `postgres: io worker '.
 Program terminated with signal SIGSEGV, Segmentation fault.
 #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
 at ./nptl/sem_waitcommon.c:136
 136 ./nptl/sem_waitcommon.c: No such file or directory.
 (gdb) where
 #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
 at ./nptl/sem_waitcommon.c:136
 #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
 #2  0x5561918e0cac in PGSemaphoreReset (sema=0x0) at
 ../src/backend/port/posix_sema.c:302
 #3  0x556191970553 in InitAuxiliaryProcess () at
 ../src/backend/storage/lmgr/proc.c:992
 #4  0x5561918e51a2 in AuxiliaryProcessMainCommon () at
 ../src/backend/postmaster/auxprocess.c:65
 #5  0x556191940676 in IoWorkerMain (startup_data=>>> out>, startup_data_len=) at
 ../src/backend/storage/aio/method_worker.c:393
 #6  0x5561918e8163 in postmaster_child_launch
 (child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
 startup_data=startup_data@entry=0x0,
 startup_data_len=startup_data_len@entry=0,
 client_sock=client_sock@entry=0x0) at
 ../src/backend/postmaster/launch_backend.c:290
 #7  0x5561918ea09a in StartChildProcess
 (type=type@entry=B_IO_WORKER) at
 ../src/backend/postmaster/postmaster.c:3973
 #8  0x5561918ea308 in maybe_adjust_io_workers () at
 ../src/backend/postmaster/postmaster.c:4404
 [..]
 (gdb) print *MyProc->sem
 Cannot access memory at address 0x0

>>>
>>> Yeah, good catch. I'll look into that next week.
>>>
>>
>> I've been unable to reproduce this issue, but I'm not sure what settings
>> you actually used for this instance. Can you give me more details how to
>> reproduce this?
> 
> Better late than never, well feel free to partially ignore me, i've
> missed that it is known issue as per FIXME there, but I would just rip
> out that commented out `if(numa_proc_interleave)` from
> FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
> those memory pages of course (in case of no-NUMA). If you do want to
> save those pages I think we have problem:
> 
> For complete picture, steps:
> 
> 1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
> 2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch
> 
> BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
> 14 out of 14 hunks FAILED -- saving rejects to file
> src/bin/pgbench/pgbench.c.rej
> 
> 3. As I'm just applying 0001 and 0006, I've got two simple rejects,
> but fixed it (due to not applying missing numa_ freelist patches).
> That's intentional on my part, because I wanted to play just with
> those two.
> 
> 4. Then I uncomment those two "if (numa_procs_interleave)" related for
> optional memory shm initialization - add_size() and so on (that have
> XXX comment above that it is causing bootstrap issues)
> 

Ah, I didn't realize you uncommented these "if" conditions. In that case
the crash is not very surprising, because the actual initialization in
InitProcGlobal ignores the GUCs and just assumes it's enabled. But
without the extra padding that likely messes up something. Or something
allocated later "overwrites" the some of the memory.

I need to clean this up, to actually consider the GUC properly.

FWIW I do have a new patch version that I plan to share in a day or two,
once I get some numbers. It didn't change this particular part, though,
it's more about the buffers/freelists/clocksweep. I'll work on PGPROC
next, I think.

regards

-- 
Tomas Vondra

Re: [Proposal] Expose internal MultiXact member count function for efficient monitoring

2025-07-25 Thread Ashutosh Bapat

On Tue, Jun 10, 2025 at 7:50 PM Naga Appani  wrote:
>
> On Tue, Mar 11, 2025 at 4:48 AM Kirill Reshke  wrote:
> >
> > On Tue, 11 Mar 2025 at 14:37, Naga Appani  wrote:
> > >
> > >
> > >
> > > On Mon, Mar 10, 2025 at 10:43 AM Naga Appani  wrote:
> > >>
> > >> Hi,
> > >>
> >
> > Hi
> >
> > > =
> > > Proposal
> > > =
> > > The internal ReadMultiXactCounts() function, implemented in multixact.c, 
> > > directly calculates the number of MultiXact members by reading live state 
> > > from shared memory. This approach avoids the performance issues of the 
> > > current filesystem-based estimation methods.
> >
> > This proposal looks sane. It is indeed helpful to keep an eye out for
> > multixact usage in systems that are heavily loaded.
> >
> > > By exposing ReadMultiXactCounts() for external use, we can provide 
> > > PostgreSQL users with an efficient way to monitor MultiXact member usage. 
> > > This could be particularly useful for integrating with tools like Amazon 
> > > RDS Performance Insights and Amazon CloudWatch to provide enhanced 
> > > database insights and proactive managed monitoring for users.
> > >
> > > Please let me know if this approach is acceptable, so I’ll go ahead and 
> > > submit a patch.
> >
> > Let's give it a try!
>
> Hi,
>
> As a follow-up, I’m submitting a patch that introduces a SQL-callable
> function to retrieve MultiXact usage metrics. Although the motivation
> has been discussed earlier in this thread, I’m including a brief recap
> below to provide context for the patch itself.
>
> While wraparound due to MultiXacts (MXID) is less frequent than XID
> wraparound, it can still lead to aggressive/wraparound vacuum behavior
> or downtime in certain workloads — especially those involving foreign
> keys, shared row locks, or long-lived transactions. Currently, users
> have no SQL-level visibility into MultiXact member consumption, which
> makes it hard to proactively respond before issues arise.

I see mxid_age() will just give mxid consumption but not members
consumption. So just that function is not enough.

>
> Sample output
> -
>  multixacts |  members
> +
>  182371396  | 2826221174
> (1 row)
>
> Performance comparison
> --
> While performance is not the primary motivation for this patch, it
> becomes important in monitoring scenarios where frequent polling is
> expected. The proposed function executes in sub-millisecond time and
> avoids any filesystem I/O, making it well-suited for lightweight,
> periodic monitoring.
>
> Implementation| Used size | MultiXact members
> | Time (ms) | Relative cost
> -+---+---+---+
> Community (pg_ls_multixactdir)   | 8642 MB   | 1.8 billion   |
> 96.879| 1.00 (baseline)
> Linux (du command)   | 8642 MB   | 1.8 billion   |
> 96| 1.00
> Proposal (ReadMultiXactCounts-based) | N/A   | 1.99 billion  |
> 0.167 | ~580x faster
>
> Documentation
> -
> - A new section is added to func.sgml to group multixact-related functions
> - A reference to this new function is included in the "Multixacts and
> Wraparound" subsection of maintenance.sgml
>
> To keep related functions grouped together, we can consider moving
> mxid_age() into the new section as well unless there are objections to
> relocating it from the current section.

In [1], we decided to document pg_get_multixact_member() in section
"Transaction ID and Snapshot Information Functions". I think the
discussion in the email thread applies to this function as well.


+  
+   MultiXact Information Functions
+

+   
+
+ pg_get_multixact_count
+ pg_get_multixact_count ()
+ record
+
+
+ Returns a record with the fields
multixacts and
members:
+ 
+  
+   multixacts: Number of
MultiXacts assigned.
+PostgreSQL initiates aggressive autovacuum when this
value grows beyond the threshold
+defined by
autovacuum_multixact_freeze_max_age, which is based
on
+the age of datminmxid. For more details, see
+https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND";>
+Routine Vacuuming: Multixact Wraparound.
+  
+  
+   members: Number of
MultiXact member entries created.
+These are stored in files under the
pg_multixact/members subdirectory.
+Wraparound occurs after approximately 4.29 billion
entries (~20 GiB). PostgreSQL initiates
+aggressive autovacuum when the number of members created
exceeds approximately 2.145 billion
+or when storage consumption in
pg_multixact/members approaches 10 GiB.
+  
+ 
+
+   

The description here doesn't follow the format of the other functions
in this sec

Re: Proposal: Limitations of palloc inside checkpointer

2025-07-25 Thread Alexander Korotkov

On Fri, Jul 25, 2025 at 3:23 PM Xuneng Zhou  wrote:
>
> Thanks for reviewing and feedback!
>
> > The v8 is attached.  It's basically the same as v6, but has revised
> > commit message and comments.
> >
> > The patch for stable branches is also attached: it just limits the
> > maximum size of the checkpointer requests queue.
>
> LGTM.

Good, thank you.
I'm going to push this if no objections.

--
Regards,
Alexander Korotkov
Supabase

Re: Draft for basic NUMA observability

2025-07-25 Thread Patrick Stählin


Hi Jakub

On 7/24/25 10:01 AM, Jakub Wartak wrote:

On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin  wrote:


Hi!

On 4/7/25 11:27 PM, Tomas Vondra wrote:


I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).


While building the PG18 beta1/2 packages I noticed that in our build
containers the selftest for pg_buffercache_numa and numa failed. It
seems that libnuma was available and pg_numa_init/numa_available returns
no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
yielding the following error when accessing
pg_buffercache_numa/pg_shmem_allocations_numa:

ERROR: failed NUMA pages inquiry: Operation not permitted

The man-page of move_pages lead me to believe that this is because of
the missing capability CAP_SYS_NICE on the process but I couldn't prove
that theory with the attached patch.
The patch did make the tests pass but also disabled NUMA permanently on
a vanilla Debian VM and that is certainly not wanted. It may well be
that my understanding of checking capabilities and how they work is
incomplete. I also think that adding a new dependency for the reason of
just checking the capability is probably a bit of an overkill, maybe we
can check if we can access move_pages once without an error before
treating it as one?

I'd be happy to debug this further but I have limited access to our
build-infra, I should be able to sneak in commands during the build though.



Hi Patrick,

So is it because the container was started without CAP_SYS_NICE so
even root -> postgres is not having this cap? In my book container
would be rather small and certainly single container wouldn't be
spanning multiple CPU sockets, so I would just disable libnuma, anyway
if I do on regular VM:

> [...]

This is just for the build-env but it runs the selftest and this fails 
then. The containers this is running in prod is a totally different 
setup and there the numa calls actually work. Disabling it may be an 
option but it would be nice to detect that we can't access it at runtime.



Can you provide exact details about this container technology?


We use podman to set everything up.


Can you provide /usr/sbin/capsh --print just before starting PG there?
Maybe this is more cgroup/cpuset somehow related too?


Here is the output, it seems that cap_sys_nice is missing from the 
bounding set:


+ /usr/sbin/capsh --print
Current: =
Bounding set 
=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap

Ambient set =
Current IAB: 
!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore

Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=2000(buildkite-agent) euid=2000(buildkite-agent)
gid=2000(buildkite-agent)
groups=2000(buildkite-agent)
Guessed mode: HYBRID (4)


Anyway, there is a simpler way to make the tests pass if that's what
you are after. We do have
contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
to match outputs in pg_buffercache_numa.out OR (!)
pg_buffercache_numa_1.out. We could just handle this edge case by
adding pg_buffercache_numa_2.out too probably (which would just
contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
Operation not permitted")


Ah, didn't know that was a possibility. Until this sees more usage than 
just querying the state, this may be a nice workaround. If this is more 
wide-spread we probably need something a bit more robust for the 
detection. I already patch out the tests for our build-env so for me 
it's "solved" but that is certainly not a proper solution.


Just FYI, I'll be on PTO so I won't have access to the build-env in the 
next two weeks.


Thanks,
Patrick

Use PqMsg_* macros in basebackup_copy.c

2025-07-25 Thread Fabrízio de Royes Mello

Attached patch for $SUBJECT.

-- 
Fabrízio de Royes Mello
From 37a119f5c4c01dbf71efe32180e00b5fa0add517 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fabr=C3=ADzio=20de=20Royes=20Mello?=
 
Date: Fri, 25 Jul 2025 11:45:48 -0300
Subject: [PATCH] Use PqMsg_* macros in basebackup_copy.c

---
 src/backend/backup/basebackup_copy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 18b0b5a52d3..bc5b3882bf9 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -191,7 +191,7 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
 	if (mysink->send_to_client)
 	{
 		/* Add one because we're also sending a leading type byte. */
-		pq_putmessage('d', mysink->msgbuffer, len + 1);
+		pq_putmessage(PqMsg_CopyData, mysink->msgbuffer, len + 1);
 	}
 
 	/* Consider whether to send a progress report to the client. */
@@ -277,7 +277,7 @@ bbsink_copystream_manifest_contents(bbsink *sink, size_t len)
 	if (mysink->send_to_client)
 	{
 		/* Add one because we're also sending a leading type byte. */
-		pq_putmessage('d', mysink->msgbuffer, len + 1);
+		pq_putmessage(PqMsg_CopyData, mysink->msgbuffer, len + 1);
 	}
 }
 
-- 
2.34.1

Re: Proposal: QUALIFY clause

2025-07-25 Thread Vik Fearing




On 25/07/2025 14:55, Matheus Alcantara wrote:

On Mon Jul 21, 2025 at 7:11 PM -03, Vik Fearing wrote:

SELECT a, b, c
FROM tab
QUALIFY wf() OVER () = ?


can be rewritten as:


SELECT a, b, c
FROM (
      SELECT a, b, c, wf() OVER () = ? AS qc
      FROM tab
) AS q
WHERE qc


and then let the optimizer take over.  The standard does this kind of
thing all over the place; I don't know what the postgres project's
position on doing things like this are.

With this transformation users will see a Subquery plan node even if
it's not present on the original query, is that expected or it can be
confusing to users?



This is a definition technique, it does not need to be implemented as a 
subquery.


--

Vik Fearing

Re: HASH_FIXED_SIZE flag gets lost when attaching to existing hash table

2025-07-25 Thread Tom Lane

Aidar Imamov  writes:
> Thank's, I agree with you.
> Attaching an edited version of the patch.

Re-reading this in the light of morning, I realized that it can be
done even more simply: let's just move isfixed to the shared struct,
rather than keep two copies.  The argument for two copies of keysize
et al is that they're used in hot code paths.  But element_alloc()
had better not be a hot code path for a shared hash table, so I
don't think that argument has force for isfixed.

Pushed with that adjustment.  I didn't back-patch, because we've
not heard complaints about people running out of shmem on Windows.
If there is anybody running a workload where this would matter,
they might be less happy not more happy about the behavior
changing in a minor release.

regards, tom lane

Re: Logical replication launcher did not automatically restart when got SIGKILL

2025-07-25 Thread cca5507

> For example, we could add a check like the following to 
013_crash_restart.pl.
> Thoughts?
> 
> -
> is($node->safe_psql('postgres',
> "SELECT count(*) = 1 FROM pg_stat_activity WHERE backend_type =
> 'logical replication launcher'"),
> 't',
> 'logical replication launcher is running after crash');
> -


Agree, but we need note that the logical replication launcher won't be 
registered in some case. (see ApplyLauncherRegister())


--
Regards,
ChangAo Chen

Re: Conflict detection for update_deleted in logical replication

2025-07-25 Thread shveta malik

On Fri, Jul 25, 2025 at 2:31 PM Amit Kapila  wrote:
>
> On Fri, Jul 25, 2025 at 12:37 PM shveta malik  wrote:
> >
> > On Thu, Jul 24, 2025 at 9:12 AM shveta malik  wrote:
> > >
> > >
> > > 2)
> > > +   if (MySubscription->retaindeadtuples &&
> > > +   FindMostRecentlyDeletedTupleInfo(localrel, 
> > > remoteslot,
> > > +
> > >   &conflicttuple.xmin,
> > > +
> > >   &conflicttuple.origin,
> > > +
> > >   &conflicttuple.ts) &&
> > > +   conflicttuple.origin != replorigin_session_origin)
> > > +   type = CT_UPDATE_DELETED;
> > > +   else
> > > +   type = CT_UPDATE_MISSING;
> > >
> > > Shall the conflict be detected as update_deleted irrespective of origin?
> > >
> >
> > On thinking more here, I think that we may have the possibility of
> > UPDATE after DELETE from the same origin only when a publication
> > selectively publishes certain operations.
> >
> > 1)
> > Consider a publication that only publishes UPDATE and DELETE
> > operations. On the publisher, we may perform operations like DELETE,
> > INSERT, and UPDATE. On the subscriber, only DELETE and UPDATE events
> > are received. In this case, should we treat the incoming UPDATE as
> > update_deleted or update_missing?
> >
>
> If the user is doing subscription only for certain operations like
> Update or Delete, she may not be interested in eventual consistency as
> some of the data may not be replicated, so a conflict detection
> followed by any resolution may not be helpful.
>
> The other point is that if we report update_delete in such cases, it
> won't be reliable, sometimes it can be update_missing as vacuum would
> have removed the row, OTOH, if we report update_missing, it will
> always be the same conflict, and we can document it.
>

Agree with both the points. We can keep the current behaviour as it is.

thanks
Shveta

Re: More protocol.h replacements this time into walsender.c

2025-07-25 Thread Nathan Bossart

On Fri, Jul 25, 2025 at 06:28:28AM -0400, Dave Cramer wrote:
> On Fri, 25 Jul 2025 at 06:23, Álvaro Herrera  wrote:
>> Yeah, we could rename it, as in the attached.  It doesn't harm anything.
> 
> Consistency is good. If your patch were applied, then it would be
> consistent to use WALData

+1

-- 
nathan

Re: More protocol.h replacements this time into walsender.c

2025-07-25 Thread Nathan Bossart

On Thu, Jul 24, 2025 at 05:39:14PM -0400, Dave Cramer wrote:
> On Thu, 24 Jul 2025 at 17:05, Jacob Champion <
> jacob.champ...@enterprisedb.com> wrote:
>> Since these are part of the replication subprotocol (i.e. tunneled,
>> via CopyData) rather than the top-level wire protocol, do they deserve
>> their own prefix? PqReplMsg_* maybe?
>>
> I'm going to wait to see if there are any other opinions. Last time I did
> this there were quite a few opinions before finally settling on the naming

+1 to a new prefix.  I don't have any strong opinions on the exact choice,
though.  PqReplMsg, ReplMsg, PqMsg_Repl, etc. seem like some obvious
options.

-- 
nathan

Re: [PATCH] Avoid unnecessary code execution in Instrument.c when TIMING is FALSE

2025-07-25 Thread Michael Paquier

On Sat, Jul 26, 2025 at 12:26:21AM +0900, Hironobu SUZUKI wrote:
> Even when using EXPLAIN ANALYZE with TIMING=FALSE, the functions
> InstrStopNode(), InstrEndLoop(), and InstrAggNode() in Instrument.c still
> execute code related to the "starttime", "counter", "firsttuple", "startup",
> and "total" fields within the Instrumentation structure.
> These operations are unnecessary when timing is disabled, and since these
> functions are called very frequently, I have created a patch to address
> this.
> 
> As far as I can tell, this change has no side effects and clarifies the
> intent of each line, but please let me know if you notice any issues.

Spoiler: this has been discussed during a meetup attended by most of
the PostgreSQL hackers based in Tokyo and surroundings on the 18th of
July, where Suzuki-san has noticed that the work done by the backend
was pointless when using the instrumentation while hacking on an
extension that relied at least on the explain hook.  One of the
remarks was that this seemed worth a submission to upstream when
timers are disabled.  The performance really took a hit when the timer
was disabled, making the extension do a lot of unnecessary work for
nothing.
--
Michael


signature.asc
Description: PGP signature

Re: Use PqMsg_* macros in basebackup_copy.c

2025-07-25 Thread Fabrízio de Royes Mello

On Fri, Jul 25, 2025 at 12:34 PM Nathan Bossart 
wrote:
>
> On Fri, Jul 25, 2025 at 11:47:52AM -0300, Fabrízio de Royes Mello wrote:
> > Attached patch for $SUBJECT.
>
> Could we move this to the existing thread on the topic [0]?  I see one
more
> CopyData character in this file, plus some others that probably need their
> own characters in protocol.h:
>

Absolutely

> ./basebackup_copy.c:146:mysink->msgbuffer[0] = 'd'; /* archive or
manifest data */

Missed that one.

> ./basebackup_copy.c:173:pq_sendbyte(&buf, 'n'); /* New
archive */
> ./basebackup_copy.c:224:pq_sendbyte(&buf, 'p');
/* Progress report */
> ./basebackup_copy.c:250:pq_sendbyte(&buf, 'p'); /*
Progress report */
> ./basebackup_copy.c:265:pq_sendbyte(&buf, 'm'); /*
Manifest */
>

Was doing a separate patch but agreed to do everything in the same
thread/patch.

Regards,

--
Fabrízio de Royes Mello

Re: Non-text mode for pg_dumpall

2025-07-25 Thread Andrew Dunstan



On 2025-07-25 Fr 12:21 PM, Noah Misch wrote:

On Thu, Jul 24, 2025 at 04:33:15PM -0400, Andrew Dunstan wrote:

On 2025-07-21 Mo 8:53 PM, Noah Misch wrote:

I suspect this is going to end with a structured dump like we use on the
pg_dump (per-database) side.  It's not an accident that v17 pg_restore doesn't
lex text files to do its job.  pg_dumpall deals with a more-limited set of
statements than pg_dump deals with, but they're not _that much_ more limited.
I won't veto a lexing-based approach if it gets the behaviors right, but I
don't have high hopes for it getting the behaviors right and staying that way.

I have been talking offline with Mahendra about this. I agree that we would
be better off with a structured object for globals. But the thing that's
been striking me all afternoon as I have pondered it is that we should not
be designing such an animal at this stage of the cycle. Whatever we do we're
going to be stuck supporting, so I have very reluctantly come to the
conclusion that it would probably be better to back the feature out and have
another go for PG 19.

That makes sense to me.  It would be quite a sprint to get this done in time,
and that wouldn't leave much room for additional testing and feedback before
the final release.  I agree with the reluctance and with the conclusion.




Before we throw the baby out with the bathwater, how about this 
suggestion? pg_dumpall would continue to produce globals.dat, but it 
wouldn't be processed by pg_restore, which would only restore the 
individual databases. Or else we just don't produce globals.dat at all. 
Then we could introduce a structured object that pg_restore could safely 
use for release 19, and I think we'd still have something useful for 
release 18.


cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: track generic and custom plans in pg_stat_statements

2025-07-25 Thread Sami Imseih

> Sami Imseih  writes:
> >> Perhaps CachedPlanType is
> >> misnamed, though, would it be more suited to name that as a sort of
> >> "origin" or "source" field concept?  We want to know which which
> >> source we have retrieved a plan that a PlannedStmt refers to.
>
> > Hmm, I’m not sure I see this as an improvement. In my opinion,
> > CachedPlanType is a clear name that describes its purpose.
>
> I think Michael's got a point.  As of HEAD there are seven different
> places that are setting this to PLAN_CACHE_NONE; who's to say that
> pg_stat_statements or some other extension might not wish to
> distinguish some of those sources?  At the very least, user-submitted
> versus internally-generated queries might be an interesting
> distinction.  I don't have a concrete proposal for a different
> categorization than what we've got, but it seems worth considering
> while we still have the flexibility to change it easily.

Sure, I get it now, I think. An example is the cached plan from a sql
in a utility statement of a prepared statement, as an example. right?

I can see that being useful, If I understood that correctly.

--
Sami

Re: Fixing memory leaks in postgres_fdw

2025-07-25 Thread Tom Lane

Etsuro Fujita  writes:
> On Fri, May 30, 2025 at 2:02 AM Tom Lane  wrote:
>> Finally, here's a minimalistic version of the original v1-0001
>> patch that I think we could safely apply to fix the DirectModify
>> problem in the back branches.  I rejiggered it to not depend on
>> inventing MemoryContextUnregisterResetCallback, so that there's
>> not hazards of minor-version skew between postgres_fdw and the
>> main backend.

> Seems reasonable.

Pushed the larger patchset now.  I had to do a little more work
to get it to play with 112faf137, but it wasn't hard.

regards, tom lane

Re: Non-text mode for pg_dumpall

2025-07-25 Thread Tom Lane

Andrew Dunstan  writes:
> Before we throw the baby out with the bathwater, how about this 
> suggestion? pg_dumpall would continue to produce globals.dat, but it 
> wouldn't be processed by pg_restore, which would only restore the 
> individual databases. Or else we just don't produce globals.dat at all. 
> Then we could introduce a structured object that pg_restore could safely 
> use for release 19, and I think we'd still have something useful for 
> release 18.

I dunno ... that seems like a pretty weird behavior.  People would
have to do a separate text-mode "pg_dumpall -g" and remember to
restore that too.  Admittedly, this could be more convenient than
"pg_dumpall -g" plus separately pg_dump'ing each database, which is
what people have to do today if they want anything smarter than a flat
text dumpfile.  But it still seems like a hack --- and it would not be
compatible with v19, where presumably "pg_dumpall | pg_restore"
*would* restore globals.  I think that the prospect of changing
dump/restore scripts and then having to change them again in v19
isn't too appetizing.

regards, tom lane

Re: Parallel heap vacuum

2025-07-25 Thread Robert Haas

On Mon, Jul 21, 2025 at 7:49 PM Andres Freund  wrote:
> That is not to say you can't have callbacks or such, it just doesn't make
> sense for those callbacks to be at the level of tableam. If you want to make
> vacuumparallel support parallel table vacuuming for multiple table AMs (I'm
> somewhat doubtful that's a good idea), you could do that by having a
> vacuumparallel.c specific callback struct.

I'm not doubtful that this is a good idea. There are a number of
tableams around these days that are "heap except whatever", where (I
suspect) the size of "whatever" ranges from quite small to moderately
large. I imagine that such efforts end up duplicating a lot of heapam
code and probably always will; but if we can avoid increasing that
amount, I think it's a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

2025-07-25 Thread Greg Burd

On 7/22/25 14:43, Greg Burd wrote:
> On 7/21/25 14:35, Andres Freund wrote:
>> Hi,
>>
>> On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
>>> On 7/18/25 13:03, Andres Freund wrote:
>>> Hello.  Thanks again for taking the time to review the email and patch,
>>> I think we're onto something good here.
>>>
 I'd be curious if anybody wants to argue for keeping the clock sweep. 
 Except
 for the have_free_buffer() use in autoprewarm, it's a rather trivial
 patch. And I really couldn't measure regressions above the noise level, 
 even
 if absurdly extreme use cases.
>>> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
>>> for keeping the freelist"?
>> Err, yes :(
> Phew.  :)  No worries.
>
 On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>> I think we'll likely need something to replace it.
> Fair, this (v5) patch doesn't yet try to address this.
>
>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>> right. The goal of the use have_free_buffer() is obviously to stop
>> prewarming
>> shared buffers if doing so would just evict buffers.  But it's not clear
>> to me
>> that we should just stop when there aren't any free buffers - what if the
>> previous buffer contents aren't the right ones?  It'd make more sense to
>> me to
>> stop autoprewarm once NBuffers have been prewarmed...
> I had the same high level reaction, that autoprewarm was leveraging
> something
> convenient but not necessarily required or even correct.  I'd considered
> using
> NBuffers as you describe due to similar intuitions, I'll dig into that 
> idea
> for
> the next revision after I get to know autoprewarm a bit better.
 Cool. I do think that'll be good enough.
>>> I re-added the have_free_buffer() function only now it returns false
>>> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
>>> has made its first complete pass.  With that I reverted my changes in
>>> the autoprewarm module.  The net should be the same behavior as before
>>> at startup when using that module.
>> I don't think we should have a have_free_buffer() that doesn't actually test
>> whether we have a free buffer, that seems too likely to cause
>> misunderstandings down the line.  What if we instead just limit the amount of
>> buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
>> the number of to-be-loaded buffers, so that shouldn't be hard.
> I'm glad you said that, I wasn't thrilled with that either and I'm not
> sure why I didn't just correct for that in the last patch set.  I'm now
> capping num_elements to NBuffers at most.
>
> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
> I'll dig into the Windows issues next week as well.
 FWIW, there are backtraces generated on windows. E.g.

 https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt

 00cd`827fdea0 7ff7`6ad82f88 ucrtbased!abort(void)+0x5a 
 [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
 00cd`827fdee0 7ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x7ff7`6b2a4cb8 "result < 
 NBuffers",
char * fileName = 0x7ff7`6b2a4c88 
 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 
 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
 00cd`827fdf20 7ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c 
 [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
 00cd`827fdf60 7ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x00cd`827fdfc0,
unsigned int * num_buf_alloc = 
 0x00cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c 
 @ 300]
 00cd`827fdfa0 7ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 
 0x00cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 
 3649]
 00cd`827fe030 7ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x`,
unsigned int64 startup_data_len = 0)+0x243 
 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
 00cd`827ff5a0 7ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x028f`e75d24d0)+0x2f7 
 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
 00cd`827ff620 7ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv

66 matches

Mail list logo