Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-03-25 Thread Andrei Lepikhov

On 3/24/25 23:45, Tom Lane wrote:

Lukas Fittl  writes:

The main argument I had initially when proposing this, is that Memoize is
different from other plan nodes, in that it makes the child node costs
"cheaper". Clearly seeing the expected cache hit/ratio (that drives that
costing modification) helps interpret why the planner came up with a given
plan.

As I said, I'm not necessarily averse to showing these numbers
somehow.  But I don't think they belong in the default output,
and I'm not even convinced that VERBOSE is the right place.
pg_overexplain seems like it could be an ideal home for this
sort of detail.
I prefer not to see these numbers in the default EXPLAIN output, not 
only because they can fluctuate but mainly because I like to view the 
basic query structure without additional details.
As I mentioned earlier, most of the information we typically need to 
explore query plans stays in path nodes and does not transfer to the 
Plan node. I believe this should stay as is, as we deal with numerous 
cases and a vast amount of data.
It would be beneficial to expose extra data in an external extension. By 
implementing a `create_plan` hook and an extensible list node in both 
Path and Plan structures, we could provide a machinery for writing an 
extension that can expose any planning-stage information in EXPLAIN on 
demand.
This could encourage developers to create a "pg_extended_explain" 
extension, which would address most user needs without requiring 
additional changes to the core system.


--
regards, Andrei Lepikhov




Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2025-03-25 Thread Ashutosh Bapat
On Tue, Mar 25, 2025 at 12:58 PM Amit Langote  wrote:
>
> -/* Updates ec1's ec_derives_list and ec_derives_hash if present. */
> +/*
> + * Appends ec2's derived clauses to ec1->ec_derives_list and adds
> + * them to ec1->ec_derives_hash if present.
> + */

WFM.

>
> > @@ -396,7 +400,7 @@ process_equivalence(PlannerInfo *root,
> > /* just to avoid debugging confusion w/ dangling pointers: */
> > ec2->ec_members = NIL;
> > ec2->ec_sources = NIL;
> > - clear_ec_derived_clauses(ec2);
> > + ec_clear_derived_clauses(ec2);
> >
> > I pondered about this naming convention when naming the functions. But
> > it seems it's not used everywhere in this file OR I am not able to see
> > the underlying naming rule if any. So I used a mixed naming. Let's
> > keep your names though. I think they are better.
>
> Got it -- I went with the ec_ prefix mainly to make the new additions
> self-consistent, since the file doesn’t seem to follow a strict naming
> pattern. Glad the names work for you. While at it, I also applied the
> same naming convention to two new functions I hadn’t touched earlier
> for some reason.

WFM.


>
> + * Derived equality clauses are stored in ec_derives_list. For small queries,
> + * this list is scanned directly during lookup. For larger queries -- e.g.,
> + * with many partitions or joins -- a hash table (ec_derives_hash) is built
> + * when the list grows beyond a threshold, for faster lookup. When present,
> + * the hash table contains the same RestrictInfos and is maintained alongside
> + * the list. We retain the list even when the hash is used to simplify
> + * serialization (e.g., in _outEquivalenceClass()) and support
> + * EquivalenceClass merging.
>
> I've merged my delta + your suggested changes as discussed above into 0002.
>

LGTM.

> Btw, about ec_clear_derived_clauses():
>
> @@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
> SpecialJoinInfo *sjinfo,
>   * drop them.  (At this point, any such clauses would be base restriction
>   * clauses, which we'd not need anymore anyway.)
>   */
> -ec->ec_derives = NIL;
> +ec_clear_derived_clauses(ec);
>  }
>
>  /*
> @@ -1544,8 +1544,7 @@ update_eclasses(EquivalenceClass *ec, int from, int to)
>  list_free(ec->ec_members);
>  ec->ec_members = new_members;
>
> -list_free(ec->ec_derives);
> -ec->ec_derives = NULL;
> +ec_clear_derived_clauses(ec);
>
> We're losing that list_free() in the second hunk, aren't we?
>
> There's also this comment:
>
> + * XXX: When thousands of partitions are involved, the list can become
> + * sizable. It might be worth freeing it explicitly in such cases.
>
> So maybe ec_clear_derived_clauses() should take a free_list parameter,
> to preserve the original behavior? What do you think?

Well spotted. How about just calling list_free() in
ec_clear_derived_clauses() to simplify things. I mean list_free()
might spend some cycles under remove_rel_from_eclass() and
process_equivalence() freeing the array but that should be ok. Just
setting it to NIL by itself looks fine. If we bundle it in a function
with a flag, we will need to explain why/when to free list and when to
not. That's unnecessary complexity I feel. In other places where the
structures have potential to grow in size, we have resorted to freeing
them rather than just forgetting them. For example, we free appinfos
in try_partitionwise_join() or child_relids.

The list shouldn't be referenced anywhere else, so it should be safe
to free it. Note that I thought list_concat() used by
process_equivalence() would reuse the memory allocated to
ec2->ec_derives_list but it doesn't. I verified that by setting the
threshold to 0, thus forcing the hash table always and running a
regression suite. It runs without any segfaults. I don't see any
change in time required to run regression.

PFA patchset
0001, 0002 are same as your patchset except some of my edits to the
commit message. Please feel free to accept or reject the edits.
0003 adds list_free() to ec_clear_derived_clauses()
--
Best Wishes,
Ashutosh Bapat
From 3196b8f306f8ac647c86d4acc945798ef6118c2e Mon Sep 17 00:00:00 2001
From: Amit Langote 
Date: Mon, 24 Mar 2025 21:18:02 +0900
Subject: [PATCH 2/3] Make derived clause lookup in EquivalenceClass more
 efficient

Previously, derived clauses were stored in ec_derives, a List of
RestrictInfos. These clauses are looked up later using the left and
right EquivalenceMembers and the clause's parent EC, typically during
join clause generation.

This lookup becomes expensive in queries with many partitions or
joins, where ec_derives may contain thousands of entries. In
particular, create_join_clause() can spend significant time scanning
this list.

To improve performance, a hash table (ec_derives_hash) is now built
when ec_derives_list exceeds 32 entries. This is the same threshold used
for join_rel_hash. To reflect this dual structure, the ec_derives
field is renamed to ec_derives_list. Th

Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Shubham Khanna
On Tue, Mar 25, 2025 at 4:31 PM Ashutosh Bapat
 wrote:
>
> On Tue, Mar 25, 2025 at 4:28 PM Shubham Khanna
>  wrote:
>
> >
> > The attached patches contain the suggested changes.
> > [1] - 
> > https://www.postgresql.org/message-id/CAA4eK1KUDEO0t6i16_CcEpg33sgcgEddHcdVC_q8j4tVUb5FWw%40mail.gmail.com
>
> Forgot to attach patches?
>
> --

Apologies for the oversight. I have attached the patches now. Please
find them included here.

Thanks and regards,
Shubham Khanna.


v21-0002-Synopsis-for-all-option.patch
Description: Binary data


v21-0001-Enhance-pg_createsubscriber-to-fetch-and-append-.patch
Description: Binary data


v21-0003-Additional-test-cases.patch
Description: Binary data


Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Ashutosh Bapat
On Tue, Mar 25, 2025 at 4:28 PM Shubham Khanna
 wrote:

>
> The attached patches contain the suggested changes.
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1KUDEO0t6i16_CcEpg33sgcgEddHcdVC_q8j4tVUb5FWw%40mail.gmail.com

Forgot to attach patches?

-- 
Best Wishes,
Ashutosh Bapat




[PATCH] Split varlena.c into varlena.c and bytea.c

2025-03-25 Thread Aleksander Alekseev
Hi,

The proposed patch extracts the code dealing with bytea from varlena.c
into a separate file, as proposed previously [1]. It merely changes
the location of the existing functions. There are no other changes.

[1]: http://postgr.es/m/Zy2UqcZS2XAXBibq%40paquier.xyz

-- 
Best regards,
Aleksander Alekseev


v1-0001-Split-varlena.c-into-varlena.c-and-bytea.c.patch
Description: Binary data


Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Ashutosh Bapat
On Tue, Mar 25, 2025 at 5:15 PM Amit Kapila  wrote:
>
> On Tue, Mar 25, 2025 at 5:08 PM Ashutosh Bapat
>  wrote:
> >
> > This looks mostly ready except the test changes. I believe when
> > committing, we are going to squash all three into a single commit. Is
> > that correct?
> >
>
> I would not prefer to commit 0003 as it is primarily because of test
> +# run pg_createsubscriber with '--all' option without '--dry-run'. I
> am not so sure whether it is really worth adding test cycles for this
> option except for dry-run mode tests but we can discuss after
> committing the core patch.

I am worried that without that test, we won't catch bugs in creating
slots, publications and subscriptions and thus causing problems in a
setup with --all. But as long as we address that concern I am ok to
defer it after committing the core patch.

-- 
Best Wishes,
Ashutosh Bapat




Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints

2025-03-25 Thread Alvaro Herrera
Hello

On 2025-Mar-25, Peter Eisentraut wrote:

> A patch in the NOT ENFORCED constraints patch series proposes to refactor
> some of the code added by this patch series ([0] patch v18-0001).  I noticed
> that the code paths from this patch series do not call
> InvokeObjectPostAlterHook() or CacheInvalidateRelcache() when a constraint
> is altered.  Was this intentional?  If not, I can fix it as part of that
> other patch, just wanted to check here.

It was not, I'm good with you fixing it, with thanks.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132...@pgmasters.net




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Amit Kapila
On Tue, Mar 25, 2025 at 5:08 PM Ashutosh Bapat
 wrote:
>
> This looks mostly ready except the test changes. I believe when
> committing, we are going to squash all three into a single commit. Is
> that correct?
>

I would not prefer to commit 0003 as it is primarily because of test
+# run pg_createsubscriber with '--all' option without '--dry-run'. I
am not so sure whether it is really worth adding test cycles for this
option except for dry-run mode tests but we can discuss after
committing the core patch.

-- 
With Regards,
Amit Kapila.




Re: Add semi-join pushdown to postgres_fdw

2025-03-25 Thread Alexander Korotkov
On Mon, Mar 24, 2025 at 6:56 PM Alexander Pyhalov
 wrote:
> Alexander Korotkov писал(а) 2025-03-24 11:49:
> > On Mon, Mar 24, 2025 at 9:07 AM Alexander Pyhalov
> >  wrote:
> >> Alexander Korotkov писал(а) 2025-03-24 04:21:
> >> > Hi, Alexander!
> >> >
> >> > On Tue, Mar 18, 2025 at 6:04 PM Alexander Pyhalov
> >> >  wrote:
> >> >> This shouldn't. When semi-join is found below left/right join, it's
> >> >> deparsed as subquery.
> >> >> Interesting enough, this mechanics (deparsing as subquery) is used
> >> >> 1) for semi-joins under left/right join,
> >> >> 2) for full outer joins when inner or outer part has some
> >> >> remote_conds.
> >> >>
> >> >> The issue here is that after subquery is deparsed, we don't consider
> >> >> if
> >> >> its target attributes are available to the upper level
> >> >> join . As for semi-join itself, all conditions are still deparsed on
> >> >> left/right join boundary, they are just not propagated further.
> >> >> This shouldn't be a problem, as they are evaluated in subquery. As for
> >> >> left/right join without semi-join beneath it - its behavior is not
> >> >> affected
> >> >> (as hidden_subquery_rels is empty).
> >> >
> >> > Thank you for the explanation.  But I have another question.  Aren't
> >> > the checks you've proposed too strict?  hidden_subquery_rels are
> >> > propagated all the way to the join tree.  So, pulling conditions would
> >> > be disables all the way to the join tree too.  Is it enough to just
> >> > disable pulling conditions directly from semi-joins, then their
> >> > further pulls will be disabled automatically?  See the attached patch.
> >> > It also contains other (mostly cosmetic improvements).
> >> >
> >> > --
> >> > Regards,
> >> > Alexander Korotkov
> >> > Supabase
> >>
> >> Hi. No, they are not too strict. Look at the following example
> >>
> >> EXPLAIN (verbose, costs off)
> >> SELECT x1.c1 FROM
> >> (SELECT * FROM ft2 WHERE EXISTS (SELECT 1 FROM ft4
> >> WHERE
> >> ft4.c1 = ft2.c1 AND ft2.c2 < 10)) x1
> >> RIGHT JOIN
> >> (SELECT * FROM ft2 WHERE EXISTS (SELECT 1 FROM ft4
> >> WHERE
> >> ft4.c1 = ft2.c1 AND ft2.c2 < 10)) x2
> >> ON (x1.c1 = x2.c1)
> >> LEFT JOIN
> >> (SELECT * FROM ft2 WHERE c2 < 11) x3
> >> ON (x1.c1 = x3.c1)
> >> ORDER BY x1.c1 LIMIT 10;
> >>
> >> With patch which you suggest, we'll deparse left part of left join as
> >> subquery, but will try to pop c2 < 10 condition from
> >> (8) LEFT JOIN ((6) SEMI JOIN (7)) subquery. When we look at left join
> >> of
> >> this subquery and ft2, we still deparse left part as
> >> subquery, so can't pop up conditions from it.
> >
> > I've checked, this query seems to result in the exactly same remote
> > SQLs with your and mine patches.  Could you elaborate more on the
> > difference?  Do you think foreign_join_ok() can give different results
> > on this query?
>
> Hi.
> With your patch this example gives the same
> ERROR:  unexpected expression in subquery output
>
> This happens, because we don't keep knowledge that we have deparsed all
> semi-joins below this left join. As long as left/right join has
> semi-join in its left or right part, this part will be deparsed as
> subquery (look at the following lines in foreign_join_ok()):
>
>  else if (jointype == JOIN_LEFT || jointype == JOIN_RIGHT ||
> jointype == JOIN_FULL)
>  {
>  /*
>   * Conditions, generated from semi-joins, should be
> evaluated before
>   * LEFT/RIGHT/FULL join.
>   */
>  if (!bms_is_empty(fpinfo_o->hidden_subquery_rels))
>  {
>  fpinfo->make_outerrel_subquery = true;
>  fpinfo->lower_subquery_rels =
> bms_add_members(fpinfo->lower_subquery_rels, outerrel->relids);
>  }
>
>  if (!bms_is_empty(fpinfo_i->hidden_subquery_rels))
>  {
>  fpinfo->make_innerrel_subquery = true;
>  fpinfo->lower_subquery_rels =
> bms_add_members(fpinfo->lower_subquery_rels, innerrel->relids);
>  }
>  }
>
>
> So, we still can't refer to its remote_conds from upper level queries
> (as not all Vars are available from subquery after subquery is created
> in one part of left or right join). It's not necessary to have semi-join
> for this as immediate left/right join inner or outer for inner/outer to
> be deparsed as subquery. But it shouldn't be an issue - we've already
> used remote_conds when created this subquery.
> What I'm trying to say - logic of 'making subquery' and extracting
> conditions should match (or we need more sophisticated way of forming
> subquery targetlist, so that extracted conditions could be used above
> subqueries).

Thank you for the explanation.  Pushed.  However, it would be nice in
future to rework this in a way that semi-joins on lower levels of join
tree don

Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2025-03-25 Thread Amit Langote
On Tue, Mar 25, 2025 at 6:36 PM Ashutosh Bapat
 wrote:
> On Tue, Mar 25, 2025 at 12:58 PM Amit Langote  wrote:
> > Btw, about ec_clear_derived_clauses():
> >
> > @@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
> > SpecialJoinInfo *sjinfo,
> >   * drop them.  (At this point, any such clauses would be base 
> > restriction
> >   * clauses, which we'd not need anymore anyway.)
> >   */
> > -ec->ec_derives = NIL;
> > +ec_clear_derived_clauses(ec);
> >  }
> >
> >  /*
> > @@ -1544,8 +1544,7 @@ update_eclasses(EquivalenceClass *ec, int from, int 
> > to)
> >  list_free(ec->ec_members);
> >  ec->ec_members = new_members;
> >
> > -list_free(ec->ec_derives);
> > -ec->ec_derives = NULL;
> > +ec_clear_derived_clauses(ec);
> >
> > We're losing that list_free() in the second hunk, aren't we?
> >
> > There's also this comment:
> >
> > + * XXX: When thousands of partitions are involved, the list can become
> > + * sizable. It might be worth freeing it explicitly in such cases.
> >
> > So maybe ec_clear_derived_clauses() should take a free_list parameter,
> > to preserve the original behavior? What do you think?
>
> Well spotted. How about just calling list_free() in
> ec_clear_derived_clauses() to simplify things. I mean list_free()
> might spend some cycles under remove_rel_from_eclass() and
> process_equivalence() freeing the array but that should be ok. Just
> setting it to NIL by itself looks fine. If we bundle it in a function
> with a flag, we will need to explain why/when to free list and when to
> not. That's unnecessary complexity I feel. In other places where the
> structures have potential to grow in size, we have resorted to freeing
> them rather than just forgetting them. For example, we free appinfos
> in try_partitionwise_join() or child_relids.
>
> The list shouldn't be referenced anywhere else, so it should be safe
> to free it. Note that I thought list_concat() used by
> process_equivalence() would reuse the memory allocated to
> ec2->ec_derives_list but it doesn't. I verified that by setting the
> threshold to 0, thus forcing the hash table always and running a
> regression suite. It runs without any segfaults. I don't see any
> change in time required to run regression.
>
> PFA patchset
> 0001, 0002 are same as your patchset except some of my edits to the
> commit message. Please feel free to accept or reject the edits.

Thanks, I've noted your suggestions.

> 0003 adds list_free() to ec_clear_derived_clauses()

Thanks, I've merged it into 0002, with this blurb in its commit
message to describe it:

The new ec_clear_derived_clauses() always frees ec_derives_list, even
though some of the original code paths that cleared the old ec_derives
field did not. This ensures consistent cleanup and avoids leaking
memory when the list grows large.

I needed to do this though ;)

-ec->ec_derives_list = NIL;
 list_free(ec->ec_derives_list);
+ec->ec_derives_list = NIL;

-- 
Thanks, Amit Langote


v4-0002-Make-derived-clause-lookup-in-EquivalenceClass-mo.patch
Description: Binary data


v4-0001-Add-assertion-to-verify-derived-clause-has-consta.patch
Description: Binary data


Re: CREATE DATABASE with filesystem cloning

2025-03-25 Thread Nazir Bilal Yavuz
Hi,

Rebased version of the patch is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From d2eb46fb18258931f65e2b01ac6b406255f3c575 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 8 Aug 2024 15:01:48 +0300
Subject: [PATCH v10] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE method is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Reviewed-by: Robert Haas 
Reviewed-by: Ranier Vilela 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 86 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 19 
 src/backend/utils/misc/postgresql.conf.sample |  4 +
 doc/src/sgml/config.sgml  | 46 ++
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index cf60e63f4e2..940d74462d1 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index c335b60a367..0fa529dfd7d 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,30 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+static void clone_file(const char *fromfile, const char *tofile);
+
 /*
  * copydir: copy a directory
  *
  * If recurse is false, subdirectories are ignored.  Anything that's not
  * a directory or a regular file is ignored.
+ *
+ * This function uses a file_copy_method GUC to determine copy method.
+ * Uses of this function must be documented in the list of places
+ * affected by this GUC.
  */
 void
 copydir(const char *fromdir, const char *todir, bool recurse)
@@ -71,7 +84,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+		{
+			if (file_copy_method == FILE_COPY_METHOD_CLONE)
+clone_file(fromfile, tofile);
+			else
+copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -214,3 +232,69 @@ copy_file(const char *fromfile, const char *tofile)
 
 	pfree(buffer);
 }
+
+/*
+ * clone one file
+ */
+static void
+clone_file(const char *fromfile, const char *tofile)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(fromfile, tofile, NULL, COPYFILE_CLONE_FORCE) < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not clone file \"%s\" to \"%s\": %m",
+		fromfile, tofile)));
+#elif defined(HAVE_COPY_FILE_RANGE)
+	int			srcfd;
+	int			dstfd;
+	ssize_t		nbytes;
+
+	srcfd = OpenTransientFile(fromfile, O_RDONLY | PG_BINARY);
+	if (srcfd < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", fromfile)));
+
+	dstfd = OpenTransientFile(tofile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (dstfd < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", tofile)));
+
+	do
+	{
+		/* If we got a cancel signal during the copy of the file, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Don't copy too much at once, so we can check for interrupts from
+		 * time to time if this falls back to a slow copy.
+		 */
+		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_COPY);
+		nbytes = copy_file_range(srcfd, NULL, dstfd, NULL, 1024 * 1024, 0);
+		if (nbytes < 0 && errno != EINTR)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not clone file \"%s\" to \"%s\": %m",
+			fromfile, tofile)));
+		pgstat_report_wait_end();
+	}
+	while (nbytes != 0);
+
+	if (CloseTransientFile(dstfd) != 0)
+		ereport(ERROR,
+(errcode_for_file_acc

Re: Expanding HOT updates for expression and partial indexes

2025-03-25 Thread Burd, Greg
Matthias, 

Rebased patch attached.

Changes in v14:
* UpdateContext now the location I've stored estate, resultRelInfo, etc.
* Reuse the result from the predicate on the partial index.

 -greg


> On Mar 7, 2025, at 5:47 PM, Matthias van de Meent 
>  wrote:
> 
> On Thu, 6 Mar 2025 at 13:40, Burd, Greg  wrote:
> 
>> 
>> 
>>> On Mar 5, 2025, at 6:39 PM, Matthias van de Meent 
>>>  wrote:
>>> 
>>> On Wed, 5 Mar 2025 at 18:21, Burd, Greg  wrote:
>>> 
 * augments IndexInfo only when needed for testing expressions and only once
>>> 
>>> 
>>> ExecExpressionIndexesUpdated seems to always loop over all indexes,
>>> always calling AttributeIndexInfo which always updates the fields in
>>> the IndexInfo when the index has only !byval attributes (e.g. text,
>>> json, or other such varlena types). You say it happens only once, have
>>> I missed something?
>> 
>> 
>> There's a test that avoids doing it more than once, [...]
> 
> 
> Is this that one?
> 
> +if (indexInfo->ii_IndexAttrByVal)
> +return indexInfo;
> 
> I think that test doesn't work consistently: a bitmapset * is NULL
> when no bits are set; and for some indexes no attribute will be byval,
> thus failing this early-exit even after processing.
> 
> Another small issue with this approach is that it always calls and
> tests in EEIU(), while it's quite likely we would do better if we
> pre-processed _all_ indexes at once, so that we can have a path that
> doesn't repeatedly get into EEIU only to exit immediately after. It'll
> probably be hot enough to not matter much, but it's still cycles spent
> on something that we can optimize for in code.

I've changed this a bit, now in ExecOpenIndices() when there are expressions or 
predicates I augment the IndexInfo with information necessary to perform the 
tests in EEIU().  I've debated adding another bool to ExecOpenIndices() to 
indicate that we're opening indexes for the purpose of an update to avoid 
building that information in cases where we are not.  Similar to the bool 
`speculative` on ExecOpenIndices() today.  Thoughts?

>> I might widen this patch a bit to include support for testing equality of 
>> index tuples using custom operators when they exist for the index.  In the 
>> use case I'm solving for we use a custom operator for equality that is not 
>> the same as a memcmp().  Do you have thoughts on that?
> 
> 
> I don't think that's a very great idea. From a certain point of view,
> you can see HOT as "deduplicating multiple tuple versions behind a
> single TID". Btree doesn't support deduplication for types that can
> have more than one representation of the same value so that e.g.
> '0.0'::numeric and '0'::numeric are both displayed correctly, even
> when they compare as equal according to certain equality operators.

Interesting, good point.  Seems like it would require a new index AM function:
bool indexed_tuple_would_change()

I'll drop this for now, it seems out of scope for this patch set.



  

v14-0001-Expand-HOT-update-path-to-include-expression-and.patch
Description: v14-0001-Expand-HOT-update-path-to-include-expression-and.patch


RE: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Ashutosh,

> The new description doesn't mention the link between the source and
> target database.

Yes, I intentionally removed.

> And I think it's incorrect. Not all databases on the
> target server will receive a subscription. Only those which have the
> same name as a database on the source server. Am I correct?

I assumed the point that the target is the streaming standby and all changes
between them would be eventually resolved. IIUC any differences between 
instances
would be resolved or the command would raise an ERROR. Thus, users do not have 
to
consider the differences of nodes so that we can simplify the description.

Analysis
===
If some databases are missing on the source, it would not be listed as target.
pg_createsubscriber can wait until all changes are replicated before the 
promotion
so these databases would be removed even on target. OK, it could work.
If some databases exist only on the source, it would be listed. This command may
fail if database written in dbinfo[0].subconninfo does not exist on the target.
Otherwise they would be eventually replicated. Looks OK.

> I prefer
> the previous wording, even if it's a bit complex, over a simpler but
> incorrect description.

Definitely, and I know description in v20 is correct. Let's keep current
one if my idea above is wrong.

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Ashutosh Bapat
On Thu, Mar 20, 2025 at 5:54 PM Amit Kapila  wrote:
>
> *  
> +  
> +   pg_createsubscriber
> +   option
> +   
> +
> + -a
> + --all
> +
> +
> + -D 
> + --pgdata
> +
> +datadir
> +
> + -P
> + --publisher-server
> +
> +connstr
>
> Most of this is unrelated to this patch. I suggest making a top-up
> patch, we can commit it separately.

Isn't this documenting -a/--all option in synopsis. Why do you think
it's unrelated?

-- 
Best Wishes,
Ashutosh Bapat




Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-25 Thread David Rowley
On Tue, 25 Mar 2025 at 21:14, Bykov Ivan  wrote:
> As I can see, your patch has the same idea as my 
> v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1].
> I think it would be better to extract the jumble buffer update with hash 
> calculation into a function
> (CompressJumble in my patch). This will result in fewer changes to the source 
> code.
> We can also simply dump the null count to the buffer when we encounter a 
> non-null field
> (which will be processed in AppendJumble, indeed).
>
> What do your thing about my patch 
> (v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)?

Seemingly, I missed that one. I agree with the general method being
suitable, which might not be surprising since I ended up at the same
conclusion independently.

I quite like your CompressJumble() idea, but I don't know that it
amounts to any compiled code changes.

I think it's a bit strange that you're flushing the pending nulls to
the jumble buffer after jumbling the first non-null. I think it makes
much more sense to do it before as I did, otherwise, you're
effectively jumbling fields out of order.

I see you opted to only jumble the low-order byte of the pending null
counter. I used the full 4 bytes as I don't think the extra 3 bytes is
a problem. Witrh v7 the memcpy to copy the pending_nulls into the
buffer is inlined into a single instruction. It's semi-conceivable
that you could have more than 255 NULLs given that a List can contain
Nodes. I don't know if we ever have any NULL items in Lists at the
moment, but I don't think that that's disallowed. I'd feel much safer
taking the full 4 bytes of the counter to help prevent future
surprises.

Aside from that, v2b still has the performance regression issue that
we're trying to solve. I did a fresh run of master as of 19c6eb06c and
tested v7 and v2b. Here are the average times it took to jumble all
the make check queries over 50 runs;

master: 76.052 ms
v7: 75.422 ms
v2b: 81.391 ms

I also attached a graph.

I'm happy to proceed with the v7 version. I'm also happy to credit you
as the primary author of it, given that you came up with the v2b
patch.  It might be best to extract the performance stuff that's in v7
and apply that separately from the bug fix. If you're still interested
and have time in the next 7 hours to do it, would you be able to split
the v7 as described, making v8-0001 as the bug fix and v8-0002 as the
performance stuff? If so, could you also add the total_jumble_len to
v8-0001 like Michael's last patch had, but adjust so it's only
maintained for Assert builds. Michael is quite keen on having more
assurance that custom jumble functions don't decide to forego any
jumbling.

If you don't have time, I'll do it in my morning (UTC+13).

Thanks

David

> [1] 
> https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain


RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-25 Thread Bykov Ivan
Hello, David!

As I can see, your patch has the same idea as my 
v2-0001-Query-ID-Calculation-Fix-Variant-B.patch from [1]. 
I think it would be better to extract the jumble buffer update with hash 
calculation into a function
(CompressJumble in my patch). This will result in fewer changes to the source 
code. 
We can also simply dump the null count to the buffer when we encounter a 
non-null field
(which will be processed in AppendJumble, indeed).

What do your thing about my patch 
(v2-0001-Query-ID-Calculation-Fix-Variant-B.patch)?

[1] 
https://www.postgresql.org/message-id/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain



RE: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

> The attached patches contain the suggested changes.

Thanks for updating the patch. I reviewed only 0001 because they would be 
committed separately.
Few comments:

01.
```
+   For every non-template database on the source server, create one
+   subscription on the target server in the database with the same name.
```

It is quite confusing for me; We do not have to describe users that this command
checks databases of the source server. How about something like:
"Create one subscription per all non-template databases on the target server."

02.
```
+# run pg_createsubscriber with '--database' and '--all' without '--dry-run'
+# and verify the failure
+command_fails_like(
+   [
+   'pg_createsubscriber',
+   '--verbose',
+   '--pgdata' => $node_s->data_dir,
+   '--publisher-server' => $node_p->connstr($db1),
+   '--socketdir' => $node_s->host,
+   '--subscriber-port' => $node_s->port,
+   '--database' => $db1,
+   '--all',
+   ],
+   qr/--database cannot be used with --all/,
+   'fail if --database is used with --all');
+
+# run pg_createsubscriber with '--publication' and '--all' and verify
+# the failure
+command_fails_like(
+   [
+   'pg_createsubscriber',
+   '--verbose',
+   '--dry-run',
+   '--pgdata' => $node_s->data_dir,
+   '--publisher-server' => $node_p->connstr($db1),
+   '--socketdir' => $node_s->host,
+   '--subscriber-port' => $node_s->port,
+   '--all',
+   '--publication' => 'pub1',
+   ],
+   qr/--publication cannot be used with --all/,
+   'fail if --publication is used with --all');
```

You seemed to move most of validation checks to 0002. Can you tell me a reason
why they are remained?

03.
```
+# Verify that the required logical replication objects are created. The
+# expected count 3 refers to postgres, $db1 and $db2 databases.
```

Hmm, but since we did a dry-run, any objects are not created, right?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2025-03-25 Thread Amit Langote
On Tue, Mar 25, 2025 at 3:17 PM Ashutosh Bapat
 wrote:
> On Mon, Mar 24, 2025 at 7:23 PM Amit Langote  wrote:
> >
> > Ok, thanks for that analysis.  I don't think there's anything about
> > the patch that makes it particularly less suitable for pwj=on.
> >
>
> I agree.
>
> > I read patch 0002 in detail last week and wrote a follow-up patch
> > (0003), mostly for cosmetic improvements, which I plan to squash into
> > 0002.
>
> For some reason v2-0003 didn't apply cleanly on my local branch.
> Possibly v2-0002 is a modified version of my 0002.

Ah, right -- I probably tweaked your 0002 a bit before splitting out
my changes into a separate patch.

> In order to not
> disturb your patchset, I have attached my edits as a diff. If you find
> those useful, apply them to 0003 and then squash it into 0002.
>
> Here are some more cosmetic comments on 0003. A bit of discussion
> might be needed. Hence did not edit myself.
>
> -/* Hash table entry in ec_derives_hash. */
> +/* Hash table entry used in ec_derives_hash. */
>
> I think we don't need "used" here entry in hash table is good enough. But I am
> ok even if it stays that way.

Ok, let's drop "used".

> - add_derived_clauses(ec1, ec2->ec_derives_list);
> + /* Updates ec1's ec_derives_list and ec_derives_hash if present. */
> + ec_add_derived_clauses(ec1, ec2->ec_derives_list);
>
> Suggestion in the attached diff.

Makes sense, though I added it after a small wording tweak to avoid
implying that the hash tables are merged in some special way. So now
it reads:

-/* Updates ec1's ec_derives_list and ec_derives_hash if present. */
+/*
+ * Appends ec2's derived clauses to ec1->ec_derives_list and adds
+ * them to ec1->ec_derives_hash if present.
+ */

> @@ -396,7 +400,7 @@ process_equivalence(PlannerInfo *root,
> /* just to avoid debugging confusion w/ dangling pointers: */
> ec2->ec_members = NIL;
> ec2->ec_sources = NIL;
> - clear_ec_derived_clauses(ec2);
> + ec_clear_derived_clauses(ec2);
>
> I pondered about this naming convention when naming the functions. But
> it seems it's not used everywhere in this file OR I am not able to see
> the underlying naming rule if any. So I used a mixed naming. Let's
> keep your names though. I think they are better.

Got it -- I went with the ec_ prefix mainly to make the new additions
self-consistent, since the file doesn’t seem to follow a strict naming
pattern. Glad the names work for you. While at it, I also applied the
same naming convention to two new functions I hadn’t touched earlier
for some reason.

> @@ -1403,6 +1403,17 @@ typedef struct JoinDomain
> * entry: consider SELECT random() AS a, random() AS b ... ORDER BY b,a.
> * So we record the SortGroupRef of the originating sort clause.
> *
> + * Derived equality clauses between EC members are stored in ec_derives_list.
>
> "clauses between EC members" doesn't read correctly - "clauses
> equating EC members" seems better. We actually don't need to mention
> "between EC members" at all - it's not relevant to the "size of the
> list" which is the subject of this paragraph. What do you think?

OK, I agree -- we don't need to mention EC members here. I've updated
the comment to keep the focus on the list itself

> + * For small queries, this list is scanned directly during lookup. For larger
> + * queries -- e.g., with many partitions or joins -- a hash table
> + * (ec_derives_hash) is built for faster lookup. Both structures contain the
> + * same RestrictInfos and are maintained in parallel.
>
> If only the list exists, there is nothing to maintain in parallel. The
> sentence seems to indicate that both the hash table and the list are
> always present (and maintained in parallel). How about dropping "Both
> ... parallel" and modifying the previous sentence as " ... faster
> lookup when the list grows beyond a threshold" or something similar.
> The next sentence anyway mentions that both the structures are
> maintained.

Agree here too.  Here's the revised comment addressing these two points:

+ * Derived equality clauses are stored in ec_derives_list. For small queries,
+ * this list is scanned directly during lookup. For larger queries -- e.g.,
+ * with many partitions or joins -- a hash table (ec_derives_hash) is built
+ * when the list grows beyond a threshold, for faster lookup. When present,
+ * the hash table contains the same RestrictInfos and is maintained alongside
+ * the list. We retain the list even when the hash is used to simplify
+ * serialization (e.g., in _outEquivalenceClass()) and support
+ * EquivalenceClass merging.

I've merged my delta + your suggested changes as discussed above into 0002.

Btw, about ec_clear_derived_clauses():

@@ -749,7 +749,7 @@ remove_rel_from_eclass(EquivalenceClass *ec,
SpecialJoinInfo *sjinfo,
  * drop them.  (At this point, any such clauses would be base restriction
  * clauses, which we'd not need anymore anyway.)
  */
-ec->ec_derives = NIL;
+ec_clear_derived_cla

Re: Proposal - Allow extensions to set a Plan Identifier

2025-03-25 Thread Lukas Fittl
On Tue, Mar 25, 2025 at 12:13 AM Michael Paquier 
wrote:

> On Mon, Mar 24, 2025 at 06:47:59PM -0500, Sami Imseih wrote:
> > I know it was mentioned above by both Michael and Andrei that
> > AppendJumble should not be exposed. I am not sure I agree with
> > that. I think it should be exposed, along with
> > JUMBLE_FIELD, JUMBLE_FIELD_SINGLE and JUMBLE_STRING
> > and _jumbleList.
> >
> > I am afraid that if we don't expose these routines/macros, the
> > extension will have
> > to re-invent those wheels. This is not included in v1-0001, but If
> > there is no objection,
> > I can update the patch. Thoughts?
>
> Hmm, yes, the macros could prove to be useful to allow extensions to
> compile their own code the manipulations they want to do on the node
> structures they'd like to touch, but wouldn't that mean that they
> would copy in some way a portion of what gen_node_support.pl does?
>

I agree with Sami that we should expose AppendJumble (it'd be necessary for
any extension that wants to re-use the jumbling logic), and I don't see a
reason to skip over the convenience macros, but also not hard to recreate
those. AFAIK you could get by without having access to _jumbleList, since
you can just call JumbleNode which can jumble lists.

The initial patch I had posted over at the other thread [0], (patch 0003 to
be precise in the initial set, which was partially based on work Sami had
done previously), showed what I think is the minimum we should enable:

case T_IndexScan:
{
  IndexScan  *splan = (IndexScan *) plan;
  ...
  JUMBLE_VALUE(splan->indexid);
  JumbleNode(jstate, (Node *) splan->indexqual);
  ...

i.e. allow someone to defer to core logic for expressions, but require them
to implement their own (manual) way of writing the jumble
functions/conditions for the more limited set of expression-containing
nodes they care about (like plan nodes).

Of course, thinking about other use cases, I could imagine someone would
potentially be interested to change core jumbling logic for only a specific
node (e.g. implementing a query ID that considers constant values to be
significant), but I'd argue that making that level of customization work is
out of scope / hard to do in general.

Perhaps we should think more carefully about this part, and refactor
> this script to work as a perl module so as it could be reused by some
> external code, at least for the parsing of the headers and the
> attributes assigned to each nodes and their attributes?
>

I've been thinking about how to rework things for a PG18-requiring
pg_stat_plans extension I'd like to publish, and if I were to choose a node
attribute-based approach I'd probably:

1. Check out the upstream Postgres source for the given version I'm
generating jumble code for
2. Modify the headers as needed to have the node attributes I want
3. Call the gen_node_support.pl via the build process
4. Copy out the resulting jumble node funcs/conds

Even with the state currently committed this is already possible, but (4)
ends up with a lot of duplicated code in the extension. Assuming we have a
way to call jumbleNode + AppendJumble and macros, that step could only keep
the jumble conds/funcs that are not defined in core.

So based on that workflow I'm not sure turning gen_node_support.pl into a
re-usable module would help, but that's just my perspective without having
built this out (yet).

Thanks,
Lukas

[0]:
https://www.postgresql.org/message-id/flat/CAP53Pkyow59ajFMHGpmb1BK9WHDypaWtUsS_5DoYUEfsa_Hktg%40mail.gmail.com

-- 
Lukas Fittl


Re: Options to control remote transactions’ access/deferrable modes in postgres_fdw

2025-03-25 Thread Etsuro Fujita
On Mon, Mar 3, 2025 at 4:49 PM Tom Lane  wrote:
> Ashutosh Bapat  writes:
> > On Sun, Mar 2, 2025 at 5:14 PM Etsuro Fujita  
> > wrote:
> >> To avoid that, I would like to propose a server option,
> >> inherit_read_only, to open the remote transactions in read-only mode
> >> if the local transaction is read-only.
>
> > Why do we need a server option. Either we say that a local READ ONLY
> > transaction causing modifications on the foreign server is problematic
> > or it's expected. But what's the point in giving that choice to the
> > user? If we deem the behaviour problematic it should be considered as
> > a bug and we should fix it. Otherwise not fix it.
>
> I tend to agree with Ashutosh's position here.  Reasoning about
> issues like this is hard enough already.  Having to figure out an
> application's behavior under more than one setting makes it harder.
> You may argue that "then the application can choose the behavior it
> likes, so there's no need to figure out both behaviors".  But for a
> lot of bits of code, that's not the situation; rather, they have to
> be prepared to work under both settings, because someone else is
> in charge of what the setting is.  (I don't know if either of you
> recall our disastrous attempt at server-side autocommit, back around
> 7.3.  The reason that got reverted was exactly that there was too
> much code that had to be prepared to work under either setting,
> and it was too hard to make that happen.  So now I look with great
> suspicion at anything that complicates our transactional behavior.)
>
> >> I would also like to propose a server option, inherit_deferrable, to
> >> open the remote transactions in deferrable mode if the local
> >> transaction is deferrable.
>
> > The documentation about deferrable is quite confusing. It says "The
> > DEFERRABLE transaction property has no effect unless the transaction
> > is also SERIALIZABLE and READ ONLY." But it doesn't tell what's the
> > effect of deferrable transaction. But probably we don't need a server
> > option here as well.
>
> Yeah, same with this: we should either change it or not.  Multiple
> possible transactional behaviors don't do anyone any favors.

I thought some users might rely on the current behavior that remote
transactions can write even if the local transaction is READ ONLY, so
it would be a good idea to add a server option for backwards
compatibility, but I agree that such an option would cause another,
more difficult problem you mentioned above.

I think the current behavior is not easy to understand, causing
unexpected results in a READ ONLY transaction as shown upthread, so I
would like to instead propose to fix it (for HEAD only as there are no
reports on this issue if I remember correctly).  I think those relying
on the current behavior should just re-declare their transactions as
READ WRITE.  Attached is an updated version of the patch, which fixes
the deferrable case as well.

In the patch I also fixed a bug; I trusted XactReadOnly to see if the
local transaction is READ ONLY, but I noticed that that is not 100%
correct, because a transaction which started as READ WRITE can show as
READ ONLY later within subtransactions, so I modified the patch so
that postgres_fdw opens remote transactions in READ ONLY mode if the
local transaction has been declared READ ONLY at the top level.

Thank you both!

Best regards,
Etsuro Fujita


Inherit-xact-properties-in-postgres-fdw-v2.patch
Description: Binary data


Re: Options to control remote transactions’ access/deferrable modes in postgres_fdw

2025-03-25 Thread Etsuro Fujita
On Mon, Mar 3, 2025 at 1:51 PM Ashutosh Bapat
 wrote:
> On Sun, Mar 2, 2025 at 5:14 PM Etsuro Fujita  wrote:
> > postgres_fdw opens remote transactions in read/write mode in a local
> > transaction even if the local transaction is read-only.  I noticed
> > that this leads to surprising behavior like this:

> I am having a hard time deciding whether this is problematic behaviour
> or not. Maybe the way example is setup - it's querying a view on a
> remote database which doesn't return anything but modified data. If
> there is no modification happening on the foreign server it won't
> return any data. Thus we have no way to verify that the table changed
> because of a READ ONLY transaction which is not expected to change any
> data. Probably some other example which returns all the rows from test
> while modifying some of it might be better.

How about something like this?

CREATE TABLE loct (f1 int, f2 text);
CREATE FUNCTION locf() RETURNS SETOF loct LANGUAGE SQL AS
  'UPDATE public.loct SET f2 = f2 || f2 RETURNING *';
CREATE VIEW locv AS SELECT t.* FROM locf() t;
CREATE FOREIGN TABLE remt (f1 int, f2 text)
  SERVER loopback OPTIONS (table_name 'locv');
INSERT INTO loct VALUES (1, 'foo'), (2, 'bar');
SELECT * FROM loct;
 f1 | f2
+-
  1 | foo
  2 | bar
(2 rows)

SELECT * FROM remt;  -- should work
 f1 |   f2
+
  1 | foofoo
  2 | barbar
(2 rows)

SELECT * FROM loct;
 f1 |   f2
+
  1 | foofoo
  2 | barbar
(2 rows)

I added this test case to the updated patch [1].

Thanks for the comments!

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK14Btk0odkH6vwBhBGjCexmmWcM_D3DG0pJtObj8k_Unag%40mail.gmail.com




Re: AIO v2.5

2025-03-25 Thread Thomas Munro
On Tue, Mar 25, 2025 at 2:18 PM Andres Freund  wrote:
> Attached v2.12, with the following changes:

Here's a tiny fixup to make io_concurrency=0 turn on
READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME.  Without this, AIO
will still run at level 1 even if you asked for 0.  Feel free to
squash, or ignore and I'll push it later, whatever suits... (tested on
the tip of your public aio-2 branch).


0001-fixup-to-v2.12-0010.fixup
Description: Binary data


Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints

2025-03-25 Thread Peter Eisentraut

On 05.03.25 13:56, Alvaro Herrera wrote:

On 2025-Mar-03, Suraj Kharage wrote:


Thanks Alvaro for the review and fixup patch.

I agree with your changes and merged that into the main patch along with a
couple of other changes.

Please find attached v6 for further review.


Thanks, I have pushed this.  I made some changes to the tests, first by
renaming the tables to avoid too generic names, and second to try and
exercise everything about once.


A patch in the NOT ENFORCED constraints patch series proposes to 
refactor some of the code added by this patch series ([0] patch 
v18-0001).  I noticed that the code paths from this patch series do not 
call InvokeObjectPostAlterHook() or CacheInvalidateRelcache() when a 
constraint is altered.  Was this intentional?  If not, I can fix it as 
part of that other patch, just wanted to check here.



[0]: 
https://www.postgresql.org/message-id/caaj_b97ahsjgwhaurqi1jdwsjzd_ygwejqqvq_ddo8dycnn...@mail.gmail.com





Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-25 Thread Amit Kapila
On Fri, Mar 21, 2025 at 9:48 PM Bertrand Drouvot
 wrote:
>
> So, I'm not sure I like the idea that much, but thinking out loud: I wonder if
> we could bypass the "active" slot checks in 16 and 17 and use injection 
> points as
> proposed as of 18 (as we need the injection points changes proposed in 0001
> up-thread). Thoughts?
>

The key point is that snapshotConflictHorizon should always be greater
than or equal to oldestRunningXid for this test to pass. The challenge
is that vacuum LOGs the safest xid to be removed as
snapshotConflictHorizon, which I think will always be either one or
more lesser than oldestRunningXid. So, we can't make it pass unless we
ensure there is no running_xact record gets logged after the last
successful transaction (in this case SQL passed to function
wait_until_vacuum_can_remove) and the till the vacuum is replayed on
the standby. I see even check_for_invalidation('pruning_', $logstart,
'with on-access pruning'); failed  [1].

Seeing all these failures, I wonder whether we can reliably test
active slots apart from wal_level change test (aka Scenario 6:
incorrect wal_level on primary.). Sure, we can try by having some
injection point kind of tests, but is it really worth because, anyway
the active slots won't get invalidated in the scenarios for row
removal we are testing in this case. The other possibility is to add a
developer-level debug_disable_running_xact GUC to test this and
similar cases, or can't we have an injection point to control logging
this WAL record? I have seen the need to control logging running_xact
record in other cases as well.

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-19%2007%3A08%3A16

-- 
With Regards,
Amit Kapila.




Re: Proposal - Allow extensions to set a Plan Identifier

2025-03-25 Thread Sami Imseih
> On 3/25/25 00:47, Sami Imseih wrote:
> > I know it was mentioned above by both Michael and Andrei that
> > AppendJumble should not be exposed. I am not sure I agree with
> > that. I think it should be exposed, along with
> > JUMBLE_FIELD, JUMBLE_FIELD_SINGLE and JUMBLE_STRING
> > and _jumbleList.
> It would be helpful if you could provide an example to support the
> reasoning behind exposing this stuff. I assume you want to influence the
> logic of jumbling, correct?

I was actually thinking more about the case which Lukas mentions above [0],
which is jumbling a Plan. So, an extension having access to AppendJumble (
and the macros ) will allow it to create a routine similar to
JumbleNode but that
can deal with the plannodes.

Of course, the same pattern can be applied to other types of nodes.

> Hmm, yes, the macros could prove to be useful to allow extensions to
> compile their own code the manipulations they want to do on the node
> structures they'd like to touch, but wouldn't that mean that they
> would copy in some way a portion of what gen_node_support.pl does?

I feel like gen_node_support.pl should just deal with generating functions
for core jumbling logic, and extensions should be able to build out their
jumbling funcs, and do it simply if they have access to AppendJumble and
friends.

> So based on that workflow I'm not sure turning gen_node_support.pl into a 
> re-usable module
> would help, but that's just my perspective without having built this out 
> (yet).

> 1. Check out the upstream Postgres source for the given version I'm 
> generating jumble code for
> 2. Modify the headers as needed to have the node attributes I want
> 3. Call the gen_node_support.pl via the build process
> 4. Copy out the resulting jumble node funcs/conds

I like this approach, and the artifacts from #4 will be packed with
the extension.


[0] 
https://www.postgresql.org/message-id/CAP53Pkw7HD%2B3sssn5UcASWLn%2Ba9oRJOM9hXteDCg64JJ662bHQ%40mail.gmail.com

--
Sami Imseih
Amazon Web Services (AWS)




Re: Squash constant lists in query jumbling by default

2025-03-25 Thread Laurenz Albe
On Tue, 2025-03-25 at 17:28 +0100, Christoph Berg wrote:
> The "jumble names of temp tables" thread was briefly touching this [1],
> I'm starting a new thread since the others are already very long.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s%2BhuLLHMKR_%2BMCk8RPQ%40mail.gmail.com#c357c56c3924642e8ef73cc1c8a0286e
> 
> Two points were made:
> 
> 1) this should be on by default
> 2) there should be no GUC for it.

+1 on both

Yours,
Laurenz Albe




Re: Allow default \watch interval in psql to be configured

2025-03-25 Thread Daniel Gustafsson
> On 24 Mar 2025, at 13:42, Ashutosh Bapat  wrote:

> LGTM. I think this is RFC. Updated CF entry.

Thanks all for review, committed.

--
Daniel Gustafsson





Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-25 08:58:08 -0700, Noah Misch wrote:
> While having nagging thoughts that we might be releasing FDs before io_uring
> gets them into kernel custody, I tried this hack to maximize FD turnover:
> 
> static void
> ReleaseLruFiles(void)
> {
> #if 0
>   while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
>   {
>   if (!ReleaseLruFile())
>   break;
>   }
> #else
>   while (ReleaseLruFile())
>   ;
> #endif
> }
> 
> "make check" with default settings (io_method=worker) passes, but
> io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
> runs.  s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
> Here's the richer of the two diffs:

Yikes. That's a very good catch.

I spent a bit of time debugging this. I think I see what's going on - it turns
out that the kernel does *not* open the FDs during io_uring_enter() if
IOSQE_ASYNC is specified [1].  Which we do add heuristically, in an attempt to
avoid a small but measurable slowdown for sequential scans that are fully
buffered (c.f. pgaio_uring_submit()).  If I disable that heuristic, your patch
above passes all tests here.


I don't know if that's an intentional or unintentional behavioral difference.

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Greetings,

Andres Freund


[1] Instead files are opened when the queue entry is being worked on
instead. Interestingly that only happens when the IO is *explicitly*
requested to be executed in the workqueue with IOSQE_ASYNC, not when it's
put there because it couldn't be done in a non-blocking way.




Re: Statistics Import and Export

2025-03-25 Thread Corey Huinker
>
> At this point, I feel I've demonstrated the limit of what can be made into
> WARNINGs, giving us a range of options for now and into the beta. I'll
> rebase and move the 0002 patch to be in last position so as to tee up
> 0003-0004 for consideration.
>

And here's the rebase (after bde2fb797aaebcbe06bf60f330ba5a068f17dda7).

The order of the patches is different, but the purpose of each is the same
as before.
From 1f9b2578f55fa1233121bcf5949a6f69d6cf8cee Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Fri, 14 Mar 2025 03:54:26 -0400
Subject: [PATCH v10 2/4] Batching getAttributeStats().

The prepared statement getAttributeStats() is fairly heavyweight and
could greatly increase pg_dump/pg_upgrade runtime. To alleviate this,
create a result set buffer of all of the attribute stats fetched for a
batch of 100 relations that could potentially have stats.

The query ensures that the order of results exactly matches the needs of
the code walking the TOC to print the stats calls.
---
 src/bin/pg_dump/pg_dump.c | 554 ++
 1 file changed, 383 insertions(+), 171 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 224dc8c9330..e3f2dac33ec 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -143,6 +143,25 @@ typedef enum OidOptions
 	zeroAsNone = 4,
 } OidOptions;
 
+typedef enum StatsBufferState
+{
+	STATSBUF_UNINITIALIZED = 0,
+	STATSBUF_ACTIVE,
+	STATSBUF_EXHAUSTED
+}			StatsBufferState;
+
+typedef struct
+{
+	PGresult   *res;			/* results from most recent
+ * getAttributeStats() */
+	int			idx;			/* first un-consumed row of results */
+	TocEntry   *te;/* next TOC entry to search for statsitics
+ * data */
+
+	StatsBufferState state;		/* current state of the buffer */
+}			AttributeStatsBuffer;
+
+
 /* global decls */
 static bool dosync = true;		/* Issue fsync() to make dump durable on disk. */
 
@@ -209,6 +228,18 @@ static int	nbinaryUpgradeClassOids = 0;
 static SequenceItem *sequences = NULL;
 static int	nsequences = 0;
 
+static AttributeStatsBuffer attrstats =
+{
+	NULL, 0, NULL, STATSBUF_UNINITIALIZED
+};
+
+/*
+ * The maximum number of relations that should be fetched in any one
+ * getAttributeStats() call.
+ */
+
+#define MAX_ATTR_STATS_RELS 100
+
 /*
  * The default number of rows per INSERT when
  * --inserts is specified without --rows-per-insert
@@ -222,6 +253,8 @@ static int	nsequences = 0;
  */
 #define MAX_BLOBS_PER_ARCHIVE_ENTRY 1000
 
+
+
 /*
  * Macro for producing quoted, schema-qualified name of a dumpable object.
  */
@@ -399,6 +432,9 @@ static void setupDumpWorker(Archive *AH);
 static TableInfo *getRootTableInfo(const TableInfo *tbinfo);
 static bool forcePartitionRootLoad(const TableInfo *tbinfo);
 static void read_dump_filters(const char *filename, DumpOptions *dopt);
+static void appendNamedArgument(PQExpBuffer out, Archive *fout,
+const char *argname, const char *argtype,
+const char *argval);
 
 
 int
@@ -10520,7 +10556,286 @@ statisticsDumpSection(const RelStatsInfo *rsinfo)
 }
 
 /*
- * printDumpRelationStats --
+ * Fetch next batch of rows from getAttributeStats()
+ */
+static void
+fetchNextAttributeStats(Archive *fout)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) fout;
+	PQExpBufferData schemas;
+	PQExpBufferData relations;
+	int			numoids = 0;
+
+	Assert(AH != NULL);
+
+	/* free last result set, if any */
+	if (attrstats.state == STATSBUF_ACTIVE)
+		PQclear(attrstats.res);
+
+	/* If we have looped around to the start of the TOC, restart */
+	if (attrstats.te == AH->toc)
+		attrstats.te = AH->toc->next;
+
+	initPQExpBuffer(&schemas);
+	initPQExpBuffer(&relations);
+
+	/*
+	 * Walk ahead looking for relstats entries that are active in this
+	 * section, adding the names to the schemas and relations lists.
+	 */
+	while ((attrstats.te != AH->toc) && (numoids < MAX_ATTR_STATS_RELS))
+	{
+		if (attrstats.te->reqs != 0 &&
+			strcmp(attrstats.te->desc, "STATISTICS DATA") == 0)
+		{
+			RelStatsInfo *rsinfo = (RelStatsInfo *) attrstats.te->createDumperArg;
+
+			Assert(rsinfo != NULL);
+
+			if (numoids > 0)
+			{
+appendPQExpBufferStr(&schemas, ",");
+appendPQExpBufferStr(&relations, ",");
+			}
+			appendPQExpBufferStr(&schemas, fmtId(rsinfo->dobj.namespace->dobj.name));
+			appendPQExpBufferStr(&relations, fmtId(rsinfo->dobj.name));
+			numoids++;
+		}
+
+		attrstats.te = attrstats.te->next;
+	}
+
+	if (numoids > 0)
+	{
+		PQExpBufferData query;
+
+		initPQExpBuffer(&query);
+		appendPQExpBuffer(&query,
+		  "EXECUTE getAttributeStats('{%s}'::pg_catalog.text[],'{%s}'::pg_catalog.text[])",
+		  schemas.data, relations.data);
+		attrstats.res = ExecuteSqlQuery(fout, query.data, PGRES_TUPLES_OK);
+		attrstats.idx = 0;
+	}
+	else
+	{
+		attrstats.state = STATSBUF_EXHAUSTED;
+		attrstats.res = NULL;
+		attrstats.idx = -1;
+	}
+
+	termPQExpBuffer(&schemas);
+	termPQExpBuffer(&relations);
+}
+
+/*
+ * Prepare the getAttributeStats() statement

Re: SQL function which allows to distinguish a server being in point in time recovery mode and an ordinary replica

2025-03-25 Thread m . litsarev

Hi!


I felt you might have missed attaching the test patches added at [1].
Well, the tests were written for the initial proposal which (after 
Michael's review and advices) has been fixed and updated. The original 
tests became not relevant actually. That is why I dropped them.



This change is not required:

Placed back the empty line. The v7 patch is attached.


Currently we have the following commitfest entries for this thread:
[1] - https://commitfest.postgresql.org/patch/5611/
[2] - https://commitfest.postgresql.org/patch/5513/
I have closed the second entry at [2].

I mistakenly pushed the patch twice. Thank you for managing that.

Respectfully,

Mikhail Litsarev,
Postgres Professional: https://postgrespro.com
From 0e7f61589c282a2e9b82cdbf8eadc2159301c3e8 Mon Sep 17 00:00:00 2001
From: Mikhail Litsarev 
Date: Tue, 25 Mar 2025 23:02:23 +0300
Subject: [PATCH v7 2/2] Wrapper function to extract whole text array from
 recovery flags bitset.

It returns SX_PROMOTE_IS_TRIGGERED, SX_STANDBY_MODE_REQUESTED flags for recovery states.
---
 src/backend/access/transam/xlogfuncs.c| 31 +++
 src/backend/access/transam/xlogrecovery.c | 16 
 src/include/access/xlogrecovery.h |  1 +
 src/include/catalog/pg_proc.dat   |  5 
 4 files changed, 53 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 8c3090165f0..9b71ff7583c 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -30,6 +30,7 @@
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "storage/standby.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
@@ -748,3 +749,33 @@ pg_promote(PG_FUNCTION_ARGS)
 		   wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+Datum
+pg_get_recovery_flags(PG_FUNCTION_ARGS)
+{
+/*
+ * Currently supported number of recovery flags is equal to two:
+ * {SX_PROMOTE_IS_TRIGGERED, SX_STANDBY_MODE_REQUESTED}.
+ * The SX_STANDBY_MODE_REQUESTED is valid only in the startup process.
+ */
+#define MAX_RECOVERY_FLAGS 2
+
+	bits32			recovery_flags;
+	intcnt = 0;
+	Datum			flags[MAX_RECOVERY_FLAGS];
+	ArrayType	   *txt_arr;
+
+	recovery_flags = GetXLogRecoveryFlags();
+
+	if (recovery_flags & SX_PROMOTE_IS_TRIGGERED)
+		flags[cnt++] = CStringGetTextDatum("PROMOTE_IS_TRIGGERED");
+
+	if (recovery_flags & SX_STANDBY_MODE_REQUESTED)
+		flags[cnt++] = CStringGetTextDatum("STANDBY_MODE_REQUESTED");
+
+	Assert(cnt <= MAX_RECOVERY_FLAGS);
+
+	/* Returns bit array as Datum */
+	txt_arr = construct_array_builtin(flags, cnt, TEXTOID);
+	PG_RETURN_ARRAYTYPE_P(txt_arr);
+}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9d8c1cdde43..bd5bdeb5db8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -4370,6 +4370,22 @@ StartupRequestWalReceiverRestart(void)
 	}
 }
 
+/*
+ * Return SX_PROMOTE_IS_TRIGGERED, SX_STANDBY_MODE_REQUESTED flags for
+ * recovery states.
+ */
+bits32 GetXLogRecoveryFlags(void)
+{
+	bits32		flags = 0;
+
+	SpinLockAcquire(&XLogRecoveryCtl->info_lck);
+	flags = XLogRecoveryCtl->sharedRecoveryFlags;
+	SpinLockRelease(&XLogRecoveryCtl->info_lck);
+
+	flags |= (localRecoveryFlags & SX_STANDBY_MODE_REQUESTED);
+
+	return flags;
+}
 
 /*
  * Has a standby promotion already been triggered?
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 2d44905ea36..faa6c666ede 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -172,6 +172,7 @@ extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
 extern XLogRecPtr GetCurrentReplayRecPtr(TimeLineID *replayEndTLI);
 
+extern bits32 GetXLogRecoveryFlags(void);
 extern bool PromoteIsTriggered(void);
 extern bool CheckPromoteSignal(void);
 extern void WakeupRecovery(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3f7b82e02bb..8b775f0bbd9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6714,6 +6714,11 @@
   proname => 'pg_is_in_recovery', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_is_in_recovery' },
 
+{ oid => '8439',
+  descr => 'return flags for recovery states',
+  proname => 'pg_get_recovery_flags', provolatile => 'v', prorettype => '_text',
+  proargtypes => '', prosrc => 'pg_get_recovery_flags' },
+
 { oid => '3820', descr => 'current wal flush location',
   proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
-- 
2.34.1

From 1667c30af357943036359139c1f3442f576413d4 Mon Sep 17 00:00:00 2001
From: Mikhail Litsarev 
Date: Tue, 25 Mar 2025 22:56:09 +0300
Subject: [PATCH v7 1/2] Replace recovery boolean flags with a bits32 set.

Move local booleans ArchiveRecoveryRequested, InArc

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-25 Thread David Rowley
On Wed, 26 Mar 2025 at 01:11, David Rowley  wrote:
> I'm happy to proceed with the v7 version. I'm also happy to credit you
> as the primary author of it, given that you came up with the v2b
> patch.  It might be best to extract the performance stuff that's in v7
> and apply that separately from the bug fix. If you're still interested
> and have time in the next 7 hours to do it, would you be able to split
> the v7 as described, making v8-0001 as the bug fix and v8-0002 as the
> performance stuff? If so, could you also add the total_jumble_len to
> v8-0001 like Michael's last patch had, but adjust so it's only
> maintained for Assert builds. Michael is quite keen on having more
> assurance that custom jumble functions don't decide to forego any
> jumbling.
>
> If you don't have time, I'll do it in my morning (UTC+13).

Here is the v8 version with the bug fix and performance stuff
separated out. I also added the Assert code to ensure we always add
something to the jumble buffer when jumbling a node.

I didn't include the regression tests I saw in the v2b patch [1]. I
felt these were just marking the fact that there used to be a bug
here.

David

[1] https://postgr.es/m/5ac172e0b77a4baba50671cd1a15285f%40localhost.localdomain


v8-0001-Fix-query-jumbling-to-account-for-NULL-nodes.patch
Description: Binary data


v8-0002-Optimize-Query-jumble.patch
Description: Binary data


Re: Cannot find a working 64-bit integer type on Illumos

2025-03-25 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Mar 26, 2025 at 10:34 AM Nathan Bossart
>  wrote:
>> In file included from pg_regress.c:34:
>> /usr/local/include/libpq-fe.h:623:8: error: unknown type name 'pg_int64'

> Looks like it's mixing up /usr/local/include and our source tree...

Yeah.  That's because the compile command for pg_regress.c has
-I../../../src/include/libpq too late, after -I switches added
for other things:

ccache cc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new 
-Wendif-labels -Wmissing-format-attribute -Wcast-function-type 
-Wformat-security -Wmissing-variable-declarations -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -Wno-unused-command-line-argument 
-Wno-compound-token-split-by-macro -Wno-format-truncation 
-Wno-cast-function-type-strict -g -O2 -fPIC -DPIC -fvisibility=hidden 
-I../../../src/include -I/usr/local/include -I/usr/local/include/libxml2  
-I/usr/local/include -I../../../src/port -I../../../src/interfaces/libpq 
'-DHOST_TUPLE="x86_64-unknown-freebsd15.0"' '-DSHELLPROG="/bin/sh"'  -c -o 
pg_regress.o pg_regress.c

How did that work before?  Perhaps somebody just now added a libpq
dependency to pg_regress.c?

regards, tom lane




Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Sami Imseih


> If I get the difference right, semantically would apply to concepts
> related to linguistics, but that's not what we have here, so you are
> using a more general term.

FWIW, the pg_stat_statements docs in a few places refer to 
queries that may look different but have the same meaning 
as “semantically equivalent”, this is why I used the same 
terminology here.  But, I have no issue with the simplified
rewrite either.

The patch LGTM as well. 

Regards,


Sami Imseih 
Amazon Web Services (AWS)



Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-25 Thread Michael Paquier
On Wed, Mar 26, 2025 at 11:56:50AM +1300, David Rowley wrote:
> Here is the v8 version with the bug fix and performance stuff
> separated out.

Why not.  I assume that you would merge these together?

> I also added the Assert code to ensure we always add
> something to the jumble buffer when jumbling a node.

Thanks.  This part looks good with its USE_ASSERT_CHECKING.

> I didn't include the regression tests I saw in the v2b patch [1]. I
> felt these were just marking the fact that there used to be a bug
> here.

I disagree about the lack of tests.  These are cheap to add.  I'd
suggest to not use the ones posted as part of v2 variant B as these do
not require the creation of a relation so they can be made cheaper,
and instead use the set I have posted in patch 0001 here which covers
most of the main scenarios from the parser with this nodes in the
Query:
https://www.postgresql.org/message-id/z9z85ui5lpkpn...@paquier.xyz

It is a bit disappointing that we require JumbleQuery() to jumble the
NULLs.  There are discussions about making this code available not
only for Querys, but for Nodes, and the NUL computations would be
missed..
--
Michael


signature.asc
Description: PGP signature


Re: AIO v2.5

2025-03-25 Thread Noah Misch
On Tue, Mar 25, 2025 at 08:17:17PM -0400, Andres Freund wrote:
> On 2025-03-25 09:15:43 -0700, Noah Misch wrote:
> > On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:
> > > FWIW, I prototyped this, it's not hard.
> > > 
> > > But it can't replace the current WARNING with 100% fidelity: If we read 60
> > > blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But 
> > > we
> > > can't encode that many block offset in single PgAioResult, there's not 
> > > enough
> > > space, and enlarging it far enough doesn't seem to make sense either.
> > > 
> > > 
> > > What we *could* do is to emit one WARNING for each bufmgr.c 
> > > smgrstartreadv(),
> > > with that warning saying that there were N zeroed blocks in a read from 
> > > block
> > > N to block Y and a HINT saying that there are more details in the server 
> > > log.
> 
> It should probably be DETAIL, not HINT...

Either is fine with me.  I would go for HINT if referring to the server log,
given the precedent of errhint("See server log for query details.").  DETAIL
fits for block counts, though:

> Could use some input on the framing of the message/detail. Right now it's:
> 
> ERROR:  invalid page in block 8 of relation base/5/16417
> DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read 
> are invalid.
> 
> But that doesn't seem great. Maybe:
> 
> DETAIL: Read of blocks 7..14, 1 other pages in the same read were also 
> invalid.
> 
> But that still isn't really a sentence.

How about this for the multi-page case:

WARNING: zeroing out %u invalid pages among blocks %u..%u of relation %s
DETAIL:  Block %u held first invalid page.
HINT: See server log for the other %u invalid blocks.


For the one-page case, the old message can stay:

WARNING:  invalid page in block %u of relation %s; zeroing out page




Re: AIO v2.5

2025-03-25 Thread Noah Misch
On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:
> Attached v2.12, with the following changes:

> TODO:

>   Wonder if it's worth adding some coverage for when checksums are disabled?
>   Probably not necessary?

Probably not necessary, agreed.  Orthogonal to AIO, it's likely worth a CI
"SPECIAL" and/or buildfarm animal that runs all tests w/ checksums disabled.


> Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts

Ready for commit


> Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to
>  ->report_return

Ready for commit w/ up to one cosmetic change:

> @@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
>  
>   /*
>* Note that we don't save the result in ioh->distilled_result, the 
> local
> -  * callback's result should not ever matter to other waiters.
> +  * callback's result should not ever matter to other waiters. However, 
> the
> +  * local backend does care, so we return the result as modified by local
> +  * callbacks, which then can be passed to ioh->report_return->result.
>*/
>   pgaio_debug_io(DEBUG3, ioh,
>  "after local completion: distilled result: 
> (status %s, id %u, error_data %d, result %d), raw_result: %d",

Should this debug message remove the word "distilled", since this commit
solidifies distilled_result as referring to the complete_shared result?


> Subject: [PATCH v2.12 03/28] aio: Add liburing dependency

Ready for commit


> Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring

Ready for commit w/ open_fd.fixup


> Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd

Ready for commit w/ up to two cosmetic changes:

> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

Second, the aio_internal.h comment changes discussed in
postgr.es/m/20250325155808.f7.nmi...@google.com and earlier.


> Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design

Ready for commit

(This and the previous patch have three spots that would change with the
s/prep/start/ renames.  No opinion on whether to rename before or rename
after.)


> Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well

The plan here looks good:
postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2


> Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support

See review here and later discussion:
postgr.es/m/20250325022037.91.nmi...@google.com


> Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers()

Ready for commit after a batch of small things, all but one of which have no
implications beyond code cosmetics.  This is my first comprehensive review of
this patch.  I like the test coverage (by the end of the patch series).  For
anyone else following, I found "diff -w" helpful for the bufmgr.c changes.
That's because a key part is former WaitReadBuffers() code moving up an
indentation level to its home in new subroutine AsyncReadBuffers().

>   Assert(*nblocks == 1 || allow_forwarding);
>   Assert(*nblocks > 0);
>   Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
> + Assert(*nblocks == 1 || allow_forwarding);

Duplicates the assert three lines back.

> + nblocks = aio_ret->result.result;
> +
> + elog(DEBUG3, "partial read, will retry");
> +
> + }
> + else if (aio_ret->result.status == PGAIO_RS_ERROR)
> + {
> + pgaio_result_report(aio_ret->result, &aio_ret->target_data, 
> ERROR);
> + nblocks = 0;/* silence compiler */
> + }
>  
>   Assert(nblocks > 0);
>   Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
>  
> + operation->nblocks_done += nblocks;

I struggled somewhat from the variety of "nblocks" variables: this local
nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of
some functions.  No one of them is clearly wrong to use the name, and some of
these names are preexisting.  That said, if you see opportunities to push in
the direction of more-specific names, I'd welcome it.

For example, this local variable could become add_to_nblocks_done instead.

> + AsyncReadBuffers(operation, &nblocks);

I suggest renaming s/nblocks/ignored_nblocks_progress/ here.

> +  * If we need to wait for IO before we can get a handle, submit already
> +  * staged IO first, so that other backends don't need to wait. There

s/already staged/already-staged/.  Normally I'd skip this as nitpicking, but I
misread this particular sentence twice, as "submit" being the subject that
"staged" something.  (It's still nitpicking, alas.)

>   /*
>* How many neighboring-on-disk blocks can we scatter-read into 
> other
>

Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Michael Paquier
On Tue, Mar 25, 2025 at 07:24:20PM -0400, Tom Lane wrote:
> fails to honor $query_jumble_ignore as the other if-branches do.
> Perhaps it's unlikely that a node would have both query_jumble_custom
> and query_jumble_ignore annotations, but still, the script would emit
> completely incorrect code if it did.  Also, the "# node type" comment
> should probably be moved down to within the first "elsif" block.

Oops, sorry about that.  Fixed both of these in 27ee6ede6bc9.

> I'd change "semantically similar queries" to "otherwise-similar
> queries"; I think writing "semantically" will just confuse people.

If I get the difference right, semantically would apply to concepts
related to linguistics, but that's not what we have here, so you are
using a more general term.

Thanks for the suggestions.
--
Michael
From 30f8066989eac6f8c72034d3fb5150368c2821a9 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 26 Mar 2025 09:17:41 +0900
Subject: [PATCH v7] Add custom query jumble function for RangeTblEntry.eref

---
 src/include/nodes/parsenodes.h| 11 +++---
 src/backend/nodes/queryjumblefuncs.c  | 19 ++
 doc/src/sgml/pgstatstatements.sgml|  9 +++--
 .../pg_stat_statements/expected/select.out| 20 ---
 4 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf2..df331b1c0d99 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1050,8 +1050,13 @@ typedef struct RangeTblEntry
 	 */
 	/* user-written alias clause, if any */
 	Alias	   *alias pg_node_attr(query_jumble_ignore);
-	/* expanded reference names */
-	Alias	   *eref pg_node_attr(query_jumble_ignore);
+
+	/*
+	 * Expanded reference names.  This uses a custom query jumble function so
+	 * that the table name is included in the computation, but not its list of
+	 * columns.
+	 */
+	Alias	   *eref pg_node_attr(custom_query_jumble);
 
 	RTEKind		rtekind;		/* see above */
 
@@ -1094,7 +1099,7 @@ typedef struct RangeTblEntry
 	 * tables to be invalidated if the underlying table is altered.
 	 */
 	/* OID of the relation */
-	Oid			relid;
+	Oid			relid pg_node_attr(query_jumble_ignore);
 	/* inheritance requested? */
 	bool		inh;
 	/* relation kind (see pg_class.relkind) */
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index f8b0f91704ba..62d6cfb7ac15 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -67,6 +67,9 @@ static void _jumbleElements(JumbleState *jstate, List *elements);
 static void _jumbleA_Const(JumbleState *jstate, Node *node);
 static void _jumbleList(JumbleState *jstate, Node *node);
 static void _jumbleVariableSetStmt(JumbleState *jstate, Node *node);
+static void _jumbleRangeTblEntry_eref(JumbleState *jstate,
+	  RangeTblEntry *rte,
+	  Alias *expr);
 
 /*
  * Given a possibly multi-statement source string, confine our attention to the
@@ -516,3 +519,19 @@ _jumbleVariableSetStmt(JumbleState *jstate, Node *node)
 	JUMBLE_FIELD(is_local);
 	JUMBLE_LOCATION(location);
 }
+
+/*
+ * Custom query jumble function for RangeTblEntry.eref.
+ */
+static void
+_jumbleRangeTblEntry_eref(JumbleState *jstate,
+		  RangeTblEntry *rte,
+		  Alias *expr)
+{
+	JUMBLE_FIELD(type);
+
+	/*
+	 * This includes only the table name, the list of column names is ignored.
+	 */
+	JUMBLE_STRING(aliasname);
+}
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index f4e384e95aea..625b84ebfefd 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -675,8 +675,13 @@ calls | 2
things, the internal object identifiers appearing in this representation.
This has some counterintuitive implications.  For example,
pg_stat_statements will consider two apparently-identical
-   queries to be distinct, if they reference a table that was dropped
-   and recreated between the executions of the two queries.
+   queries to be distinct, if they reference for example a function that was
+   dropped and recreated between the executions of the two queries.
+   Conversely, if a table is dropped and recreated between the
+   executions of queries, two apparently-identical queries may be
+   considered the same. However, if the alias for a table is different
+   for otherwise-similar queries, these queries will be considered
+   distinct.
The hashing process is also sensitive to differences in
machine architecture and other facets of the platform.
Furthermore, it is not safe to assume that queryid
diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 708c6b0e9c41..1eebc2898ab9 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -433,11 +433,10 @@ COMMIT;
 SELECT calls, query FROM p

Re: Proposal: Progressive explain

2025-03-25 Thread Rafael Thofehrn Castro
Hello Robert,

Fixed most of the recommendations. Going over one at a time.

> The documentation for the progressive_explain = { off | explain |
> analyze } option seems like it should go into more detail about how
> the "explain" and "analyze" values are different. I'm not 100% sure I
> know the answer, and I'm not the least-experienced person who will
> ever read this documentation.

Added details about behavior of each option. In the doc for that GUC
there is a link to another section that explains in detail how progressive
explains work.

> WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
> invites confusion with MultiExecProcNode, to which it is unrelated.

Renamed that function (and the unwrap equivalent) to
WrapMemberNodesExecProcNodesWithExplain
as MemberNodes is the name used by explain when parsing those types of
nodes.

> I do notice that WrapExecProcNodeWithExplain does not walk
> the ps->initPlan list, which I think is an oversight.

Fixed.

> I just went to some trouble to start breaking up the monolith that is
> explain.c, so I'm not that happy about seeing this patch dump another
> 800+ lines of source code into that file. Probably we should have a
> new source file for some or this, or maybe even more than one.

The whole progressive explain code was moved to a new set of files
explain_progressive.h and explain_progressive.c.

> The changes to explain.h add three new data types. Two of them begin
> with an uppercase letter and one with a lowercase letter. That seems
> inconsistent. I also don't think that the declaration of char plan[]
> is per PostgreSQL coding style. I believe we always write char
> plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
> other than plan, since it's really a string-ified explain-y kind of
> thing, not literally a Plan. Also, can we please not have structure
> members with single letter names? "h" and "p" are going to be
> completely ungreppable, and I like grepping

Done. Adjusted all data types to be uppercase. Added FLEXIBLE_ARRAY_MEMBER
and renamed "h" and "p".

> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?

Changed to throw ERROR, which shouldn't happen.

> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.

Adjusted this logic. Structure ProgressiveExplainHashEntry now contains
a field to store the allocated size, which is used to compare with
the new plan being printed.

> Thanks for the pointer. I'm a bit skeptical about what's going on here
> in ProgressiveExplainReleaseFunc(). It seems like we're depending on
> shared memory to tell us whether we need to do purely backend-local
> cleanup, like calling disable_timeout() and resetting
> ProgressiveExplainPending and activeQueryDesc. I would have expected
> us to keep track in local memory of whether this kind of work needs to
> be done. It seems roundabout to take an LWLock, do a hash table lookup
> to see if there's an entry there, release the LWLock, and then very
> shortly thereafter take the lock a second time to release the entry
> that we now know is there.

> The comment in ProgressiveExplainCleanup about only detaching from the
> DSA if the query ended gracefully is not ideal from my point of view
> because it says what the code does instead of why the code does that
> thing. Also, the function is seemingly called with queryDesc as an
> argument not because you need it for anything but because you're going
> to test whether it's null. In that case, you could just pass a
> Boolean. Even then, something seems odd about this: why do we have to
> be holding ProgressiveExplainHashLock to dsa_detach the
> somewhat-inscrutably named area pe_a? And why are we not detaching it
> in case of error?

> I am wondering why you chose this relatively unusual error cleanup
> strategy. What I would have done is invent AtEOXact_ProgressiveExplain
> and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
> because it doesn't need separate handling for transactions and
> subtransactions, but it's so unlike what we do in most places that
> it's hard for me to tell whether it's correct. I feel like the
> approach you've chosen here would make sense if what we wanted to do
> was basically release memory or some memory-like resource associated
> closely with the context -- e.g. expandedrecord.c releases a
> TupleDesc, but this is doing more than that.

> I think the effect of this choice is that clea

Squash constant lists in query jumbling by default

2025-03-25 Thread Christoph Berg
Re: Álvaro Herrera
> Introduce squashing of constant lists in query jumbling
> 
> pg_stat_statements produces multiple entries for queries like
> SELECT something FROM table WHERE col IN (1, 2, 3, ...)
> 
> depending on the number of parameters, because every element of
> ArrayExpr is individually jumbled.  Most of the time that's undesirable,
> especially if the list becomes too large.
> 
> Fix this by introducing a new GUC query_id_squash_values which modifies
> the node jumbling code to only consider the first and last element of a
> list of constants, rather than each list element individually.  This
> affects both the query_id generated by query jumbling, as well as
> pg_stat_statements query normalization so that it suppresses printing of
> the individual elements of such a list.
> 
> The default value is off, meaning the previous behavior is maintained.

The "jumble names of temp tables" thread was briefly touching this [1],
I'm starting a new thread since the others are already very long.

[1] 
https://www.postgresql.org/message-id/flat/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s%2BhuLLHMKR_%2BMCk8RPQ%40mail.gmail.com#c357c56c3924642e8ef73cc1c8a0286e

Two points were made:

1) this should be on by default
2) there should be no GUC for it.

For 1), Sami said "Why would anyone not want to squash the IN list?"
to which I can only agree. Michael agreed as well, that's already +3.

For 2), Tom said that configurability is 1) often much less useful
than originally planned, and 2) tools have to cope with both settings
anyway, making implementing them harder. Plus, switching at run-time
makes the result even less predictable.

So, I would propose that we drop the GUC and make it the default.
Opinions?

Christoph




Re: Squash constant lists in query jumbling by default

2025-03-25 Thread Sami Imseih
> > If this GUC sticks around, it should be at least PGC_SUSET (on
> > the analogy of compute_query_id) to make it harder to break
> > pg_stat_statements that way.
>
> I have no problem making it superuser-only, and I can see making "on" be
> the default.  I am not opposed to removing it completely either, if we
> really think that the current behavior is no longer useful for anybody.

I am in favor of complete removal. [1] will change the behavior of table
jumbling without introducing a GUC, and I don't think we should introduce
a GUC for the squash values case either. Why one behavior change is configurable
while the other is not? seems confusing, IMO.

Also, as a matter of principle, it seems most are favoring not
introducing GUCs to
configure queryId behavior. I agree.

[1] 
https://www.postgresql.org/message-id/flat/CAA5RZ0uNofEXfEfNw3uRN3D3oXkFPQ_s%2BhuLLHMKR_%2BMCk8RPQ%40mail.gmail.com#c357c56c3924642e8ef73cc1c8a0286e

--
Sami Imseih
Amazon Web Services (AWS)




Re: Statistics Import and Export

2025-03-25 Thread Corey Huinker
>
> The original reason we wanted to issue warnings was to allow ourselves
> a chance to change the meaning of parameters, add new parameters, or
> even remove parameters without causing restore failures. If there are
> any ERRORs that might limit our flexibility I think we should downgrade
> those to WARNINGs.
>

+1


> Also, out of a sense of paranoia, it might be good to downgrade some
> other ERRORs to WARNINGs, like in 0002. I don't think it's quite as
> important as you seem to think, however. It doesn't make a lot of
> difference unless the user is running restore with --single-transaction
> or --exit-on-error, in which case they probably don't want the restore
> to continue if something unexpected happens. I'm fine having the
> discussion, though, or we can wait until beta to see what kinds of
> problems people encounter.
>

At this point, I feel I've demonstrated the limit of what can be made into
WARNINGs, giving us a range of options for now and into the beta. I'll
rebase and move the 0002 patch to be in last position so as to tee up
0003-0004 for consideration.


Re: optimize file transfer in pg_upgrade

2025-03-25 Thread Nathan Bossart
On Thu, Mar 20, 2025 at 03:23:13PM -0500, Nathan Bossart wrote:
> I'm still aiming to commit this sometime early next week.

Committed.  Thanks to everyone who chimed in on this thread.

While writing the attributions, I noticed that nobody seems to have
commented specifically on 0001.  The closest thing to a review I see is
Greg's note upthread [0].  This patch is a little bigger than what I'd
ordinarily feel comfortable with committing unilaterally, but it's been
posted in its current form since February 28th, this thread has gotten a
decent amount of traffic since then, and it's not a huge change ("9 files
changed, 96 insertions(+), 37 deletions(-)").  I'm happy to address any
post-commit feedback that folks have.  As noted earlier [1], I'm not wild
about how it's implemented, but this is the nicest approach I've thought of
thus far.

I also wanted to draw attention to this note in 0003:

/*
 * XXX: The below line is a hack to deal with the fact that we
 * presently don't have an easy way to find the corresponding 
new
 * tablespace's path.  This will need to be fixed if/when we add
 * pg_upgrade support for in-place tablespaces.
 */
new_tablespace = old_tablespace;

I intend to address this in v19, primarily to enable same-version
pg_upgrade testing with non-default tablespaces.  My current thinking is
that we should have pg_upgrade also gather the new cluster tablespace
information and map them to the corresponding tablespaces on the old
cluster.  This might require some refactoring in pg_upgrade.  In any case,
I didn't feel this should block the feature for v18.

[0] 
https://postgr.es/m/CAKAnmm%2Bi3Q1pZ05N_b8%3DS3B%3DrztQDn--HoW8BRKVtCg53r8NiQ%40mail.gmail.com
[1] https://postgr.es/m/Z9h5Spp76EBygyEL%40nathan

-- 
nathan




Re: Squash constant lists in query jumbling by default

2025-03-25 Thread Dmitry Dolgov
>
>
>> On Tue, Mar 25, 2025, 6:28 PM Álvaro Herrera 
wrote:

On 2025-Mar-25, Tom Lane wrote:

> If this GUC sticks around, it should be at least PGC_SUSET (on

> the analogy of compute_query_id) to make it harder to break

> pg_stat_statements that way.


I have no problem making it superuser-only, and I can see making "on" be

the default.  I am not opposed to removing it completely either, if we

really think that the current behavior is no longer useful for anybody.


I'm in favor of removing the GUC of course, but if memory serves there
were some folks in the patch discussion thread, who claimed they would
need to be able to keep non-squashed behavior. I don't recall if there were
particular arguments to support that, will try to find those messages again.
But overall as long as nobody objects, I think it's fine to get rid of GUC.

Earlier in the discussion, other possible values for the option were
suggested, such as a way to distinguish arrays that had "lots" (say,
hundreds or more) of entries from arrays that were "small".  That could
be selected by the user (or site admin) using this GUC, though there was
no agreement on exactly what that would be.  During the FOSDEM 2024
development meeting there was a general dislike of this idea, which
AFAIR was mostly predicated on the displayed query no longer being valid
SQL.  But now that we've chosen a format that uses SQL comments, this is
no longer a problem, so I think we haven't closed that door yet.  But we
may still find out that no user cares about this.


Agree, the way how things work now brings this option back on the table.
I can refresh the patch doing this, but I'm afk for about a week so it will
take some time. At the same time the proposal to do squashing by default
does not seem to be strictly dependent on that, so maybe they could be
considered as isolated ideas.

>


Re: Statistics Import and Export

2025-03-25 Thread Jeff Davis
On Tue, 2025-03-25 at 10:53 -0400, Corey Huinker wrote:
> 
> So this patch swings the pendulum a bit back towards accepting some
> things as errors.

Not exactly. I see patch 0001 as a change to the function signatures
from regclass to schemaname/relname, both for usability as well as
control over ERROR vs WARNING.

There's agreement to do so, so I went ahead and committed that part.

> the best we can do is to draw the error-versus-warning line at a
> place that:
> 
> * doesn't mess up flawed restores that we would otherwise expect to
> complete at least partially
> * is easy for us to understand
> * is easy for us to explain
> * we can live with for the next couple of decades

The original reason we wanted to issue warnings was to allow ourselves
a chance to change the meaning of parameters, add new parameters, or
even remove parameters without causing restore failures. If there are
any ERRORs that might limit our flexibility I think we should downgrade
those to WARNINGs.

Also, out of a sense of paranoia, it might be good to downgrade some
other ERRORs to WARNINGs, like in 0002. I don't think it's quite as
important as you seem to think, however. It doesn't make a lot of
difference unless the user is running restore with --single-transaction
or --exit-on-error, in which case they probably don't want the restore
to continue if something unexpected happens. I'm fine having the
discussion, though, or we can wait until beta to see what kinds of
problems people encounter.

Regards,
Jeff Davis





Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-25 13:18:50 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:
> > On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> > > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > > > There are 2 1/2 ways around this:
> > > > 
> > > > 1) Stop using IOSQE_ASYNC heuristic
> > > > 2a) Wait for all in-flight IOs when any FD gets closed
> > > > 2b) Wait for all in-flight IOs using FD when it gets closed
> > > > 
> > > > Given that we have clear evidence that io_uring doesn't completely 
> > > > support
> > > > closing FDs while IOs are in flight, be it a bug or intentional, it 
> > > > seems
> > > > clearly better to go for 2a or 2b.
> > > 
> > > Agreed.  If a workload spends significant time on fd.c closing files, I
> > > suspect that workload already won't have impressive benchmark numbers.
> > > Performance-seeking workloads will already want to tune FD usage high 
> > > enough
> > > to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> > > clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> > > abandon it in favor of (2a).  What other considerations could be 
> > > important?
> > 
> > The only other consideration I can think of is whether this should happen 
> > for
> > all io_methods or not.
> 
> Either way is fine, I think.

Here's a draft incremental patch (attached as a .fixup to avoid triggering
cfbot) implementing 2b).


> > I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
> > it's a bit weird to have a bool in a struct called *Ops.
> 
> That wouldn't bother me.  IndexAmRoutine has many bools, and "Ops" is
> basically a synonym of "Routine".

Cool. Done that way.

The repeated-iteration approach taken in pgaio_closing_fd() isn't the
prettiest, but it's hard to to imagine that ever being a noticeable.


This survives a testrun where I use your torture patch and where I force all
IOs to use ASYNC. Previously that did not get very far.  I also did verify
that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

Greetings,

Andres Freund
commit 290de30c0e4a87ed3bd6cdef478376d39a922912
Author: Andres Freund 
Date:   2025-03-25 16:50:27 -0400

wip: aio: io_uring wait for in-flight IOs

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:

diff --git a/src/include/storage/aio_internal.h 
b/src/include/storage/aio_internal.h
index a46d4393ebc..352610f60a8 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -287,6 +287,15 @@ typedef struct PgAioCtl
  */
 typedef struct IoMethodOps
 {
+   /* properties */
+
+   /*
+* If an FD is about to be closed, do we need to wait for all in-flight
+* IOs referencing that FD?
+*/
+   boolwait_on_fd_before_close;
+
+
/* global initialization */
 
/*
@@ -368,6 +377,7 @@ extern PgAioResult pgaio_io_call_complete_local(PgAioHandle 
*ioh);
 /* aio_io.c */
 extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
 extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern bool pgaio_io_uses_fd(PgAioHandle *ioh, int fd);
 
 /* aio_target.c */
 extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index c512495dce3..8816ccfa0d7 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1303,6 +1303,41 @@ pgaio_closing_fd(int fd)
 * it's probably not worth it.
 */
pgaio_submit_staged();
+
+   /*
+* If requested by the IO method, wait for all IOs that use the
+* to-be-closed FD.
+*/
+   if (pgaio_method_ops->wait_on_fd_before_close)
+   {
+   /*
+* As waiting for one IO to complete may complete multiple IOs, 
we
+* can't just use a mutable list iterator. The maximum number of
+* in-flight IOs is fairly small, so just restart the loop after
+* waiting for an IO.
+*/
+   while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+   {
+   dlist_iter  iter;
+   PgAioHandle *ioh = NULL;
+
+   dclist_foreach(iter, &pgaio_my_backend->in_flight_ios)
+   {
+   ioh = dclist_container(PgAioHandle, node, 
iter.cur);
+
+   if (pgaio_io_uses_fd(ioh, fd))
+   break;
+   else
+   ioh = NULL;
+   }
+
+   if (!ioh)
+   break;
+
+   /* see comment in pgaio_io_wait_for_free() about 
raciness */
+   pgaio_io_wait(ioh, ioh->generation);
+   }
+   }
 }
 
 /*
diff --git a/src/backe

Re: Cannot find a working 64-bit integer type on Illumos

2025-03-25 Thread Thomas Munro
On Wed, Mar 26, 2025 at 10:34 AM Nathan Bossart
 wrote:
> anaconda seems to be upset about this one [0].  I've spent all of 30
> seconds looking at it so far, but it appears to be using an old version of
> the header file.
>
> In file included from pg_regress.c:34:
> /usr/local/include/libpq-fe.h:623:8: error: unknown type name 'pg_int64'

Looks like it's mixing up /usr/local/include and our source tree...




Re: Squash constant lists in query jumbling by default

2025-03-25 Thread Dmitry Dolgov
On Tue, Mar 25, 2025, 7:40 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

>
>>> On Tue, Mar 25, 2025, 6:28 PM Álvaro Herrera 
> wrote:
>
> On 2025-Mar-25, Tom Lane wrote:
>
> > If this GUC sticks around, it should be at least PGC_SUSET (on
>
> > the analogy of compute_query_id) to make it harder to break
>
> > pg_stat_statements that way.
>
>
> I have no problem making it superuser-only, and I can see making "on" be
>
> the default.  I am not opposed to removing it completely either, if we
>
> really think that the current behavior is no longer useful for anybody.
>
>
> I'm in favor of removing the GUC of course, but if memory serves there
> were some folks in the patch discussion thread, who claimed they would
> need to be able to keep non-squashed behavior. I don't recall if there were
> particular arguments to support that, will try to find those messages
> again.
>

Nevermind, I've checked it out -- I think the case I had in mind [1] in fact
supports GUC removal:

> If anyone subtly changes jumbling logic when the extension is
active, the instance could get huge performance issues.

[1]:
https://www.postgresql.org/message-id/b8721722-a73e-0ee9-6513-425e9c88d92f%40postgrespro.ru

>


Re: Cannot find a working 64-bit integer type on Illumos

2025-03-25 Thread Daniel Gustafsson
> On 26 Mar 2025, at 00:01, Tom Lane  wrote:

> How did that work before?  Perhaps somebody just now added a libpq
> dependency to pg_regress.c?

I believe the libpq dependency came in 66d6086cbcbfc8 which wasn't all that
recent.

--
Daniel Gustafsson





Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Michael Paquier
On Tue, Mar 25, 2025 at 11:58:21AM -0500, Sami Imseih wrote:
> "Since the queryid hash value is computed on the post-parse-analysis
> representation of the queries, the opposite is also possible: queries with
> identical texts might appear as separate entries, if they have different
> meanings as a result of factors such as different search_path settings."
> 
> I think this text could remain as is, because search_path still
> matters for things like functions, etc.

Yeah, I think that's OK as-is.  I'm open to more improvements,
including more tests for these function patterns.  It's one of these
areas where we should be able to tweak RangeTblFunction and apply a
custom function to its funcexpr, and please note that I have no idea
how complex it could become as this is a Node expression.  :D

Functions in a temporary schema is not something as common as temp
tables, I guess, so these matter less, but they would still be a cause
of bloat for monitoring in very specific workloads.

> 2/
> "For example, pg_stat_statements will consider two apparently-identical
> queries to be distinct, if they reference a table that was dropped and
> recreated between the executions of the two queries."
> 
> This is no longer true for relations, but is still true for functions. I think
> we should mention the caveats in a bit more detail as this change
> will have impact on the most common case. What about something
> like this?
> 
> "For example, pg_stat_statements will consider two apparently-identical
> queries to be distinct, if they reference a function that was dropped and
> recreated between the executions of the two queries.

That's a bit larger than functions, but we could remain a bit more
evasive, with "if they reference *for example* a function that was
dropped and recreated between the executions of the two queries".

Note that for DDLs, like CREATE TABLE, we also group entries with
identical relation names, so we are kind of in line with the patch,
not with the current docs.

> Conversely, if a table is dropped and recreated between the
> executions of queries, two apparently-identical queries may be
> considered the same. However, if the alias for a table is different
> for semantically similar queries, these queries will be considered distinct"

This addition sounds like an improvement here.

As this thread has proved, we had little coverage these cases in pgss,
so I've applied the tests as an independent change.  It is also useful
to track how things change in the commit history depending on how the
computation is tweaked.  I've also included your doc suggestions.  I
feel that we could do better here, but that's a common statement
anyway when it comes to the documentation.
--
Michael
From 8ea61bb0d7d6c4dbbb40dbaedb5e751027163cfe Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 26 Mar 2025 08:07:59 +0900
Subject: [PATCH v6] Add custom query jumble function for RangeTblEntry.eref

---
 src/include/nodes/parsenodes.h| 11 +++---
 src/backend/nodes/queryjumblefuncs.c  | 19 ++
 doc/src/sgml/pgstatstatements.sgml|  9 +++--
 .../pg_stat_statements/expected/select.out| 20 ---
 4 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf2..a87f949b389e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1050,8 +1050,13 @@ typedef struct RangeTblEntry
 	 */
 	/* user-written alias clause, if any */
 	Alias	   *alias pg_node_attr(query_jumble_ignore);
-	/* expanded reference names */
-	Alias	   *eref pg_node_attr(query_jumble_ignore);
+
+	/*
+	 * Expanded reference names.  This uses a custom query jumble function so
+	 * as the table name is included in the computation, not its list of
+	 * columns.
+	 */
+	Alias	   *eref pg_node_attr(custom_query_jumble);
 
 	RTEKind		rtekind;		/* see above */
 
@@ -1094,7 +1099,7 @@ typedef struct RangeTblEntry
 	 * tables to be invalidated if the underlying table is altered.
 	 */
 	/* OID of the relation */
-	Oid			relid;
+	Oid			relid pg_node_attr(query_jumble_ignore);
 	/* inheritance requested? */
 	bool		inh;
 	/* relation kind (see pg_class.relkind) */
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index f8b0f91704ba..62d6cfb7ac15 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -67,6 +67,9 @@ static void _jumbleElements(JumbleState *jstate, List *elements);
 static void _jumbleA_Const(JumbleState *jstate, Node *node);
 static void _jumbleList(JumbleState *jstate, Node *node);
 static void _jumbleVariableSetStmt(JumbleState *jstate, Node *node);
+static void _jumbleRangeTblEntry_eref(JumbleState *jstate,
+	  RangeTblEntry *rte,
+	  Alias *expr);
 
 /*
  * Given a possibly multi-statement source string, confine our attention to the
@@ -516,3 +519,19 @@ _jumbleVariableSetSt

Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Tom Lane
Michael Paquier  writes:
> [ v6-0001-Add-custom-query-jumble-function-for-RangeTblEntr.patch ]

Couple of minor review comments ...

* In 5ac462e2b, this bit:

# node type
-   if (($t =~ /^(\w+)\*$/ or $t =~ /^struct\s+(\w+)\*$/)
+   if ($query_jumble_custom)
+   {
+   # Custom function that applies to one field of a node.
+   print $jff "\tJUMBLE_CUSTOM($n, $f);\n";
+   }
+   elsif (($t =~ /^(\w+)\*$/ or $t =~ /^struct\s+(\w+)\*$/)

fails to honor $query_jumble_ignore as the other if-branches do.
Perhaps it's unlikely that a node would have both query_jumble_custom
and query_jumble_ignore annotations, but still, the script would emit
completely incorrect code if it did.  Also, the "# node type" comment
should probably be moved down to within the first "elsif" block.

* In the v6 patch, this doesn't read very smoothly:

+ * Expanded reference names.  This uses a custom query jumble function so
+ * as the table name is included in the computation, not its list of
+ * columns.

Perhaps better

+ * Expanded reference names.  This uses a custom query jumble function so
+ * that the table name is included in the computation, but not its list of
+ * columns.

* Also, here:

+   considered the same. However, if the alias for a table is different
+   for semantically similar queries, these queries will be considered
+   distinct.

I'd change "semantically similar queries" to "otherwise-similar
queries"; I think writing "semantically" will just confuse people.

Otherwise LGTM.

regards, tom lane




Re: Advanced Patch Feedback Session / pgconf.dev 2025

2025-03-25 Thread Tomas Vondra
Hi,

I'd like to remind people planning to attend pgconf.dev the registration
for Advanced Patch Feedback Session closes on Friday (in ~2 days). If
you're interested in participating (as a contributor or a committer),
please sign up using the forms linked in the announcement.

kind regards
Tomas

On 3/13/25 17:37, Tomas Vondra wrote:
> Hi all,
> 
> pgconf.dev 2025 will host "Advanced Patch Feedback Session", with the
> same format as in 2024 [1]:
> 
>   Participants will work in small groups with a Postgres committer to
>   analyze a past contribution or attempted contribution, discussing
>   what they could have done better both in the patch and communication
>   about the patch. Applications are now closed and applicants have been
>   notified.
> 
> Per the schedule [2], the APFS is on Tuesday, May 13, 14:00 - 17:30. The
> APFS session is organized by me, Robert Haas and Amit Kapila.
> 
> [1]:
> https://www.pgevents.ca/events/pgconfdev2024/schedule/session/224-advanced-patch-feedback-session-invite-only/
> 
> [2]: https://2025.pgconf.dev/schedule.html
> 
> 
> At this point we're looking for those who would like to participate in
> this session, and get feedback for some of their patches.
> 
> The session would be similar to [1] in that we construct groups each
> containing one or more committer volunteers and several non-committer
> contributors who have patches about which they want feedback.  The
> committers look at the patches from the non-committers in their group
> in advance of the session, and then give whatever feedback they have.
> This can be about technical details of the patches, general technical
> matters, or tactical/communication aspects.
> 
> We strongly recommend non-committers to also review each other's patches
> in a similar manner, so that they can benefit from feedback given to
> other members of the group, or (even better) to participate in the
> discussion and share their feedback.
> 
> All this can be for long-term general education but also for short-term
> tactical purposes, depending on what the group thinks is beneficial.
> 
> 
> If you'd like to participate in this session as either a patch author
> who wants their patch reviewed or as a reviewer/committer/group
> leader, please use one of these forms to register:
> 
>   - patch author:
> 
> https://docs.google.com/forms/d/1DqXh5k6kp2ksVmbMOFkw8lybCNRMsGAsjP7PSTM52YA
> 
>   - committer:
> 
> https://docs.google.com/forms/d/1aEyvHErVZsy-QJRTsYtyxlz3mp1q9Vh7wa2g-PXpSsw
> 
> If applying as a patch author, please also include a commitfest link
> or two about what patches you have pending and would like to focus on.
> To be considered, a patch should have some nontrivial level of size or
> complexity or difficulty so that it's worth having a meeting about it.
> If you have specific questions / areas for feedback, please describe
> those in the patch comment.
> 
> If you have any logistical caveats, like you are still waiting for
> travel approval, please include that.  Beyond those kinds of things,
> please don't sign up unless you really can and want to come and make
> your own way there. It would be very wasteful and disappointing if a
> bunch of people signed up and then didn't show up.
> 
> Given that pgconf.dev is just about 8 weeks away, we would like to get
> this first round of feedback in the next 2 weeks (by Friday, March 28),
> so that we can proceed with further steps (forming groups, etc.)
> 
> If needed, feel free to ask question - either here, or get in touch
> directly with either of the organizers (me, Robert or Amit).
> 
> 
> kind regards
> 
> --
> Tomas Vondra
> 
> 

-- 
Tomas Vondra





Re: AIO v2.5

2025-03-25 Thread Noah Misch
On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> On 2025-03-25 08:58:08 -0700, Noah Misch wrote:
> > While having nagging thoughts that we might be releasing FDs before io_uring
> > gets them into kernel custody, I tried this hack to maximize FD turnover:
> > 
> > static void
> > ReleaseLruFiles(void)
> > {
> > #if 0
> > while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
> > {
> > if (!ReleaseLruFile())
> > break;
> > }
> > #else
> > while (ReleaseLruFile())
> > ;
> > #endif
> > }
> > 
> > "make check" with default settings (io_method=worker) passes, but
> > io_method=io_uring in the TEMP_CONFIG file got different diffs in each of 
> > two
> > runs.  s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
> > Here's the richer of the two diffs:
> 
> Yikes. That's a very good catch.
> 
> I spent a bit of time debugging this. I think I see what's going on - it turns
> out that the kernel does *not* open the FDs during io_uring_enter() if
> IOSQE_ASYNC is specified [1].  Which we do add heuristically, in an attempt to
> avoid a small but measurable slowdown for sequential scans that are fully
> buffered (c.f. pgaio_uring_submit()).  If I disable that heuristic, your patch
> above passes all tests here.

Same result here.  As an additional data point, I tried adding this so every
reopen gets a new FD number (leaks FDs wildly):

--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1304,5 +1304,5 @@ LruDelete(File file)
 * to leak the FD than to mess up our internal state.
 */
-   if (close(vfdP->fd) != 0)
+   if (dup2(2, vfdP->fd) != vfdP->fd)
elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : 
data_sync_elevel(LOG),
 "could not close file \"%s\": %m", vfdP->fileName);

The same "make check" w/ TEMP_CONFIG io_method=io_uring passes with the
combination of that and the max-turnover change to ReleaseLruFiles().

> I don't know if that's an intentional or unintentional behavioral difference.
> 
> There are 2 1/2 ways around this:
> 
> 1) Stop using IOSQE_ASYNC heuristic
> 2a) Wait for all in-flight IOs when any FD gets closed
> 2b) Wait for all in-flight IOs using FD when it gets closed
> 
> Given that we have clear evidence that io_uring doesn't completely support
> closing FDs while IOs are in flight, be it a bug or intentional, it seems
> clearly better to go for 2a or 2b.

Agreed.  If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other.  I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a).  What other considerations could be important?




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Euler Taveira
On Tue, Mar 25, 2025, at 8:07 AM, Shubham Khanna wrote:
> Apologies for the oversight. I have attached the patches now. Please
> find them included here.

I started looking at this patch. When I started reading the commit message, I
didn't understand why the options that provide names to objects are
incompatible with --all option. I agree that --all and --database should be
incompatible but the others shouldn't. We already have a check saying that the
number of specified objects cannot be different from the number of databases.
Why don't rely on it instead of forbidding these options?

/* Number of object names must match number of databases */
if (num_pubs > 0 && num_pubs != num_dbs)
{
pg_log_error("wrong number of publication names specified");
pg_log_error_detail("The number of specified publication names (%d) 
must match the number of specified database names (%d).",
num_pubs, num_dbs);
exit(1);
}

The following documentation is inaccurate since it doesn't say you should be
allowed to connect to the database too.

+  
+   Create one subscription per all non-template databases on the target
+   server. Automatically generated names for subscriptions, publications,

My suggestion is:

Create one subscription per database on the target server. Exceptions are
template databases and databases that don't allow connections.

You are mixing short (-a) and long option (--all) on the same paragraph. It
might confuse the reader.

+   switches. This option cannot be used together with 
--all.
+   If -d option is not provided, the database name will be
+   obtained from -P option. If the database name is not
+   specified in either the -d option, or the
+   -P option, and -a option is not
+   specified, an error will be reported.

Since there is only short options, you should also use it.

+   boolall_dbs;/* --all option was specified */
SimpleStringList objecttypes_to_remove; /* list of object types to remove */

This comment doesn't follow the same pattern as the other members. It is using
a sentence. Maybe '--all option' but it would be different from the last added
option: enable-two-phase.

I'm already thinking about translation so it sounds better if you rephrase this
description. Even if it is not precise (but that's what it is expected since if
you cannot connect to a database, it won't be possible to setup a logical
replication for it.)

+   printf(_("  -a, --all   create subscriptions for all 
non-template source databases\n"));

Suggestion:

  create subscriptions for all databases except template databases

The following comment is not accurate. The postgres database is not created by
user and will be fetched.

+/*
+ * If --all is specified, fetch a list of all user-created databases from the
+ * source server. Internally, this is treated as if the user specified multiple
+ * --database options, one for each source database.
+ */

+static void
+fetch_source_databases(struct CreateSubscriberOptions *opt)

There is no 'source' terminology in the other function names. It uses
publisher, subscriber, primary and standby. There is also other functions using
'get' so I would use get_publisher_databases as function name.

It is just a matter of style but since the columns you are using are booleans,
it sounds natural to not specify the equality operator. You should also test
datconnlimit to avoid invalid databases. I would add ORDER BY for
predictability.

+   res = PQexec(conn, "SELECT datname FROM pg_database WHERE datistemplate = 
false AND datallowconn = true");

Something like:

SELECT datname FROM pg_database WHERE datistemplate AND datallowconn AND 
datconnlimit <> -2 ORDER BY 1

What happens if you don't specify the dbname in -P option?

+   /* Establish a connection to the source server */
+   conn = connect_database(opt->pub_conninfo_str, true);

It defaults to user name. If you are using another superuser (such as admin)
the connection might fail if there is no database named 'admin'. vacuumdb that
has a similar --all option, uses another option (--maintenance-db) to discover
which databases should be vacuumed. I'm not suggesting to add the
--maintenance-db option. I wouldn't like to hardcode a specific database
(template1, for example) if there is no dbname in the connection string.
Instead, suggest the user to specify dbname into -P option. An error message
should be provided saying the exact reason: no dbname was specified.

I don't think both comments add anything. You already explained before the
function body.

+   /* Process the query result */
+   for (int i = 0; i < PQntuples(res); i++)
+   {
+   const char *dbname = PQgetvalue(res, i, 0);
+
+   simple_string_list_append(&opt->database_names, dbname);
+
+   /* Increment num_dbs to reflect multiple --database options */
+   num_dbs++;
+   }

I wouldn't add an error here.

+   /* Error 

Re: why there is not VACUUM FULL CONCURRENTLY?

2025-03-25 Thread Robert Haas
On Sat, Mar 22, 2025 at 5:43 AM Antonin Houska  wrote:
> Can you please give me an example? I don't recall seeing a lock upgrade in the
> tree. That's the reason I tried rather hard to avoid that.

VACUUM has to upgrade the lock in order to truncate away pages at the
end of the table.

Or just:
BEGIN;
SELECT * FROM sometable;
VACUUM FULL sometable;
COMMIT;

I don't think we should commit something that handles locking the way
this patch does. I mean, it would be one thing if you had a strategy
for avoiding erroring out when a deadlock would otherwise occur by
doing something clever. But it seems like you would just need to
detect the same problem in a different way. Doing something
non-standard doesn't make sense unless we get a clear benefit from it.
(Even then it might be unsafe, of course, but at least then you have a
motivation to take the risk.)

> > - On what basis do you make the statement in the last paragraph that
> > the decoding-related lag should not exceed one WAL segment? I guess
> > logical decoding probably keeps up pretty well most of the time but
> > this seems like a very strong guarantee for something I didn't know we
> > had any kind of guarantee about.
>
> The patch itself does guarantee that by checking the amount of unprocessed WAL
> regularly when it's copying the data into the new table. If too much WAL
> appears to be unprocessed, it enforces the decoding before the copying is
> resumed.

Hmm. If the source table is not locked against writes, it seems like
we could always get into a situation where this doesn't converge --
you just need to modify the table faster than those changes can be
decoded and applied. Maybe that's different from what we're talking
about here, though.

> > - What happens if we crash?
>
> The replication slot we create is RS_TEMPORARY, so it disappears after
> restart. Everything else is as if the current implementation of CLUSTER ends
> due to crash.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [fixed] Trigger test

2025-03-25 Thread Lilian Ontowhee
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi Dmitrii,

Paul Jungwirth and I reviewed this patch, and here are our comments:

1. The patch applies and tests pass.
2. The patch fixes a bug in contrib/spi, which is not really a practical 
extension, but rather examples of how to use SPI. The contrib/spi directory 
actually has four extensions: refint, autoinc, insert_username, and 
moddatetime. The patch is for refint, which is a way you could implement 
foreign keys if it weren't built in to Postgres.
3. Consider updating documentation for doc/src/contrib-spi.sgml, or any file as 
appropriate, to reflect the changes.
4. Are there any cases where check_primary_key() and check_foreign_key() should 
be called using a BEFORE trigger? Will this change break backwards 
compatibility? Consider adding a test with a BEFORE trigger to ensure the error 
"must be fired by AFTER trigger" is raised.

Thank you!

The new status of this patch is: Waiting on Author


Re: Squash constant lists in query jumbling by default

2025-03-25 Thread Sami Imseih
> At the same time the proposal to do squashing by default
> does not seem to be strictly dependent on that, so maybe they could be
> considered as isolated ideas.

Here is a patch to remove the GUC, if we settle on doing so.

--
Sami Imseih
Amazon Web Services (AWS)


v1-0001-Remove-the-query_id_squash_values-GUC.patch
Description: Binary data


Re: AIO v2.5

2025-03-25 Thread Noah Misch
On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:
> On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > > There are 2 1/2 ways around this:
> > > 
> > > 1) Stop using IOSQE_ASYNC heuristic
> > > 2a) Wait for all in-flight IOs when any FD gets closed
> > > 2b) Wait for all in-flight IOs using FD when it gets closed
> > > 
> > > Given that we have clear evidence that io_uring doesn't completely support
> > > closing FDs while IOs are in flight, be it a bug or intentional, it seems
> > > clearly better to go for 2a or 2b.
> > 
> > Agreed.  If a workload spends significant time on fd.c closing files, I
> > suspect that workload already won't have impressive benchmark numbers.
> > Performance-seeking workloads will already want to tune FD usage high enough
> > to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> > clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> > abandon it in favor of (2a).  What other considerations could be important?
> 
> The only other consideration I can think of is whether this should happen for
> all io_methods or not.

Either way is fine, I think.

> I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
> it's a bit weird to have a bool in a struct called *Ops.

That wouldn't bother me.  IndexAmRoutine has many bools, and "Ops" is
basically a synonym of "Routine".




Re: Cannot find a working 64-bit integer type on Illumos

2025-03-25 Thread Nathan Bossart
On Tue, Mar 25, 2025 at 09:41:21PM +1300, Thomas Munro wrote:
> On Wed, Mar 5, 2025 at 6:00 AM Peter Eisentraut  wrote:
>> I agree with your patch 0001-Deprecate-pg_int64.patch.  I don't see a
>> reason to keep the current arrangement around pg_int64.
> 
> Thanks for looking!  Pushed.

anaconda seems to be upset about this one [0].  I've spent all of 30
seconds looking at it so far, but it appears to be using an old version of
the header file.

In file included from pg_regress.c:34:
/usr/local/include/libpq-fe.h:623:8: error: unknown type name 'pg_int64'
  623 | extern pg_int64 lo_lseek64(PGconn *conn, int fd, pg_int64 offset, int 
whence);
  |^
/usr/local/include/libpq-fe.h:623:50: error: unknown type name 'pg_int64'
  623 | extern pg_int64 lo_lseek64(PGconn *conn, int fd, pg_int64 offset, int 
whence);
  |  ^
/usr/local/include/libpq-fe.h:627:8: error: unknown type name 'pg_int64'
  627 | extern pg_int64 lo_tell64(PGconn *conn, int fd);
  |^
/usr/local/include/libpq-fe.h:629:48: error: unknown type name 'pg_int64'
  629 | extern int  lo_truncate64(PGconn *conn, int fd, pg_int64 len);
  | ^
4 errors generated.

[0] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anaconda&dt=2025-03-25%2021%3A22%3A33

-- 
nathan




Re: AIO v2.5

2025-03-25 Thread Noah Misch
On Tue, Mar 25, 2025 at 04:56:53PM -0400, Andres Freund wrote:
> The repeated-iteration approach taken in pgaio_closing_fd() isn't the
> prettiest, but it's hard to to imagine that ever being a noticeable.

Yep.  I've reviewed the fixup code, and it looks all good.

> This survives a testrun where I use your torture patch and where I force all
> IOs to use ASYNC. Previously that did not get very far.  I also did verify
> that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

I, too, see the test diffs gone.




Re: Improve CRC32C performance on SSE4.2

2025-03-25 Thread John Naylor
On Mon, Mar 24, 2025 at 6:37 PM John Naylor  wrote:
>
> I'll take a look at the configure
> checks soon, since I had some questions there.

One other thing I forgot to mention: The previous test function had
local constants that the compiler was able to fold, resulting in no
actual vector instructions being emitted:

movabs  rdx, 12884901891
xor eax, eax
crc32   rax, rdx
crc32   rax, rdx
ret

That may be okay for practical purposes, but in the spirit of commit
fdb5dd6331e30 I changed it in v15 to use global variables and made
sure it emits what the function attributes are intended for:

vmovdqu64   zmm3, ZMMWORD PTR x[rip]
xor eax, eax
vpclmulqdq  zmm0, zmm3, ZMMWORD PTR y[rip], 0
vextracti32x4   xmm2, zmm0, 1
vmovdqa64   xmm1, xmm0
vmovdqu64   ZMMWORD PTR y[rip], zmm0
vextracti32x4   xmm0, zmm0, 2
vpternlogq  xmm1, xmm2, xmm0, 150
vmovq   rdx, xmm1
crc32   rax, rdx
vzeroupper
ret

--
John Naylor
Amazon Web Services




Re: Improve CRC32C performance on SSE4.2

2025-03-25 Thread John Naylor
On Mon, Mar 24, 2025 at 6:37 PM John Naylor  wrote:
> I'll take a look at the configure
> checks soon, since I had some questions there.

I'm leaning towards a length limit for v15-0001 so that inlined
instructions are likely to be unrolled. Aside from lack of commit
message, I think that one is ready for commit soon-ish.

I'm feeling pretty good about 0002, but since there is still room for
cosmetic fiddling, I want to let it sit for a bit longer.

I felt the previous proposals for configure.ac were unnecessarily
invasive, and the message looked out of place, so I made configure.ac
more similar to master, using the AVX popcount stuff as a model. I
also went the extra step and added a separate AC_MSG_CHECKING for
vectorized CRC. I'm not sure we really need that, but this algorithm
is trivially adoptable to Arm so it might be welcome for visibility.

For Meson, I just made the CRC checking comment a bit more general,
since keeping up this level of detail would result a loss in
readability.

0003 is just to demonstrate on CI that we are in fact computing the
same answer as master. An earlier patch had some additional tests in
strings.sql but I have yet to dig those out.

--
John Naylor
Amazon Web Services
From cebf6a4b6ecec7fdc30678828ad9883149d9378b Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 28 Feb 2025 16:27:30 +0700
Subject: [PATCH v15 1/3] Inline CRC computation for fixed-length input

Use a simplified copy of the loop in pg_crc32c_sse42.c to avoid
moving code to a separate header.
---
 src/include/port/pg_crc32c.h | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 65ebeacf4b1..0ab7513f523 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -43,12 +43,42 @@ typedef uint32 pg_crc32c;
 
 #if defined(USE_SSE42_CRC32C)
 /* Use Intel SSE4.2 instructions. */
+
+#include 
+
 #define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
+	((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0x)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+pg_attribute_no_sanitize_alignment()
+static inline
+pg_crc32c
+pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
+{
+	if (__builtin_constant_p(len) && len < 32)
+	{
+		const unsigned char *p = data;
+
+		/*
+		 * For small constant inputs, inline the computation to avoid a
+		 * function call and allow the compiler to unroll loops.
+		 */
+#if SIZEOF_VOID_P >= 8
+		for (; len >= 8; p += 8, len -= 8)
+			crc = _mm_crc32_u64(crc, *(const uint64 *) p);
+#endif
+		for (; len >= 4; p += 4, len -= 4)
+			crc = _mm_crc32_u32(crc, *(const uint32 *) p);
+		for (; len > 0; --len)
+			crc = _mm_crc32_u8(crc, *p++);
+		return crc;
+	}
+	else
+		return pg_comp_crc32c_sse42(crc, data, len);
+}
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
-- 
2.48.1

From c519bd870fbb719fe147984b1a2aeff81316172c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 25 Mar 2025 15:19:16 +0700
Subject: [PATCH v15 3/3] Add debug for CI XXX not for commit

---
 src/port/pg_crc32c_sse42.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index f392eb5b236..3b9c448486f 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -19,6 +19,8 @@
 
 #include "port/pg_crc32c.h"
 
+#define DEBUG_CRC/* XXX not for commit, or at least comment out */
+
 pg_attribute_no_sanitize_alignment()
 pg_attribute_target("sse4.2")
 pg_crc32c
@@ -87,6 +89,9 @@ pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length)
 	pg_crc32c	crc0 = crc;
 	size_t		len = length;
 	const char *buf = data;
+#ifdef DEBUG_CRC
+	const size_t orig_len PG_USED_FOR_ASSERTS_ONLY = len;
+#endif
 
 	/* Align on cacheline boundary. WIP: The threshold needs testing. */
 	if (unlikely(len > 256))
@@ -139,7 +144,13 @@ pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length)
 		len = end - buf;
 	}
 
-	return pg_comp_crc32c_sse42(crc0, buf, len);
+	crc0 = pg_comp_crc32c_sse42(crc0, buf, len);
+
+#ifdef DEBUG_CRC
+	Assert(crc0 == pg_comp_crc32c_sse42(crc, data, orig_len));
+#endif
+
+	return crc0;
 }
 
 #endif
-- 
2.48.1

From 1fc0fd60062446307bda6c60619344d8588c8125 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 25 Mar 2025 19:22:32 +0700
Subject: [PATCH v15 2/3] Improve CRC32C performance on x86_64

The current SSE4.2 implementation of CRC32C relies on the native
CRC32 instruction, which operates on 8 bytes at a time. We can get a
substantial speedup on longer inputs by using carryless multiplication
on SIMD registers, processing 64 bytes per loop iteration.

The VPCLMULQDQ instruction on 512-bit registers has been available
on Intel hardware since 2019 and AMD since 2022. There is an older
varian

Re: Expanding HOT updates for expression and partial indexes

2025-03-25 Thread Burd, Greg
Apologies for the noise, I overlooked a compiler warning.

fixed. 

-greg



> On Mar 25, 2025, at 7:47 AM, Burd, Greg  wrote:
> 
> Matthias, 
> 
> Rebased patch attached.
> 
> Changes in v14:
> * UpdateContext now the location I've stored estate, resultRelInfo, etc.
> * Reuse the result from the predicate on the partial index.
> 
> -greg
> 
> 
> 
>> On Mar 7, 2025, at 5:47 PM, Matthias van de Meent 
>>  wrote:
>> 
>> On Thu, 6 Mar 2025 at 13:40, Burd, Greg  wrote:
>> 
>> 
>>> 
>>> 
>>> 
 On Mar 5, 2025, at 6:39 PM, Matthias van de Meent 
  wrote:
 
 On Wed, 5 Mar 2025 at 18:21, Burd, Greg  wrote:
 
 
> * augments IndexInfo only when needed for testing expressions and only 
> once
 
 
 
 ExecExpressionIndexesUpdated seems to always loop over all indexes,
 always calling AttributeIndexInfo which always updates the fields in
 the IndexInfo when the index has only !byval attributes (e.g. text,
 json, or other such varlena types). You say it happens only once, have
 I missed something?
>>> 
>>> 
>>> 
>>> There's a test that avoids doing it more than once, [...]
>> 
>> 
>> 
>> Is this that one?
>> 
>> +if (indexInfo->ii_IndexAttrByVal)
>> +return indexInfo;
>> 
>> I think that test doesn't work consistently: a bitmapset * is NULL
>> when no bits are set; and for some indexes no attribute will be byval,
>> thus failing this early-exit even after processing.
>> 
>> Another small issue with this approach is that it always calls and
>> tests in EEIU(), while it's quite likely we would do better if we
>> pre-processed _all_ indexes at once, so that we can have a path that
>> doesn't repeatedly get into EEIU only to exit immediately after. It'll
>> probably be hot enough to not matter much, but it's still cycles spent
>> on something that we can optimize for in code.
> 
> 
> I've changed this a bit, now in ExecOpenIndices() when there are expressions 
> or predicates I augment the IndexInfo with information necessary to perform 
> the tests in EEIU().  I've debated adding another bool to ExecOpenIndices() 
> to indicate that we're opening indexes for the purpose of an update to avoid 
> building that information in cases where we are not.  Similar to the bool 
> `speculative` on ExecOpenIndices() today.  Thoughts?
> 
> 
>>> I might widen this patch a bit to include support for testing equality of 
>>> index tuples using custom operators when they exist for the index.  In the 
>>> use case I'm solving for we use a custom operator for equality that is not 
>>> the same as a memcmp().  Do you have thoughts on that?
>> 
>> 
>> 
>> I don't think that's a very great idea. From a certain point of view,
>> you can see HOT as "deduplicating multiple tuple versions behind a
>> single TID". Btree doesn't support deduplication for types that can
>> have more than one representation of the same value so that e.g.
>> '0.0'::numeric and '0'::numeric are both displayed correctly, even
>> when they compare as equal according to certain equality operators.
> 
> 
> Interesting, good point.  Seems like it would require a new index AM function:
> bool indexed_tuple_would_change()
> 
> I'll drop this for now, it seems out of scope for this patch set.
> 
> 
> 
> 

  

v15-0001-Expand-HOT-update-path-to-include-expression-and.patch
Description: v15-0001-Expand-HOT-update-path-to-include-expression-and.patch


Re: a pool for parallel worker

2025-03-25 Thread Kirill Reshke
On Tue, 11 Mar 2025 at 17:38, Andy Fan  wrote:
>
>
>
> Hi,
>

Hi!

> Currently when a query needs some parallel workers, postmaster spawns
> some backend for this query and when the work is done, the backend
> exit.  there are some wastage here, e.g. syscache, relcache, smgr cache,
> vfd cache and fork/exit syscall itself.
>
> I am thinking if we should preallocate (or create lazily) some backends
> as a pool for parallel worker. The benefits includes:
>
> (1) Make the startup cost of a parallel worker lower in fact.
> (2) Make the core most suitable for the cases where executor need to a
> new worker to run a piece of plan more. I think this is needed in some
> data redistribution related executor in a distributed database.
>
> I guess the both cases can share some well designed code, like costing or
> transfer the data between worker and leader.

Surely forking from the postmaster is costly.

> The boring thing for the pool is it is [dbid + userId] based, which
> I mean if the dbid or userId is different with the connection in pool,
> they can't be reused.  To reduce the effect of UserId, I think if we can
> start the pool with a superuser and then switch the user information
> with 'SET ROLE xxx'. and the pool can be created lazily.

I don't think this is secure. Currently, if your postgresql process
had started under superuser role, there is no way to undo that.
Consider a worker in a pool running a user query, which uses UDF. In
this UDF, one can simply RESET SESSION AUTHORIZATION and process with
anything under superuser rights.

> Any comments on this idea?
>
> --
> Best Regards
> Andy Fan
>
>
>


-- 
Best regards,
Kirill Reshke




Re: [Patch] remove duplicated smgrclose

2025-03-25 Thread Masahiko Sawada
On Thu, Mar 13, 2025 at 1:37 AM Steven Niu  wrote:
>
>
>
> 在 2025/3/12 6:31, Masahiko Sawada 写道:
> > On Mon, Mar 10, 2025 at 3:08 AM Steven Niu  wrote:
> >>
> >>
> >>
> >>   Hi, Masahiko
> >>
> >> Thanks for your comments! I understand your concern as you stated.
> >> However, my initial patch was split into two parts as Kirill suggested.
> >> This thread is about the first part. Another part is here: 
> >> https://commitfest.postgresql.org/patch/5149/
> >> Or you can take a look at the v2-0001-remove-duplicated-smgrclose.patch in 
> >> this thread for the complete change.
> >>
> >> I think either we review the v2-patch, or review the both 5149 and 5196 
> >> CFs, for my complete change.
> >> There should be no missing operations.
> >>
> >> Please let me know if you have more comments.
> >
> > What is the main point of removing these duplication? While I can see
> > that in smgrDoPendingDeletes(), we do
> > 'smgrsw[reln->smgr_which].smgr_close(reln, forknum);' twice for each
> > relation: in smgrdounlinkall() and in smgrrelease(), I'm not sure what
> > this change benefits to us. Particularly, we quick return from the
> > second mdclose() call as all segment files are already closed. Have
> > you experienced a case like where the system got stuck or got very
> > slower due to these duplicated calls?
> >
> > Also, we might need to pay attention that with the patch
> > smgrdounlinkall() came to depend on smgrclose(). For example, the more
> > works we add in smgrclose() in the future, the more works
> > smgrdounlinkall() will do, which doesn't happen with the current code
> > as smgrdounlinkall() depends on mdclose().
> >
> > Given that the patched codes doesn't do exactly the same things as
> > before (e.g, smgrdounlinkall() would end up resetting
> > reln->smgr_cached_nblocks[forknum] too), I think we need some reasons
> > for legitimating this change.
> >
> > Regards,
> >
>
>
> Hi, Masahiko
>
> Thank you for your detailed review and valuable insights. I understand
> your concern about the immediate benefits of removing the duplicate
> smgr_close() call, especially since the second call effectively becomes
> a no-op due to the prior file closures. However, the primary intent of
> my change is not driven by performance improvements or addressing a
> specific issue, but rather by enhancing the code's structure and
> clarity. Having two "Close the forks at smgr level" operations might
> lead to confusion for readers of the code.
>
>
> Additionally, the smgrclose() function not only closes the smgr but also
> resets smgr_cached_nblocks and smgr_targblock, making it a comprehensive
> operation. In the current implementation, the smgr is closed inside
> smgrdounlinkall(), while smgr_cached_nblocks and smgr_targblock are
> reset outside of it. This creates a fragmentation in the code logic,
> which could be streamlined.

I've investigated the code further. It seems like smgrclose() became
just an alias of smgrrelease() by commit 21d9c3ee4ef, making the
duplications of calling mdclose() for each relation and its fork by
smgrclose() and smgrunlinkall(). I'm hesitant with the first proposal
of removing smgrclose() calls following smgrunlinkall() because, as
you also pointed out, we would miss some operations (e.g. resetting
smgr_cached_nblocks), and it would create another coding convention
that we can omit smgrclose() only after smgrunlinkall(), causing the
code divergences, which might introduce another confusion.

> Would it make sense to remove the smgr closing operation within
> smgrdounlinkall() and leave the rest of the code unchanged?

If I understand your idea correctly, smgrdounlinkall() would end up
calling mdunlink() for each relation without mdclose(). I don't think
it's okay.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Proposal - Allow extensions to set a Plan Identifier

2025-03-25 Thread Sami Imseih
>
> So this comes down to forking the Postgres code to do the job.  What I
> had in mind was a slightly different flow, where we would be able to
> push custom node attributes between the header parsing and the
> generation of the extension code.  If we do that, there would be no
> need to fork the Postgres code: extensions could force the definitions
> they want to the nodes they want to handle, as long as these do not
> need to touch the in-core query jumble logic, of course.


Allowing extensions to generate code for custom node attributes could be
taken up in 19. What about for 18 we think about exposing AppendJumble,
 JUMBLE_FIELD, JUMBLE_FIELD_SINGLE and JUMBLE_STRING?


—
Sami Imseih
Amazon Web Services (AWs)

>


Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread vignesh C
On Wed, 26 Mar 2025 at 02:05, Euler Taveira  wrote:
>
> On Tue, Mar 25, 2025, at 8:07 AM, Shubham Khanna wrote:
>
> The following code is not accurate. If I specify --all, --database and
> --subscription, it will report only --database. The user will remove it and 
> run
> again. At this time, --subscription will be report. My expectation is to have
> all reports at once.

I felt pg_dump reports conflicts incrementally rather than listing all
incompatible options at once as shown below:
### Specified, --data-only, --schema-only, --statistics-only,
--no-data and --clean options. together
Example1:  ./pg_dump --data-only --schema-only --statistics-only
--no-data --clean -d postgres -f dump.txt
pg_dump: error: options -s/--schema-only and -a/--data-only cannot be
used together

### After removing --schema-only option, another error arises
Example2:  ./pg_dump --data-only  --statistics-only --no-data --clean
-d postgres -f dump.txt
pg_dump: error: options -a/--data-only and --statistics-only cannot be
used together

### After removing --no-data option, another error arises
Example3:  ./pg_dump --data-only   --no-data --clean -d postgres -f dump.txt
pg_dump: error: options -a/--data-only and --no-data cannot be used together

### After removing --clean option,  another error arises
Example4: ./pg_dump --data-only --clean -d postgres -f dump.txt
pg_dump: error: options -c/--clean and -a/--data-only cannot be used together

I believe the current patch follows this approach and also simplifies
the code handling, making it easier to maintain.

Regards,
Vignesh




Re: UUID v7

2025-03-25 Thread Masahiko Sawada
On Sun, Feb 9, 2025 at 9:07 AM Andrey Borodin  wrote:
>
> I've took into account note from Sergey that "offset" is better name for 
> uuidv7() argument than "shift".
>
> > On 5 Feb 2025, at 03:02, Masahiko Sawada  wrote:
> >
> >>
> >> I was thinking about incorporating test like this.
> >>
>  With this patch we can generate correct UUIDs in a very distant future.
>  postgres=# select x, 
>  
>  uuid_extract_timestamp(uuidv7((x::text || ' year'::text)::interval)),
>  (x::text || ' year'::text)::interval
>  from generate_series(1,9000,1000) x;
>  x   |   uuid_extract_timestamp|  interval
>  --+-+
>    1 | 2026-01-31 12:00:53.084+05  | 1 year
>  1001 | 3026-01-31 12:00:53.084+05  | 1001 years
>  2001 | 4026-01-31 12:00:53.084+05  | 2001 years
>  3001 | 5026-01-31 12:00:53.084+05  | 3001 years
>  4001 | 6026-01-31 12:00:53.084+05  | 4001 years
>  5001 | 7026-01-31 12:00:53.085+05  | 5001 years
>  6001 | 8026-01-31 12:00:53.085+05  | 6001 years
>  7001 | 9026-01-31 12:00:53.085+05  | 7001 years
>  8001 | 10026-01-31 12:00:53.085+05 | 8001 years
>  (9 rows)
> >>
> >
> > Something like following queries might be workable for example?
> >
> > create table test (c serial, d uuid, t timestamptz generated always as
> > (uuid_extract_timestamp(d)) stored);
> > insert into test (d) select uuidv7((n || 'years')::interval) from
> > generate_series(1, 2000) n;
> > select count(*) from (select t - lag(t) over (order by c) as diff from
> > test) where diff > '10 year' ;
>
> Yeah, makes sense. I reduced tolerance to 366+1 day. Must be stable if we've 
> done all the time offset business right.
>
> > Here are some review comments:
> >
> > #define NS_PER_S   INT64CONST(10)
> > #define NS_PER_MS  INT64CONST(100)
> > +#define US_PER_MS  INT64CONST(1000)
> > #define NS_PER_US  INT64CONST(1000)
> >
> > I think it's clear if we put US_PER_MS below NS_PER_US.
>
> OK.
>
> > ---
> >  *
> > - * ns is a number of nanoseconds since start of the UNIX epoch. This value 
> > is
> > + * unix_ts_ms is a number of milliseconds since start of the UNIX epoch,
> > + * ns_in_ms is a number of nanoseconds within millisecond. These values are
> >  * used for time-dependent bits of UUID.
> >
> > I think we can mention that the RFC describes that stored unix
> > timestamp as an unsigned integer.
>
> Done. Feel free to adjust my wordings, I've no sense of idiomatic English.
>
> >
> > ---
> > static pg_uuid_t *
> > -generate_uuidv7(int64 ns)
> > +generate_uuidv7(uint64 unix_ts_ms, uint32 ns_in_ms)
> >
> > How about renaming ns_in_ms with sub_ms?
>
> OK.
>
> >
> > ---
> > +/* 64 bits is enough for real time, but not for a time range of 
> > UUID */
> >
> > I could not understand the point of this comment. It seems to say that
> > 64-bits is not enough for a time range of UUID, but doesn't the time
> > range of UUIDv7 use only 48 bits? It seems to need more comments.
>
> I've tried to say that acquiring current time as an int64 ns since UNIX epoch 
> is still viable for the code (until year 2262).
>
>
> > ---
> > -ns = (ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > SECS_PER_DAY * USECS_PER_SEC)
> > -* NS_PER_US + ns % NS_PER_US;
> > +us = (ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > SECS_PER_DAY * USECS_PER_SEC);
> >
> > /* Generate an UUIDv7 */
> > -uuid = generate_uuidv7(ns);
> > +uuid = generate_uuidv7(us / US_PER_MS, (us % US_PER_MS) *
> > NS_PER_US + ns % NS_PER_US);
> >
> > Need to update comments in uuidv7_internval() such as:
> >
> >/*
> > * Shift the current timestamp by the given interval. To calculate time
> > * shift correctly, we convert the UNIX epoch to TimestampTz and use
> > * timestamptz_pl_interval(). Since this calculation is done with
> > * microsecond precision, we carry nanoseconds from original ns value to
> > * shifted ns value.
> > */
> >
> > and
> >
> >/*
> > * Convert a TimestampTz value back to an UNIX epoch and back 
> > nanoseconds.
> > */
>
> I've tried. I'm not very satisfied with comments, but could not come up with 
> easier description.
>

Thank you for updating the patch. I had missed to track this patch.

I've updated the patch from your v4 patch. In this version, I excluded
the argument name change (from 'shift' to 'offset') as it's not
related to the bug fix and simplified the regression test case.

Please review it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v5-0001-Fix-timestamp-overflow-in-UUIDv7-implementation.patch
Description: Binary data


Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-25 09:15:43 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:
> > FWIW, I prototyped this, it's not hard.
> > 
> > But it can't replace the current WARNING with 100% fidelity: If we read 60
> > blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But we
> > can't encode that many block offset in single PgAioResult, there's not 
> > enough
> > space, and enlarging it far enough doesn't seem to make sense either.
> > 
> > 
> > What we *could* do is to emit one WARNING for each bufmgr.c 
> > smgrstartreadv(),
> > with that warning saying that there were N zeroed blocks in a read from 
> > block
> > N to block Y and a HINT saying that there are more details in the server 
> > log.

It should probably be DETAIL, not HINT...


> Sounds fine.

I got that working. To make it readable, it required changing division of
labor between buffer_readv_complete() and buffer_readv_complete_one() a bit,
but I think it's actually easier to understand now.

Still need to beef up the test infrastructure a bit to make the multi-block
cases more easily testable.


Could use some input on the framing of the message/detail. Right now it's:

ERROR:  invalid page in block 8 of relation base/5/16417
DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read 
are invalid.

But that doesn't seem great. Maybe:

DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid.

But that still isn't really a sentence.

Greetings,

Andres Freund




Re: a pool for parallel worker

2025-03-25 Thread Andy Fan


Hi, 
>> The boring thing for the pool is it is [dbid + userId] based, which
>> I mean if the dbid or userId is different with the connection in pool,
>> they can't be reused.  To reduce the effect of UserId, I think if we can
>> start the pool with a superuser and then switch the user information
>> with 'SET ROLE xxx'. and the pool can be created lazily.
>
> I don't think this is secure. Currently, if your postgresql process
> had started under superuser role, there is no way to undo that.
> Consider a worker in a pool running a user query, which uses UDF. In
> this UDF, one can simply RESET SESSION AUTHORIZATION and process with
> anything under superuser rights.

oh.. yes! "RESET SESSION AUTHORIZATION" does the very bad thing for
me. Per my testing, UDF is not essential, just a sql command can unset
the role. I am not sure why do we design like this.  Currently I want
some knowledge:

(a). Is there any well known "RESET SESSION AUTHORIZATION" usage. I
didn't realize this command before. and if it is not common used, I even
want to forbid this command in our internal version.

(b). Is there any other ways to allow different user with the same
database sharing the same connection? Current "SET ROLE x" is exciting
but "RESET SESSION AUTHORIZATION" is dispointing.

-- 
Best Regards
Andy Fan





Re: Cannot find a working 64-bit integer type on Illumos

2025-03-25 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 26 Mar 2025, at 00:01, Tom Lane  wrote:
>> How did that work before?  Perhaps somebody just now added a libpq
>> dependency to pg_regress.c?

> I believe the libpq dependency came in 66d6086cbcbfc8 which wasn't all that
> recent.

It looks like this has been broken for a very long time, but it must
never have mattered before because libpq-fe.h is so stable, and
pg_regress doesn't use any new-ish APIs from it.  So pulling in
whatever version the platform had still worked.

I think this should work to fix it:

-pg_regress.o: override CPPFLAGS += -I$(top_builddir)/src/port 
-I$(libpq_srcdir) $(EXTRADEFS)
+pg_regress.o: override CPPFLAGS := -I$(top_builddir)/src/port 
-I$(libpq_srcdir) $(EXTRADEFS) $(CPPFLAGS)

but I haven't tested yet.

regards, tom lane




Doc: Add Table of Contents to psql Reference Page

2025-03-25 Thread David G. Johnston
Hi!

IMO, the psql reference has too many non-standard subsections (and too much
material in each) to strictly conform to our guidelines for application
documentation.  It needs a Table of Contents of its own to improve
usability.

Attached.

David J.
From c419cd7d32f0b60671cfa652c312c889875031bc Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Tue, 25 Mar 2025 23:03:41 -0700
Subject: [PATCH] doc: add local TOC to psql reference page

---
 doc/src/sgml/ref/psql-ref.sgml | 63 --
 1 file changed, 52 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index f083dba49a..60ba5c6ba7 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -45,7 +45,48 @@ PostgreSQL documentation
 
  
 
- 
+ 
+  Table of Contents
+   
+
+
+
+ 
+ 
+  
+   
+   
+   
+
+
+ 
+  
+ 
+
+   
+   
+
+
+ 
+  
+  
+  
+  
+ 
+
+   
+  
+ 
+
+
+
+
+
+
+   
+ 
+
+ 
   Options
 
   
@@ -623,7 +664,7 @@ EOF
  
 
 
- 
+ 
   Exit Status
 
   
@@ -636,10 +677,10 @@ EOF
  
 
 
- 
+ 
   Usage
 
-  
+  
 Connecting to a Database
 
 
@@ -712,7 +753,7 @@ $ psql postgresql://dbmaster:5433/mydb?sslmode=require
 
   
 
-  
+  
 Entering SQL Commands
 
 
@@ -764,7 +805,7 @@ testdb=>
 
   
 
-  
+  
 Meta-Commands
 
 
@@ -4080,7 +4121,7 @@ select 1\; select 2\; select 3;
   
  
 
- 
+ 
   Advanced Features
 

@@ -5076,7 +5117,7 @@ testdb=> \set PROMPT1 '%[%033[1;33;40m%]%n@%/%R%[%033[0m%]%# '
 

 
-   
+   
 Command-Line Editing
 

@@ -5319,7 +5360,7 @@ PSQL_EDITOR_LINENUMBER_ARG='--line '
  
 
 
- 
+ 
   Files
 
  
@@ -5390,7 +5431,7 @@ PSQL_EDITOR_LINENUMBER_ARG='--line '
  
 
 
- 
+ 
   Notes
 
 
@@ -5437,7 +5478,7 @@ PSQL_EDITOR_LINENUMBER_ARG='--line '
  
 
 
- 
+ 
   Notes for Windows Users
 
  
-- 
2.34.1



Possibly hard-to-read message

2025-03-25 Thread Kyotaro Horiguchi
Hello,

I came across the following help message added in commit 1a759c83278:

+   HELP0("  WATCH_INTERVAL\n"
+ "number of seconds \\watch by default waits between 
executing the query buffer\n");

It took me a little while to understand it. I read "executing the
query buffer" as referring to executions of the query buffer, which
clarified that the wait occurs between each execution.

> number of seconds \\watch waits by default between executions of the query 
> buffer

I’m just wondering if the message might be worth revising. If it’s
already clear enough, please feel free to disregard this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 523d624f0b3efe94f58934a3c4508c4cc39735bd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 26 Mar 2025 11:58:37 +0900
Subject: [PATCH] Minor rewording of WATCH_INTERVAL help message

Improve phrasing for better readability of the recently added message.
---
 src/bin/psql/help.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index fe96e3e1de9..62edfb121c7 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -461,7 +461,7 @@ helpVariables(unsigned short int pager)
 		  "  VERSION_NUM\n"
 		  "psql's version (in verbose string, short string, or numeric format)\n");
 	HELP0("  WATCH_INTERVAL\n"
-		  "number of seconds \\watch by default waits between executing the query buffer\n");
+		  "number of seconds \\watch waits by default between executions of the query buffer\n");
 
 	HELP0("\nDisplay settings:\n");
 	HELP0("Usage:\n");
-- 
2.43.5



Re: Possibly hard-to-read message

2025-03-25 Thread David G. Johnston
On Tue, Mar 25, 2025 at 8:07 PM Kyotaro Horiguchi 
wrote:

> Hello,
>
> I came across the following help message added in commit 1a759c83278:
>
> +   HELP0("  WATCH_INTERVAL\n"
> + "number of seconds \\watch by default waits between
> executing the query buffer\n");
>
> It took me a little while to understand it. I read "executing the
> query buffer" as referring to executions of the query buffer, which
> clarified that the wait occurs between each execution.
>
> > number of seconds \\watch waits by default between executions of the
> query buffer
>
> I’m just wondering if the message might be worth revising. If it’s
> already clear enough, please feel free to disregard this.
>
>
I'm on board with tweaking this.

default number of seconds \watch waits after each execution.

Requires changing \watch:

- Wait the specified number of seconds (default 2) between executions.
+Wait the specified number of seconds (default 2) after each execution.

The fragment "of the query buffer" can be omitted from the description for
watch_interval to match up with the wording in \watch.

The manual too would be updated:

- This variable sets the default interval which \watch waits between
executing the query.
+This variable sets the default interval \watch waits after each query
execution.

(Removing "which" should be considered regardless; tongue-twister and not
all that sure it is grammatically correct or necessary.)

I'm guess I'm ok with the second sentence there; but all it is doing is
explain how defaults work...
Specifying an interval in the command overrides this variable.

David J.


Re: a pool for parallel worker

2025-03-25 Thread Andy Fan
Andy Fan  writes:

> Hi, 
>>> The boring thing for the pool is it is [dbid + userId] based, which
>>> I mean if the dbid or userId is different with the connection in pool,
>>> they can't be reused.  To reduce the effect of UserId, I think if we can
>>> start the pool with a superuser and then switch the user information
>>> with 'SET ROLE xxx'. and the pool can be created lazily.
>>
>> I don't think this is secure. Currently, if your postgresql process
>> had started under superuser role, there is no way to undo that.
>> Consider a worker in a pool running a user query, which uses UDF. In
>> this UDF, one can simply RESET SESSION AUTHORIZATION and process with
>> anything under superuser rights.
>
> oh.. yes! "RESET SESSION AUTHORIZATION" does the very bad thing for
> me. Per my testing, UDF is not essential, just a sql command can unset
> the role. I am not sure why do we design like this.  Currently I want
> some knowledge:
>
> (a). Is there any well known "RESET SESSION AUTHORIZATION" usage. I
> didn't realize this command before. and if it is not common used, I even
> want to forbid this command in our internal version.

Besides the "RESET SESSION AUTHORIZATION", "SET ROLE x" does some bad
thing as well. e.g. 

=# set role x;  -- as superuser.
=> set role y;  -- as role x. it switch to role y easily.

I can understand why we need the above behavior now. Just that I don't
want to discard this goal(sharing the same connection to the same
database amongo the different users) so quickly. I hope we have other
clean method to do it.  If no,  do you think will disable"re/reset
session authorization" "set role x" in the normal backend, and only
allowing them in some internal connections (like the connection in
parallel worker pool) work.

For now, it is unclear for me what is the purpose for "set role x"
command due to it can be undo so easily.

-- 
Best Regards
Andy Fan





Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Michael Paquier
On Tue, Mar 25, 2025 at 07:56:29PM -0500, Sami Imseih wrote:
> FWIW, the pg_stat_statements docs in a few places refer to 
> queries that may look different but have the same meaning 
> as “semantically equivalent”, this is why I used the same 
> terminology here.  But, I have no issue with the simplified
> rewrite either.
> 
> The patch LGTM as well. 

If any adjustments are required, it would be always possible to do
these later.  Anyway, workloads with a lot of temporary tables are
going to benefit from this change, so let's see how it goes.

Applied, after eyeing much more the PGSS dumps from installcheck
before and after the patch, to make sure that I'm not missing
something..
--
Michael


signature.asc
Description: PGP signature


Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-03-25 Thread Richard Guo
On Wed, Mar 26, 2025 at 3:06 PM Tender Wang  wrote:
> The comment about  notnullattnums in struct RangeTblEntry says that:
> * notnullattnums is zero-based set containing attnums of NOT NULL
> * columns.
>
> But in get_relation_notnullatts():
> rte->notnullattnums = bms_add_member(rte->notnullattnums,
> i + 1);
>
> The notnullattnums seem to be 1-based.

This corresponds to the attribute numbers in Var nodes; you can
consider zero as representing a whole-row Var.

Thanks
Richard




Re: NOT ENFORCED constraint feature

2025-03-25 Thread Álvaro Herrera
On 2025-Mar-26, Amul Sul wrote:

> The reason for the change is to revert to the behavior before commit
> #80d7f990496b1c, where recursion occurred regardless of the
> changed flags. This is also described in the header comment for
> ATExecAlterConstrDeferrability() (earlier it was for
> ATExecAlterConstraintInternal):
> 
>  * Note that we must recurse even when the values are correct, in case
>  * indirect descendants have had their constraints altered locally.
>  * (This could be avoided if we forbade altering constraints in partitions
>  * but existing releases don't do that.)

Umm, why?  Surely we should not allow a partition tree to become
inconsistent.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
#error "Operator lives in the wrong universe"
  ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)




Re: Proposal - Allow extensions to set a Plan Identifier

2025-03-25 Thread Michael Paquier
On Tue, Mar 25, 2025 at 04:23:15PM -0500, Sami Imseih wrote:
> On 3/25/25 00:47, Sami Imseih wrote:
>> 1. Check out the upstream Postgres source for the given version I'm 
>> generating jumble code for
>> 2. Modify the headers as needed to have the node attributes I want
>> 3. Call the gen_node_support.pl via the build process
>> 4. Copy out the resulting jumble node funcs/conds
> 
> I like this approach, and the artifacts from #4 will be packed with
> the extension.

So this comes down to forking the Postgres code to do the job.  What I
had in mind was a slightly different flow, where we would be able to
push custom node attributes between the header parsing and the
generation of the extension code.  If we do that, there would be no
need to fork the Postgres code: extensions could force the definitions
they want to the nodes they want to handle, as long as these do not
need to touch the in-core query jumble logic, of course.

Perhaps we should give more time and thoughts to the concepts we want
to expose at this level of the code for extensions.  Hence I would
side with not rushing things, and consider our options more carefully
for v19 with what we would consider to be better design.
--
Michael


signature.asc
Description: PGP signature


Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Amit Kapila
On Wed, Mar 26, 2025 at 2:05 AM Euler Taveira  wrote:
>
> On Tue, Mar 25, 2025, at 8:07 AM, Shubham Khanna wrote:
>
> Apologies for the oversight. I have attached the patches now. Please
> find them included here.
>
>
> I started looking at this patch. When I started reading the commit message, I
> didn't understand why the options that provide names to objects are
> incompatible with --all option. I agree that --all and --database should be
> incompatible but the others shouldn't. We already have a check saying that the
> number of specified objects cannot be different from the number of databases.
> Why don't rely on it instead of forbidding these options?
>

We must ensure to match the order of objects specified with the
database as well (The doc says: The order of the multiple publication
name switches must match the order of database switches.). It is easy
for users to do that when explicitly specifying databases with -d
switches but not with --all case. I can see a way to extend the
current functionality to allow just fetching databases from the
publisher and displaying them to the user, then we can expect the user
to specify the names of other objects in the same order but not
otherwise.

>
> What happens if you don't specify the dbname in -P option?
>
> +   /* Establish a connection to the source server */
> +   conn = connect_database(opt->pub_conninfo_str, true);
>
> It defaults to user name. If you are using another superuser (such as admin)
> the connection might fail if there is no database named 'admin'. vacuumdb that
> has a similar --all option, uses another option (--maintenance-db) to discover
> which databases should be vacuumed. I'm not suggesting to add the
> --maintenance-db option. I wouldn't like to hardcode a specific database
> (template1, for example) if there is no dbname in the connection string.
> Instead, suggest the user to specify dbname into -P option. An error message
> should be provided saying the exact reason: no dbname was specified.
>

But why shouldn't we use the same specification as vacuumdb, which is:
"If not specified, the postgres database will be used, or if that does
not exist, template1 will be used."?

>
> I checked the applications that provide multiple synopsis using the following 
> command.
>
> grep ' -c
>
> There are just 3 applications that have multiple cmdsynopsis. pgbench has 2
> distinct tasks (load and run) so it makes sense to have 2 synopsis. pg_ctl has
> multiple tasks and also deserves multiple synopsis. The complexity introduced
> into vacuumdb (per table, per schema) seems to justify multiple synopsis too.
> However, the same logic doesn't apply here. IMO 0002 shouldn't be applied
> because the additional option (--all) doesn't justify multiple synopsis for
> syntax clarification.
>

Yeah, also, vacuumdb doesn't have a separate line for --all in
synopsis. So, I agree with you about not adding '--all' option in the
synopsis.

-- 
With Regards,
Amit Kapila.




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Amit Kapila
On Tue, Mar 25, 2025 at 5:30 PM Ashutosh Bapat
 wrote:
>
> On Thu, Mar 20, 2025 at 5:54 PM Amit Kapila  wrote:
> >
> > *  
> > +  
> > +   pg_createsubscriber
> > +   option
> > +   
> > +
> > + -a
> > + --all
> > +
> > +
> > + -D 
> > + --pgdata
> > +
> > +datadir
> > +
> > + -P
> > + --publisher-server
> > +
> > +connstr
> >
> > Most of this is unrelated to this patch. I suggest making a top-up
> > patch, we can commit it separately.
>
> Isn't this documenting -a/--all option in synopsis. Why do you think
> it's unrelated?
>

I was not clear at that stage whether to include it along with the
core patch, but today, I looked at it while responding to Euler and
found that it is not required. We can still discuss whether to include
it, but the main patch can be committed even without this.

-- 
With Regards,
Amit Kapila.




Re: Update LDAP Protocol in fe-connect.c to v3

2025-03-25 Thread Peter Eisentraut

On 23.03.25 04:05, Andrew Jackson wrote:

 > This is the first complaint I can recall hearing about that, so
exactly which ones are "many"?

I've tested a 2 before figuring out about the v3 issue. lldap[0] and the 
docker image osixia/docker-openldap[1].
- lldap  gives the following error message when I attempt to connect 
without the patch "Service Error: while handling incoming messages: 
while receiving LDAP op: Bind request version is not equal to 3. This is 
a serious client bug.". With the attached patch this error message does 
not appear
-  osixia/docker-openlap gives the following error message without the 
patch "67df745e conn=1001 op=0 RESULT tag=97 err=2 text=historical 
protocol version requested, use LDAPv3 instead".

"

 > Also, are we really sufficiently compliant with v3 that just adding 
this bit is enough?


I believe that this bit is all that is needed. Per the man page for 
ldap_set_option [2]: "The protocol version used by the library defaults 
to LDAPv2 (now historic), which corresponds to the LDAP_VERSION2 macro. 
Application developers are encouraged to explicitly set 
LDAP_OPT_PROTOCOL_VERSION to LDAPv3, using the LDAP_VERSION3 macro, or 
to allow users to select the protocol version."


 > src/test/ldap/ doesn't do it for you?

Looking through the tests here it seems like they are all tests for the 
serverside auth functionality that is configurable in pg_hba.conf. I 
don't see any tests that test the client side "LDAP Lookup of Connection 
Parameters" described in [3]


Ah yes.  There are two independent pieces of LDAP functionality.  One is 
the client authentication support in the backend, the other is the 
connection parameter lookup in libpq.  The former does set the LDAP 
protocol version, the latter does not.  This was clearly just forgotten. 
 Your patch makes sense.






Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-22 19:09:55 -0700, Noah Misch wrote:
> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > Attached v2.11
>
> > Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring
>
> Apart from some isolated cosmetic points, this is ready to commit:
>
> > +   ereport(ERROR,
> > +   errcode(err),
> > +   errmsg("io_uring_queue_init failed: 
> > %m"),
> > +   hint != NULL ? errhint("%s", hint) : 0);
>
> https://www.postgresql.org/docs/current/error-style-guide.html gives the 
> example:
>
> BAD:open() failed: %m
> BETTER: could not open file %s: %m
>
> Hence, this errmsg should change, perhaps to:
> "could not setup io_uring queues: %m".

You're right. I didn't intentionally "violate" the policy, but I do have to
admit, I'm not a huge fan of that aspect, it just obfuscates what actually
failed, forcing one to look at the code or strace to figure out what precisely
failed.

(Changed)


> > +   pgaio_debug_io(DEBUG3, ioh,
> > +  "wait_one io_gen: %llu, ref_gen: 
> > %llu, cycle %d",
> > +  (long long unsigned) ref_generation,
> > +  (long long unsigned) ioh->generation,
>
> In the message string, io_gen appears before ref_gen.  In the subsequent args,
> the order is swapped relative to the message string.

Oops, you're right.


> > --- a/src/backend/utils/activity/wait_event_names.txt
> > +++ b/src/backend/utils/activity/wait_event_names.txt
> > @@ -192,6 +192,8 @@ ABI_compatibility:
> >
> >  Section: ClassName - WaitEventIO
> >
> > +AIO_IO_URING_SUBMIT"Waiting for IO submission via io_uring."
> > +AIO_IO_URING_COMPLETION"Waiting for IO completion via io_uring."
> >  AIO_IO_COMPLETION  "Waiting for IO completion."
>
> I'm wondering if there's an opportunity to enrich the last two wait event
> names and/or descriptions.  The current descriptions suggest to me more
> similarity than is actually there.  Inputs to the decision:
>
> - AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
>   PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED.  The three
>   starting states are the states where some other backend owns the next
>   action, so the current backend can only wait to be signaled.
>
> - AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
>   from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.
>
> Possible names and descriptions, based on PgAioHandleState enum names and
> comments:
>
> AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
> AIO_COMPLETED_SHARED  "Waiting for IO shared completion callback."
>
> If "shared completion callback" is too internals-focused, perhaps this:
>
> AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
> AIO_COMPLETED_SHARED  "Waiting for IO completion to update shared memory."

Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the
IO with io_method=worker/sync. For that AIO_COMPLETED_SHARED would be
inappropriate.

We could use a different wait event if wait for an IO via CV in
PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait().  But I'm not
sure that would get you that far - we don't broadcast the CV when
transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait
event would stay the same, now wrong, wait event until the shared callback
completes. Obviously waking everyone up just so they can use a differen wait
event doesn't make sense.

A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
"execution" or something like that, to hint at a separation between the raw IO
being completed and the IO, including the callbacks completing.


> > --- a/doc/src/sgml/config.sgml
> > +++ b/doc/src/sgml/config.sgml
> > @@ -2710,6 +2710,12 @@ include_dir 'conf.d'
> >  worker (execute asynchronous I/O using 
> > worker processes)
> > 
> >
> > +  
> > +   
> > +io_uring (execute asynchronous I/O using
> > +io_uring, if available)
>
> I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
> Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
> there to reduce surprise on other platforms.

You're right, the if available can be misunderstood. But not mentioning that
it's an optional dependency seems odd too. What about something like

   
io_uring (execute asynchronous I/O using
io_uring, requires postgres to have been built with
--with-liburing
 /
-Dliburing)
   

Should the docs for --with-liburing/-Dliburing mention it's linux only? We
don't seem to do that for things like systemd (linux), selinux (linux) and
only kinda for bonjour (macos).

Greetings,

Andres Freund




Re: pg_recvlogical requires -d but not described on the documentation

2025-03-25 Thread Fujii Masao



On 2025/03/21 10:12, Hayato Kuroda (Fujitsu) wrote:

Dear Fujii-san,


Thanks for the patch! It looks good to me.

I'm considering whether to back-patch these changes to older versions.
Since pg_recvlogical --drop-slot worked without --dbname in 9.4
but started failing unintentionally in 9.5, it could be considered a bug.
However, this behavior has existed for a long time without complaints or
bug reports, and there was no clear documentation stating that
--drop-slot should work without --dbname.

Given this, I think that also we could treat it as not a bug and apply
the change only to the master branch. What do you think should we
back-patch it as a bug fix or apply it only to master?


Personally considered, such a long-standing but harmless bug can be regarded as
the specification. So, I vote that this is an enhancement and be applied only to
master.


+1

I've updated the commit messages for both patches and also revised
the code comments in the 0002 patch. The updated patches are attached.

Unless there are any objections, I'm thinking to commit them.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 53360ffb7a908a881a94ff81934dca9a7825d8d1 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Tue, 18 Mar 2025 10:07:08 +0900
Subject: [PATCH v4] doc: Clarify required options for each action in
 pg_recvlogical

Each pg_recvlogical action requires specific options. For example,
--slot, --dbname, and --file must be specified with the --start action.
Previously, the documentation did not clearly outline these requirements.

This commit updates the documentation to explicitly state
the necessary options for each action.

Author: Hayato Kuroda 
Co-authored-by: Fujii Masao 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Vignesh C 
Reviewed-by: David G. Johnston 
Discussion: 
https://postgr.es/m/oscpr01mb14966930b4357bae8c9d68a8af5...@oscpr01mb14966.jpnprd01.prod.outlook.com
---
 doc/src/sgml/ref/pg_recvlogical.sgml | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml 
b/doc/src/sgml/ref/pg_recvlogical.sgml
index 95eb14b6352..2946bdae1e5 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -73,6 +73,11 @@ PostgreSQL documentation
 by --dbname.

 
+   
+The --slot and --dbname are required
+for this action.
+   
+

 The --two-phase can be specified with
 --create-slot to enable decoding of prepared 
transactions.
@@ -87,6 +92,10 @@ PostgreSQL documentation
 Drop the replication slot with the name specified
 by --slot, then exit.

+
+   
+The --slot is required for this action.
+   
   
  
 
@@ -101,6 +110,11 @@ PostgreSQL documentation
 --no-loop is specified.

 
+   
+The --slot and --dbname,
+--file are required for this action.
+   
+

 The stream format is determined by the output plugin specified when
 the slot was created.
@@ -159,6 +173,9 @@ PostgreSQL documentation
 Write received and decoded transaction data into this
 file. Use - for stdout.

+   
+This parameter is required for --start.
+   
   
  
 
@@ -265,6 +282,9 @@ PostgreSQL documentation
 mode, create the slot with this name. In --drop-slot
 mode, delete the slot with this name.

+   
+This parameter is required for any of actions.
+   
   
  
 
@@ -305,7 +325,11 @@ PostgreSQL documentation
  The dbname can be a connection string.  If so,
  connection string parameters will override any conflicting
- command line options.  Defaults to the user name.
+ command line options.
+
+
+ This parameter is required for --create-slot
+ and --start.
 

   
-- 
2.48.1

From 44c686ae11cc0de8d0aaf70c23397f94b477a82a Mon Sep 17 00:00:00 2001
From: Hayato Kuroda 
Date: Wed, 19 Mar 2025 11:30:16 +0900
Subject: [PATCH v4 2/2] Allow pg_recvlogical --drop-slot to work without
 --dbname.

When pg_recvlogical was introduced in 9.4, the --dbname option was not
required for --drop-slot. Without it, pg_recvlogical --drop-slot connected
using a replication connection (not tied to a specific database) and
was able to drop both physical and logical replication slots, similar to
pg_receivewal --drop-slot.

However, commit 0c013e08cfb unintentionally changed this behavior in 9.5,
making pg_recvlogical always check whether it's connected to a specific
database and fail if it's not. This change was expected for --create-slot
and --start, which handle logical replication slots and require a database
connection, but it was unnecessary for --drop-slot, which should work with
any replication connection. As a result, --dbname became a require

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2025-03-25 Thread Michael Paquier
On Mon, Mar 17, 2025 at 10:22:47AM -0700, Jacob Champion wrote:
> I looked into switching over to pgstat_report_activity(), but that
> wasn't designed to be called in the middle of backend initialization.
> It would take more work to make those calls safe/sane when `st_state
> == STATE_STARTING`. I plan to mark this patchset as Withdrawn for now.

Okay, fine by me.  I had the impression that it would have been
possible to salvage some of the wait event states, but at least the 
starting state showing up in pg_stat_activity will be able to provide
some information, so it's better than none.  Unfortunately, I don't
have any room until the feature freeze for that.

Outside the stat report activity calls, I've been wondering if we
should add some dynamic tracking of which hba/ident entry a backend
PID is working with.  For example, if we knew the file and the entry
line number, we would know on which auth method this backend is
bumping into.  That maybe of course limited if someone modifies and
reloads the HBA file while a backend is stuck.  Now, these files are
mostly static, and we have system views that provide the contents of
the ident and HBA files as SQL, so with a JOIN..
--
Michael


signature.asc
Description: PGP signature


Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-03-25 Thread Richard Guo
On Sun, Mar 23, 2025 at 6:25 PM Richard Guo  wrote:
> On Sat, Mar 22, 2025 at 2:21 AM Tom Lane  wrote:
> > The way to make this work is what I said before: move the planner's
> > collection of relation information to somewhere a bit earlier in
> > the planner.  But not to outside the planner.

> I'm considering moving the collection of attnotnull information before
> pull_up_sublinks, in hopes of leveraging this info to pull up NOT IN
> in the future, something like attached.

Here is an updated version of the patch with some cosmetic changes and
a more readable commit message.  I'm wondering if it's good enough to
be pushed.  Any comments?

Thanks
Richard


v3-0001-Reduce-Var-IS-NOT-NULL-quals-during-constant-folding.patch
Description: Binary data


Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-25 Thread Amit Kapila
On Tue, Mar 25, 2025 at 5:24 PM Ashutosh Bapat
 wrote:
>
> On Tue, Mar 25, 2025 at 5:15 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 25, 2025 at 5:08 PM Ashutosh Bapat
> >  wrote:
> > >
> > > This looks mostly ready except the test changes. I believe when
> > > committing, we are going to squash all three into a single commit. Is
> > > that correct?
> > >
> >
> > I would not prefer to commit 0003 as it is primarily because of test
> > +# run pg_createsubscriber with '--all' option without '--dry-run'. I
> > am not so sure whether it is really worth adding test cycles for this
> > option except for dry-run mode tests but we can discuss after
> > committing the core patch.
>
> I am worried that without that test, we won't catch bugs in creating
> slots, publications and subscriptions and thus causing problems in a
> setup with --all.
>

The -all option internally maps to -d switches, so the current tests
with that option should suffice for the need you are expecting.

-- 
With Regards,
Amit Kapila.




Re: NOT ENFORCED constraint feature

2025-03-25 Thread Amul Sul
On Tue, Mar 25, 2025 at 10:18 PM Peter Eisentraut  wrote:
>
> On 21.03.25 06:58, Amul Sul wrote:
> >
> > []
> > Attached is the updated version, where the commit messages for patch
> > 0005 and 0006 have been slightly corrected. Additionally, a few code
> > comments have been updated to consistently use the ENFORCED/NOT
> > ENFORCED keywords. The rest of the patches and all the code are
> > unchanged.
>
> I have committed patches 0001 through 0003.  I made some small changes:
>

Thank you very much !

> In 0001, I renamed the function UpdateConstraintEntry() to
> AlterConstrUpdateConstraintEntry() so the context is clearer.
>
> In 0002, you had this change:
>
> @@ -12113,75 +12154,89 @@ ATExecAlterConstraintInternal(List **wqueue, 
> ATAlterConstraint *cmdcon,
>   * If the table at either end of the constraint is partitioned, we 
> need to
>   * handle every constraint that is a child of this one.
>   */
> -   if (recurse && changed &&
> +   if (recurse &&
>  (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE ||
> -(OidIsValid(refrelid) &&
> - get_rel_relkind(refrelid) == RELKIND_PARTITIONED_TABLE)))
> -   ATExecAlterChildConstr(wqueue, cmdcon, conrel, tgrel, rel, 
> contuple,
> -  recurse, 
> otherrelids, lockmode);
> +get_rel_relkind(refrelid) == RELKIND_PARTITIONED_TABLE))
> +   AlterConstrDeferrabilityRecurse(wqueue, cmdcon, conrel, 
> tgrel, rel,
> + 
>   contuple, recurse, otherrelids,
> + 
>   lockmode);
>
> AFAICT, dropping the "changed" from the conditional was not correct.  Or at
> least, it would do redundant work if nothing was "changed".  So I put that
> back.  Let me know if that change was intentional or there is something else
> going on.
>

Makes sense. This is intentional, but I must confess that this change
isn't part of the scope of this patch. I should have mentioned it when
posting, as it was something I intended to discuss with Álvaro, but it
slipped my mind.

The reason for the change is to revert to the behavior before commit
#80d7f990496b1c, where recursion occurred regardless of the
changed flags. This is also described in the header comment for
ATExecAlterConstrDeferrability() (earlier it was for
ATExecAlterConstraintInternal):

 *
 * Note that we must recurse even when the values are correct, in case
 * indirect descendants have had their constraints altered locally.
 * (This could be avoided if we forbade altering constraints in partitions
 * but existing releases don't do that.)
 *

Regards,
Amul




ALTER COLUMN SET DATA TYPE does not change the generation expression's collation

2025-03-25 Thread jian he
hi.

ATPrepAlterColumnType forbids us to ALTER COLUMN SET DATA TYPE USING (expr)
for generated columns.
however we can still change the generated column type from non-text to text
or text type from one collation to another collation.

In ATExecAlterColumnType, we also need to set the generation
expression collation?

We can do this by adding exprSetCollation:

--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14115,6 +14115,7 @@ ATExecAlterColumnType(AlteredTableInfo *tab,
Relation rel,
 errmsg("default for
column \"%s\" cannot be cast automatically to type %s",

colName, format_type_be(targettype;
}
+   exprSetCollation(defaultexpr, targetcollid);


-
CREATE TABLE x1(a int,
b int GENERATED ALWAYS AS (a * 2) stored,
c text GENERATED ALWAYS AS ('1') stored );
ALTER TABLE x1 alter column b set data type text collate "C";
ALTER TABLE x1 alter column c set data type text collate "C";

SELECT pg_get_expr(d.adbin, d.adrelid) AS default_value, d.adbin
FROM   pg_catalog.pg_attributea
JOIN   pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum) = (d.adrelid, d.adnum)
ANDa.attrelid = 'x1'::regclass
ANDa.attname in ('b', 'c');
by adding exprSetCollation, the output is

default_value | (a * 2)
adbin | {COERCEVIAIO :arg {OPEXPR :opno 514 :opfuncid 141
:opresulttype 23 :opretset false :opcollid 0 :inputcollid 0 :args
({VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0
:varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1
:varattnosyn 1 :location -1} {CONST :consttype 23 :consttypmod -1
:constcollid 0 :constlen 4 :constbyval true :constisnull false
:location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]}) :location -1}
:resulttype 25 :resultcollid 950 :coerceformat 2 :location -1}
-[ RECORD 2 
]-+
default_value | '1'::text COLLATE "C"
adbin | {CONST :consttype 25 :consttypmod -1 :constcollid 950
:constlen -1 :constbyval false :constisnull false :location -1
:constvalue 5 [ 20 0 0 0 49 ]}


master behavior:

default_value | (a * 2)
adbin | {COERCEVIAIO :arg {OPEXPR :opno 514 :opfuncid 141
:opresulttype 23 :opretset false :opcollid 0 :inputcollid 0 :args
({VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0
:varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1
:varattnosyn 1 :location -1} {CONST :consttype 23 :consttypmod -1
:constcollid 0 :constlen 4 :constbyval true :constisnull false
:location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]}) :location -1}
:resulttype 25 :resultcollid 0 :coerceformat 2 :location -1}
-[ RECORD 2 
]-+--
default_value | '1'::text
adbin | {CONST :consttype 25 :consttypmod -1 :constcollid 100
:constlen -1 :constbyval false :constisnull false :location -1
:constvalue 5 [ 20 0 0 0 49 ]}




RE: BUG #18815: Logical replication worker Segmentation fault

2025-03-25 Thread Zhijie Hou (Fujitsu)
On Wed, Feb 19, 2025 at 4:38 AM Tom Lane wrote:

Hi,

> 
> Here's a completed set of patches for bug #18815 [1] (which, briefly, 
> is that the logrep worker opens/closes indexes multiple times and 
> thereby breaks brininsertcleanup).
> 
> 0001 adds a subscriber-side BRIN index to 013_partition.pl, which 
> replicates the crash Sergey reported.  I've got mixed feelings about 
> whether to actually commit this: it's certainly useful for testing 
> this fix, but without context it looks like a pretty random thing to 
> do.  It could be justified perhaps on the grounds of testing 
> aminsertcleanup callbacks within replication, since BRIN is the only core 
> index type that has such a callback.
> 
> 0002 fixes the crash by hardening brininsertcleanup against multiple 
> calls.  I think that this is patching a symptom not the root cause, 
> but it still seems like good defensive programming.
> 
> 0003 adds Asserts to ExecOpenIndices and ExecCloseIndices to check 
> that they're not called more than once per ResultRelInfo, and thereby 
> exposes what I consider the root cause: apply_handle_tuple_routing 
> opens the indexes twice and then closes them twice.  This somewhat 
> accidentally leaves us with zero refcounts and zero locks on the 
> indexes, so in the absence of aminsertcleanup callbacks the only real 
> reason to complain is the duplicative construction of the IndexInfo 
> structures.  But the double call of brininsertcleanup breaks the 
> existing coding of that function, and it might well break third-party 
> aminsertcleanup callbacks if there are any.  So I think we should
> institute a policy that this coding pattern is forbidden.
> 
> And finally, 0004 fixes worker.c to not do that.  This turned out 
> simpler than I thought, because I was misled by believing that the 
> ExecOpenIndices/ExecCloseIndices calls in apply_handle_tuple_routing 
> itself do something useful.  They don't, and should be nuked outright.

My colleague Nisha reported off-list about a crash in logical replication that
occurs during unique constraint violations on leaf partitions. Upon
investigation, I confirmed that this crash started happening after commit
9ff6867.

The problem arises because unique key conflict detection relied on the
ExecOpenIndices call in apply_handle_insert_internal and
apply_handle_tuple_routing with the speculative parameter set to true to
construct necessary index information. However, these openings were redundant,
as indexes had already been opened during target partition searches via the
parent table (e.g., using ExecFindPartition). Hence, they were removed in
commit 9ff6867. Unfortunately, ExecFindPartition opens indexes without
constructing the needed index information for conflict detection, which leads
to the crash.

To fix it, I tried to add a detect_conflict flag in ModifyTableState, enabling
ExecFindPartition() to internally build the required index information in this
case, like the attachment.

An alternative approach may be to delay index information initialization until
immediately before executing the actual update or insert. E.g., do that in
InitConflictIndexes(). This approach can also avoid unnecessary construction of
index information when the tuple to be updated is not found, or during a
cross-partition update where index information for the source partition is not
required. However, it introduces an additional location for index
initialization, potentially increasing maintenance efforts.

Best Regards,
Hou zj


v1-0001-Fix-crashes-in-logical-replication-during-unique-.patch
Description:  v1-0001-Fix-crashes-in-logical-replication-during-unique-.patch


Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > I don't know if that's an intentional or unintentional behavioral 
> > difference.
> > 
> > There are 2 1/2 ways around this:
> > 
> > 1) Stop using IOSQE_ASYNC heuristic
> > 2a) Wait for all in-flight IOs when any FD gets closed
> > 2b) Wait for all in-flight IOs using FD when it gets closed
> > 
> > Given that we have clear evidence that io_uring doesn't completely support
> > closing FDs while IOs are in flight, be it a bug or intentional, it seems
> > clearly better to go for 2a or 2b.
> 
> Agreed.  If a workload spends significant time on fd.c closing files, I
> suspect that workload already won't have impressive benchmark numbers.
> Performance-seeking workloads will already want to tune FD usage high enough
> to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> abandon it in favor of (2a).  What other considerations could be important?

The only other consideration I can think of is whether this should happen for
all io_methods or not.

I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
it's a bit weird to have a bool in a struct called *Ops.

Greetings,

Andres Freund




Re: vacuum_truncate configuration parameter and isset_offset

2025-03-25 Thread Nikolay Shaplov
В письме от вторник, 25 марта 2025 г. 17:57:46 MSK пользователь Nathan Bossart 
написал:

> In any case, AFAICT the votes are somewhat evenly divided at the moment, so
> I do not intend to proceed with this patch for now.

Counting votes does not lead anywhere, as I can ask friends and colleagues to 
vote for me for example. This is wrong way.

There are people who involved it supporting reloption engine. If Alvaro said 
"isset_offset" is good, let's have it, I will shut keep my mouth shut, right 
after saying "As you say". Because he is the author of that part of the code, 
I respect this.

I am also the person that dedicated lot's of efforts to work with reloptions. I 
will do it for may years ahead, I think. This isset_offset I will have to 
support from now on, if we do not revert it. I do not like it, I see there is 
no logic in it, at least the way you suggest it.

If you want to dedicate part of your life to reloptions, you are welcome. I 
need help with a code to review. If you really need isset_offset let's together 
redisign options to use it instead of default_value. Go for it, but not 
partly, but totally.

But if you want to fix one option and leave, please do not leave us with 
isset_offset. 

This is not about democracy, this is about who will deal with that part of the 
code later. I guess it will be me and Alvaro, not you. Please let us have the 
code the way we like it, since we support it most of the time.

-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su


signature.asc
Description: This is a digitally signed message part.


Re: vacuum_truncate configuration parameter and isset_offset

2025-03-25 Thread Nathan Bossart
On Mon, Mar 24, 2025 at 08:57:27PM -0700, David G. Johnston wrote:
> On Mon, Mar 24, 2025 at 11:45 AM Nathan Bossart 
> wrote:
>> * I don't think this matches the parse_bool() behavior exactly.  For
>>   example, parse_bool() appears to accept inputs like "t" to mean "true".
>>   This is also probably not a huge deal.
> 
> Fixable for sure.

I've attached an attempt at this.

>> That being said, I'm open to applying this patch if it seems like a
>> majority of folks prefer this implementation.  FWIW I'm still partial to
>> the isset_offset approach for the reasons I've already discussed.
>
> I'm preferable to turning the implementation into an enum.  Slight
> preference to documenting it as the boolean it presently is; pending any
> information regarding where in user-space, particularly in psql but also
> any client-side details, this internal detail might show through.  It
> avoids having to tell the reader that we choose enum in order to
> accommodate optionality.  But it isn't the end of the world if we do say
> that either.

The only other place I found that reveals this internal implementation
detail is parse_one_reloption.  In the attached patch, I've modified it to
emit the same error message as for RELOPT_TYPE_BOOL to further uphold the
illusion that it is a Boolean reloption.  I'm not sure I've patched up all
such holes, though.

TBH I find the attached to be even more of a hack than isset_offset.  We
have to try to match behavior in multiple places, and we need lengthy
commentary about why you'd want to use this enum.  Furthermore, it's only
useful for Boolean-style reloptions, unlike isset_offset which can be used
for any type.  But perhaps more importantly, I agree with Robert that it's
not clear what problem it's trying to solve.  The concrete arguments in
this thread range from avoiding extra bytes in the struct to maintaining
consistency with other options with backing GUCs, which I'd argue are more
style preferences than anything.

In any case, AFAICT the votes are somewhat evenly divided at the moment, so
I do not intend to proceed with this patch for now.

-- 
nathan
>From d9034c94d496ba497f65acfe3c80b57b08b4a95c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 24 Mar 2025 12:46:26 -0500
Subject: [PATCH v2 1/1] change vacuum_truncate relopt to enum

---
 src/backend/access/common/reloptions.c | 89 +++---
 src/backend/commands/vacuum.c  |  4 +-
 src/include/access/reloptions.h| 13 
 src/include/utils/rel.h| 15 -
 src/tools/pgindent/typedefs.list   |  1 +
 5 files changed, 82 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index 645b5c00467..a011a932e7d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -147,15 +147,6 @@ static relopt_bool boolRelOpts[] =
},
false
},
-   {
-   {
-   "vacuum_truncate",
-   "Enables vacuum to truncate empty pages at the end of 
this table",
-   RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-   ShareUpdateExclusiveLock
-   },
-   true
-   },
{
{
"deduplicate_items",
@@ -524,6 +515,43 @@ static relopt_enum_elt_def viewCheckOptValues[] =
{(const char *) NULL}   /* list terminator */
 };
 
+/*
+ * The values are from StdRdOptBool.
+ *
+ * This enum is meant to be used for Boolean reloptions for which we need to
+ * be able to determine whether the value was explicitly set for the relation.
+ * There is a third unusable StdRdOptBool value called
+ * STDRD_OPTION_BOOL_NOT_SET that should be set as the default value for such
+ * options.  Then, code that uses the option can first test for the "not set"
+ * value and fall back to something else (e.g., a GUC) as needed.
+ *
+ * Note that the strings below have been carefully chosen to match the behavior
+ * of parse_bool().
+ */
+static relopt_enum_elt_def StdRdOptBoolValues[] =
+{
+   {"on", STDRD_OPTION_BOOL_ON},
+   {"of", STDRD_OPTION_BOOL_OFF},
+   {"off", STDRD_OPTION_BOOL_OFF},
+   {"t", STDRD_OPTION_BOOL_ON},
+   {"tr", STDRD_OPTION_BOOL_ON},
+   {"tru", STDRD_OPTION_BOOL_ON},
+   {"true", STDRD_OPTION_BOOL_ON},
+   {"f", STDRD_OPTION_BOOL_OFF},
+   {"fa", STDRD_OPTION_BOOL_OFF},
+   {"fal", STDRD_OPTION_BOOL_OFF},
+   {"fals", STDRD_OPTION_BOOL_OFF},
+   {"false", STDRD_OPTION_BOOL_OFF},
+   {"y", STDRD_OPTION_BOOL_ON},
+   {"ye", STDRD_OPTION_BOOL_ON},
+   {"yes", STDRD_OPTION_BOOL_ON},
+   {"n", STDRD_OPTION_BOOL_OFF},
+   {"no", STDRD_OPTION_BOOL_OFF},
+   {"1", STDRD_OPTION_BOOL_ON},
+   {"0", STDRD_OPTION_BOOL_OFF},
+   {(const char *) NULL}   /* list terminator */
+};
+
 static relopt_enum enumRelOpts[] 

Re: Add missing PQclear for StreamLogicalLog function

2025-03-25 Thread Daniel Gustafsson
> On 19 Mar 2025, at 11:03, Daniel Gustafsson  wrote:
> 
>> On 19 Mar 2025, at 06:38, Steven Niu  wrote:
> 
>> During browsing the code, I found one missing PQclear in function 
>> StreamLogicalLog(). It's a very small problem as it only happens in error 
>> condition. However since another similar patch was accepted,
>> maybe we should also fix this one.
> 
> Thanks for the report, the patch looks reasonable so I'll have a look at
> applying it once back from travelling.

Applied to head, thanks!

--
Daniel Gustafsson





Re: Squash constant lists in query jumbling by default

2025-03-25 Thread Tom Lane
Christoph Berg  writes:
> For 2), Tom said that configurability is 1) often much less useful
> than originally planned, and 2) tools have to cope with both settings
> anyway, making implementing them harder. Plus, switching at run-time
> makes the result even less predictable.

To clarify that last bit: if some clients run with the GUC on and some
with it off, you have a mess.  Even statements that are completely
identical will have different query IDs under the two settings.

If this GUC sticks around, it should be at least PGC_SUSET (on
the analogy of compute_query_id) to make it harder to break
pg_stat_statements that way.

regards, tom lane




Re: NOT ENFORCED constraint feature

2025-03-25 Thread Peter Eisentraut

On 21.03.25 06:58, Amul Sul wrote:

I think the next step here is that you work to fix Álvaro's concerns
about the recursion structure.


Yes, I worked on that in the attached version. I refactored
ATExecAlterConstraintInternal() and moved the code that updates the
pg_constraint entry into a separate function (see 0001), so it can be
called from the places where the entry needs to be updated, rather
than revisiting ATExecAlterConstraintInternal(). In 0002,
ATExecAlterConstraintInternal() is split into two functions:
ATExecAlterConstrDeferrability() and
ATExecAlterConstrInheritability(), which handle altering deferrability
and inheritability, respectively. These functions are expected to
recurse on themselves, rather than revisiting
ATExecAlterConstraintInternal() as before. This approach simplifies
things. Similarly can add ATExecAlterConstrEnforceability() which
recurses itself.


Yeah, I gave this a look and I think this code layout is good.  There
are more functions now, but the code flow is simpler.



Thank you !

Attached is the updated version, where the commit messages for patch
0005 and 0006 have been slightly corrected. Additionally, a few code
comments have been updated to consistently use the ENFORCED/NOT
ENFORCED keywords. The rest of the patches and all the code are
unchanged.


I have committed patches 0001 through 0003.  I made some small changes:

In 0001, I renamed the function UpdateConstraintEntry() to
AlterConstrUpdateConstraintEntry() so the context is clearer.

In 0002, you had this change:

@@ -12113,75 +12154,89 @@ ATExecAlterConstraintInternal(List **wqueue, 
ATAlterConstraint *cmdcon,
 * If the table at either end of the constraint is partitioned, we need 
to
 * handle every constraint that is a child of this one.
 */
-   if (recurse && changed &&
+   if (recurse &&
(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE ||
-(OidIsValid(refrelid) &&
- get_rel_relkind(refrelid) == RELKIND_PARTITIONED_TABLE)))
-   ATExecAlterChildConstr(wqueue, cmdcon, conrel, tgrel, rel, 
contuple,
-  recurse, 
otherrelids, lockmode);
+get_rel_relkind(refrelid) == RELKIND_PARTITIONED_TABLE))
+   AlterConstrDeferrabilityRecurse(wqueue, cmdcon, conrel, tgrel, 
rel,
+   
contuple, recurse, otherrelids,
+   
lockmode);

AFAICT, dropping the "changed" from the conditional was not correct.  Or at
least, it would do redundant work if nothing was "changed".  So I put that
back.  Let me know if that change was intentional or there is something else
going on.

I will work through the remaining patches.  It looks good to me so far.





Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Sami Imseih
> > Attached is the second one, with more tests coverage with attribute
> > aliases (these being ignored exists in stable branches, but why not
> > while on it) and table aliases, and the fixes for the issues pointed
> > out by Christoph.  I'll double-check all that again tomorrow. Please
> > find an updated version attached for now.

There are several parts of the doc that may no longer hold true.

1/
"Since the queryid hash value is computed on the post-parse-analysis
representation of the queries, the opposite is also possible: queries with
identical texts might appear as separate entries, if they have different
meanings as a result of factors such as different search_path settings."

I think this text could remain as is, because search_path still
matters for things like functions, etc.

"""
postgres=# SET SEARCH_PATH=a;
SET
postgres=# explain verbose select * from test();
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=4)
   Output: 1
 Query Identifier: -1813735303617154554
(3 rows)

postgres=# SET SEARCH_PATH=b;
SET
postgres=# explain verbose select * from test();
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=4)
   Output: 1
 Query Identifier: -3896107319863686763
(3 rows)
"""

2/
"For example, pg_stat_statements will consider two apparently-identical
queries to be distinct, if they reference a table that was dropped and
recreated between the executions of the two queries."

This is no longer true for relations, but is still true for functions. I think
we should mention the caveats in a bit more detail as this change
will have impact on the most common case. What about something
like this?

"For example, pg_stat_statements will consider two apparently-identical
queries to be distinct, if they reference a function that was dropped and
recreated between the executions of the two queries. Conversely, if a table is
dropped and recreated between the executions of queries, two
apparently-identical
queries will be considered the same. However, if the alias for a table is
different for semantically similar queries, these queries will be
considered distinct"

--
Sami Imseih
Amazon Web Services (AWS)




Re: query_id: jumble names of temp tables for better pg_stat_statement UX

2025-03-25 Thread Christoph Berg
Re: Michael Paquier
> Attached is the second one, with more tests coverage with attribute
> aliases (these being ignored exists in stable branches, but why not
> while on it) and table aliases, and the fixes for the issues pointed
> out by Christoph.  I'll double-check all that again tomorrow. Please
> find an updated version attached for now.

Looks good to me.

Christoph




Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-25 17:10:19 +1300, Thomas Munro wrote:
> On Tue, Mar 25, 2025 at 2:18 PM Andres Freund  wrote:
> > Attached v2.12, with the following changes:
> 
> Here's a tiny fixup to make io_concurrency=0 turn on
> READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME.  Without this, AIO
> will still run at level 1 even if you asked for 0.  Feel free to
> squash, or ignore and I'll push it later, whatever suits... (tested on
> the tip of your public aio-2 branch).

Thanks! I squashed it into "aio: Basic read_stream adjustments for real AIO"
and updated the commit message to account for that.

Greetings,

Andres Freund




Re: AIO v2.5

2025-03-25 Thread Noah Misch
On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:
> On 2025-03-25 07:11:20 -0700, Noah Misch wrote:
> > On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:
> > > If we want to implement it, I think we could introduce PGAIO_RS_WARN, 
> > > which
> > > then could tell the stager to issue the WARNING. It would add a bit of
> > > distributed cost, both to callbacks and users of AIO, but it might not be 
> > > too
> > > bad.
> 
> FWIW, I prototyped this, it's not hard.
> 
> But it can't replace the current WARNING with 100% fidelity: If we read 60
> blocks in a single smgrreadv, we today would would emit 60 WARNINGs.  But we
> can't encode that many block offset in single PgAioResult, there's not enough
> space, and enlarging it far enough doesn't seem to make sense either.
> 
> 
> What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
> with that warning saying that there were N zeroed blocks in a read from block
> N to block Y and a HINT saying that there are more details in the server log.

Sounds fine.

> > Another thought on complete_shared running on other backends: I wonder if we
> > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
> > other process" or similar, so people wonder less about how "SELECT FROM a" 
> > led
> > to a log message about IO on table "b".
> 
> I've been wondering about that as well, and yes, we probably should.
> 
> I'd add the pid of the backend that started the IO to the message - although
> need to check whether we're trying to keep PIDs of other processes from
> unprivileged users.

We don't.

> I think we probably should add a similar, but not equivalent, context in io
> workers. Maybe "I/O worker executing I/O on behalf of process %d".

Sounds good.




Re: AIO v2.5

2025-03-25 Thread Andres Freund
Hi,

On 2025-03-25 06:33:21 -0700, Noah Misch wrote:
> On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> > On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > > (We may be due for a test mode that does smgrreleaseall() at every
> > > CHECK_FOR_INTERRUPTS()?)
> >
> > I suspect we are. I'm a bit afraid of even trying...
> >
> > ...
> >
> > It's extremely slow - but at least the main regression as well as the aio 
> > tests pass!
>
> One less thing!

Unfortunately I'm now doubting the thoroughness of my check - while I made
every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
log_min_messages <= DEBUGN...

I'll try that in a bit.


> > Because the end state varies, depending on the number of previously staged
> > IOs, the IO method and whether batchmode is enabled, I think it's better if
> > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> > *not* aligned with an internal state name.  It will just mislead readers to
> > think that there's a deterministic mapping when that does not exist.
>
> That's fair.  Could we provide the mapping in a comment, something like the
> following?

Yes!

I wonder if it should also be duplicated or referenced elsewhere, although I
am not sure where precisely.


> --- a/src/include/storage/aio_internal.h
> +++ b/src/include/storage/aio_internal.h
> @@ -34,5 +34,10 @@
>   * linearly through all states.
>   *
> - * State changes should all go through pgaio_io_update_state().
> + * State changes should all go through pgaio_io_update_state().  Its callers
> + * use these naming conventions:
> + *
> + * - A "start" function (e.g. FileStartReadV()) moves an IO from
> + *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
> + *   PGAIO_HS_COMPLETED_LOCAL.
>   */
>  typedef enum PgAioHandleState

One detail I'm not sure about: The above change is correct, but perhaps a bit
misleading, because we can actually go "back" to IDLE. Not sure how to best
phrase that though.


> > That's not an excuse for pgaio_io_prep* though, that's a pointlessly 
> > different
> > naming that I just stopped seeing.

I assume you're on board with renaming _io_prep* to _io_start_*?


> > I'll try to think more about this, perhaps I can make myself see your POV
> > more.
>
> > > As the patch stands, LockBufferForCleanup() can succeed when
> > > ConditionalLockBufferForCleanup() would have returned false.
> >
> > That's already true today, right? In master 
> > ConditionalLockBufferForCleanup()
> > for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() 
> > doesn't.
>
> I'm finding a LocalRefCount check under LockBufferForCleanup:

I guess I should have stopped looking at code / replying before my last email
last night... Not sure how I missed that.



> CheckBufferIsPinnedOnce(Buffer buffer)
> {
>   if (BufferIsLocal(buffer))
>   {
>   if (LocalRefCount[-buffer - 1] != 1)
>   elog(ERROR, "incorrect local pin count: %d",
>LocalRefCount[-buffer - 1]);
>   }
>   else
>   {
>   if (GetPrivateRefCount(buffer) != 1)
>   elog(ERROR, "incorrect local pin count: %d",
>GetPrivateRefCount(buffer));
>   }
> }

Pretty random orthogonal thought, that I was reminded of by the above code
snippet:

It sure seems we should at some point get rid of LocalRefCount[] and just use
the GetPrivateRefCount() infrastructure for both shared and local buffers.  I
don't think the GetPrivateRefCount() infrastructure cares about
local/non-local, leaving a few asserts aside.  If we do that, and start to use
BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
differences between shared and local buffers would be a lot smaller.


> > > Like the comment, I expect it's academic today.  I expect it will stay
> > > academic.  Anything that does a cleanup will start by reading the buffer,
> > > which will resolve any refcnt the AIO subsystems holds for a read.  If 
> > > there's
> > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > > that.  How about just removing the ConditionalLockBufferForCleanup() 
> > > changes
> > > or replacing them with a comment (like the present paragraph)?
> >
> > I think we'll need an expanded version of what I suggest once we have 
> > writes -
> > but as you say, it shouldn't matter as long as we only have reads. So I 
> > think
> > moving the relevant changes, with adjusted caveats, to the bufmgr: write
> > change makes sense.
>
> Moving those changes works for me.  I'm not currently seeing the need under
> writes, but that may get clearer upon reaching those patches.

FWIW, I don't think it's currently worth looking at the write side in detail,
there's enough required changes to make that not necessarily the best use of
your time at this point. At leas

  1   2   >