date:20250403

Re: Reducing the log spam

2025-04-03 Thread Laurenz Albe

On Wed, 2025-04-02 at 15:23 -0400, Tom Lane wrote:
> Fujii Masao  writes:
> > Filtering log messages by SQLSTATE might be useful for some users,
> > but I'm unsure if it should belong in the core. There are also other
> > potential filtering needs, such as by application name, client host,
> > database, or roles. Adding only SQLSTATE filtering may not be good idea,
> > while supporting all possible cases in the core wouldn't be practical 
> > either.
> 
> > Given that, I think implementing this as an extension using emit_log_hook
> > would be a better approach. Anyway, I'd like to hear more opinions from
> > other hackers on this.
> 
> I took a brief look and I concur with Fujii-san's conclusion: this'd
> be a fine extension a/k/a contrib module, but it seems a bit too
> opinionated about what sort of filtering is needed to be appropriate
> within elog.c itself.
> 
> Also, just as a minor coding thought, I'd suggest using -1 not 0
> to terminate the array of encoded SQLSTATEs, rather than assuming
> that nobody would write 0.  Compare commit 58fdca220.

I agree with these assessments and will withdraw the patch.

Yours,
Laurenz Albe

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-04-03 Thread Peter Eisentraut

It occurred to me that we will also want to have NOT NULL NOT ENFORCED 
constraints eventually.  As we have discussed elsewhere, the NOT 
ENFORCED state is closely related to the NOT VALID state.  So that 
should probably be considered in the design here.


Reading up on this again now, I'm confused about putting the NOT VALID 
state for not-null constraints into pg_attribute.  We have catalogued 
not-null constraints now, so we can put metadata for them into 
pg_constraint!  And we have NOT VALID and NOT ENFORCED flags in 
pg_constraint already.


So what is the purpose of the attnotnullvalid field?  In the latest 
posted patch, I don't see this column used in the executor for the 
actual constraint checking.  So is this all merely for clients to 
understand the constraint metadata?  If we add more metadata for 
not-null constraints, do we need to add a new pg_attribute flag for each 
one?  That doesn't seem right.

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-04-03 Thread Alvaro Herrera

On 2025-Apr-03, Peter Eisentraut wrote:

> It occurred to me that we will also want to have NOT NULL NOT ENFORCED
> constraints eventually.  As we have discussed elsewhere, the NOT
> ENFORCED state is closely related to the NOT VALID state.  So that
> should probably be considered in the design here.

Yeah, I don't think there's time to shoehorn NOT ENFORCED status for
not-null constraints.  I'd guess that it'd take at least a couple of
weeks to make that work.

> Reading up on this again now, I'm confused about putting the NOT VALID
> state for not-null constraints into pg_attribute.  We have catalogued
> not-null constraints now, so we can put metadata for them into
> pg_constraint!  And we have NOT VALID and NOT ENFORCED flags in
> pg_constraint already.
> 
> So what is the purpose of the attnotnullvalid field?  In the latest posted
> patch, I don't see this column used in the executor for the actual
> constraint checking.  So is this all merely for clients to understand the
> constraint metadata?  If we add more metadata for not-null constraints, do
> we need to add a new pg_attribute flag for each one?  That doesn't seem
> right.

The new flag is there for quick access by get_relation_info.  We could
easily not have it otherwise, because clients don't need it, but its
lack would probably make planning measurably slower because it'd have to
do syscache access for every single not-null constraint to figure out if
it's valid or not.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

2025-04-03 Thread Bertrand Drouvot

Hi,

On Thu, Apr 03, 2025 at 09:25:02AM +0530, vignesh C wrote:
> On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
>  wrote:
> >
> Couple of suggestions:

Thanks for looking at it!

> 1) I felt we can include a similar verification for one of the logical
> replication tests too:
> +# Wait for the walsender to update its IO statistics.
> 
> +# Has to be done before the next restart and far enough from the
> +# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
> +$node_primary->poll_query_until(
> + 'postgres',
> + qq[SELECT sum(reads) > 0
> +   FROM pg_catalog.pg_stat_io
> +   WHERE backend_type = 'walsender'
> +   AND object = 'wal']
> +  )
> +  or die
> +  "Timed out while waiting for the walsender to update its IO statistics";
> +

Initially ([1]) it was added in 035_standby_logical_decoding.pl. But this
test (035) is already racy enough that I felt better to move it to 
001_stream_rep.pl
(see [2]) instead.

I don't think that's a big issue as the code path being changed in the patch is
shared between logical and physical walsenders. That said that would not hurt to
add a logical walsender test, but I would prefer it to be outside
035_standby_logical_decoding.pl. Do you have a suggestion for the location of
such a test?

> 2) We can comment this in a single line itself:
> + /*
> + * Report IO statistics
> + */

Right, but:

- it's already done that way in walsender.c (see "Try to flush any pending 
output to the client"
for example).
- the exact same comment is written that way in pgstat_bgwriter.c and
pgstat_checkpointer.c

So that I just copy/paste it.

[1]: 
https://www.postgresql.org/message-id/Z73IsKBceoVd4t55%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: 
https://www.postgresql.org/message-id/Z77jgvhwOu9S0a5r%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade: Support for upgrading to checksums enabled

2025-04-03 Thread Peter Eisentraut


On 11.03.25 11:42, Peter Eisentraut wrote:
Here is an updated patch that works more along those lines.  It adds a 
pg_upgrade option --update-checksums, which activates the code to 
rewrite the checksums.  You must specify this option if the source and 
target clusters have different checksum settings.


Note that this also works to hypothetically upgrade between future 
different checksum versions (hence "--update-*", not "--enable-*"). 
Also, as the patch is currently written, it is also required to specify 
this option to downgrade from checksums to no-checksums.  (It will then 
write a zero into the checksum place, as it would look if you had never 
used checksums.)  Also, you can optionally specify this option even if 
the checksum settings are the same, then it will recalculate the 
checksums.  Probably not all of this is useful, but perhaps a subset of 
it.  Thoughts?


Also, I still don't know what to do about the Windows code path in 
copyFile().  We could just not support this feature on Windows?  Or 
maybe the notionally correct thing to do would be to move that code into 
copyFileByRange().  But then we'd need a different default on Windows 
and it would require more documentation.  I don't know what to do here 
and I don't have enough context to make a suggestion.  But if we don't 
answer this, I don't think we can move ahead with this feature.


I'm not sensing much enthusiasm for this feature or for working out the 
remaining problems, so I'm closing this commitfest entry.

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-04-03 Thread jian he

On Thu, Apr 3, 2025 at 2:24 AM Alvaro Herrera  wrote:
>
> create table singlepp (id bigint default 1) partition by list (id);
> alter table singlepp add constraint dummy_constr not null id not valid;
> create table singlepp_1 (id bigint default 1);
> alter table singlepp_1 add constraint dummy_constr not null id;
> alter table singlepp attach partition singlepp_1 for values in ('1');
>
> Here, conislocal for the constraint on singlepp_1 is false.
>
> select conislocal from pg_constraint where conrelid = 'singlepp_1'::regclass;
>  conislocal
> 
>  f
>
> if I run pg_dump and restore in a different database, it emits this:
>
> CREATE TABLE public.singlepp (
> id bigint DEFAULT 1
> )
> PARTITION BY LIST (id);
> CREATE TABLE public.singlepp_1 (
> id bigint DEFAULT 1 CONSTRAINT dummy_constr NOT NULL
> );
> ALTER TABLE ONLY public.singlepp ATTACH PARTITION public.singlepp_1 
> FOR VALUES IN ('1');
> ALTER TABLE public.singlepp
> ADD CONSTRAINT dummy_constr NOT NULL id NOT VALID;
>

Thanks for mentioning flagInhAttrs!
For table partitioning, the V6 pg_dump output is correct.
conislocal's discrepancy in before and after pg_dump can be
fixed(adjust) in AdjustNotNullInheritance.

per above quoted example, The main idea is
ALTER TABLE public.singlepp
ADD CONSTRAINT dummy_constr NOT NULL id NOT VALID;
will cascade to table singlepp_1 .
However, since singlepp_1 already has a valid NOT NULL constraint,
merging occurs.
like, singlepp_1's coninhcount value increases from 0 to 1.
while at it, we can also set conislocal to false.
with the same idea, the pg_constraint.convalidated discrepancy before
and after pg_dump also resolved.

but we need to change the pg_dump output for table inheritance.
for table inheritance:
CREATE TABLE inhnn (a INTEGER);
ALTER TABLE inhnn ADD CONSTRAINT cc not null a NOT VALID;
CREATE TABLE inhnn_cc(a INTEGER) INHERITS(inhnn);

the V6 output is
CREATE TABLE public.inhnn (a integer);
CREATE TABLE public.inhnn_cc ( a integer) INHERITS (public.inhnn);
ALTER TABLE public.inhnn ADD CONSTRAINT cc NOT NULL a NOT VALID;

we need change it to
CREATE TABLE public.inhnn (a integer);
CREATE TABLE public.inhnn_cc (a integer CONSTRAINT cc NOT NULL)
INHERITS (public.inhnn);
ALTER TABLE public.inhnn ADD CONSTRAINT cc NOT NULL a NOT VALID;

so that after pg_dump we can still have a state where the parent's
constraint is invalid and the child's is valid.

summary:
For parents invalid children valid cases, pg_dump's output changes the
convalidate and conislocal column value.

To resolve this issue:
For table partitioning: V6 pg_dump output works fine, but need change
function AdjustNotNullInheritance
For table inheritance: need change pg_dump output, also change
MergeWithExistingConstraint.

needless to say, attach scratch96.sql  is used to test pg_dump before
and after the difference.
you can compare V6 and my changes.

scratch96.sql
Description: application/sql

v6-0001-ensure-pg_dump-table-constraint-info-remain-th.no-cfbot
Description: Binary data

Re: Test to dump and restore objects left behind by regression

2025-04-03 Thread Alvaro Herrera

On 2025-Apr-03, Ashutosh Bapat wrote:

> Looks like the problem is in the test itself as pointed out by Jeff in
> [1]. PFA patch fixing the test and enabling statistics back.

Thanks, pushed.

> A note about variable name changes and introduction of new variables.
> We run step 2 between 1 and 3 so that autovacuum gets a chance to run
> on the old cluster and update statistics. Autovacuum run is not
> necessary but useful here. Before these changes all the cluster
> initializations were using the same variables @initdb_params and
> %node_params. However with these changes, we modify the variable in
> step 2 and then again require original values in step 4. So I have
> used two sets of variables prefixed with old_ and new_ for clusters
> created in 1st step and 2nd step respectively. 4th step uses the
> variables with prefix old_. I think this change eliminates confusion
> caused by using same variables with different values.

This was a good change, thanks.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"

Re: Draft for basic NUMA observability

2025-04-03 Thread Bertrand Drouvot

Hi,

On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra  wrote:
> 
> Hi Tomas,
> 
> > OK, so you agree the commit messages are complete / correct?
> 
> Yes.

Not 100% sure on my side. 

=== v21-0002

Says:

"
This introduces three new functions:

- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple
"

While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
is).

About v21-0002:

=== 1

I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).

About v21-0003:

=== 2

> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values.

+   if (buffers_per_page > 1)
+   os_page_query_count = NBuffers;
+   else
+   os_page_query_count = NBuffers * pages_per_buffer;

yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.

=== 3

+   if (buffers_per_page > 1)
+   {

As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:

"
if (buffers_per_page > 1) {
/* BLCKSZ < PAGESIZE: one page hosts many Buffers */
for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
/* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
for (i = 0; i < NBuffers; i++) {
.
.
.
"

=== 4

> That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.

Yeah I think that it can be simply removed then.

=== 5

> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...

I think that pg_shm_allocations_numa() is better (given the examples you just
shared).

=== 6

> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19? Also maybe we should just focus
till v21-0003 (and discard v21-0004 for 18).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-04-03 Thread vignesh C

On Thu, 3 Apr 2025 at 09:07, Peter Smith  wrote:
>
> Hi,
>
> I was away for the last few weeks when this feature got committed, so
> I missed the chance to post my comments earlier.
>
> It seems the final SGML docs are mostly from this [1] suggestion, but
> I think there are one or two more improvements that can be made to it:
>
> ==
>
> 1.
> It is not at all obvious from the current Options syntax that
> -R/--remove must take a value.
> A objtype should be included here to address that.
>
> Also, putting the "publications" as a  renders the HTML
> better IMO, making it way easier to recognize that "publications" is
> an object type, and also making future object types easier to add.
>
> ~
>
> 2. markup
> Use SGML  markup for --dry-run
> Use SGML  markup for pg_dump
> Use SGML  markup for the specific object type value "publications"
>
> ~
>
> 3.
> Instead of "all tables" publications, we can call these FOR ALL TABLES
> publications. That will be consistent with the Notes later on this
> page.
>
> ~
>
> 4.
> "are individually logged and do show up in a --dry-run"
>
> I felt that "and do show up" is really just another way of saying "logged".
> So, maybe reword this to say "are individually logged, including
> during a --dry-run"
>
> ~~~
>
> Please find attached a small patch that implements all these changes.

Thanks for the patch, the changes look good to me.

Regards,
Vignesh

Re: AIX support

2025-04-03 Thread wenhui qiu

HI Srirama
 It's getting close to  code freeze. Any updates from your end?

Thanks

On Tue, Mar 18, 2025 at 12:58 AM Srirama Kucherlapati 
wrote:

> Hi Team,
>
> Here are the updates on the meson on AIX.  Our team had pushed the fixes
> meson github here …
>
>
>
> https://github.com/mesonbuild/meson/pull/14335
>
>
>
> #14335 Enhance AIX shared library build to use an export List.
> 
>
>
>
>
>
>
>
>

Quote-less file names in error messages

2025-04-03 Thread Kyotaro Horiguchi

Hello,

The recent commit 2da74d8d640 added the following new messages:

+   libpq_append_conn_error(conn, "could not open ssl keylog file 
%s: %s",
+   libpq_append_conn_error(conn, "could not write to ssl keylog 
file %s: %s

However, I believe our convention is to quote file names in such messages.

The attached patch 0001 fixes this issue.

While working on this, I found a few other instances of the same
issue:

- A WARNING message in `fd.c` seems worth fixing (0002).

- Two DEBUG2 messages in `xlog.c` seem less important to fix (0003).

- I don't think it's worth bothering with those in developer tools (0004).

Please find the attached patches.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 408e4b89f7457dba3f96732addea2ee03a09ae5f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Apr 2025 11:30:12 +0900
Subject: [PATCH 1/4] Quote file names in recently introduced error messages

Add missing quotes around file names in the messages added by
recent commit 2da74d8d640.
---
 src/interfaces/libpq/fe-secure-openssl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 4bfd8e0447c..78f9e84eb35 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -711,7 +711,7 @@ SSL_CTX_keylog_cb(const SSL *ssl, const char *line)
 
 	if (fd == -1)
 	{
-		libpq_append_conn_error(conn, "could not open ssl keylog file %s: %s",
+		libpq_append_conn_error(conn, "could not open ssl keylog file \"%s\": %s",
 conn->sslkeylogfile, pg_strerror(errno));
 		return;
 	}
@@ -719,7 +719,7 @@ SSL_CTX_keylog_cb(const SSL *ssl, const char *line)
 	/* line is guaranteed by OpenSSL to be NUL terminated */
 	rc = write(fd, line, strlen(line));
 	if (rc < 0)
-		libpq_append_conn_error(conn, "could not write to ssl keylog file %s: %s",
+		libpq_append_conn_error(conn, "could not write to ssl keylog file \"%s\": %s",
 conn->sslkeylogfile, pg_strerror(errno));
 	else
 		rc = write(fd, "\n", 1);
-- 
2.43.5

>From d761188a30b9543e7e1ecb71b6f3c5a64cb41dcf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Apr 2025 11:34:10 +0900
Subject: [PATCH 2/4] Quote file names in an exising WARNING message

Add a missing pair of quotes around a file name in an existing WARNING
message in fd.c.
---
 src/backend/storage/file/fd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e8299dd556..7f74d3be258 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3293,7 +3293,7 @@ CleanupTempFiles(bool isCommit, bool isProcExit)
 else if (fdstate & FD_CLOSE_AT_EOXACT)
 {
 	elog(WARNING,
-		 "temporary file %s not closed at end-of-transaction",
+		 "temporary file \"%s\" not closed at end-of-transaction",
 		 VfdCache[i].fileName);
 	FileClose(i);
 }
-- 
2.43.5

>From 19d6bfdc9399c2e1dc0c933fdc6b4491851c55ca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Apr 2025 11:42:23 +0900
Subject: [PATCH 3/4] Quote file names in existing DEBUG2 messages

Add missing quotes around file names in existing DEBUG2 messages in
xlog.c.
---
 src/backend/access/transam/xlog.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ec40c0b7c42..083d1a17272 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4021,7 +4021,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr,
 	 */
 	XLogFileName(lastoff, 0, segno, wal_segment_size);
 
-	elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
+	elog(DEBUG2, "attempting to remove WAL segments older than log file \"%s\"",
 		 lastoff);
 
 	xldir = AllocateDir(XLOGDIR);
@@ -4098,7 +4098,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 	 */
 	XLogFileName(switchseg, newTLI, switchLogSegNo, wal_segment_size);
 
-	elog(DEBUG2, "attempting to remove WAL segments newer than log file %s",
+	elog(DEBUG2, "attempting to remove WAL segments newer than log file \"%s\"",
 		 switchseg);
 
 	xldir = AllocateDir(XLOGDIR);
-- 
2.43.5

>From fd283b2a8ad28f25054c86fa8cadba4f1b982a12 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Apr 2025 11:45:54 +0900
Subject: [PATCH 4/4] Quote file names in developer tool messages

Add missing quotes around file names in some messages in developer tools.
---
 src/interfaces/ecpg/test/pg_regress_ecpg.c | 10 +-
 src/tools/pg_bsd_indent/args.c |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/interfaces/ecpg/test/pg_regress_ecpg.c b/src/interfaces/ecpg/test/pg_regress_ecpg.c
index ba3477f9dd8..88ccb58433d 100644
--- a/src/interfaces/ecpg/test/pg_regress_ecpg.c
+++ b/s

Re: [PoC] Reducing planning time when tables have many partitions

2025-04-03 Thread David Rowley

On Fri, 4 Apr 2025 at 00:34, David Rowley  wrote:
> I've attached 2 patches, which I think addresses most of this, aside
> from the last point.
>
> These do need more work. I've just attached what I have so far before
> I head off for the day. I am planning on running some performance
> tests tomorrow and doing a round on the comments.

I've done some further work on this, mostly relating to the code
comments. I also removed the now-empty
dispose_eclass_member_iterator() function.

A couple of things which I'm still uncertain of:

1. How to handle the ec_childmembers array in _outEquivalenceClass().
There's no field to know the size of the array. Maybe I should add one
and then print out the non-empty lists.
2. When processing RELOPT_OTHER_JOINREL in add_child_eq_member(), I'm
adding the member to each List for all individual relid mentioned in
child_relids.  This will result in the member going on multiple Lists
and cause the iterator to possibly return the member multiple times.
That might matter in a few places, e.g.
generate_join_implied_equalities_normal() keeps some scoring based on
the number of members.

For #2, Yuya's Bitmapset approach didn't suffer from this issue as the
Bitmapsets would be unioned to get the non-duplicative members. I
wondered about doing list_append_unique() instead of lappend() in
generate_join_implied_equalities_normal(). Unsure. The only other
thing I can think of is to do something else with members for
RELOPT_OTHER_JOINREL and store them elsewhere.

I also did some benchmarking using the attached script. I've attached
the results of running that on my AMD Zen2 machine. See the end of the
script for the CREATE TABLE statement for loading that into postgres.

The results look pretty good. v37 came out slightly faster than v36,
either noise or because of dispose_eclass_member_iterator() removal.

-- overall plan time.
select testname,sum(plan_time)::int as plan_ms from bench_results
group by 1 order by 2;
 testname | plan_ms
--+-
 v37_patch|6806
 v36_patch|6891
 v35_patch|6917
 master_1aff1dc8d |   21113

-- plan time by number of joins for 1024 parts
select testname,joins,sum(plan_time)::int as "plan_ms" from
bench_results where parts=1024 group by 1,2 order by 2,1;
 testname | joins | plan_ms
--+---+-
 master_1aff1dc8d | 0 | 239
 v35_patch| 0 | 120
 v36_patch| 0 | 120
 v37_patch| 0 | 119
 master_1aff1dc8d | 1 | 485
 v35_patch| 1 | 181
 v36_patch| 1 | 184
 v37_patch| 1 | 180
 master_1aff1dc8d | 2 | 832
 v35_patch| 2 | 252
 v36_patch| 2 | 253
 v37_patch| 2 | 249
 master_1aff1dc8d | 3 |1284
 v35_patch| 3 | 342
 v36_patch| 3 | 338
 v37_patch| 3 | 337
 master_1aff1dc8d | 4 |1909
 v35_patch| 4 | 427
 v36_patch| 4 | 435
 v37_patch| 4 | 435
 master_1aff1dc8d | 5 |2830
 v35_patch| 5 | 530
 v36_patch| 5 | 540
 v37_patch| 5 | 535
 master_1aff1dc8d | 6 |4759
 v35_patch| 6 | 685
 v36_patch| 6 | 691
 v37_patch| 6 | 681

-- The memory used is about the same as before:
select testname,joins,sum(mem_alloc)::int as mem_alloc from
bench_results group by 1,2 order by 2,1;
 testname | joins | mem_alloc
--+---+---
 master_1aff1dc8d | 0 |231110
 v35_patch| 0 |233662
 v36_patch| 0 |233662
 v37_patch| 0 |233662
 master_1aff1dc8d | 1 |432685
 v35_patch| 1 |435369
 v36_patch| 1 |435369
 v37_patch| 1 |435369
 master_1aff1dc8d | 2 |476916
 v35_patch| 2 |476300
 v36_patch| 2 |476300
 v37_patch| 2 |476300
 master_1aff1dc8d | 3 |801834
 v35_patch| 3 |801372
 v36_patch| 3 |801372
 v37_patch| 3 |801372
 master_1aff1dc8d | 4 |917312
 v35_patch| 4 |917015
 v36_patch| 4 |917015
 v37_patch| 4 |917015
 master_1aff1dc8d | 5 |   1460833
 v35_patch| 5 |   1460701
 v36_patch| 5 |   1460701
 v37_patch| 5 |   1460701
 master_1aff1dc8d | 6 |   2550570
 v35_patch| 6 |   2639395
 v36_patch| 6 |   2639395
 v37_patch| 6 |   2639395

David
#!/bin/bash

seconds=60
dbname=postgres
testname=$1
loops=10

psql -c "drop table if exists lp;" $dbname
psql -c "create table lp (a int not null) partition by list(a);" $dbname
psql -c "alter system set max_parallel_workers_per_gather = 0;" $dbname
psql -c "select pg_reload_conf();" $dbname

rm /tmp/partbench_re

Re: Conflict detection for multiple_unique_conflicts in logical replication

2025-04-03 Thread Amit Kapila

On Fri, Apr 4, 2025 at 10:41 AM Peter Smith  wrote:
>
> My point is, if it is deemed useful for a user to know if a *single*
> conflict was caused by an INSERT or by an UPDATE, then why is it not
> equally useful to know if *multiple* conflicts were caused by an
> INSERT or by an UPDATE?
>

The reason is that users can resolve single insert_exists or
update_exists by using last_update_wins kind of resolution strategy
either manually or automatically (in the future, after we get that
patch committed). The same may not be true for multiple rows case or
at least it won't be as simple the case as for single row, one may
need to consider removing multiple rows which can lead to data
inconsistency, so we are planning an ERROR as the default resolution
startegy. This could be the reason that even BDR has a single
conflict_type for this case [1][2]. I don't deny the possibility that
in the future, one comes up with a case where separate conflict types
for these makes sense, and at that time, we can consider adding more
conflict types, but for now, there is no solid case for it that is why
we kept a single conflict type.

[1] - 
https://www.enterprisedb.com/docs/pgd/4/bdr/conflicts/#insert-operations-that-violate-multiple-unique-constraints
[2] - 
https://www.enterprisedb.com/docs/pgd/4/bdr/conflicts/#update-operations-that-violate-multiple-unique-constraints

-- 
With Regards,
Amit Kapila.

Re: New criteria for autovacuum

2025-04-03 Thread Konstantin Knizhnik




On 03/04/2025 6:50 pm, Aleksander Alekseev wrote:

Hi,


... and it is claimed that autovacuum will never be triggered in order
to set hint bits, assuming we never modify the table again.

Actually I waited a bit and got a better EXPLAIN:

```
  Index Only Scan using humidity_idx on humidity  (cost=0.42..181.36
rows=1970 width=4) (actual time=0.372..16.869 rows=2904.00 loops=1)
Index Cond: ((ts >= '2022-01-01 00:00:00'::timestamp without time
zone) AND (ts < '2022-05-02 00:00:00'::timestamp without time zone)
AND (city = 'M
oscow'::text))
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=44
  Planning Time: 0.520 ms
  Execution Time: 17.980 ms
(7 rows)
```

This happens when CHECKPOINT starts:

```
2025-04-03 18:36:23.435 MSK [209952] LOG:  checkpoint starting: time
```

Interestingly it takes unusually long for my toy database:

```
2025-04-03 18:40:53.082 MSK [209952] LOG:  checkpoint complete: wrote
3522 buffers (5.4%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0
removed, 5 recycled; write=269.463 s, sync=0.029 s, total=269.647 s;
sync files=32, longest=0.004 s, average=0.001 s; distance=68489 kB,
estimate=68489 kB; lsn=0/F4F3870, redo lsn=0/F167DD0
```

There is nothing in between these two lines.

To my humble knowledge, CHECKOINT shouldn't set hint bits and should
take that long. At this point I don't know what's going on.

This is `master` branch, b82e7eddb023.


Checkpoint is not setting hint bits and not updating VM.
it just writes dirty pages to disk.

My patch includes simple test reproducing the problem. You can check 
that VM is never updated in this case and explicit checkpoint doesn't 
solve the issue.
Certainly manual vacuum will help. But user should somehow managed to 
notice that index-only scan is not used or used and perform larger 
number of heap fetches and understand that problem is with not updated 
VM and vacuum is needed to fix it.


In your example both sessions are inserting data into the table. Vacuum 
performed in one session doesn't take in account records created by 
uncommitted transaction in another session.
So I do not think that plan in your case is improved because of 
checkpoint. Most likely autovacuum was just triggered for this table and 
updates VM.


What is needed to reproduce the problem?
1. Table with populated data
2. Presence of transaction with assigned XID which prevents vacuum from 
marking pages of this table as all visible
3. Vacuum or autovacuum processed this table (to eliminate dead tuple 
and reset number of inserted tuples since last vacuum).


After this 3 steps autovacuum will never be called for this table (at 
least until forced vacuum caused by wraparound).

And IOS will not be used or be very inefficient fot this table.

Re: Make COPY format extendable: Extract COPY TO format implementations

2025-04-03 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Mon, 31 Mar 2025 10:05:34 -0700,
  "David G. Johnston"  wrote:

>  The CopyFromInFunc API allows for each attribute to somehow
> have its I/O format individualized.  But I don't see how that is practical
> or useful, and it adds burden on API users.

If an extension want to use I/O routines, it can use the
CopyFromInFunc API. Otherwise it can provide an empty
function.

For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
uses the CopyFromInFunc API but
https://github.com/kou/pg-copy-arrow/blob/main/copy_arrow.cc
uses an empty function for the CopyFromInFunc API.

The "it adds burden" means that "defining an empty function
is inconvenient", right? See also our past discussion for
this design:

https://www.postgresql.org/message-id/ZbijVn9_51mljMAG%40paquier.xyz

> Keeping empty options does not strike as a bad idea, because this
> forces extension developers to think about this code path rather than
> just ignore it.

> I suggest we remove both .CopyFromInFunc and .CopyFromStart/End and add a
> property to CopyFromRoutine (.ioMode?) with values of either Copy_IO_Text
> or Copy_IO_Binary and then just branch to either:
> 
> CopyFromTextLikeInFunc & CopyFromTextLikeStart/End
> or
> CopyFromBinaryInFunc & CopyFromStart/End
> 
> So, in effect, the only method an extension needs to write is converting
> to/from the 'serialized' form to the text/binary form (text being near
> unanimous).

I object this API. If we choose this API, we can create only
custom COPY formats that compatible with PostgreSQL's
text/binary form. For example, the above jsonlines format
and Apache Arrow format aren't implemented. It's meaningless
to introduce this custom COPY format mechanism with the
suggested API.

> It seems to me that CopyFromOneRow could simply produce a *string
> collection,
> one cell per attribute, and NextCopyFrom could do all of the above on a
> for-loop over *string

You suggest that we use a string collection instead of a
Datum collection in CopyFromOneRow() and convert a string
collection to a Datum collection in NextCopyFrom(), right?

I object this API. Because it has needless string <-> Datum
conversion overhead. For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
parses a JSON value to Datum. If we use this API, we need to
convert parsed Datum to string in an extension and
NextCopyFrom() re-converts the converted string to
Datum. It will slow down custom COPY format.

I want this custom COPY format feature for performance. So
APIs that require needless overhead for non text/csv/binary
formats isn't acceptable to me.

Thanks,
-- 
kou

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-04-03 Thread Rushabh Lathia

On Thu, Apr 3, 2025 at 8:33 PM Peter Eisentraut 
wrote:

> On 03.04.25 10:07, Alvaro Herrera wrote:
> > The new flag is there for quick access by get_relation_info.  We could
> > easily not have it otherwise, because clients don't need it, but its
> > lack would probably make planning measurably slower because it'd have to
> > do syscache access for every single not-null constraint to figure out if
> > it's valid or not.
>
> In the v6 patch, you are adding a attnullability field to the
> CompactAttribute in the tuple descriptor and use that in
> get_relation_info().  That seems like the right approach, because then
> you're doing all that preprocessing about which constraint is active in
> the relcache.  So I don't see where the extra pg_attribute field
> attnotnullvalid is getting used.'
>

attnotnullvalid is getting used to populate the CompatAttribute
(populate_compact_attribute_internal).
The primary reason for adding a new field to pg_attribute is to avoid the
need for an additional scan
of pg_constraint when populating CompatAttribute, as this extra scan
introduces performance overhead
while retrieving catalog information for a relation.

-- 
Rushabh Lathia

Re: Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state

2025-04-03 Thread Masahiko Sawada

On Mon, Aug 12, 2024 at 9:25 PM cca5507  wrote:
>
> Hi,
>
> I refactor the code and fix the git apply warning according to [1].
>
> Here are the new version patches.
>

I've investigated the reported issue and reviewed the patch. IIUC to
fix this issue, what we need to do is to track catalog-change
transactions even in BUILDING_SNAPSHOT phase so that we can decode
transactions correctly that started in FULL_SNAPSHOT. My question is;
in order to track just catalog-change transactions, whether it's
sufficient to check if XLOG_XACT_COMMIT[_PREPARED] has the
XACT_XINFO_HAS_INVALS flag. If yes, we probably should change only
xact_decode() to check the commit records even in BUILDING_SNAPSHOT.
Otherwise, we would need to change mostly all paths where we mark the
transaction as catalog-change as the patch does.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

2025-04-03 Thread Bertrand Drouvot

Hi,

On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
> On 4/3/25 15:12, Jakub Wartak wrote:
> > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra  wrote:
> > 
> >> ...
> >>
> >> So unless someone can demonstrate a use case where this would matter,
> >> I'd not worry about it too much.
> > 
> > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> > it's just that I don't have cycles left today and probably lack skills
> > (i've never dealt with arrays so far) thus it would be slow to get it
> > right... but I can pick up anything tomorrow morning.
> > 
> 
> OK, I took a stab at reworking/simplifying this the way I proposed.
> Here's v24 - needs more polishing, but hopefully enough to show what I
> had in mind.
> 
> It does these changes:
> 
> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
> is not "extending" the existing one.

I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.

The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+  The pg_buffercache_numa_pages() provides the same
information as pg_buffercache_pages()" is still true.

> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
> page_num and node_id. page_num is a sequence starting from 0 for each
> buffer.

+1 on the idea

> 3) It now builds an array of records, with one record per buffer/page.
> 
> 4) I realized we don't really need to worry about buffers_per_page very
> much, except for logging/debugging. There's always "at least one page"
> per buffer, even if an incomplete one, so we can do this:
> 
>os_page_count = NBuffers * Max(1, pages_per_buffer);
> 
> and then
> 
>   for (i = 0; i < NBuffers; i++)
>   {
>   for (j = 0; j < Max(1, pages_per_buffer); j++)

That's a nice simplification as we always need to take care of at least one page
per buffer.

> and everything just works fine, I think.

I think the same.

> Opinions? I personally find this much cleaner / easier to understand.

I agree that's easier to understand and that that looks correct.

A few random comments:

=== 1

It looks like that firstNumaTouch is not set to false anymore.

=== 2

+   pfree(os_page_status);
+   pfree(os_page_ptrs);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

=== 3

+   ro_volatile_var = *(uint64 *)ptr

space missing before "ptr"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Detach partition with constraint test

2025-04-03 Thread Amul Sul

On Thu, Apr 3, 2025 at 5:52 PM Ashutosh Bapat
 wrote:
>
> Hi,
> I tested the "not enforced" constraint feature extensively today
> especially the cases of partitioned table. Everything seems to be
> working fine.
>
> While doing that, I found that foreign_key.sql does not have a test to
> make sure that a partition continues to have the constraints in the
> same state after detaching. Here's a 0001 patch adding those tests in
> the same block as the not enforced constraints. Probably we could add
> similar test in other blocks as well, but I haven't done that yet.
> Checking if something like this would be acceptable.
>
> 0002 fixes a comment in the same block.
>

Thanks for the testing; both patches look quite reasonable to me.

Regards,
Amul

RE: Some codes refer slot()->{'slot_name'} but it is not defined

2025-04-03 Thread Hayato Kuroda (Fujitsu)

Dear Fujii-san,

> I've pushed the patches. Thanks!

This is a closing post. I confirmed at least one BF animal for each version
have been run and said OK. IIUC there are no threads to be forked.

Thanks for pushing!

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: [fixed] Trigger test

2025-04-03 Thread Dmitrii Bondar



On 04/04/2025 01:11, Tom Lane wrote:

So that's a long laundry list and we haven't even dug hard.
Is it worth it?  If you feel like doing the legwork then
I'm willing to support the project, but I really wonder if
we shouldn't cut our losses and just remove the module.

(I hesitate now to look at the rest of contrib/spi/ :-()


You wrote a note that I decided to omit. As I mentioned, the patch does 
not even fix the cascade update problem—there are still broken 
cases—because it seems impossible to address it in a gentle way (the 
code was patched 20 years ago; it's truly legacy).


I considered removing it entirely, but that seemed too drastic a 
solution (and, at the very least, I don't have enough expertise to make 
that decision). If everything looks acceptable, I would prefer to cut 
the module. The |check_primary_key| and |check_foreign| functions are 
clearly unused, are buggy, and no one has reported any obvious 
problems—so refint.c can be safely removed. Autoinc.c also looks 
problematic.


There are some question. When should we remove the module? Should we 
mark it as deprecated for now and remove it later? Should we handle it 
in another thread? Should we apply this patch in that case?


Best regards,

Dmitrii

Re: AIO v2.5

2025-04-03 Thread Ranier Vilela

Hi.

Em qua., 2 de abr. de 2025 às 08:58, Andres Freund 
escreveu:

> Hi,
>
> I've pushed fixes for 1) and 2) and am working on 3).
>
Coverity has one report about this.

CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field
result_one.result is uninitialized when calling pgaio_result_report.

Below not is a fix, but some suggestion:

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
  else
  result->status = PGAIO_RS_WARNING;

+ result->result = 0;
+
  /*
  * The encoding is complicated enough to warrant cross-checking it against
  * the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
buf_off, Buffer buffer,
  /* Check for garbage data. */
  if (!failed)
  {
- PgAioResult result_one;
-
  if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
  failed_checksum))
  {
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
buf_off, Buffer buffer,
  */
  if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
  {
+ PgAioResult result_one;
+
  buffer_readv_encode_error(&result_one, is_temp,
   *zeroed_buffer,
   *ignored_checksum,


1. I couldn't find the correct value to initialize the *result* field.
2. result_one can be reduced scope.

best regards,
Ranier Vilela

psql suggestion "select " offers nothing, can we get functions like "\df "

2025-04-03 Thread Kirk Wolak

Hackers,
  "call " works.  Obviously it was a trivial case.

  But "select " does nothing.

  Worse, "select pg_stat_st" has no clue.  I was looking for ... _reset

  It's not that difficult to add, I am suggesting that we use the same
logic as \df at that point?

  Is there any such support for this?
  Is it bad form to offer a bounty for someone to do this in this list? (I
simply won't have time for 2-3 months).

Thanks!

Re: pg_recvlogical cannot create slots with failover=true

2025-04-03 Thread Masahiko Sawada

On Wed, Apr 2, 2025 at 11:57 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Michael,
>
> > Maybe we don't need a short option at all for this, at least initially?
>
> Indeed, updated.

Thank you for updating the patch. Here are some minor comments:

 The --two-phase can be specified with
 --create-slot to enable decoding of prepared
transactions.
+Also, --falover can be specified with
+--create-slot to create replication slot which can be
+synchronized to the standby.

How about rewording the whole sentence like:

+The --two-phase and
--failover options
+can be specified with --create-slot.

Explaining both options here seems redundant to me.

---
+   
+Allows created replication slot to be synchronized to the standbys.
+This option may only be specified with --create-slot.
+   

How about slightly rewording to like:

+Enables the slot to be synchronized to the standbys. This
option may only be
+specified with --create-slot.

Also, the descriptions of pg_recvlogical options are written in
alphabetical order. Please put the description for --failover option
after -E/--endpos.

The rest looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: SQLFunctionCache and generic plans

2025-04-03 Thread Tom Lane

Alexander Lakhin  writes:
> I've discovered that starting from 0dca5d68d, the following query:
> CREATE FUNCTION f(x anyelement) RETURNS anyarray AS '' LANGUAGE SQL;
> SELECT f(0);

> triggers:
> TRAP: failed Assert("fcache->func->rettype == VOIDOID"), File: "functions.c", 
> Line: 1737, PID: 3784779

Drat.  I thought I'd tested the empty-function-body case, but
evidently that was a few changes too far back.  Will fix,
thanks for catching it.

regards, tom lane

Re: AIO v2.5

2025-04-03 Thread Ranier Vilela

Em qui., 3 de abr. de 2025 às 15:35, Andres Freund 
escreveu:

> Hi,
>
> On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:
> > Em qua., 2 de abr. de 2025 às 08:58, Andres Freund 
> > escreveu:
> >
> > > Hi,
> > >
> > > I've pushed fixes for 1) and 2) and am working on 3).
> > >
> > Coverity has one report about this.
> >
> > CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
> > 13. uninit_use_in_call: Using uninitialized value result_one. Field
> > result_one.result is uninitialized when calling pgaio_result_report.
>
> Isn't this a rather silly thing to warn about for coverity?

Personally, I consider every warning to be important.


>   The field isn't
> used in pgaio_result_report().  It can't be a particularly rare thing to
> have
> struct fields that aren't always used?
>
Always considered a risk, someone may start using it.


>
>
> > Below not is a fix, but some suggestion:
> >
> > diff --git a/src/backend/storage/buffer/bufmgr.c
> > b/src/backend/storage/buffer/bufmgr.c
> > index 1c37d7dfe2..b0f9ce452c 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
> >   else
> >   result->status = PGAIO_RS_WARNING;
> >
> > + result->result = 0;
> > +
>
> That'd be completely wrong - and the tests indeed fail if you do that. The
> read might succeed with a warning (e.g. due to zero_damaged_pages) in which
> case the result still carries important information about how many blocks
> were
> successfully read.
>
That's exactly why it's not a patch.


>
>
> >   /*
> >   * The encoding is complicated enough to warrant cross-checking it
> against
> >   * the decode function.
> > @@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td,
> uint8
> > buf_off, Buffer buffer,
> >   /* Check for garbage data. */
> >   if (!failed)
> >   {
> > - PgAioResult result_one;
> > -
> >   if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
> >   failed_checksum))
> >   {
> > @@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td,
> uint8
> > buf_off, Buffer buffer,
> >   */
> >   if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
> >   {
> > + PgAioResult result_one;
> > +
> >   buffer_readv_encode_error(&result_one, is_temp,
> >*zeroed_buffer,
> >*ignored_checksum,
> >
> >
> > 1. I couldn't find the correct value to initialize the *result* field.
>
> It is not accessed in this path.  I guess we can just zero-initialize
> result_one to shut up coverity.
>
Very good.


>
>
> > 2. result_one can be reduced scope.
>
> True.
>
Ok.

best regards,
Ranier Vilela

Re: Using read stream in autoprewarm

2025-04-03 Thread Melanie Plageman

On Thu, Apr 3, 2025 at 11:17 AM Heikki Linnakangas  wrote:
>
> I had a quick look at this. Looks good overall, some small remarks:

Thanks for taking a look!

> v12-0001-Autoprewarm-global-objects-separately.patch
>
> > Instead, modify apw_load_buffers() to prewarm the shared objects in one
> > invocation of autoprewarm_database_main() while connected to the first
> > valid database.
>
> So it effectively treats "global objects" as one extra database,
> launching a separate worker process to handle global objects. It took me
> a while to understand that. From the commit message, I understood that
> it still does that within the first worker process invocation, but no. A
> comment somewhere would be good.

Yea, I could have been more explicit about that.

Actually, I was chatting about this with Andres off-list and he was
like, why do you need to check the database at all? Won't
prewarm_stop_idx already have that built in? And I think he's right.
In attached v13, I've added a separate patch (0002) which turns this
check into an assert. And I removed the check from all of the other
loops in the later patches.

> v12-0003-Use-streaming-read-I-O-in-autoprewarm.patch
>
> I wonder if the have_free_buffer() calls work correctly with read
> streams? Or will you "overshoot", prewarming a few more pages after the
> buffer cache is already full? I guess that depends on when exactly the
> read stream code allocates the buffer.

It does have some overshoot -- but a max of io_combine_limit blocks
will be evicted. The read stream code builds up an IO of up to
io_combine_limit blocks before calling StartReadBuffer(). So you could
be in a situation where you weren't quite out of buffers on the
freelist while you are building up the IO and then when you go to pin
those buffers, there aren't enough on the freelist. But I think that's
okay.

> While reviewing this, I noticed a pre-existing bug: The code ignores
> 'tablespace' when deciding if it's reached the end of the current
> relation. I believe it's possible to have two different relations with
> the same relnumber, in different tablespaces.

Good catch. I've included a fix for this in the attached set (0001)

- Melanie
From 9d5e4bf6a05d0a3f2838dd9efa8715b05be77423 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 3 Apr 2025 12:47:19 -0400
Subject: [PATCH v13 1/4] Fix autoprewarm neglect of tablespaces

While prewarming blocks from a dump file, autoprewarm_database_main()
mistakenly ignored tablespace when detecting the beginning of the next
relation to prewarm. Because RelFileNumbers are only unqiue within a
tablespace, autoprewarm could miss prewarming blocks from a
relation with the same RelFileNumber in a different tablespace.

Though this situation is likely rare in practice, it's best to make the
code correct. Do so by explicitly checking for the RelFileNumber when
detecting a new relation.

Reported-by: Heikki Linnakangas 
Discussion: https://postgr.es/m/97c36982-603b-494a-95f4-aaf2a12ac27e%40iki.fi
---
 contrib/pg_prewarm/autoprewarm.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 73485a2323c..760b1548eff 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -472,10 +472,15 @@ autoprewarm_database_main(Datum main_arg)
 
 		/*
 		 * As soon as we encounter a block of a new relation, close the old
-		 * relation. Note that rel will be NULL if try_relation_open failed
-		 * previously; in that case, there is nothing to close.
+		 * relation. RelFileNumbers are only guaranteed to be unique within a
+		 * tablespace, so check that too.
+		 *
+		 * Note that rel will be NULL if try_relation_open failed previously;
+		 * in that case, there is nothing to close.
 		 */
-		if (old_blk != NULL && old_blk->filenumber != blk->filenumber &&
+		if (old_blk != NULL &&
+			(old_blk->tablespace != blk->tablespace ||
+			 old_blk->filenumber != blk->filenumber) &&
 			rel != NULL)
 		{
 			relation_close(rel, AccessShareLock);
@@ -487,7 +492,9 @@ autoprewarm_database_main(Datum main_arg)
 		 * Try to open each new relation, but only once, when we first
 		 * encounter it. If it's been dropped, skip the associated blocks.
 		 */
-		if (old_blk == NULL || old_blk->filenumber != blk->filenumber)
+		if (old_blk == NULL ||
+			old_blk->tablespace != blk->tablespace ||
+			old_blk->filenumber != blk->filenumber)
 		{
 			Oid			reloid;
 
@@ -508,6 +515,7 @@ autoprewarm_database_main(Datum main_arg)
 
 		/* Once per fork, check for fork existence and size. */
 		if (old_blk == NULL ||
+			old_blk->tablespace != blk->tablespace ||
 			old_blk->filenumber != blk->filenumber ||
 			old_blk->forknum != blk->forknum)
 		{
-- 
2.34.1

From 8879473d355dba5c9397c398240db2e1efb81a4c Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 3 Apr 2025 14:54:09 -0400
Subject: [PATCH v13 2/4] Remove superfluous autoprewarm check

autoprewarm_d

Re: RFC: Logging plan of the running query

2025-04-03 Thread Robert Haas

On Thu, Apr 3, 2025 at 1:32 AM torikoshia  wrote:
> I think tracking execution progress involves more challenges to solve
> compared to simply outputting the plan.
> For this reason, I believe an incremental approach -- first completing
> the basic plan output functionality in this thread and then extending it
> to support progress tracking -- would be the good way forward.

I think you might be right, although I am not totally certain yet.

> However, for the next patch, I'm considering introducing a GUC to allow
> prior setup before outputting the plan, in response to the previously
> quoted comment:
>
> One way in which this proposal seems safer than previous proposals is
> that previous proposals have involved session A poking session B and
> trying to get session B to emit an EXPLAIN on the fly with no prior
> setup. That would be very useful, but I think it's more difficult and
> more risky than this proposal, where all the configuration happens in
> the session that is going to emit the EXPLAIN output.

When I wrote this comment, I was unaware of Andres's proposal to use
ProcessInterrupts() to install the ExecProcNode() wrapper. With that
approach, which you have already implemented, I don't see a reason to
require prior configuration.

> With this change, it should be possible to use a file-level variable
> instead.

I think the question of whether ActiveQueryDesc can be file-level is
separate from whether prior configuration is needed. If it is
important to touch this from multiple source files, then it is fine
for it to be global. However, if we have a new source file, say
dynamic_explain.c, then you could have functions
ProcessDynamicExplainInterrupt() and DynamicExplainCleanup() in that
file to set, use, clear ActiveQueryDesc, and the rest of the system
might not need to know about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: AIO v2.5

2025-04-03 Thread Andres Freund

Hi,

On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:
> Em qua., 2 de abr. de 2025 às 08:58, Andres Freund 
> escreveu:
> 
> > Hi,
> >
> > I've pushed fixes for 1) and 2) and am working on 3).
> >
> Coverity has one report about this.
> 
> CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
> 13. uninit_use_in_call: Using uninitialized value result_one. Field
> result_one.result is uninitialized when calling pgaio_result_report.

Isn't this a rather silly thing to warn about for coverity?  The field isn't
used in pgaio_result_report().  It can't be a particularly rare thing to have
struct fields that aren't always used?


> Below not is a fix, but some suggestion:
> 
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 1c37d7dfe2..b0f9ce452c 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
>   else
>   result->status = PGAIO_RS_WARNING;
> 
> + result->result = 0;
> +

That'd be completely wrong - and the tests indeed fail if you do that. The
read might succeed with a warning (e.g. due to zero_damaged_pages) in which
case the result still carries important information about how many blocks were
successfully read.


>   /*
>   * The encoding is complicated enough to warrant cross-checking it against
>   * the decode function.
> @@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
>   /* Check for garbage data. */
>   if (!failed)
>   {
> - PgAioResult result_one;
> -
>   if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
>   failed_checksum))
>   {
> @@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
> buf_off, Buffer buffer,
>   */
>   if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
>   {
> + PgAioResult result_one;
> +
>   buffer_readv_encode_error(&result_one, is_temp,
>*zeroed_buffer,
>*ignored_checksum,
> 
> 
> 1. I couldn't find the correct value to initialize the *result* field.

It is not accessed in this path.  I guess we can just zero-initialize
result_one to shut up coverity.


> 2. result_one can be reduced scope.

True.


Greetings,

Andres Freund

Re: Modern SHA2- based password hashes for pgcrypto

2025-04-03 Thread Alvaro Herrera

On 2025-Mar-11, Bernd Helmle wrote:

> Please find attached v4 of this patch. I added the following changes:
> 
> - Check for non-supported characters in the salt like passlib does.
> - Check for reserved tokens when parsing the salt string (i find this
> very strict, but it covers the cases Japin Li pointed out).
> 
> I didn't implement being more strict when it comes to the salt length:
> the salt might be provided from external sources and the password hash
> is just used for validating within the database.
> 
> When hashing the password within postgres, users always should use
> gen_salt(), which generates a reasonable salt string. 

This sounds reasonable to me.

> Maybe, in case of empty salts, we should issue a WARNING instead of
> erroring out and put additional documentation on how to use it right.

I don't know, that doesn't seem ideal to me, because it's very easy to
run stuff and never see the warnings.  If we find that people are
desperate to use empty salts, we can relax that later (turn the error to
a warning), but I'd rather not have it in the first cut.

> Hmm, i didn't understand that passlib does decode them first, i thought
> they use it encoded... at least, in our current form we're pretty much
> compatible with Drepper, passlib and OpenSSL, as far as i tested:

I am ready to believe that I misinterpreted what I read.

> So the hashes we produce are identical, but with being more strict we
> differ in the handling of the provided salt.

Okay.


I can offer a few cosmetic changes.  0001 is a pgindent run, and 0002 is
some manual adjustments after that.  There are only two nontrivial
changes

1. the calculation for rounds was using type long, which is confusing
because the range is different according to the platform.  Since it's
limited by the macro definitions to no more than 9, we can make
it an int32 instead.  So we use strtoint() instead of strtoul() to parse
the value, and remove the "l" suffixes from the macros that define the
limits and default, which were bugging me a bit when used in the
gen_list struct.

2. I changed "if (block % 3)" to "if ((block % 3) != 0)".

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El destino baraja y nosotros jugamos" (A. Schopenhauer)
>From 864395b6468d62f17d2e8f0e593408849564edc1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Wed, 2 Apr 2025 17:25:20 +0200
Subject: [PATCH 1/2] pgindent

---
 contrib/pgcrypto/crypt-gensalt.c | 14 +--
 contrib/pgcrypto/crypt-sha.c | 42 +---
 contrib/pgcrypto/px-crypt.c  |  6 ++---
 contrib/pgcrypto/px-crypt.h  |  2 +-
 4 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/contrib/pgcrypto/crypt-gensalt.c b/contrib/pgcrypto/crypt-gensalt.c
index b9aded897d1..922f9eea6fa 100644
--- a/contrib/pgcrypto/crypt-gensalt.c
+++ b/contrib/pgcrypto/crypt-gensalt.c
@@ -190,9 +190,9 @@ static char *
 _crypt_gensalt_sha(unsigned long count,
   const char *input, int size, char *output, 
int output_size)
 {
-   char * s_ptr = output;
+   char   *s_ptr = output;
unsigned int result_bufsize = PX_SHACRYPT_SALT_BUF_LEN;
-   int rc;
+   int rc;
 
/* output buffer must be allocated with PX_MAX_SALT_LEN bytes */
if (PX_MAX_SALT_LEN < result_bufsize)
@@ -203,8 +203,8 @@ _crypt_gensalt_sha(unsigned long count,
}
 
/*
-* Care must be taken to not exceed the buffer size allocated for
-* the input character buffer.
+* Care must be taken to not exceed the buffer size allocated for the
+* input character buffer.
 */
if ((PX_SHACRYPT_SALT_MAX_LEN != size)
|| (output_size < size))
@@ -229,8 +229,8 @@ _crypt_gensalt_sha(unsigned long count,
/*
 * Normalize salt string
 *
-* size of input buffer was checked above to
-* not exceed PX_SHACRYPT_SALT_LEN_MAX.
+* size of input buffer was checked above to not exceed
+* PX_SHACRYPT_SALT_LEN_MAX.
 */
for (int i = 0; i < size; i++)
{
@@ -268,4 +268,4 @@ _crypt_gensalt_sha256_rn(unsigned long count,
output[2] = '$';
 
return _crypt_gensalt_sha(count, input, size, output, output_size);
-}
\ No newline at end of file
+}
diff --git a/contrib/pgcrypto/crypt-sha.c b/contrib/pgcrypto/crypt-sha.c
index ec51e865a5e..13ee8913052 100644
--- a/contrib/pgcrypto/crypt-sha.c
+++ b/contrib/pgcrypto/crypt-sha.c
@@ -77,24 +77,25 @@ px_crypt_shacrypt(const char *pw, const char *salt, char 
*passwd, unsigned dstle
int err;
 
const char *dec_salt_binary;/* pointer into the real salt string */
-   StringInfo decoded_salt = NULL; /* decoded salt string */
+   StringInfo  decoded_salt = NULL;/* decoded salt string */
 
unsigned char sha_buf[PX_SHACRYPT_DIGEST_MAX_LEN];

Re: Draft for basic NUMA observability

2025-04-03 Thread Tomas Vondra

On 4/3/25 15:12, Jakub Wartak wrote:
> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra  wrote:
> 
>> ...
>>
>> So unless someone can demonstrate a use case where this would matter,
>> I'd not worry about it too much.
> 
> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> it's just that I don't have cycles left today and probably lack skills
> (i've never dealt with arrays so far) thus it would be slow to get it
> right... but I can pick up anything tomorrow morning.
> 

OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.

It does these changes:

1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.

2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.

3) It now builds an array of records, with one record per buffer/page.

4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:

   os_page_count = NBuffers * Max(1, pages_per_buffer);

and then

  for (i = 0; i < NBuffers; i++)
  {
  for (j = 0; j < Max(1, pages_per_buffer); j++)
  {
  ..
  }
  }

and everything just works fine, I think.


Opinions? I personally find this much cleaner / easier to understand.


regards

-- 
Tomas VondraFrom 70018899b698b186ffebb03c7336022ae83fbfb8 Mon Sep 17 00:00:00 2001
From: Jakub Wartak 
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v24 1/3] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.

A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.

The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.

On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).

Author: Jakub Wartak 
Co-authored-by: Bertrand Drouvot 
Reviewed-by: Andres Freund 
Reviewed-by: Álvaro Herrera 
Reviewed-by: Tomas Vondra 
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 .cirrus.tasks.yml   |   2 +
 configure   | 187 
 configure.ac|  14 +++
 doc/src/sgml/func.sgml  |  13 ++
 doc/src/sgml/installation.sgml  |  21 
 meson.build |  23 
 meson_options.txt   |   3 +
 src/Makefile.global.in  |   6 +-
 src/backend/utils/misc/guc_tables.c |   2 +-
 src/include/catalog/pg_proc.dat |   4 +
 src/include/pg_config.h.in  |   3 +
 src/include/port/pg_numa.h  |  40 ++
 src/include/storage/pg_shmem.h  |   1 +
 src/makefiles/meson.build   |   3 +
 src/port/Makefile   |   1 +
 src/port/meson.build|   1 +
 src/port/pg_numa.c  | 120 ++
 17 files changed, 442 insertions(+), 2 deletions(-)
 create mode 100644 src/include/port/pg_numa.h
 create mode 100644 src/port/pg_numa.c

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
 --enable-cassert --enable-injection-points --enable-debug \
 --enable-tap-tests --enable-nls \
 --with-segsize-blocks=6 \
+--with-libnuma \
 --with-liburing \
 \
 ${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
 -Dllvm=disabled \
 --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
 -DPERL=perl5.36-i386-linux-gnu \
+-Dlibnuma=disabled \
 build-32
 EOF
 
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
 XML2_CFLAGS
 XML2_CONFIG
 with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
 LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
@@ -872,6 +875,7 @@ with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
+with_libnuma
 with_libxml
 with_libxslt
 with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
 LIBURING_LIBS
 LIBCURL_CFLAGS
 LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
 XML2_CONFIG
 XML2_CFLAGS
 XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
   --with-uuid=LIB build contrib/uuid

Re: psql suggestion "select " offers nothing, can we get functions like "\df "

2025-04-03 Thread Pavel Stehule

Hi

čt 3. 4. 2025 v 20:17 odesílatel Tom Lane  napsal:

> Kirk Wolak  writes:
> >   But "select " does nothing.
>
> What would you have it offer?  Every single column and function name
> in the database?  Seems unlikely to be very helpful.
>
> postgres=# select count(distinct attname) from pg_attribute;
>  count
> ---
>   1218
> (1 row)
>
> postgres=# select count(distinct proname) from pg_proc;
>  count
> ---
>   2787
> (1 row)
>
> That's with zero user-defined objects.
>

you can try https://www.pgcli.com/

the tab complete there is better (or different) than in psql. I don't use
it, but I know people who use it well.

I agree so tab complete for functions can be nice feature. It is a question
if it can work with readline. At the end it is always faster for me to
write \df *reset*, and using mouse for copy

Regards

Pavel

>
> regards, tom lane
>
>
>

Re: New criteria for autovacuum

2025-04-03 Thread Sami Imseih

> Interestingly it takes unusually long for my toy database:

> There is nothing in between these two lines.
>
> To my humble knowledge, CHECKOINT shouldn't set hint bits and should
> take that long. At this point I don't know what's going on.
>

>From what I can tell in your example, you ran the manual vacuum ( session 1)
while you had an open transaction (session 2), so vacuum could not remove
the dead tuples or update the visibility map. Once you committed session 2,
autovacuum came in and did its job after the autovacuum_naptime passed
(default 1 minute). Checkpoint does not update the visibility map, only
vacuum does.

IMO, I don't think we need this patch for vacuum, as simply making sure
autovacuum runs more frequently on the table that is accessed via
index-only scans often is a way to deal with this already, i.e
lowering autovacuum_vacuum_scale_factor. Maybe others have
a different opinion?

13 also introduced autovacuum_vacuum_scale_factor to deal with append
only tables that only saw their first vacuum for wraparound
prevention ( 200 million transactions by default ) and that made
index-only scans slow
because of an outdated visibility map as well as the wraparound vacuum being
more disruptive.

As far as extra metrics go for the scenario in which and index only scan must
visit a table, pg_stat_all_indexes and pg_stat_all_tables do have a
idx_tup_fetch
counter which increases anytime an index scan visits the table, i.e.
index-only scan
with heap fetches or a regular index scan. I think having a counter
specifically for
heap fetches due to index-only scans could be valuable.

--
Sami Imseih
Amazon Web Services (AWS)

Re: BTScanOpaqueData size slows down tests

2025-04-03 Thread Peter Geoghegan

On Wed, Apr 2, 2025 at 12:43 PM Álvaro Herrera  wrote:
> I'm just saying that this is the reason to have the field be last in the
> struct.

Right. And I'm just saying that I don't think that it would take very
much effort to change it. That aspect isn't performance critical.

--
Peter Geoghegan

Re: New criteria for autovacuum

2025-04-03 Thread Melanie Plageman

On Thu, Apr 3, 2025 at 4:37 PM Sami Imseih  wrote:
>
> From what I can tell in your example, you ran the manual vacuum ( session 1)
> while you had an open transaction (session 2), so vacuum could not remove
> the dead tuples or update the visibility map. Once you committed session 2,
> autovacuum came in and did its job after the autovacuum_naptime passed
> (default 1 minute). Checkpoint does not update the visibility map, only
> vacuum does.
>
> IMO, I don't think we need this patch for vacuum, as simply making sure
> autovacuum runs more frequently on the table that is accessed via
> index-only scans often is a way to deal with this already, i.e
> lowering autovacuum_vacuum_scale_factor. Maybe others have
> a different opinion?

Yea, I mean if you have a table that isn't meeting the threshold for
being vacuumed by autovacuum based on inserted/modified tuples that
you are querying a lot, then you should probably change the autovacuum
table storage options for that table.

I don't fully understand if the example provided here is in that
situation (i.e. autovac not being triggered because they were below
the thresholds).

Historically, we haven't updated the VM when doing on-access pruning
because we wanted vacuum to have to vacuum those pages and freeze them
before an aggressive vacuum was required. This release, in
052026c9b903, I added functionality to vacuum to scan more all-visible
pages during regular vacuums. I think this enables us to update the VM
during on-access pruning. This is something I plan to work on in 19.
It seems like it would alleviate situations like this.

> As far as extra metrics go for the scenario in which and index only scan must
> visit a table, pg_stat_all_indexes and pg_stat_all_tables do have a
> idx_tup_fetch
> counter which increases anytime an index scan visits the table, i.e.
> index-only scan
> with heap fetches or a regular index scan. I think having a counter
> specifically for
> heap fetches due to index-only scans could be valuable.

What do you imagine using the heap fetches during index-only scans counter for?

- Melanie

Re: dblink query interruptibility

2025-04-03 Thread Noah Misch

On Wed, Mar 12, 2025 at 11:03:41AM +0100, Andreas Karlsson wrote:
> On 3/12/25 12:48 AM, Noah Misch wrote:
> > Overall, in the absence of objections, I will queue a task to back-patch the
> > non-postgres_fdw portion of commit d3c5f37 to v13-v16.

Pushed (e.g. v16 has commit 82a8f0f).  Only v16 had libpq-be-fe-helpers.h at
all, so I also back-patched 28a5917 to add it.  The original use case for
libpq-be-fe-helpers.h was interrupting PQconnectdbParams(), commit e460248.  I
decided not to back-patch that one, since connection-time delays are often
limited in ways query runtime is not.  We could change that decision.

Re: [fixed] Trigger test

2025-04-03 Thread Tom Lane

Dmitrii Bondar  writes:
> [ v6-0001-Triggers-test-fix-with-the-invalid-cache-in-refin.patch ]

I spent a little bit of time looking over this patch.  My first
instinct was "we can't really change the recommended method of
installing these triggers" --- but that would only matter if we
thought there were actual production users of these triggers,
which surely there are not (anymore).  The only reason we're
carrying refint.c at all is as an example of C-coded triggers.
So changing the example seems fine, and you're right that this
sort of change is far better done from an AFTER trigger.

However ... as an example refint.c is pretty darn awful :-(.
I'd never looked hard at it before, and now that I have,
I'm rather horrified, particularly by the shoddy quoting practices.
None of the identifiers inserted into constructed queries receive
quote_identifier() protection, and the insertion of data values
around line 500 is beyond awful.

So while I think your v6 patch fixes the problem(s) it set out
to fix, it still feels a lot like putting lipstick on a pig.
I wonder if we'd be better off to nuke refint.c altogether.
If we don't, I feel like we're morally obliged to spend more
effort trying to make it less of an example of bad practices.
Some of the things I think need to be cleaned up:

* It's ridiculous that the update-cascade case is inserting
data values textually at all.  Even if it were quoting them
properly, that's expensive and risks round-trip-conversion
problems.  That should be handled as an additional Param value
passed into the statement.

* Worse yet, that code doesn't work if used more than once,
because the first value that needs to be updated gets baked
into the plan that will be re-used later.  So the Param
approach is really essential even aside from quoting concerns.

* String comparisons to detect value equality (around line 400)
are not terribly cool either.  Proper code would be looking up
the default equality operator for the datatypes and applying that.

* Some thought needs to be given to table and column names that
require quoting.  I guess in a way there's an advantage to not
quoting the table names that come from trigger arguments: it
lets the user get around the module's failure to think about
schema-qualified names, by writing 'foo.bar' rather than just
'bar'.  But that's not documented.  If we did quote everything
then we'd really have to go over to providing separate schema
and name arguments for each of the other tables.  In any case,
not quoting the column names has nothing to recommend it.

* I'll slide gently past the question of whether this should
be expected to react on-the-fly to DDL changes in the tables.
SPI will do some of that under the hood, but it can't fix
cases where the query string would need to change (eg.
table or column renames).

So that's a long laundry list and we haven't even dug hard.
Is it worth it?  If you feel like doing the legwork then
I'm willing to support the project, but I really wonder if
we shouldn't cut our losses and just remove the module.

(I hesitate now to look at the rest of contrib/spi/ :-()

regards, tom lane

Re: Update Unicode data to Unicode 16.0.0

2025-04-03 Thread Peter Eisentraut


On 17.03.25 19:54, Jeff Davis wrote:

On Thu, 2025-03-13 at 14:49 +0100, Peter Eisentraut wrote:

I think these test result changes are incorrect.  AFAICT, nothing has
changed in the upstream data that would explain such a change.

I didn't get such test differences in my original patch.  Did you
rebase
the patch or regenerate it?  Maybe something went wrong there.


Right. The previous version was improperly rebased.

I went back to your original version and rebased over master (including
the latest optimizations for case mapping), and then rebased my changes
on top of that. Attached.

One annoyance with the recent case mapping optimizations is that the
Unicode update diff is larger, similar to the diff for
unicode_norm_hashfunc.h. Previously, it was binary search, so the only
differences were logical differences.


I have committed the update to the Unicode tables.  I suggest you commit 
your pg_upgrade patch on top of that now.

Re: Statistics Import and Export

2025-04-03 Thread Jeff Davis

On Wed, 2025-04-02 at 21:26 -0500, Nathan Bossart wrote:

> Okay, here is an updated patch set.

> * Besides custom format calling WriteToc() twice to update the data
>   offsets, tar format ... even if it did, the worst case is that
> the
>   content of restore.sql (which isn't used by pg_restore) would be
>   different.  I noticed some past discussion that seems to suggest
> that
>   this format might be a candidate for deprecation [0], so I'm not
> sure
>   it's worth doing anything fancier.

I agree that the risk for tar format seems much lower.

> * Our batching code assumes that stats entries are dumped in TOC
> order,
> 

...

>  I propose moving all
>   stats entries in SECTION_POST_DATA to RESTORE_PASS_POST_ACL, which
>   ensures that we always dump stats in TOC order.  One convenient
> side
>   effect of this change is that we can revert a decent chunk of
> commit
>   a0a4601765.  It might be possible to do better via smarter
> lookahead code
>   or a more sophisticated cache, but it's a bit late in the game for
> that.

This simplifies commit a0a4601765. I'd break out that simplification as
a separate commit to make it easier to understand what happened. 

In patch 0003, there are quite a few static function-scoped variables,
which is not a style that I'm used to. One idea is to bundle them into
a struct representing the cache state (including enough information to
fetch the next batch), and have a single static variable that points to
that.

Also in 0003, the "next_te" variable is a bit confusing, because it's
actually the last TocEntry, until it's advanced to point to the current
one.

Other than that, looks good to me.

Regards,
Jeff Davis

Re: Expand applicability of aggregate's sortop optimization

2025-04-03 Thread Tom Lane

Andrei Lepikhov  writes:
> Rebase on current master.

I started to look at v5, and was immediately put off by the fact
that its documentation of the new support request type consists of
precisely zero words.  You have the wrong idea of what is required
when adding a support request.  The specification of what the request
does and what it returns is your PRIMARY work product, not something
you can just blow off and expect people to reverse-engineer over and
over again in the future.  Please look at the existing entries in
supportnodes.h and note that they all have quite specific and detailed
comments.

minmax_support() is pretty short on comments as well.  The one comment
that's there is unintelligible because it starts out by talking about
an index ... the reader is likely to say "what index?"  Also, I am
far from convinced that the code is right: the fact that the aggsortop
shares a btree opfamily with orderClause->sortop doesn't seem to be
enough to justify removing the aggorder clause.  What if they are
opposite sort directions?

Actually ... even if those operators do match, is it okay to just
remove the aggorder clause?  What happens if we do not end up using
an index-based implementation?  The code really needs to include 
a defense of why it's okay to do what it's doing.

Also, that one comment says "So, we can still use the index IFF the
aggregated expression equals the expression used in the ordering
operation".  But I see no check that they match.

I kind of wonder whether the case that this is attempting to optimize
actually occurs in the field often enough to justify expending this
many cycles looking for it.  MIN(x ORDER BY x) does not seem like
something that a rational person would write.

BTW, it looks to me like that Assert would fail on MIN(x ORDER BY x,y)
which is even less sensible to write, but that doesn't make an
assertion failure okay.

I wonder whether you picked the best spot to insert the hook call
in preprocess_aggref.  In particular, the hook can do nothing that
would affect the polymorphic-result-type resolution step, which
is maybe not a restriction in practice but I'm not entirely sure.
I'm a little inclined to call it before we start digging information
out of the Aggref, rather than after.

regards, tom lane

Re: Test to dump and restore objects left behind by regression

2025-04-03 Thread Andres Freund

Hi,

On 2025-04-03 10:20:09 +0200, Alvaro Herrera wrote:
> On 2025-Apr-03, Ashutosh Bapat wrote:
>
> > Looks like the problem is in the test itself as pointed out by Jeff in
> > [1]. PFA patch fixing the test and enabling statistics back.
>
> Thanks, pushed.

Since then the pg_upgrade tests have been failing on skink/valgrind, due to
exceeding the already substantially increased timeout.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-03%2007%3A06%3A19&stg=pg_upgrade-check
(note that there are other issues in that run)

284/333 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade   
TIMEOUT1.66s   killed by signal 15 SIGTERM

[10:38:19.815](16.712s) ok 20 - check that locales in new cluster match 
original cluster
...
# Running: pg_dumpall --no-sync --dbname port=15114 host=/tmp/bh_AdT5uvQ 
dbname='postgres' --file 
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/tmp_test_gp2G/dump2.sql
death by signal at 
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
 line 181.
...
[10:44:11.720](351.905s) # Tests were run but no plan was declared and 
done_testing() was not seen.

I've increased the timeout even further, but I can't say that I am happy about
the slowest test getting even slower. Adding test time in the serially slowest
test is way worse than adding the same time in a concurrent test.

I suspect that the test will go a bit faster if log_statement weren't forced
on, printing that many log lines, with context, does make valgrind slower,
IME. But Cluster.pm forces it to on, and I suspect that putting a global
log_statement=false into TEMP_CONFIG would have it's own disadvantages.

/me and checks prices for increasing the size of skink's host.

Greetings,

Andres

Re: AIO v2.5

2025-04-03 Thread Noah Misch

On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:
> 4b)
> 
> That's not all though, after getting past this failure, I see uninitialized
> memory errors for reads into temporary buffers:
> 
> ==3334031== VALGRINDERROR-BEGIN
> ==3334031== Conditional jump or move depends on uninitialised value(s)
> ==3334031==at 0xD7C859: PageIsVerified (bufpage.c:108)
> ==3334031==by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
> ==3334031==by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
> ==3334031==by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
> ==3334031==by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
> ==3334031==by 0xD24720: pgaio_io_reclaim (aio.c:644)
> ==3334031==by 0xD24400: pgaio_io_process_completion (aio.c:521)
> ==3334031==by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
> ==3334031==by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
> ==3334031==by 0xD245E0: pgaio_io_wait (aio.c:587)
> ==3334031==by 0xD24FFE: pgaio_wref_wait (aio.c:900)
> ==3334031==by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
> ==3334031==by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
> ==3334031==by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
> ==3334031==by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
> ==3334031==by 0x8B594F: heap_getnextslot (heapam.c:1375)
> ==3334031==by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
> ==3334031==by 0xB29177: SeqNext (nodeSeqscan.c:81)
> ==3334031==by 0xB28F75: ExecScanFetch (execScan.h:126)
> ==3334031==by 0xB28FDD: ExecScanExtended (execScan.h:170)
> 
> 
> The reason for this one is, I think, that valgrind doesn't understand io_uring
> sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
> of commands is somewhat hard to intercept by tools like valgrind and rr.
> 
> The best fix for that one would, I think, be to have method_io_uring() iterate
> over the IOV and mark the relevant regions as defined?  That does fix the
> issue at least and does seem to make sense?

Makes sense.  Valgrind knows that read() makes its target bytes "defined".  It
probably doesn't have an io_uring equivalent for that.

I expect we only need this for local buffers, and it's unclear to me how the
fix for (4a) didn't fix this.  Before bufmgr Valgrind integration (1e0dfd1 of
2020-07) there was no explicit handling of shared_buffers.  I suspect that
worked because the initial mmap() of shared memory was considered "defined"
(zeros), and steps like PageAddItem() copy only defined bytes into buffers.
Hence, shared_buffers remained defined without explicit Valgrind client
requests.  This example uses local buffers.  Storage for those comes from
MemoryContextAlloc() in GetLocalBufferStorage().  That memory starts
undefined, but it becomes defined at PageInit() or read().  Hence, I expected
the fix for (4a) to make the buffer defined after io_uring read.  What makes
the outcome different?

In the general case, we could want client requests as follows:

- If completor==definer and has not dropped pin:
  - Make defined before verifying page.  That's all.  It might be cleaner to
do this when first retrieving a return value from io_uring, since this
just makes up for what Valgrind already does for readv().

- If completor!=definer or has dropped pin:
  - Make NOACCESS in definer when definer cedes its own pin.
  - For io_method=worker, make UNDEFINED before starting readv().  It might be
cleanest to do this when the worker first acts as the owner of the AIO
subsystem pin, if that's a clear moment earlier than readv().
  - Make DEFINED in completor before verifying page.  It might be cleaner to
do this when the completor first retrieves a return value from io_uring,
since this just makes up for what Valgrind already does for readv().
  - Make NOACCESS in completor after verifying page.  Similarly, it might be
cleaner to do this when the completor releases the AIO subsystem pin.

> Not quite sure if we should mark
> the entire IOV is efined or just the portion that was actually read - the
> latter is additional fiddly code, and it's not clear it's likely to be 
> helpful?

Seems fine to do the simpler way if that saves fiddly code.

> 4c)
> 
> Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
> PageIsVerified() causes the *next* read into the same buffer in an IO worker
> to fail:
> 
> ==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
> ==3339904==at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
> ==3339904==by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
> ==3339904==by 0x5B93C83: pread (pread64.c:25)
> ==3339904==by 0xD274F4: pg_preadv (pg_iovec.h:56)
> ==3339904==by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
> ==3339904==by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
> ==3339904==by 0xC91E26: postmaster_child_launch

Re: Non-text mode for pg_dumpall

2025-04-03 Thread Andrew Dunstan



On 2025-04-01 Tu 1:59 AM, Mahendra Singh Thalor wrote:

On Mon, 31 Mar 2025 at 23:43, Álvaro Herrera  wrote:

Hi

FWIW I don't think the on_exit_nicely business is in final shape just
yet.  We're doing something super strange and novel about keeping track
of an array index, so that we can modify it later.  Or something like
that, I think?  That doesn't sound all that nice to me.  Elsewhere it
was suggested that we need some way to keep track of the list of things
that need cleanup (a list of connections IIRC?) -- perhaps in a
thread-local variable or a global or something -- and we install the
cleanup function once, and that reads from the variable.  The program
can add things to the list, or remove them, at will; and we don't need
to modify the cleanup function in any way.

--
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Thanks Álvaro for the feedback.

I removed the old handling of on_exit_nicely_list from the last patch
set and added one simple function to just update the archive handle in
shutdown_info.  (shutdown_info.AHX = AHX;)

For first database, we will add entry into on_exit_nicely_list array
and for rest database, we will update only shutdown_info as we already
closed connection for previous database.With this fix, we will not
touch entry of on_exit_nicely_list for each database.

Here, I am attaching updated patches.




OK, looks good. here's my latest. I'm currently working on tidying up 
docco and comments.



cheers


andrew




--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From aea4ab40f4d461141ba92b50986b68f036d044cb Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor 
Date: Wed, 19 Mar 2025 01:18:46 +0530
Subject: [PATCH v20250403 1/4] Move common pg_dump code related to connections
 to a new file

ConnectDatabase is used by pg_dumpall, pg_restore and pg_dump so move
common code to new file.

new file name: connectdb.c

Author:Mahendra Singh Thalor 
---
 src/bin/pg_dump/Makefile |   5 +-
 src/bin/pg_dump/connectdb.c  | 294 +++
 src/bin/pg_dump/connectdb.h  |  26 +++
 src/bin/pg_dump/meson.build  |   1 +
 src/bin/pg_dump/pg_backup.h  |   6 +-
 src/bin/pg_dump/pg_backup_archiver.c |   6 +-
 src/bin/pg_dump/pg_backup_db.c   |  79 +--
 src/bin/pg_dump/pg_dump.c|   2 +-
 src/bin/pg_dump/pg_dumpall.c | 278 +
 9 files changed, 352 insertions(+), 345 deletions(-)
 create mode 100644 src/bin/pg_dump/connectdb.c
 create mode 100644 src/bin/pg_dump/connectdb.h

diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index 233ad15ca75..fa795883e9f 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	compress_lz4.o \
 	compress_none.o \
 	compress_zstd.o \
+	connectdb.o \
 	dumputils.o \
 	filter.o \
 	parallel.o \
@@ -50,8 +51,8 @@ pg_dump: pg_dump.o common.o pg_dump_sort.o $(OBJS) | submake-libpq submake-libpg
 pg_restore: pg_restore.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
 	$(CC) $(CFLAGS) pg_restore.o $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
-pg_dumpall: pg_dumpall.o dumputils.o filter.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
-	$(CC) $(CFLAGS) pg_dumpall.o dumputils.o filter.o $(WIN32RES) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+pg_dumpall: pg_dumpall.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
+	$(CC) $(CFLAGS) pg_dumpall.o $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
 install: all installdirs
 	$(INSTALL_PROGRAM) pg_dump$(X) '$(DESTDIR)$(bindir)'/pg_dump$(X)
diff --git a/src/bin/pg_dump/connectdb.c b/src/bin/pg_dump/connectdb.c
new file mode 100644
index 000..9e593b70e81
--- /dev/null
+++ b/src/bin/pg_dump/connectdb.c
@@ -0,0 +1,294 @@
+/*-
+ *
+ * connectdb.c
+ *This is a common file connection to the database.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *src/bin/pg_dump/connectdb.c
+ *
+ *-
+ */
+
+#include "postgres_fe.h"
+
+#include "common/connect.h"
+#include "common/logging.h"
+#include "common/string.h"
+#include "connectdb.h"
+#include "dumputils.h"
+#include "fe_utils/string_utils.h"
+
+static char *constructConnStr(const char **keywords, const char **values);
+
+/*
+ * ConnectDatabase
+ *
+ * Make a database connection with the given parameters.  An
+ * interactive password prompt is automatically issued if required.
+ *
+ * If fail_on_error is false, we return NULL without printing any message
+ * on failure, but preserve any prompted password for the next try.
+ *
+ * On success, the 'connstr' is set to a connection string containing
+ * the options used and 'server_version' is set

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-03 Thread Daniel Gustafsson

> On 3 Apr 2025, at 20:02, Jacob Champion  
> wrote:
> 
> On Tue, Mar 18, 2025 at 9:09 PM Thomas Munro  wrote:
>> All pushed (wasn't sure if Daniel was going to but once I got tangled
>> up in all that kqueue stuff he probably quite reasonably assumed that
>> I would :-)).
> 
> Attached are two more followups, separate from the libcurl split:
> - 0001 is a patch originally proposed at [1]. Christoph pointed out
> that the build fails on a platform that tries to enable Curl without
> having either epoll() or kqueue(), due to a silly mistake I made in
> the #ifdefs.
> - 0002 should fix some timeouts in 002_client.pl reported by Andres on
> Discord. I allowed a short connect_timeout to propagate into tests
> which should not have it.

Thanks, both LGTM so pushed.

--
Daniel Gustafsson

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-03 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Bertrand,

> The other idea to simplify the changes for backbranches:
> sub reactive_slots_change_hfs_and_wait_for_xmins
> {
> ...
> +  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
> 
>   # create the logical slots
> -  create_logical_slots($node_standby, $slot_prefix);
> +  create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
> 
> ...
> +  if ($needs_active_slot)
> +  {
> +$handle =
> +  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
> +  }
> 
> What if this function doesn't take input parameter needs_active_slot
> and rather removes the call to make_slot_active? We will call
> make_slot_active only at the required places. This should make the
> changes much less because after that, we don't need to make changes
> related to drop and create. Sure, in some cases, we will test two
> inactive slots instead of one, but I guess that would be the price to
> keep the tests simple and more like HEAD.

Actually, I could not decide which one is better, so let me share both drafts.
V5-PG17-1 uses the previous approach, and v5-PG17-2 uses new proposed one.
Bertrand, which one do you like?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v5-PG17-1-0001-Stabilize-035_standby_logical_decoding.pl.patch
Description:  v5-PG17-1-0001-Stabilize-035_standby_logical_decoding.pl.patch


v5-PG17-2-0001-Stabilize-035_standby_logical_decoding.pl.patch
Description:  v5-PG17-2-0001-Stabilize-035_standby_logical_decoding.pl.patch

New criteria for autovacuum

2025-04-03 Thread Konstantin Knizhnik


Hi hackers,

Sometime ago I investigated slow query performance case of one customer 
and noticed that index-only scan has made a lot of heap fetches.



-> Index Only Scan using ix_client_objects_vendor_object_id on 
client_objects client_objects_1 (cost=0.56..2.78 rows=1 width=0) (actual 
time=0.006..0.006 rows=1 loops=208081) Index Cond: (vendor_object_id = 
vendor_objects.id) Heap Fetches: 208081 Buffers: shared hit=1092452 
read=156034


So almost any index entry requires visibility check and index-only scan 
is actually normal index-scan.

It certainly have bad impact on performance.
I do not know what happen in this particular case, why pages are not 
marked as all-visible and why index-only scan plan was chosen by optimizer.
Butthe problem can be quite easily reproduced. We can just populate 
table with some data with some other transaction with assigned XID active.

Then explicitly vacuum this tables or wait until autovacuum does it.
At this moment table has no more dead or inserted tuples so autovacuum 
will not be called for it. But heap pages of this table are still not 
marked as all-visible.
And will never be marked as all-visible unless table is updated or is 
explicitly vacuumed.


This is why I think that it may be useful to add more columns 
to|pg_stat_all_tables|and|pg_stat_all_indexes|views providing 
information about heap visibility checks performed by index-only scan. 
And in addition to number of dead/inserted tuples add number of such 
visibility checks as criteria for performing autovacuum for the 
particular table.


Proposed patch is attached.
I am not quit happy with the test - it is intended to check if 
autovacuum is really triggered by this new criteria. But it depends on 
autovacuum activation frequency and may take several seconds.


Will be glad to receive any feedback.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fea683cb49c..f8dc1d04deb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8929,6 +8929,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

   
 
+  
+   autovacuum_vacuum_check_threshold 
(integer)
+   
+autovacuum_vacuum_check_threshold
+configuration parameter
+   
+   
+   
+
+ Specifies the number of heap tuple visibility checks by index-only 
scan needed to trigger a
+ VACUUM in any one table.
+ The default is 1000 tuples.  If -1 is specified, autovacuum will not
+ trigger a VACUUM operation on any tables based on
+ the number of visibility checks.
+ This parameter can only be set in the 
postgresql.conf
+ file or on the server command line;
+ but the setting can be overridden for individual tables by
+ changing table storage parameters.
+
+   
+  
+
   
autovacuum_analyze_threshold 
(integer)

@@ -8990,6 +9012,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

   
 
+  
+   autovacuum_vacuum_check_scale_factor 
(floating point)
+   
+
autovacuum_vacuum_check_scale_factor
+configuration parameter
+   
+   
+   
+
+ Specifies a fraction of the table size to add to
+ autovacuum_vacuum_check_threshold
+ when deciding whether to trigger a VACUUM.
+ The default is 0.2 (20% of table size).
+ This parameter can only be set in the 
postgresql.conf
+ file or on the server command line;
+ but the setting can be overridden for individual tables by
+ changing table storage parameters.
+
+   
+  
+
   
autovacuum_analyze_scale_factor 
(floating point)

diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index 46c1dce222d..f93cf1671ed 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1869,6 +1869,8 @@ default_reloptions(Datum reloptions, bool validate, 
relopt_kind kind)
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_max_threshold)},
{"autovacuum_vacuum_insert_threshold", RELOPT_TYPE_INT,
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_ins_threshold)},
+   {"autovacuum_vacuum_check_threshold", RELOPT_TYPE_INT,
+   offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_check_threshold)},
{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
analyze_threshold)},
{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
@@ -1895,6 +1897,8 @@ default_reloptions(Datum reloptions, bool validate, 
relopt_kind kind)
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_scale_factor)},
{"autovacuum_vacuum_insert_scale_facto

Re: SQLFunctionCache and generic plans

2025-04-03 Thread Alexander Lakhin


Hello Tom,

02.04.2025 21:09, Tom Lane wrote:

Since feature freeze is fast approaching, I did a tiny bit more
cosmetic work on this patchset and then pushed it.  (There's still
plenty of time for adjustments if you have further comments.)


I've discovered that starting from 0dca5d68d, the following query:
CREATE FUNCTION f(x anyelement) RETURNS anyarray AS '' LANGUAGE SQL;
SELECT f(0);

triggers:
TRAP: failed Assert("fcache->func->rettype == VOIDOID"), File: "functions.c", 
Line: 1737, PID: 3784779

On 0dca5d68d~1, it raises:
ERROR:  return type mismatch in function declared to return integer[]
DETAIL:  Function's final statement must be SELECT or 
INSERT/UPDATE/DELETE/MERGE RETURNING.
CONTEXT:  SQL function "f" during startup

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Statistics Import and Export

2025-04-03 Thread Nathan Bossart

Thanks for reviewing.

On Thu, Apr 03, 2025 at 03:23:40PM -0700, Jeff Davis wrote:
> This simplifies commit a0a4601765. I'd break out that simplification as
> a separate commit to make it easier to understand what happened. 

Done.

> In patch 0003, there are quite a few static function-scoped variables,
> which is not a style that I'm used to. One idea is to bundle them into
> a struct representing the cache state (including enough information to
> fetch the next batch), and have a single static variable that points to
> that.

As discussed off-list, I didn't take this suggestion for now.  Corey did
this originally, and I converted it to static function-scoped variables 1)
to reduce patch size and 2) because I noticed that each of the state
variables were only needed in one function.  I agree that a struct might be
slightly more readable, but we can always change this in the future if
desired.

> Also in 0003, the "next_te" variable is a bit confusing, because it's
> actually the last TocEntry, until it's advanced to point to the current
> one.

I've renamed it to expected_te.

> Other than that, looks good to me.

Great.  I'm planning to commit the attached patch set tomorrow morning.

For the record, I spent most of today trying very hard to fix the layering
violations in 0002.  While I was successful, the result was awkward,
complicated, and nigh unreadable.  This is now the second time I've
attempted to fix this and have felt the result was worse than where I
started.  So, I added extremely descriptive comments instead.  I'm hoping
that it will be possible to clean this up with some additional work in v19.
I have a few ideas, but if anyone has suggestions, I'm all ears.

-- 
nathan
>From a5a31f2754bf69a81fdc48769c1ee950317a2cf0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 3 Apr 2025 13:20:28 -0500
Subject: [PATCH v12n6 1/4] Skip second WriteToc() call for custom-format dumps
 without data.

Presently, "pg_dump --format=custom" calls WriteToc() twice.  The
second call updates the data offset information, which allegedly
makes parallel pg_restore significantly faster.  However, if we're
not dumping any data, there are no data offsets to update, so we
can skip this step.

Reviewed-by: Jeff Davis 
Discussion: https://postgr.es/m/Z9c1rbzZegYQTOQE%40nathan
---
 src/bin/pg_dump/pg_backup_custom.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_custom.c 
b/src/bin/pg_dump/pg_backup_custom.c
index e44b887eb29..f7c3af56304 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -755,9 +755,11 @@ _CloseArchive(ArchiveHandle *AH)
 * If possible, re-write the TOC in order to update the data 
offset
 * information.  This is not essential, as pg_restore can cope 
in most
 * cases without it; but it can make pg_restore significantly 
faster
-* in some situations (especially parallel restore).
+* in some situations (especially parallel restore).  We can 
skip this
+* step if we're not dumping any data; there are no offsets to 
update
+* in that case.
 */
-   if (ctx->hasSeek &&
+   if (ctx->hasSeek && AH->public.dopt->dumpData &&
fseeko(AH->FH, tpos, SEEK_SET) == 0)
WriteToc(AH);
}
-- 
2.39.5 (Apple Git-154)

>From 0244b3f02e083e6c37a2c282292d4d8fa0a69fed Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 3 Apr 2025 17:15:51 -0500
Subject: [PATCH v12n6 2/4] pg_dump: Reduce memory usage of dumps with
 statistics.

Right now, pg_dump stores all generated commands for statistics in
memory.  These commands can be quite large and therefore can
significantly increase pg_dump's memory footprint.  To fix, wait
until we are about to write out the commands before generating
them, and be sure to free the commands after writing.  This is
implemented via a new defnDumper callback that works much like the
dataDumper one but is specifically designed for TOC entries.

Custom dumps that include data might write the TOC twice (to update
data offset information), which would ordinarily cause pg_dump to
run the attribute statistics queries twice.  However, as a hack, we
save the length of the written-out entry in the first pass and skip
over it in the second.  While there is no known technical issue
with executing the queries multiple times and rewriting the
results, it's expensive and feels risky, so let's avoid it.

As an exception, we _do_ execute the queries twice for the tar
format.  This format does a second pass through the TOC to generate
the restore.sql file.  pg_restore doesn't use this file, so even if
the second round of queries returns different results than the
first, it won't corrupt the output; the archive and restore.sql
file will just have different content.

Author: Corey Huinker 
Co-authored-by: Nathan Bossa

Re: pg_stat_statements: improve loading and saving routines for the dump file

2025-04-03 Thread Tom Lane

m.litsa...@postgrespro.ru writes:
>> What does this patch give on aglobal scale? Does it save much memory or 
>> increase performance? How many times?

> This patch makes the code semantically more correct and we don't lose 
> anything. It is obviously not about performance or memory optimisation.

Indeed not: on my machine I see sizeof(pgssEntry) = 432.  It's full of
int64 fields, so the alignment requirement is 8 bytes, meaning that
the mutex field accounts for 8 bytes even though it's likely just 1
or 4 bytes wide.  Still, that means what you suggest is only going
to save 8/432 = 1.8% of the on-disk size of the struct.  Given that
we also store the SQL query text for each entry, the net savings
fraction is even smaller.

I don't really agree that this is adding any semantic correctness
either.  If we were reading directly into a live hashtable entry,
overwriting the mutex field could indeed be bad.  But the fread()
call is reading into a local variable "temp" and we only copy
selected fields out of that.  So it's just some bytes in a
transient stack entry.

On the whole therefore, I'm inclined to reject this on several
grounds:

* The savings is not worth the costs of bumping the stats file format
magic number (and thereby invalidating everyone's stored statistics).

* I'd rather keep the flexibility to put the mutex wherever we want in
the struct.  Right now that isn't worth much either, but depending on
what fields get added in future, we might have the opportunity to move
the mutex to someplace where it fits into padding space and hence adds
nothing to sizeof(pgssEntry).

* Adding the requirement on where the mutex is makes the code more
fragile, since somebody might ignore the comment and place things
incorrectly.  I'd like to think that such an error would be spotted
immediately, but we've committed sillier mistakes.  The consequences
would likely be that some fields don't get stored/reloaded correctly,
which might escape notice for awhile.

regards, tom lane

Re: Removing unneeded self joins

2025-04-03 Thread Richard Guo

On Fri, Apr 4, 2025 at 1:02 AM Alexander Korotkov  wrote:
> I've got an off-list bug report from Alexander Lakhin involving a
> placeholder variable.  Alena and Andrei proposed a fix.  It is fairly
> simple: we just shouldn't remove PHVs during self-join elimination, as
> they might still be referenced from other parts of a query.  The patch
> is attached.  I'm going to fix this if no objections.

Hmm, I'm not sure about the fix.  It seems to me that it simply
prevents removing any PHVs in the self-join removal case.  My concern
is that this might result in PHVs that could actually be removed not
being removed in many cases.

Besides, there's the specific comment above this code explaining the
logic behind the removal of PHVs.  Shouldn't that comment be updated
to reflect the changes?

Thanks
Richard

RE: pg_recvlogical cannot create slots with failover=true

2025-04-03 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

> Thank you for updating the patch. Here are some minor comments:
> 
>  The --two-phase can be specified with
>  --create-slot to enable decoding of prepared
> transactions.
> +Also, --falover can be specified with
> +--create-slot to create replication slot which can
> be
> +synchronized to the standby.
> 
> How about rewording the whole sentence like:
> 
> +The --two-phase and
> --failover options
> +can be specified with --create-slot.
> 
> Explaining both options here seems redundant to me.

Fixed.

> ---
> +   
> +Allows created replication slot to be synchronized to the standbys.
> +This option may only be specified with
> --create-slot.
> +   
> 
> How about slightly rewording to like:
> 
> +Enables the slot to be synchronized to the standbys. This
> option may only be
> +specified with --create-slot.

Fixed. The description in usage() is adjusted based on this.

> Also, the descriptions of pg_recvlogical options are written in
> alphabetical order. Please put the description for --failover option
> after -E/--endpos.

Right. I put because it had short-term '-o' in old version, but it was removed.
Fixed.

PSA new version.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v3-0001-Allow-pg_recvlogical-to-create-slots-with-failove.patch
Description:  v3-0001-Allow-pg_recvlogical-to-create-slots-with-failove.patch

RE: Improve CRC32C performance on SSE4.2

2025-04-03 Thread Devulapalli, Raghuveer

> Thanks for looking, I plan to commit this over the weekend unless there are
> objections.

LGTM. 

Raghuveer

Re: speedup COPY TO for partitioned table.

2025-04-03 Thread jian he

On Tue, Apr 1, 2025 at 1:38 PM vignesh C  wrote:
>
> On Tue, 1 Apr 2025 at 06:31, jian he  wrote:
> >
> > On Mon, Mar 31, 2025 at 4:05 PM Kirill Reshke  
> > wrote:
> >
> > Thanks for doing the benchmark.
>
> Few comments to improve the comments, error message and remove
> redundant assignment:
> 1) How about we change below:
> /*
>  * partition's rowtype might differ from the root table's.  We must
>  * convert it back to the root table's rowtype as we are export
>  * partitioned table data here.
> */
> To:
> /*
>  * A partition's row type might differ from the root table's.
>  * Since we're exporting partitioned table data, we must
>  * convert it back to the root table's row type.
>  */
>
i changed it to
/*
 * A partition's rowtype might differ from the root table's.
 * Since we are export partitioned table data here,
 * we must convert it back to the root table's rowtype.
*/
Since many places use "rowtype",
using "rowtype" instead of "row type" should be fine.


> 2) How about we change below:
> +   if (relkind == RELKIND_FOREIGN_TABLE)
> +   ereport(ERROR,
> +
> errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +   errmsg("cannot
> copy from foreign table \"%s\"",
> +
>  RelationGetRelationName(rel)),
> +
> errdetail("partition \"%s\" is a foreign table",
> RelationGetRelationName(rel)),
> +   errhint("Try
> the COPY (SELECT ...) TO variant."));
>
> To:
> +   if (relkind == RELKIND_FOREIGN_TABLE)
> +   ereport(ERROR,
> +
> errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +   errmsg("cannot
> copy from a partitioned table having foreign table partition \"%s\"",
> +
>  RelationGetRelationName(rel)),
> +
> errdetail("partition \"%s\" is a foreign table",
> RelationGetRelationName(rel)),
> +   errhint("Try
> the COPY (SELECT ...) TO variant."));
>
i am not so sure.
since the surrounding code we have

else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("cannot copy from foreign table \"%s\"",
RelationGetRelationName(rel)),
 errhint("Try the COPY (SELECT ...) TO variant.")));

let's see what others think.
> 3) How about we change below:
> /*
>  * rel: the relation to be copied to.
>  * root_rel: if not NULL, then the COPY partitioned relation to destination.
>  * processed: number of tuples processed.
> */
> To:
> /*
>  * rel: the relation from which data will be copied.
>  * root_rel: If not NULL, indicates that rel's row type must be
>  *   converted to root_rel's row type.
>  * processed: number of tuples processed.
>  */
>
i changed it to

/*
 * rel: the relation from which the actual data will be copied.
 * root_rel: if not NULL, it indicates that we are copying partitioned relation
 * data to the destination, and "rel" is the partition of "root_rel".
 * processed: number of tuples processed.
*/

>  4)  You can initialize processed to 0 along with declaration in
> DoCopyTo function and remove the below:
>  +   if (cstate->rel && cstate->rel->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE)
> {
> ...
> ...
> processed = 0;
> -   while (table_scan_getnextslot(scandesc,
> ForwardScanDirection, slot))
> ...
> ...
> -
> -   ExecDropSingleTupleTableSlot(slot);
> -   table_endscan(scandesc);
> +   }
> +   else if (cstate->rel)
> +   {
> +   processed = 0;
> +   CopyThisRelTo(cstate, cstate->rel, NULL, &processed);
> }
ok.
From 374f38d7187e92882e5fa4fe6e8b9dd7d7785ed9 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 4 Apr 2025 11:09:04 +0800
Subject: [PATCH v9 1/1] support COPY partitioned_table TO

CREATE TABLE pp (id  INT, val int ) PARTITION BY RANGE (id);
create table pp_1 (val int, id int);
create table pp_2 (val int, id int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10);
insert into pp select g, 10 + g from generate_series(1,9) g;
copy pp to stdout(header);

the above case is much slower (around 25% in some case) than
``COPY (select * from pp) to stdout(header);``,
because of column remaping.  but this is still a new
feature, since master does not support ``COPY (partitioned_table)``.

reivewed by: vignesh C 
reivewed by: David Rowley 
reivewed by: Melih Mutlu 
reivewed by: Kirill Reshke 

discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=iQ@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5467/
---
 doc/src/sgml/ref/copy.sgml  |   8 +-
 src/backend/command

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2025-04-03 Thread Alexander Korotkov

Hi, Vitaly!

On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov  wrote:
> The slot data is flushed to the disk at the beginning of checkpoint. If
> an existing slot is advanced in the middle of checkpoint execution, its
> advanced restart LSN is taken to calculate the oldest LSN for WAL
> segments removal at the end of checkpoint. If the node is restarted just
> after the checkpoint, the slots data will be read from the disk at
> recovery with the oldest restart LSN which can refer to removed WAL
> segments.
>
> The patch introduces a new in-memory state for slots -
> flushed_restart_lsn which is used to calculate the oldest LSN for WAL
> segments removal. This state is updated every time with the current
> restart_lsn at the moment, when the slot is saving to disk.

Thank you for your work on this subject.   I think generally your
approach is correct.  When we're truncating the WAL log, we need to
reply on the position that would be used in the case of server crush.
That is the position flushed to the disk.

While your patch is generality looks good, I'd like make following notes:

1) As ReplicationSlotsComputeRequiredLSN() is called each time we need
to advance the position of WAL needed by replication slots, the usage
pattern probably could be changed.  Thus, we probably need to call
ReplicationSlotsComputeRequiredLSN() somewhere after change of
restart_lsn_flushed while restart_lsn is not changed.  And probably
can skip ReplicationSlotsComputeRequiredLSN() in some cases when only
restart_lsn is changed.
2) I think it's essential to include into the patch test caches which
fail without patch.  You could start from integrating [1] test into
your patch, and then add more similar tests for different situations.

Links.
1.  
https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me

--
Regards,
Alexander Korotkov
Supabase

Re: [PATCH] Add sortsupport for range types and btree_gist

2025-04-03 Thread Andrey Borodin



> On 3 Apr 2025, at 16:13, Heikki Linnakangas  wrote:
> 
> Committed

Cool! Thank you!!!


Best regards, Andrey Borodin.

Re: Improve error reporting for few options in pg_createsubscriber

2025-04-03 Thread Amit Kapila

On Sat, Mar 22, 2025 at 7:26 PM vignesh C  wrote:
>
> Hi,
>
> Currently, error reports for database, publication, subscription, and
> replication slots do not include the option name. This has been
> addressed by including the option name in the error messages, ensuring
> consistency similar to remove option. Additionally, pg_log_error and
> exit(1) have been replaced with a single pg_fatal statement which
> means the same. The attached patch implements these changes.
>

A few comments:
*
pg_createsubscriber.exe -D ..\..\data_standby -d db1 -d db2 -d db1
--publication pub1 --publication pub1 -P "dbname=postgres port=5432"
-p 5444 --dry-run
pg_createsubscriber: error: database "db1" specified more than once
for --database

Even when user has used short form, we are displaying long form for
database. I suggest to use both short and long form where applicable.

*
"invalid object type \"%s\" specified for --remove". Isn't it better
to use the short form as well in this error message?

-- 
With Regards,
Amit Kapila.

Re: Some codes refer slot()->{'slot_name'} but it is not defined

2025-04-03 Thread Fujii Masao





On 2025/04/04 12:06, Hayato Kuroda (Fujitsu) wrote:

Dear Fujii-san,


I think this fix should be backpatched to all supported versions.
Since the issue you found and the one I found seem different,
I'd prefer committing them separately.


I have no objections.


I've pushed the patches. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Improve error reporting for few options in pg_createsubscriber

2025-04-03 Thread vignesh C

On Fri, 4 Apr 2025 at 09:36, Amit Kapila  wrote:
>
> A few comments:
> *
> pg_createsubscriber.exe -D ..\..\data_standby -d db1 -d db2 -d db1
> --publication pub1 --publication pub1 -P "dbname=postgres port=5432"
> -p 5444 --dry-run
> pg_createsubscriber: error: database "db1" specified more than once
> for --database
>
> Even when user has used short form, we are displaying long form for
> database. I suggest to use both short and long form where applicable.

Fixed this.

> *
> "invalid object type \"%s\" specified for --remove". Isn't it better
> to use the short form as well in this error message?

Fixed this.

The attached v2 version patch has the changes for the same.

Regards,
Vignesh
From 40ef017e6e5f22fb8650495935c186b5a4a56878 Mon Sep 17 00:00:00 2001
From: Vignesh 
Date: Sat, 22 Mar 2025 18:56:15 +0530
Subject: [PATCH v2] Improve error reporting consistency in pg_createsubscriber

Ensure consistent error reporting in pg_createsubscriber
by including the option name in error messages. Additionally,
replace pg_log_error and exit calls with pg_fatal.
---
 src/bin/pg_basebackup/pg_createsubscriber.c   | 26 +--
 .../t/040_pg_createsubscriber.pl  |  4 +--
 2 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index eed3793c816..1d2f39e3926 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -2124,10 +2124,7 @@ main(int argc, char **argv)
 	num_dbs++;
 }
 else
-{
-	pg_log_error("database \"%s\" specified more than once", optarg);
-	exit(1);
-}
+	pg_fatal("database \"%s\" specified more than once for -d/--database", optarg);
 break;
 			case 'D':
 subscriber_dir = pg_strdup(optarg);
@@ -2146,7 +2143,7 @@ main(int argc, char **argv)
 if (!simple_string_list_member(&opt.objecttypes_to_remove, optarg))
 	simple_string_list_append(&opt.objecttypes_to_remove, optarg);
 else
-	pg_fatal("object type \"%s\" is specified more than once for --remove", optarg);
+	pg_fatal("object type \"%s\" is specified more than once for -R/--remove", optarg);
 break;
 			case 's':
 opt.socket_dir = pg_strdup(optarg);
@@ -2174,10 +2171,7 @@ main(int argc, char **argv)
 	num_pubs++;
 }
 else
-{
-	pg_log_error("publication \"%s\" specified more than once", optarg);
-	exit(1);
-}
+	pg_fatal("publication \"%s\" specified more than once for --publication", optarg);
 break;
 			case 3:
 if (!simple_string_list_member(&opt.replslot_names, optarg))
@@ -2186,10 +2180,7 @@ main(int argc, char **argv)
 	num_replslots++;
 }
 else
-{
-	pg_log_error("replication slot \"%s\" specified more than once", optarg);
-	exit(1);
-}
+	pg_fatal("replication slot \"%s\" specified more than once for --replication-slot", optarg);
 break;
 			case 4:
 if (!simple_string_list_member(&opt.sub_names, optarg))
@@ -2198,10 +2189,7 @@ main(int argc, char **argv)
 	num_subs++;
 }
 else
-{
-	pg_log_error("subscription \"%s\" specified more than once", optarg);
-	exit(1);
-}
+	pg_fatal("subscription \"%s\" specified more than once for --subscription", optarg);
 break;
 			default:
 /* getopt_long already emitted a complaint */
@@ -2226,7 +2214,7 @@ main(int argc, char **argv)
 
 		if (bad_switch)
 		{
-			pg_log_error("%s cannot be used with --all", bad_switch);
+			pg_log_error("%s cannot be used with -a/--all", bad_switch);
 			pg_log_error_hint("Try \"%s --help\" for more information.", progname);
 			exit(1);
 		}
@@ -2352,7 +2340,7 @@ main(int argc, char **argv)
 			dbinfos.objecttypes_to_remove |= OBJECTTYPE_PUBLICATIONS;
 		else
 		{
-			pg_log_error("invalid object type \"%s\" specified for --remove", cell->val);
+			pg_log_error("invalid object type \"%s\" specified for -R/--remove", cell->val);
 			pg_log_error_hint("The valid option is: \"publications\"");
 			exit(1);
 		}
diff --git a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
index 80153f7d77e..2d532fee567 100644
--- a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
+++ b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
@@ -399,7 +399,7 @@ command_fails_like(
 		'--database' => $db1,
 		'--all',
 	],
-	qr/--database cannot be used with --all/,
+	qr/--database cannot be used with -a\/--all/,
 	'fail if --database is used with --all');
 
 # run pg_createsubscriber with '--publication' and '--all' and verify
@@ -416,7 +416,7 @@ command_fails_like(
 		'--all',
 		'--publication' => 'pub1',
 	],
-	qr/--publication cannot be used with --all/,
+	qr/--publication cannot be used with -a\/--all/,
 	'fail if --publication is used with --all');
 
 # run pg_createsubscriber with '--all' option
-- 
2.43.0

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

2025-04-03 Thread Bertrand Drouvot

Hi,

On Thu, Apr 03, 2025 at 03:23:31PM +0530, vignesh C wrote:
> Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Yeah that sounds like a good place for it. Done in the attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 49d9c73a29f5c05567d01f57424b7b8fd4df4eee Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 25 Feb 2025 10:18:05 +
Subject: [PATCH v6] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to not overload the walsender
3. adds a test for a physical walsender and a test for a logical walsender
---
 src/backend/replication/walsender.c| 63 +++---
 src/test/recovery/t/001_stream_rep.pl  | 16 ++
 src/test/subscription/t/001_rep_changes.pl | 13 +
 3 files changed, 74 insertions(+), 18 deletions(-)
  65.8% src/backend/replication/
  20.0% src/test/recovery/t/
  14.0% src/test/subscription/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..9eed37b5de9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -91,10 +91,14 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/* Minimum interval walsender IO stats flushes */
+#define MIN_IOSTATS_FLUSH_INTERVAL 1000
+
 /*
  * Maximum data payload in a WAL data message.  Must be >= XLOG_BLCKSZ.
  *
@@ -2742,6 +2746,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2836,30 +2842,51 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * When the WAL sender is caught up or has pending data to send, we
+		 * also periodically report I/O statistics. It's done periodically to
+		 * not overload the WAL sender.
 		 */
-		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
-			 !streamingDoneSending) ||
-			pq_is_send_pending())
+		if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
 		{
-			long		sleeptime;
-			int			wakeEvents;
+			TimestampTz now;
 
-			if (!streamingDoneReceiving)
-wakeEvents = WL_SOCKET_READABLE;
-			else
-wakeEvents = 0;
+			now = GetCurrentTimestamp();
 
-			/*
-			 * Use fresh timestamp, not last_processing, to reduce the chance
-			 * of reaching wal_sender_timeout before sending a keepalive.
-			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			if (TimestampDifferenceExceeds(last_flush, now, MIN_IOSTATS_FLUSH_INTERVAL))
+			{
+/*
+ * Report IO statistics
+ */
+pgstat_flush_io(false);
+(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+last_flush = now;
+			}
 
-			if (pq_is_send_pending())
-wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (send_data != XLogSendLogical || pq_is_send_pending())
+			{
+long		sleeptime;
+int			wakeEvents;
+
+if (!streamingDoneReceiving)
+	wakeEvents = WL_SOCKET_READABLE;
+else
+	wakeEvents = 0;
+
+/*
+ * Use fresh timestamp, not last_processing, to reduce the
+ * chance of reaching wal_sender_timeout before sending a
+ * keepalive.
+ */
+sleeptime = WalSndComputeSleeptime(now);
+
+if (pq_is_send_pending())
+	wakeEvents |= WL_SOCKET_WRITEABLE;
+
+/* Sleep until something happens or we time out */
+WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
+			}
 
-			/* Sleep until something happens or we time out */
-			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
 		}
 	}
 }
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ccd8417d449..2508295eca6 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -333,6 +336,19 @@ $node_primary->psql(
 
 note "switching to physical

Re: psql suggestion "select " offers nothing, can we get functions like "\df "

2025-04-03 Thread Tom Lane

Kirk Wolak  writes:
>   But "select " does nothing.

What would you have it offer?  Every single column and function name
in the database?  Seems unlikely to be very helpful.

postgres=# select count(distinct attname) from pg_attribute;
 count 
---
  1218
(1 row)

postgres=# select count(distinct proname) from pg_proc;
 count 
---
  2787
(1 row)

That's with zero user-defined objects.

regards, tom lane

Re: Restrict publishing of partitioned table with a foreign table as partition

2025-04-03 Thread Sergey Tatarintsev


01.04.2025 21:48, Shlok Kyal пишет:

On Fri, 28 Mar 2025 at 16:35, Álvaro Herrera  wrote:

On 2025-Mar-28, Shlok Kyal wrote:


On Mon, 24 Mar 2025 at 21:17, Álvaro Herrera  wrote:

Also, surely we should document this restriction in the SGML docs
somewhere.

I have added comment in create_publication.sgml

Hmm, okay, but "We cannot" is not the style used in the documentation.
In addition, I think this mechanism should be mentioned in
logical-replication.sgml; currently there's a note in the Restrictions
section about foreign tables, which should be expanded to explain this.


I have modified the comment in create_publication.sgml and also added
comment in the restrictions section of logical-replication.sgml.
I have also added a more detailed explanation in comment of
'check_foreign_tables'

I have attached the updated v11 patch.


Thanks and Regards,
Shlok Kyal


Hi!

I looked at the latest version of the patch, and think that we should 
free puboids list here:


diff --git a/src/backend/commands/tablecmds.c 
b/src/backend/commands/tablecmds.c

index 6a128f7bd4e..4254654cc24 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -20122,6 +20122,7 @@ ATExecAttachPartition(List **wqueue, Relation 
rel, PartitionCmd *cmd,

relname)));
    }
    }
+   list_free(puboids);
    }

    /*


--
With best regards,
Sergey Tatarintsev,
PostgresPro

Re: Eliminating SPI / SQL from some RI triggers - take 3

2025-04-03 Thread Amit Langote

On Fri, Dec 20, 2024 at 1:23 PM Amit Langote  wrote:
> We discussed $subject at [1] and [2] and I'd like to continue that
> work with the hope to commit some part of it for v18.

I did not get a chance to do any further work on this in this cycle,
but plan to start working on it after beta release, so moving this to
the next CF.  I will post a rebased patch after the freeze to keep the
bots green for now.

-- 
Thanks, Amit Langote

Re: Extensible user mapping handler for FDW

2025-04-03 Thread Daniel Gustafsson

> On 2 Apr 2025, at 03:36, Peifeng Qiu  wrote:

> A major downside is the need to modify catalog table
> pg_user_mapping. So we can't apply this change to existing
> installation of prior stable versions.

A feature like this would never be considered for backpatching anyways so
that's not really a concern.

--
Daniel Gustafsson

Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-04-03 Thread Shlok Kyal

On Thu, 20 Mar 2025 at 18:09, Ajin Cherian  wrote:
>
> On Thu, Mar 13, 2025 at 5:49 PM Ajin Cherian  wrote:
> >
> > Moving this patch to the next CF as this patch needs more design level
> > inputs which may not be feasible in this CF but do continue to review
> > the patch.
> >
> > regards,
> > Ajin Cherian
> > Fujitsu Australia
>
> Rebased the patch as it no longer applied.
>
Hi Ajin,

I have reviewed patch v16-0001, here are my comments:

1. There are some places where comments are more than 80 columns
window. I think it should be <=80 as per [1].
a. initial comment in reorderbuffer.c
+ *   Filtering is temporarily suspended for a sequence of changes
(set to 100
+ *   changes) when an unfilterable change is encountered. This
strategy minimizes
+ *   the overhead of hash searching for every record, which is
beneficial when the
+ *   majority of tables in an instance are published and thus
unfilterable. The
+ *   threshold of 100 was determined to be the optimal balance
based on performance
+ *   tests.
+ *
+ *   Additionally, filtering is paused for transactions
containing WAL records
+ *   (INTERNAL_SNAPSHOT, COMMAND_ID, or INVALIDATION) that modify
the historical
+ *   snapshot constructed during logical decoding. This pause is
necessary because
+ *   constructing a correct historical snapshot for searching publication
+ *   information requires processing these WAL records.

b.
+ if (!has_tuple)
+ {
+ /*
+ * Mapped catalog tuple without data, emitted while catalog table was in
+ * the process of being rewritten. We can fail to look up the
+ * relfilenumber, because the relmapper has no "historic" view, in
+ * contrast to the normal catalog during decoding. Thus repeated rewrites
+ * can cause a lookup failure. That's OK because we do not decode catalog
+ * changes anyway. Normally such tuples would be skipped over below, but
+ * we can't identify whether the table should be logically logged without
+ * mapping the relfilenumber to the oid.
+ */
+ return NULL;
+ }

2. I think, 'rb->unfiltered_changes_count' should be initialised in
the function 'ReorderBufferAllocate'. While debugging I found that the
initial value of rb->unfiltered_changes_count is 127. I think it
should be set to '0' inside 'ReorderBufferAllocate'. Am I missing
something here?
I have also added the same comment in point 1. in [2].

Also, please ignore point 2. in email [2] a crash happened because I
was testing it without doing a clean build. Sorry for the
inconvenience.

[1]: https://www.postgresql.org/docs/devel/source-format.html
[2]: 
https://www.postgresql.org/message-id/CANhcyEUF1HzDRj_fJAGCqBqNg7xGVoATR7XgR311xq8UvBg9tg%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Re: libpq support for NegotiateProtocolVersion

2025-04-03 Thread Ranier Vilela

Hi.

Per Coverity.

CID 1596094: (#1 of 1): Structurally dead code (UNREACHABLE)
unreachable: Since the loop increment i++; is unreachable, the loop body
will never execute more than once.

The code of the function *pqGetNegotiateProtocolVersion3* is a little
confusing.

I believe that the Coverity alert is valid.
The function never returns 0.

best regards,
Ranier Vilela

Re: Draft for basic NUMA observability

2025-04-03 Thread Tomas Vondra

On 4/3/25 10:23, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
>> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra  wrote:
>>
>> Hi Tomas,
>>
>>> OK, so you agree the commit messages are complete / correct?
>>
>> Yes.
> 
> Not 100% sure on my side. 
> 
> === v21-0002
> 
> Says:
> 
> "
> This introduces three new functions:
> 
> - pg_buffercache_init_entries
> - pg_buffercache_build_tuple
> - get_buffercache_tuple
> "
> 
> While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
> is).
> 

Ah, OK. Jakub, can you correct (and double-check) this in the next
version of the patch?

> About v21-0002:
> 
> === 1
> 
> I can see that the pg_buffercache_init_entries() helper comments are added in
> v21-0003 but I think that it would be better to add them in v21-0002
> (where the helper is actually created).
> 

+1 to that

> About v21-0003:
> 
> === 2
> 
>> I hear you, attached v21 / 0003 is free of float/double arithmetics
>> and uses non-float point values.
> 
> +   if (buffers_per_page > 1)
> +   os_page_query_count = NBuffers;
> +   else
> +   os_page_query_count = NBuffers * pages_per_buffer;
> 
> yeah, that's more elegant. I think that it properly handles the relationships
> between buffer and page sizes without relying on float arithmetic.
> 

In the review I just submitted, I changed this to

os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);

but maybe it's less clear. Feel free to undo my change.

> === 3
> 
> +   if (buffers_per_page > 1)
> +   {
> 
> As buffers_per_page does not change, I think I'd put this check outside of the
> for (i = 0; i < NBuffers; i++) loop, something like:
> 
> "
> if (buffers_per_page > 1) {
> /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
> for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> .
> } else {
> /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
> for (i = 0; i < NBuffers; i++) {
> .
> .

I don't know. It's a matter of opinion, but I find the current code
fairly understandable. Maybe if it meaningfully reduced the code
nesting, but even with the extra branch we'd still need the for loop.
I'm not against doing this differently, but I'd have to see how that
looks. Until then I think it's fine to have the code as is.

> .
> "
> 
> === 4
> 
>> That _numa_prepare_ptrs() is unused and will need to be removed,
>> but we can still move some code there if necessary.
> 
> Yeah I think that it can be simply removed then.
> 

I'm not particularly attached on having the _ptrs() function, but why
couldn't it build the os_page_ptrs array as before?

> === 5
> 
>> @Bertrand: do you have anything against pg_shm_allocations_numa
>> instead of pg_shm_numa_allocations? I don't mind changing it...
> 
> I think that pg_shm_allocations_numa() is better (given the examples you just
> shared).
> 
> === 6
> 
>> I find all of those non-user friendly and I'm afraid I won't be able
>> to pull that alone in time...
> 
> Maybe we could add a few words in the doc to mention the "multiple nodes per
> buffer" case? And try to improve it for say 19? Also maybe we should just 
> focus
> till v21-0003 (and discard v21-0004 for 18).
> 

IMHO it's not enough to paper this over by mentioning it in the docs.

I'm a bit confused about which patch you suggest to leave out. I think
0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
We've been discussing how to represent nodes for buffers in 0003.

I understand it'd require a bit of coding, but AFAICS adding an array
would be fairly trivial amount of code. Something like

  values[i++]
= PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));

would do, I think. But if we decide to do the 3-column view, that'd be
even simpler, I think. AFAICS it means we could mostly ditch the
pg_buffercache refactoring in patch 0002, and 0003 would get much
simpler too I think.

If Jakub doesn't have time to work on this, I can take a stab at it
later today.

regards

-- 
Tomas Vondra

Re: Small optimization set tuple block/tableOid once

2025-04-03 Thread Ranier Vilela

rebased heapam.c and heapam_handler.c

best regards,
Ranier Vilela


v1-heapam_set_tuple_block_once.patch
Description: Binary data


v1-heapam_handler_set_tuple_block_once.patch
Description: Binary data

Re: Making sslrootcert=system work on Windows psql

2025-04-03 Thread Sandeep Thakkar

On Wed, Apr 2, 2025 at 2:35 AM George MacKerron 
wrote:

> I was very pleased to see the sslrootcert=system connection option added
> in Postgres 16 (I even blogged about it:
> https://neon.tech/blog/avoid-mitm-attacks-with-psql-postgres-16). But
> sslrootcert=system has not been widely supported by psql installations,
> perhaps because people compiling Postgres haven’t always been aware of the
> requirement to point OpenSSL in the direction of the system’s root CA
> certificates.
>
> I’ve recently been trying to get it more widely supported, with some
> success (details at end of this message).
>
> However, psql via the EnterpriseDB Windows installer still doesn’t support
> sslrootcert=system, and I think a tiny patch is needed. The diff is
> attached, and can be seen in context here:
> https://github.com/postgres/postgres/compare/master...jawj:postgres:jawj-sslrootcert-system-windows
>
> Essentially, on Windows with OpenSSL 3.2+, it replaces
> SSL_CTX_set_default_verify_paths(SSL_context) with
> SSL_CTX_load_verify_store(SSL_context, "org.openssl.winstore:”).
>
> Please note the EDB Windows installers for PostgreSQL versions (upto v17)
use OpenSSL v3.0, which is an LTS version. PostgreSQL 18 installer may use
OpenSSL v3.5 depending on the release timeframe.


> I’m not a Windows or OpenSSL expert, but so far the patched code seems to
> work in theory and in practice (sources below, and I’ve compiled and tested
> it working on Windows 11 x64).
>
>
> # Sources
>
> https://stackoverflow.com/a/79461864/338196
> https://docs.openssl.org/master/man7/OSSL_STORE-winstore/
> https://docs.openssl.org/master/man3/SSL_CTX_load_verify_locations/
>
>
> # Status of sslrootcert=system in various packages providing psql
>
> ## Mac
> Postgres.app — now fixed (
> https://github.com/PostgresApp/PostgresApp/issues/801)
> MacPorts — now fixed (https://trac.macports.org/ticket/72080)
> EDB installer — now fixed (
> https://github.com/EnterpriseDB/edb-installers/issues/264)
> homebrew — was working already
>
> ## Linux
> Debian/Ubuntu — now Recommends ca-certificates (
> https://salsa.debian.org/postgresql/postgresql/-/commit/96077ad61c36386646cdd9b5ce0e423a357ce73b
> )
>
> ## Windows
> EDB installer — in progress
> WSL1, WSL2 (Ubuntu, openSUSE) — was working already
>
>

-- 
Sandeep Thakkar

Re: Some codes refer slot()->{'slot_name'} but it is not defined

2025-04-03 Thread Fujii Masao





On 2025/04/03 12:15, Hayato Kuroda (Fujitsu) wrote:

Dear hackers,

Cluster.pm defines a function slot()which requires a slot_name as a key
and returns attributes of the given slot, as a hash-ref. ISTM, the hash
does not contain 'slot_name'.

However, I found that some codes access it by using a key 'slot_name'. ISTM it 
always
becomes 'undef' thus any tests are meaningless.

It looks like that existing codes want to check the existing of given logical 
slots.
So, it is enough to search with key 'plugin'. The valid value is set if exists, 
otherwise ''.

How do you think?


I think you're right. The patch looks good to me.

-is($node_primary->slot('dropme_slot')->{'slot_name'},
-   undef, 'logical slot was actually dropped on standby');
+is($node_primary->slot('dropme_slot')->{'plugin'},
+   '', 'logical slot was actually dropped on standby');

This seems like a separate issue from what your patch is addressing,
but since this test is meant to confirm that the slot was dropped
on the standby, shouldn't node_primary be node_replica instead?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-04-03 Thread Amit Kapila

On Thu, Apr 3, 2025 at 2:20 PM vignesh C  wrote:
>
> On Thu, 3 Apr 2025 at 09:07, Peter Smith  wrote:
> >
> > Hi,
> >
> > I was away for the last few weeks when this feature got committed, so
> > I missed the chance to post my comments earlier.
> >
> > It seems the final SGML docs are mostly from this [1] suggestion, but
> > I think there are one or two more improvements that can be made to it:
> >
>
> Thanks for the patch, the changes look good to me.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-03 Thread Amit Kapila

On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot
 wrote:
>
> On Thu, Apr 03, 2025 at 05:34:10AM +, Hayato Kuroda (Fujitsu) wrote:
> > Dear Bertrand, Amit,
> >
> > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think 
> > > > that we should
> > > > keep the slots active and only avoid doing the checks for them (they are
> > > invalidated
> > > > that's fine, they are not that's fine too).
> > > >
> > >
> > > I don't mind doing that, but there is no benefit in making slots
> > > active unless we can validate them. And we will end up adding some
> > > more checks, as in function check_slots_conflict_reason without any
> > > advantage.
>
> I think that there is advantage. The pros are:
>
> - the test would be closer to HEAD from a behavioural point of view
> - it's very rare to hit the corner cases: so the test would behave the same
> as on HEAD most of the time (and when it does not that would not hurt as the
> checks are nor done)
> - Kuroda-San's patch removes "or die "replication slot stats of 
> vacuum_full_activeslot not updated"
> while keeping the slot active is able to keep it (should the slot being 
> invalidated
> or not). But more on that in the comment === 1 below.
>
> > I feel Kuroda-San's second patch is simple, and we have
> > > fewer chances to make mistakes and easy to maintain in the future as
> > > well.
>
> Yeah maybe but the price to pay is to discard the pros above. That said, I'm 
> also
> fine with Kuroda-San's patch if both of you feel that it's better.
>
> > I have concerns for Bertrand's patch that it could introduce another timing
> > issue. E.g., if the activated slots are not invalidated, dropping slots is 
> > keep
> > being activated so the dropping might be fail.
>
> Yeah, but the drop is done with "$node_standby->psql" so that the test does 
> not
> produce an error. It would produce an error should we use 
> "$node_standby->safe_psql"
> instead.
>
> > I did not reproduce this but
> > something like this can happen if we activate slots.
>
> You can see it that way (+ reproducer.txt):
>
> "
> +   my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from 
> pg_replication_slots]);
> +   note "BDT: $bdt";
> +
> $node_standby->psql('postgres',
> qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
> "
>
> You'd see the slot being active and the "$node_standby->psql" not reporting
> any error.
>

Hmm, but adding some additional smarts also makes this test less easy
to backpatch. I see your points related to the benefits, but I still
mildly prefer to go with the lesser changes approach for backbranches
patch. Normally, we don't enhance backbranches code without making
equivalent changes in HEAD, so adding some new bugs only in
backbranches has a lesser chance.

-- 
With Regards,
Amit Kapila.

Re: Draft for basic NUMA observability

2025-04-03 Thread Bertrand Drouvot

Hi Jakub,

On Thu, Apr 03, 2025 at 02:36:57PM +0200, Jakub Wartak wrote:
> On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
>  wrote:
> Right, we could also put it as a limitation. I would be happy to leave
> it as it must be a rare condition, but Tomas stated he's not.
> 
> > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).
> 
> Do you mean discard pg_buffercache_numa (0002+0003) and instead go
> with pg_shm_allocations_numa (0004) ?

No I meant the opposite: focus on 0001, 0002 and 0003 for 18. But if Tomas is
confident enough to also focus in addition to 0004, that's fine too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

2025-04-03 Thread Jakub Wartak

On Thu, Apr 3, 2025 at 2:15 PM Tomas Vondra  wrote:

> Ah, OK. Jakub, can you correct (and double-check) this in the next
> version of the patch?

Done.

> > About v21-0002:
> >
> > === 1
> >
> > I can see that the pg_buffercache_init_entries() helper comments are added 
> > in
> > v21-0003 but I think that it would be better to add them in v21-0002
> > (where the helper is actually created).
> >
>
> +1 to that

Done.

> > About v21-0003:
> >
> > === 2
> >
> >> I hear you, attached v21 / 0003 is free of float/double arithmetics
> >> and uses non-float point values.
> >
> > +   if (buffers_per_page > 1)
> > +   os_page_query_count = NBuffers;
> > +   else
> > +   os_page_query_count = NBuffers * pages_per_buffer;
> >
> > yeah, that's more elegant. I think that it properly handles the 
> > relationships
> > between buffer and page sizes without relying on float arithmetic.
> >
>
> In the review I just submitted, I changed this to
>
> os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
>
> but maybe it's less clear. Feel free to undo my change.

Cool, thanks, will send shortly v23 with this applied.

>
> > === 3
> >
> > +   if (buffers_per_page > 1)
> > +   {
> >
> > As buffers_per_page does not change, I think I'd put this check outside of 
> > the
> > for (i = 0; i < NBuffers; i++) loop, something like:
> >
> > "
> > if (buffers_per_page > 1) {
> > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
> > for (i = 0; i < NBuffers; i++) {
> > .
> > .
> > .
> > .
> > } else {
> > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
> > for (i = 0; i < NBuffers; i++) {
> > .
> > .
>
> I don't know. It's a matter of opinion, but I find the current code
> fairly understandable. Maybe if it meaningfully reduced the code
> nesting, but even with the extra branch we'd still need the for loop.
> I'm not against doing this differently, but I'd have to see how that
> looks. Until then I think it's fine to have the code as is.

v23 will have incorporated Bertrand's idea soon. No hard feelings, but
it's kind of painful to switch like that ;)

> > === 4
> >
> >> That _numa_prepare_ptrs() is unused and will need to be removed,
> >> but we can still move some code there if necessary.
> >
> > Yeah I think that it can be simply removed then.
> >
>
> I'm not particularly attached on having the _ptrs() function, but why
> couldn't it build the os_page_ptrs array as before?

I've removed it in v23, the code for me just didn't have flow...

> > === 5
> >
> >> @Bertrand: do you have anything against pg_shm_allocations_numa
> >> instead of pg_shm_numa_allocations? I don't mind changing it...
> >
> > I think that pg_shm_allocations_numa() is better (given the examples you 
> > just
> > shared).

Done.

> > === 6
> >
> >> I find all of those non-user friendly and I'm afraid I won't be able
> >> to pull that alone in time...
> >
> > Maybe we could add a few words in the doc to mention the "multiple nodes per
> > buffer" case? And try to improve it for say 19? Also maybe we should just 
> > focus
> > till v21-0003 (and discard v21-0004 for 18).
> >
>
> IMHO it's not enough to paper this over by mentioning it in the docs.

OK

> I'm a bit confused about which patch you suggest to leave out. I think
> 0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
> We've been discussing how to represent nodes for buffers in 0003.

IMHO 0001 + 0004 is good. 0003 is probably the last troublemaker, but
we settled on arrays right?

> I understand it'd require a bit of coding, but AFAICS adding an array
> would be fairly trivial amount of code. Something like
>
>   values[i++]
> = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));
>
> would do, I think. But if we decide to do the 3-column view, that'd be
> even simpler, I think. AFAICS it means we could mostly ditch the
> pg_buffercache refactoring in patch 0002, and 0003 would get much
> simpler too I think.
>
> If Jakub doesn't have time to work on this, I can take a stab at it
> later today.

I won't be able to even start on this today, so if you have cycles for
please do so...

-J.

Re: Making sslrootcert=system work on Windows psql

2025-04-03 Thread George MacKerron

> On 3 Apr 2025, at 11:41, Daniel Gustafsson  wrote:
> 
> org.openssl.winstore isn't by OpenSSL defined as the default even on Windows,
> but a future version might change that.

Right — there’s definitely an argument that OpenSSL should in future make it 
possible to have this be the default via a compile-time option, at least.

> I don't think we need to be more specific regarding what OpenSSL support, but
> in hindsight I wonder if we should be more specific around that "system"
> actually means.  The attached (untested) small diff tries to make that more
> clear. (Line reflow omitted for review ease.)

I guess my issue here was twofold: 

(1) sslrootcert=system on Windows doesn’t do what it says on the tin. In other 
words, it doesn’t do (a) what it sounds like it does or (b) what it says it 
does in the docs.

(2) sslrootcert=system on Windows doesn’t do a thing that would be extremely 
useful in some common situations. Namely: connecting securely to servers that 
present a certificate signed by a public CA.

Your diff certainly fixes (1b), so it’s definitely an improvement. But of 
course it does nothing for (2). :(

> Right now sslrootcert can have two different values, a filename or "system".  
> I
> don't think altering what "system" means is a good idea, but I also don't 
> think
> limiting ourselves to those two values is helpful.  We either need to make a
> new param.  to over time replace sslrootcert with, which can handle multiple
> different values; or we need to retrofit a DSL/syntax to sslrootcert for
> differentiating.  Both have in common that the coding task is magnitudes 
> easier
> than figuring out the user experience.
> 
> Something to consider for v19 work.

To give a bit of context here, my feeling is that the widespread use of 
sslmode=require is a pretty serious security problem in the Postgres community. 
I strongly suspect many users don’t realise that it offers no protection _at 
all_ against MITM attacks. I know it took me a while to figure that point out, 
because sslmode=require just _sounds_ reassuringly secure.

That’s why I was so pleased to read about sslrootcert=system in Postgres 16: I 
thought it was going to improve this situation. But sslrootcert=system (or 
similar) isn’t going to be widely used until Postgres providers put it in their 
connection strings, and Postgres providers aren’t going to put it in their 
connection strings until it has a damn good chance of just working.

On Linux and Mac, I would say the ‘damn good chance of just working’ bar has 
now been reached. But on Windows, I suspect a _lot_ of devs are using psql as 
installed by the EDB Installer (it’s the only option listed at 
https://www.postgresql.org/download/windows/, after all). So until that works, 
sslrootcert=system (or similar) is going to remain a no-go.

What I am saying is: it would be _really_ nice not to have to wait another 
whole release cycle to get a level of security on many people’s Postgres 
connections that’s simply on par with the security of visiting some random web 
page.

So: what can be done?

(1) I could ask the EDB installer guys if they’re willing to apply my patch to 
the Postgres source as part of their build process, so as to use the Windows 
store in this one case. Personally, I think that would be a clear improvement; 
but I don’t know if they’ll like the idea. Based on Sandeep’s comment, it seems 
this is also dependent on OpenSSL 3.5 (LTS) becoming available prior to the 
Postgres 18 release.

(2) Your idea of a new parameter, or a new value of sslrootcert, is what I was 
also starting to mull this morning. Is there any chance at all this could be 
done for Postgres 18 or, failing that, 18.1?

I quite like sslrootcert=os: it’s snappy, and it implies that the Operating 
System root certs are going to be used (which is what I would have liked 
sslrootcert=system to mean). Some possible alternatives might be 
sslrootcert=public-cas or sslrootcert=os-default.

The key thing is that it should work out-of-the-box basically everywhere, so my 
preferred behaviour for it would be:

* On Windows, use the Windows built-in cert store (per my original patch).

* On Mac and Linux, just do the exact same thing sslrootcert=system currently 
does.

Is there any realistic prospect of this? I appreciate that it’s not the result 
of a lengthy, thorough and principled UX evaluation. On the other hand, it’s a 
few lines of code that could enable a pretty big improvement in security for 
many users’ Postgres connections much sooner.

(3) Any other ideas?

--
George MacKerron

Re: Making sslrootcert=system work on Windows psql

2025-04-03 Thread Christoph Berg

Re: George MacKerron
> (3) Any other ideas?

I'm not a fan of "security by adding more connection parameters".

What are the chances of making "use the system/os default CA store"
the default? "sslmode=require" would then already actually "require" a
certificate if I'm reading the docs right. This would match user
expectation for POLA.

This default could then be pointed at the correct locations (plural)
on all operating systems. (sslrootcert=system:wincert:otherlocation?)

The "default default" would still be sslmode=prefer so it wouldn't
break today's normal case. Users of sslmode=require will understand
that supplying a CA certificate is no longer optional.

Perhaps add a sslmode=require-weak could be added as a workaround.

Christoph

Re: Update LDAP Protocol in fe-connect.c to v3

2025-04-03 Thread Peter Eisentraut


On 22.03.25 22:22, Andrew Jackson wrote:

Apologies, forgot to attach the patch in the prior email.

On Sat, Mar 22, 2025 at 4:10 PM Andrew Jackson 
mailto:andrewjackson...@gmail.com>> wrote:


Currently the LDAP usage in fe-connect.c does not explicitly set the
protocol version to v3. This causes issues with many LDAP servers as
they will often require clients to use the v3 protocol and disallow
any use of the v2 protocol. Further the other usage of LDAP in
postgres (in `backend/libpq/auth.c`) uses the v3 protocol.

This patch changes fe-connect.c so that it uses the v3 protocol
similar to `backend/libpq/auth.c`.

One further note is that I do not currently see any test coverage
over the LDAP functionality in `fe-connect.c`. I am happy to add
that to this patch if needed.


Here is a slightly polished version of this patch.  I added an error 
message, and changed the return code, but it's a bit confusing which one 
might be the right one.


I also looked over the test file that you sent in a separate message. 
That also looks generally ok, but I'm not so deep into LDAP right now 
that I can give a detailed review.


My hunch right now is that we should probably take the patch that sets 
the version option and consider it for backpatching.  The patch with the 
tests can be held for detailed review later.
From c1e85711e1b0c7efcd1fa55cc46db959e12d6cfb Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 3 Apr 2025 15:06:13 +0200
Subject: [PATCH v2] libpq: Set LDAP protocol version 3

Some LDAP servers reject the default version 2 protocol.  So set
version 3 before starting the connection.  This matches how the
backend LDAP code has worked all along.

Co-authored-by: Andrew Jackson 
Discussion: 
https://www.postgresql.org/message-id/flat/CAKK5BkHixcivSCA9pfd_eUp7wkLRhvQ6OtGLAYrWC%3Dk7E76LDQ%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index 715b5d5aff4..d45fd6bdcf9 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -5475,6 +5475,7 @@ ldapServiceLookup(const char *purl, PQconninfoOption 
*options,
   *entry;
struct berval **values;
LDAP_TIMEVAL time = {PGLDAP_TIMEOUT, 0};
+   const int   ldapversion = LDAP_VERSION3;
 
if ((url = strdup(purl)) == NULL)
{
@@ -5606,6 +5607,15 @@ ldapServiceLookup(const char *purl, PQconninfoOption 
*options,
return 3;
}
 
+   if ((rc = ldap_set_option(ld, LDAP_OPT_PROTOCOL_VERSION, &ldapversion)) 
!= LDAP_SUCCESS)
+   {
+   libpq_append_error(errorMessage, "could not set LDAP protocol 
version: %s",
+  ldap_err2string(rc));
+   free(url);
+   ldap_unbind(ld);
+   return 3;
+   }
+
/*
 * Perform an explicit anonymous bind.
 *
-- 
2.49.0

Re: Update LDAP Protocol in fe-connect.c to v3

2025-04-03 Thread Tom Lane

Peter Eisentraut  writes:
> Here is a slightly polished version of this patch.  I added an error 
> message, and changed the return code, but it's a bit confusing which one 
> might be the right one.

I'm kind of -0.5 on declaring the variable as "const", because none of
our existing calls of ldap_set_option do that.  I do see that the
Linux man page for ldap_set_option claims that that argument can be
const, but I think you're risking a portability gotcha for no large
gain.  LGTM otherwise.

> My hunch right now is that we should probably take the patch that sets 
> the version option and consider it for backpatching.  The patch with the 
> tests can be held for detailed review later.

+1 for that plan.

regards, tom lane

RE: AIX support

2025-04-03 Thread Srirama Kucherlapati

Hi Wenhui Qiu,

May I know the freeze dates.

We are nearly ready to deliver the patch. Currently, we have compiled the
source using Meson and are investigating one test case issue. Once we pinpoint
the cause, we will send you the patch. Notably, this test case behaves
differently on AIX, opting for a Bitmap Index Scan instead of an Index Scan.
Not sure if this is really specific to AIX?

>> meson test

396 Ok: 81
397 Expected Fail:  0
398 Fail:   1  (regex)
399 Unexpected Pass:0
400 Skipped:219
401 Timeout:0


>>
  7  explain (costs off) select * from pg_proc where proname ~ '^abc';
  8 -  QUERY PLAN
  9 ---
10 - Index Scan using pg_proc_proname_args_nsp_index on pg_proc
11 -   Index Cond: ((proname >= 'abc'::text) AND (proname < 'abd'::text))
12 + QUERY PLAN
13 +
14 + Bitmap Heap Scan on pg_proc
15 Filter: (proname ~ '^abc'::text)
16 -(3 rows)
17 +   ->  Bitmap Index Scan on pg_proc_proname_args_nsp_index
18 + Index Cond: ((proname >= 'abc'::text) AND (proname < 'abd'::text))
19 +(4 rows)

As of now below are the changes wrt to meson.

modified:   meson.build
modified:   src/backend/meson.build
modified:   src/bin/initdb/initdb.c
modified:   src/include/c.h
modified:   src/include/port/aix.h
modified:   src/include/utils/float.h
modified:   src/interfaces/ecpg/compatlib/meson.build
modified:   src/interfaces/ecpg/ecpglib/meson.build
modified:   src/interfaces/ecpg/pgtypeslib/meson.build
modified:   src/interfaces/libpq/meson.build
modified:   src/test/regress/expected/txid.out
modified:   src/test/regress/expected/xid.out
modified:   src/test/regress/sql/sanity_check.sql



Warm regards,
Sriram.

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

2025-04-03 Thread vignesh C

On Thu, 3 Apr 2025 at 12:54, Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Thu, Apr 03, 2025 at 09:25:02AM +0530, vignesh C wrote:
> > On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
> >  wrote:
> > >
> > Couple of suggestions:
>
> Thanks for looking at it!
>
> > 1) I felt we can include a similar verification for one of the logical
> > replication tests too:
> > +# Wait for the walsender to update its IO statistics.
> >
> > +# Has to be done before the next restart and far enough from the
> > +# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
> > +$node_primary->poll_query_until(
> > + 'postgres',
> > + qq[SELECT sum(reads) > 0
> > +   FROM pg_catalog.pg_stat_io
> > +   WHERE backend_type = 'walsender'
> > +   AND object = 'wal']
> > +  )
> > +  or die
> > +  "Timed out while waiting for the walsender to update its IO statistics";
> > +
>
> Initially ([1]) it was added in 035_standby_logical_decoding.pl. But this
> test (035) is already racy enough that I felt better to move it to 
> 001_stream_rep.pl
> (see [2]) instead.
>
> I don't think that's a big issue as the code path being changed in the patch 
> is
> shared between logical and physical walsenders. That said that would not hurt 
> to
> add a logical walsender test, but I would prefer it to be outside
> 035_standby_logical_decoding.pl. Do you have a suggestion for the location of
> such a test?

Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Regards,
Vignesh

Re: Make query cancellation keys longer

2025-04-03 Thread Heikki Linnakangas


On 03/04/2025 14:29, Daniel Gustafsson wrote:

On 2 Apr 2025, at 15:48, Heikki Linnakangas  wrote:



And committed, with minor a bunch more little cleanups.


0001 in commit 09be39112654 seems to have missed (omitted?) adding the HAVE_
macros to pg_config.h.in, which doesn't matter as of now since they're not used
but it will be in the way for anyone running autoheader.  Any objection to me
pushing the attached?


Sounds good, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [PATCH] Add sortsupport for range types and btree_gist

2025-04-03 Thread Heikki Linnakangas


On 02/04/2025 22:41, Heikki Linnakangas wrote:

On 02/04/2025 20:18, Heikki Linnakangas wrote:
So I added it for the btree opfamily too, and moved the function to 
rangetypes.c since it's not just for gist anymore. I Ccmmitted that 
part, and will start looking more closely at the remaining btree_gist 
parts now.


Here's an updated version of the remaining parts. Found a couple of bugs:


Committed with some further cosmetic fixes. Thanks for the patience, 
this has been lingering for a long time, especially if you count the 
earlier attempt back in 2021!


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Make query cancellation keys longer

2025-04-03 Thread Daniel Gustafsson

> On 2 Apr 2025, at 15:48, Heikki Linnakangas  wrote:

> And committed, with minor a bunch more little cleanups.

0001 in commit 09be39112654 seems to have missed (omitted?) adding the HAVE_
macros to pg_config.h.in, which doesn't matter as of now since they're not used
but it will be in the way for anyone running autoheader.  Any objection to me
pushing the attached?

--
Daniel Gustafsson

timingsafe.diff
Description: Binary data

Re: [PoC] Reducing planning time when tables have many partitions

2025-04-03 Thread David Rowley

On Tue, 25 Mar 2025 at 06:49, Tom Lane  wrote:
> I finally made some time to look at this patchset, and I'm pretty
> disappointed, because after 35 versions I'd expect to see something
> that looks close to committable.  This doesn't really.  I like the
> basic idea of taking child EC members out of ECs' main ec_members
> lists, but there are too many weird details and
> underexplained/overcomplicated/unmaintainable data structures.
>
> One thing I don't love is putting the children into RelOptInfos.
> That seems like an unrelated data structure.  Have you thought
> about instead having, in each EC that needs it, an array indexed
> by RTI of per-relation child-member lists?  I think this might
> net out as less storage because there typically aren't that many
> ECs in a query.  But the main thing is to not have so many
> interconnections between ECs and RelOptInfos.

I think that's quite a good idea. One drawback of that method is that
we'd need to duplicate the EquivalenceMembers into each relid making
up the joinrels in add_child_join_rel_equivalences(). That could mean
finding the same EM multiple times when iterating over the set. I
don't think that causes issues other than wasted effort.

> Another thing I really don't like is the back-link from EMs to ECs:
>
> +   EquivalenceClass *em_ec;/* EquivalenceClass which has this 
> member */
>
> That makes the data structure circular, which will cause pprint to
> recurse infinitely.  (The fact that you hadn't noticed that makes
> me wonder how you debugged any of these data structure changes.)
> We could prevent the recursion with suitable annotation on this field,
> but I'd really rather not have the field in the first place.  Circular
> pointers are dangerous and best avoided.  Also, it's bloating a node
> type that you are concerned about supporting a lot of.  Another point
> is that I don't see any code to take care of updating these links
> during an EC merge.
>
> Some thoughts about the iterator stuff:
>
> * setup_eclass_member_iterator_with_children is a carpal-tunnel-inducing
> name.  Could we drop the "_with_children" part?  It doesn't seem to
> add much, since there's no variant for "without children".
>
> * The root parameter should be first; IMO there should be no
> exceptions to that within the planner.  Perhaps putting the target
> iterator parameter last would make it read more nicely.  Or you could
> rely on struct assignment:
>
> it = setup_eclass_member_iterator(root, ec, relids);
>
> * Why did you define the iterator as possibly returning irrelevant
> members?  Doesn't that mean that every caller has to double-check?
> Wouldn't it make for less code and fewer bugs for the iterator to
> have that responsibility?  If there is a good reason to do it like
> that, the comments should explain why.

I've attached 2 patches, which I think addresses most of this, aside
from the last point.

These do need more work. I've just attached what I have so far before
I head off for the day. I am planning on running some performance
tests tomorrow and doing a round on the comments.

> I don't really like the concept of 0004 at all.  Putting *all*
> the EC-related RelOptInfos into a root-stored list seems to be
> doubling down very hard on the assumption that no performance-critical
> operations will ever need to search that whole list.  Is there a good
> reason to do it like that, rather than say using the bitmap-index
> concept separately within each EC?  That might also alleviate the
> problem you're having with the bitmapsets getting too big.

I've dropped this patch out of the set for now. There's other work
going on that might solve the issue that patch was aiming to solve.

> Given that we've only got a week left, I see little hope of getting
> any of this into v18.

I am keen on not giving up quite yet. I'd very much value any further
input you have. It doesn't seem excessively complex to have quite a
large impact on the performance of the planner here.

David


v36-0001-Add-the-PlannerInfo-context-to-the-parameter-of-.patch
Description: Binary data


v36-0002-Speed-up-searches-for-child-EquivalenceMembers.patch
Description: Binary data

Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).

2025-04-03 Thread Fujii Masao




On 2025/04/03 17:53, jian he wrote:

On Wed, Apr 2, 2025 at 11:20 PM Fujii Masao  wrote:


  if (!RelationIsPopulated(rel))
  ereport(ERROR,
  errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot copy from unpopulated
materialized view \"%s\"",
  RelationGetRelationName(rel)),
  errhint("Use the REFRESH MATERIALIZED VIEW
command to populate the materialized view first."));


I think it's better to use the same hint message as the one output by
"COPY (SELECT * FROM ) TO",
specifically: "Use the REFRESH MATERIALIZED VIEW command," for consistency.


ok.




The copy.sgml documentation should clarify that COPY TO can
be used with a materialized view only if it is populated.


"COPY TO can be used only with plain tables, not views, and does not
copy rows from child tables or child partitions"
i changed it to
"COPY TO can be used with plain tables and materialized views, not
regular views, and does not copy rows from child tables or child
partitions"


It would be clearer to specify that "COPY TO" applies to *populated*
materialized views rather than just "materialized views"?



Another alternative wording I came up with:
"COPY TO can only be used with plain tables and materialized views,
not regular views. It also does not copy rows from child tables or
child partitions."


If we split the first description into two as you suggested,
I'm tempted to propose the following improvements to enhance
the overall descriptions:

-
"COPY TO" can be used with plain tables and populated materialized views. For example, "COPY table 
TO" copies the same rows as "SELECT * FROM ONLY table." However, it doesn't directly support other 
relation types, such as partitioned tables, inheritance child tables, or views. To copy all rows from these relations, 
use "COPY (SELECT * FROM table) TO."
-



your wording makes sense to me.
I try to ensure that the changing part in copy.sgml the line width
is less than 80 characters.
but I also want to make sure "<>" "" within the same line.
so after the change it becomes:


 COPY TO can be used with plain
 tables and populated materialized views.
 For example,
 COPY table
TO
 copies the same rows as
 SELECT * FROM ONLY table.
 However it doesn't directly support other relation types,
 such as partitioned tables, inheritance child tables, or views.




The tests seem to have been placed under the category "COPY FROM ... DEFAULT",
which feels a bit misaligned. How about adding them to the end of copy.sql 
instead?


ok.


Thanks for updating the patch! I made some minor cosmetic changes
and updated the commit log. The revised patch is attached.

Unless there are any objections, I'll proceed with committing it.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 18bdeab3d74e539835d59c9a86da8f4ccf2243be Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Thu, 3 Apr 2025 18:41:44 +0900
Subject: [PATCH v4] Allow "COPY table TO" command to copy rows from
 materialized views.

Previously, "COPY table TO" command worked only with plain tables and
did not support materialized views, even when they were populated and
had physical storage. To copy rows from materialized views,
"COPY (query) TO" command had to be used, instead.

This commit extends "COPY table TO" to support populated materialized
views directly, improving usability and performance, as "COPY table TO"
is generally faster than "COPY (query) TO". Note that copying from
unpopulated materialized views will still result in an error.

Author: jian he 
Reviewed-by: Kirill Reshke 
Reviewed-by: David G. Johnston 
Reviewed-by: Vignesh C 
Reviewed-by: Fujii Masao 
Discussion: 
https://postgr.es/m/CACJufxHVxnyRYy67hiPePNCPwVBMzhTQ6FaL9_Te5On9udG=y...@mail.gmail.com
---
 doc/src/sgml/ref/copy.sgml | 20 ++--
 src/backend/commands/copyto.c  | 13 -
 src/test/regress/expected/copy.out | 12 
 src/test/regress/sql/copy.sql  |  9 +
 4 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index df093da97c5..d6859276bed 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -520,16 +520,16 @@ COPY count
   Notes
 

-COPY TO can be used only with plain
-tables, not views, and does not copy rows from child tables
-or child partitions.  For example, COPY table TO copies
-the same rows as SELECT * FROM ONLY table.
-The syntax COPY (SELECT * FROM table) TO ... can be used to
-dump all of the rows in an inheritance hierarchy, partitioned table,
-or view.
+COPY TO can be used with plain
+tables and populated materialized views.
+For example,
+COPY table
+TO copies the same rows as
+SELECT * F

Re: Draft for basic NUMA observability

2025-04-03 Thread Tomas Vondra

On 4/3/25 09:01, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra  wrote:
> 
> Hi Tomas,
> 
>> OK, so you agree the commit messages are complete / correct?
> 
> Yes.
> 
>> OK. FWIW if you disagree with some of my proposed changes, feel free to
>> push back. I'm sure some may be more a matter of personal preference.
> 
> No, it's all fine. I will probably have lots of questions about
> setting proper env for development that cares itself about style, but
> that's for another day.
> 
>> [..floats..]
>> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
>> confusing and difficult to reason about ...
>>
>> I agree we don't want special cases for each possible combination of
>> page sizes (I'm not sure we even know all the combinations). What I was
>> thinking about is two branches, one for (block >= page) and another for
>> (block < page). AFAICK both values have to be 2^k, so this would
>> guarantee we have either (block/page) or (page/block) as integer.
>>
>> I wonder if you could even just calculate both, and have one loop that
>> deals with both.
>>
> [..]
>> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
>> or any other data type. I mean that it works with non-floating point
>> values. It could be int64, Size or whatever is large enough to not
>> overflow. I really don't see how changing stuff to double makes this
>> easier to understand.
> 
> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values. It should be more readable too with
> those comments. I have not put it into its own function, because now
> it fits the whole screen, so hopefully one can follow visually. Please
> let me know if that code solves the doubts or feel free to reformat
> it. That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.
> 

IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.

>>> 12) You have also raised "why not pg_shm_allocations_numa" instead of
>>> "pg_shm_numa_allocations"
>>>
>>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
>>> naming things in general), I can change if you want.
>>>
>>
>> Me neither. I wonder if there's some precedent when adding similar
>> variants for other catalogs ... can you check? I've been thinking about
>> pg_stats and pg_stats_ext, but maybe there's a better example?
> 
> Hm, it seems we always go with suffix "_somethingnew":
> 
> * pg_stat_database -> pg_stat_database_conflicts
> * pg_stat_subscription -> pg_stat_subscription_stats
> * even here: pg_buffercache -> pg_buffercache_numa
> 
> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...
> 

+1 to pg_shmem_allocations_numa

>>> 13) In the patch: "review: What if we get multiple pages per buffer
>>> (the default). Could we get multiple nodes per buffer?"
>>>
>>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
>>> output multiple rows per single buffer (with "page_no") then we could
>>> get this:
>>> buffer1:..:page0:numanodeID1
>>> buffer1:..:page1:numanodeID2
>>> buffer2:..:page0:numanodeID1
>>>
>>> Should we add such functionality?
>>
>> When you say "today no" does that mean we know all pages will be on the
>> same node, or that there may be pages from different nodes and we can't
>> display that? That'd not be great, IMHO.
>>
>> I'm not a huge fan of returning multiple rows per buffer, with one row
>> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
>> rest of the fields is for the whole buffer, it'd be wrong to duplicate
>> that for each page.
> 
> OPEN_QUESTION: With v21 we have all the information available, we are
> just unable to display this in pg_buffercache_numa right now. We could
> trim the view so that it has 3 columns (and user needs to JOIN to
> pg_buffercache for more details like relationoid), but then what the
> whole refactor (0002) was for if we would just return bufferId like
> below:
> 
> buffer1:page0:numanodeID1
> buffer1:page1:numanodeID2
> buffer2:page0:numanodeID1
> buffer2:page1:numanodeID1
> 
> There's also the problem that reading/joining could be inconsistent
> and even slower.
> 

I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.

I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.

So unless someone can demonstrate a use case where this would matter,
I'd not worry about it to

Re: libpq support for NegotiateProtocolVersion

2025-04-03 Thread Heikki Linnakangas


On 03/04/2025 14:55, Ranier Vilela wrote:

Hi.

Per Coverity.

CID 1596094: (#1 of 1): Structurally dead code (UNREACHABLE)
unreachable: Since the loop increment i++; is unreachable, the loop body 
will never execute more than once.


That is true, and I think we should just silence it. The loop is is 
written that way in anticipation of future changes where libpq would 
actually request some protocol extensions, but would be happy to 
continue without server supporting them. But since we don't have any 
yet, it's always an error if the server responded with any unsupported 
protocol extensions.


The code of the function *pqGetNegotiateProtocolVersion3* is a little 
confusing.


I believe that the Coverity alert is valid.
The function never returns 0.


It certainly does. It returns 0 when connecting to a pre-v18 server, 
with the "max_protocol_version=3.2" option. In that case, the local 
variable "num" is 0, the loop never executes, and the function returns 0.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Draft for basic NUMA observability

2025-04-03 Thread Jakub Wartak

On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra  wrote:

> On 4/3/25 09:01, Jakub Wartak wrote:
> > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra  wrote:

Hi Tomas,

Here's v23 attached (had to rework it because the you sent v22 just
the moment you I wanted to send it) Change include:
- your review should be incorporated
- name change for shm view
- pg_buffercache_numa refactor a little to keep out the if outside
loops (as Bertrand wanted initially)

So let's continue in this subthread.

> I think a view with just 3 columns would be a good solution. It's what
> pg_shmem_allocations_numa already does, so it'd be consistent with that
> part too.
>
> I'm not too worried about the cost of the extra join - it's going to be
> a couple dozen milliseconds at worst, I guess, and that's negligible in
> the bigger scheme of things (e.g. compared to how long the move_pages is
> expected to take). Also, it's not like having everything in the same
> view is free - people would have to do some sort of post-processing, and
> that has a cost too.
>
> So unless someone can demonstrate a use case where this would matter,
> I'd not worry about it too much.

OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.

> I'm -1 on JSON, I don't see how would that solve anything better than
> e.g. a regular array, and it's going to be harder to work with. So if we
> don't want to go with the 3-column view proposed earlier, I'd stick to a
> simple array. I don't think there's a huge difference between those two
> approaches, it should be easy to convert between those approaches using
> unnest() and array_agg().
>
> Attached is v22, with some minor review comments:
>
> 1) I suggested we should just use "libnuma support" in configure,
> instead of talking about "NUMA awareness support", and AFAICS you
> agreed. But I still see the old text in configure ... is that
> intentional or a bit of forgotten text?

It was my bad, too many rebases and mishaps with git voodoo..

> 2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
> and simplified a couple lines (but that's a matter of preference).

Great, thanks.

> 3) I don't think it's correct for pg_get_shmem_numa_allocations to just
> silently ignore nodes outside the valid range. I suggest we simply do
> elog(ERROR), as it's an internal error we don't expect to happen.

It's there too.

-J.


v23-0001-Add-support-for-basic-NUMA-awareness.patch
Description: Binary data


v23-0004-Add-new-pg_shmem_allocations_numa-view.patch
Description: Binary data


v23-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patch
Description: Binary data


v23-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch
Description: Binary data

Re: Using read stream in autoprewarm

2025-04-03 Thread Heikki Linnakangas


On 03/04/2025 17:31, Melanie Plageman wrote:

Attached v12 fixes a bug Bilal found off-list in 0002 related to
handling invalid blocks.


I had a quick look at this. Looks good overall, some small remarks:

v12-0001-Autoprewarm-global-objects-separately.patch


Instead, modify apw_load_buffers() to prewarm the shared objects in one
invocation of autoprewarm_database_main() while connected to the first
valid database.


So it effectively treats "global objects" as one extra database, 
launching a separate worker process to handle global objects. It took me 
a while to understand that. From the commit message, I understood that 
it still does that within the first worker process invocation, but no. A 
comment somewhere would be good.


One extra worker process invocation is obviously not an improvement 
performance-wise, but seems acceptable.


v12-0002-Refactor-autoprewarm_database_main-in-preparatio.patch

Yes, I agree this makes the logic more clear

v12-0003-Use-streaming-read-I-O-in-autoprewarm.patch

I wonder if the have_free_buffer() calls work correctly with read 
streams? Or will you "overshoot", prewarming a few more pages after the 
buffer cache is already full? I guess that depends on when exactly the 
read stream code allocates the buffer.



While reviewing this, I noticed a pre-existing bug: The code ignores 
'tablespace' when deciding if it's reached the end of the current 
relation. I believe it's possible to have two different relations with 
the same relnumber, in different tablespaces.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [PATCH] Add sortsupport for range types and btree_gist

2025-04-03 Thread Heikki Linnakangas


On 03/04/2025 18:00, Bernd Helmle wrote:

Note that the original coding was by Christoph Heiss and i picked up
his work and rewrote/fixed especially the part with the direct use of
sortsupport functions and many other issues spotted by testing and by
Andrey.


Sorry Christoph I missed you in the commit message!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Extend ALTER DEFAULT PRIVILEGES for large objects

2025-04-03 Thread Fujii Masao





On 2025/04/03 23:04, Yugo NAGATA wrote:

On Wed, 2 Apr 2025 02:35:35 +0900
Fujii Masao  wrote:




On 2025/01/23 19:22, Yugo NAGATA wrote:

On Wed, 22 Jan 2025 13:30:17 +0100
Laurenz Albe  wrote:


On Fri, 2024-09-13 at 16:18 +0900, Yugo Nagata wrote:

I've attached a updated patch. The test is rewritten using 
has_largeobject_privilege()
function instead of calling loread & lowrite, which makes the test a bit 
simpler.
Thare are no other changes.


When I tried to apply this patch, I found that it doesn't apply any
more since commit f391d9dc93 renamed tab-complete.c to tab-complete.in.c.

Attached is a rebased patch.


Thank you for updating the patch!


I agree that large objects are a feature that should fade out (alas,
the JDBC driver still uses it for BLOBs).  But this patch is not big
or complicated and is unlikely to create a big maintenance burden.

So I am somewhat for committing it.  It works as advertised.
If you are fine with my rebased patch, I can mark it as "ready for
committer".  If it actually gets committed depends on whether there
is a committer who thinks it worth the effort or not.


I confirmed the patch and I am fine with it.


I've started reviewing this patch since it's marked as "ready for committer".


Thank you for your reviewing this patch!


I know of several systems that use large objects, and I believe
this feature would be beneficial for them. Overall, I like the idea.


The latest patch looks good to me. I just have one minor comment:


   only the privileges for schemas, tables (including views and foreign
   tables), sequences, functions, and types (including domains) can be
   altered.


In alter_default_privileges.sgml, this part should also mention large objects?


Agreed. I attached a updated patch.


Thanks for updating the patch!

If there are no objections, I'll proceed with committing it using the following 
commit log.


Extend ALTER DEFAULT PRIVILEGES to define default privileges for large objects.

Previously, ALTER DEFAULT PRIVILEGES did not support large objects.
This meant that to grant privileges to users other than the owner,
permissions had to be manually assigned each time a large object
was created, which was inconvenient.

This commit extends ALTER DEFAULT PRIVILEGES to allow defining default
access privileges for large objects. With this change, specified privileges
will automatically apply to newly created large objects, making privilege
management more efficient.

As a side effect, this commit introduces the new keyword OBJECTS
since it's used in the syntax of ALTER DEFAULT PRIVILEGES.

Original patch by Haruka Takatsuka, with some fixes and tests by Yugo Nagata,
and rebased by Laurenz Albe.

Author: Takatsuka Haruka 
Co-authored-by: Yugo Nagata 
Co-authored-by: Laurenz Albe 
Reviewed-by: Masao Fujii 
Discussion: 
https://postgr.es/m/20240424115242.236b499b2bed5b7a27f7a...@sraoss.co.jp


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: duplicated comments on get_relation_constraints

2025-04-03 Thread Richard Guo

On Wed, Apr 2, 2025 at 11:39 AM Richard Guo  wrote:
> The change looks good to me.  I plan to push it soon, barring any
> objections.

Pushed.

Thanks
Richard

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

2025-04-03 Thread Alexander Korotkov

On Thu, Apr 3, 2025 at 5:18 PM Alena Rybakina  wrote:
> Okay, I agree with you.

Good.  I've reflected this limitation in comments and the commit
message.  Also, I've adjust regression tests by removing excessive
ones and adding more important cases.  I'm going to push this if no
objections.

--
Regards,
Alexander Korotkov
Supabase


v9-0001-Extract-make_SAOP_expr-function-from-match_orclau.patch
Description: Binary data


v9-0002-Convert-x-IN-VALUES-.-to-x-ANY-.-then-appropriate.patch
Description: Binary data

Re: New criteria for autovacuum

2025-04-03 Thread Aleksander Alekseev

Hi Konstantin,

> But the problem can be quite easily reproduced. We can just populate table 
> with some data with some other transaction with assigned XID active.
> Then explicitly vacuum this tables or wait until autovacuum does it.
> At this moment table has no more dead or inserted tuples so autovacuum will 
> not be called for it. But heap pages of this table are still not marked as 
> all-visible.
> And will never be marked as all-visible unless table is updated or is 
> explicitly vacuumed.

I decided to experiment with the scenario you are describing. For
those who likes to have exact steps to reproduce the issue, as I do,
here they are:

Session 1:

```
CREATE TABLE humidity(
  ts TIMESTAMP NOT NULL,
  city TEXT NOT NULL,
  humidity INT NOT NULL);

CREATE INDEX humidity_idx ON humidity (ts, city) INCLUDE (humidity);

INSERT INTO humidity (ts, city, humidity)
SELECT ts + (INTERVAL '60 minutes' * random()), city, 100*random()
FROM generate_series('2010-01-01' :: TIMESTAMP,
 '2020-12-31', '1 hour') AS ts,
 unnest(array['Moscow', 'Berlin']) AS city;
```

Session 2:

```
BEGIN;

INSERT INTO humidity (ts, city, humidity)
SELECT ts + (INTERVAL '60 minutes' * random()), city, 100*random()
FROM generate_series('2022-01-01' :: TIMESTAMP,
 '2025-12-31', '1 hour') AS ts,
 unnest(array['Moscow', 'Berlin']) AS city;

-- no COMMIT
```

Session 1:

```
VACUUM humidity;
```

Session 2:

```
COMMIT;
```

Session 1:

```
EXPLAIN (ANALYZE, BUFFERS ON) SELECT humidity FROM humidity WHERE ts
>= '2022-01-01' AND ts < '2022-05-02' AND city = 'Moscow';
```

The result is:

```
 Index Only Scan using humidity_idx on humidity  (cost=0.42..58.75
rows=67 width=4) (actual time=0.060..7.490 rows=2904.00 loops=1)
   Index Cond: ((ts >= '2022-01-01 00:00:00'::timestamp without time
zone) AND (ts < '2022-05-02 00:00:00'::timestamp without time zone)
AND (city = 'M
oscow'::text))
   Heap Fetches: 2904
   Index Searches: 1
   Buffers: shared hit=123
 Planning:
   Buffers: shared hit=10
 Planning Time: 0.840 ms
 Execution Time: 7.964 ms
```

... and it is claimed that autovacuum will never be triggered in order
to set hint bits, assuming we never modify the table again.

I have mixed feelings about addressing this. Although this behavior is
somewhat counterintuitive, if the user has a read-only lookup table
he/she can always execute VACUUM manually. In order to relieve him of
this unbearable burden we are going to need to introduce some overhead
that will affect all the users (not to mention people maintaining the
code). This would be convenient for sure but I'm not entirely sure if
it's worth it.

Thoughts?

-- 
Best regards,
Aleksander Alekseev

Re: Making sslrootcert=system work on Windows psql

2025-04-03 Thread Daniel Gustafsson

> On 3 Apr 2025, at 03:21, Jacob Champion  
> wrote:
> 
> On Wed, Apr 2, 2025 at 7:15 AM George MacKerron  
> wrote:
>>> But happily, I don’t think we need to choose. Can’t we just use the Windows 
>>> system store if neither of the relevant environment variables is set?

The env vars are only overrides for the default, they are not required to be
set, so their presence (or lack thereof) cannot be used to base any decision on
really.

>> Thinking about this a little more, I guess the remaining concern is about 
>> people on Windows compiling their own psql from source, using an OpenSSL 
>> build that has a meaningful OPENSSLDIR baked in.
> 
> Right. In a past life I shipped client stacks on Windows that looked
> kind of like that; I would have been less than happy if a client
> suddenly stopped using the certificate bundle I'd set up.

Agreed.  I don't think it's productive to assume that OPENSSLDIR is bogus as a
general rule, OpenSSL sure doesn't.  Also, remember that the API we use sets
OpenSSL to use the default dir, file or *store*, not just file/dir:

SSL_CTX_set_default_verify_paths() specifies that the default locations
from which CA certificates are loaded should be used.  There is one default
directory, one default file and one default store.  The default CA
certificates directory is called certs in the default OpenSSL directory,
and this is also the default store.

org.openssl.winstore isn't by OpenSSL defined as the default even on Windows,
but a future version might change that.

>> My preference would be for "org.openssl.winstore:" to be the compile-time 
>> default, though, because the option is called sslrootcert=system and it’s 
>> documented as using “the system’s trusted CA roots” (not 
>> sslrootcert=openssldir or documented as using OpenSSL’s default CA roots).
> 
> If we'd decided to do that from the beginning, maybe... but it looks
> like the winstore URI wasn't released yet when we designed that, so
> "the system" couldn't have meant anything except for OpenSSL. Maybe
> the documentation needs to be more specific now that OpenSSL is
> supporting more stuff.

I don't think we need to be more specific regarding what OpenSSL support, but
in hindsight I wonder if we should be more specific around that "system"
actually means.  The attached (untested) small diff tries to make that more
clear. (Line reflow omitted for review ease.)

> Even if we want to change that definition sometime in the future, we'd
> still have to wait at least until OpenSSL 3.1 was no longer supported;
> I don't think it would be very helpful for our definition of "system"
> to change abruptly when upgrading OpenSSL past the 3.2 boundary. All
> this to say, I'd like to support the winstore, but I'm not convinced
> it should take over the existing meaning of "system". Even just adding
> it as a fallback has some risk to any packagers who have gotten it
> working.

Right now sslrootcert can have two different values, a filename or "system".  I
don't think altering what "system" means is a good idea, but I also don't think
limiting ourselves to those two values is helpful.  We either need to make a
new param.  to over time replace sslrootcert with, which can handle multiple
different values; or we need to retrofit a DSL/syntax to sslrootcert for
differentiating.  Both have in common that the coding task is magnitudes easier
than figuring out the user experience.

Something to consider for v19 work.

--
Daniel Gustafsson

rootcertsystem.diff
Description: Binary data

Re: two occurrences of assign print_notnull within pg_dump.c

2025-04-03 Thread Ashutosh Bapat

On Thu, Apr 3, 2025 at 4:31 PM jian he  wrote:
>
> hi.
>
> in src/bin/pg_dump/pg_dump.c
> within function dumpTableSchema:
> there are two occurrences of:
> print_notnull = (tbinfo->notnull_constrs[j] != NULL &&
>  (tbinfo->notnull_islocal[j] ||
>   dopt->binary_upgrade ||
>   tbinfo->ispartition));

The same commit 14e87ffa5c543b5f30ead7413084c25f7735039f modified
existing definition of print_notnull and added another. I wonder why.
- probably just an oversight or define to closer to usage. But we
don't do the latter.

-- 
Best Wishes,
Ashutosh Bapat

Re: Draft for basic NUMA observability

2025-04-03 Thread Jakub Wartak

On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra  wrote:
>
> Hi,
>
> I've spent a bit of time reviewing this. In general I haven't found
> anything I'd call a bug, but here's a couple comments for v18 ... Most
> of this is in separate "review" commits, with a couple exceptions.

Hi, thank you very much for help on this, yes I did not anticipate
this patch to organically grow like that...
I've squashed those review findings into v20 and provided answers for
the "review:".

> 1) Please update the commit messages, with proper formatting, etc. I
> tried to do that in the attached v19, but please go through that, add
> relevant details, update list of reviewers, etc. The subject should not
> be overly long, etc.

Fixed by you.

> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
> use just "libnuma support" similar to other libraries.

Fixed by you.

> 3) I don't think we need pg_numa.h to have this:
>
> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
>
> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
> would it be necessary here? It's enough to have a prototype in .c file.

Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.

> 4) Improved .sgml to have acronym/productname in a couple places.

Great.

> 5) I don't think the comment for pg_buffercache_init_entries() is very
> useful. That it's helper for pg_buffercache_pages() tells me nothing
> about how to use it, what the arguments are, etc.

I've added an explanation (in 0003 though), so that this is covered.
I've always assumed that 'static' functions don't need that much of
that(?)

> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
> comment too. I mean, there's no info about what the arguments are, which
> arguments are input or output, etc. And it only discussed one option
> (block page < memory page), but what happens in the other case? The
> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
> saying it's wrong).
>
> However, it seems rather suspicious that pages_per_blk is calculated as
> float, and pg_buffercache_numa_prepare_ptrs() then does this:
>
> for (size_t j = 0; j < pages_per_blk; j++)
> { ... }
>
> I mean, isn't this vulnerable to rounding errors, which might trigger
> some weird behavior? If not, it'd be good to add a comment why this is
> fine, it confuses me a lot. I personally would probably prefer doing
> just integer arithmetic here.

Please bear with me: If you set client_min_messages to debug1 and then
pg_buffercache_numa will dump:
a) without HP,  DEBUG:  NUMA: os_page_count=32768 os_page_size=4096
pages_per_blk=2.00
b) with HP (2M) DEBUG:  NUMA: os_page_count=64 os_page_size=2097152
pages_per_blk=0.003906

so we need to be agile to support two cases as you mention (BLCKSZ >
PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
4kB..1GB, thus we can get in that float the following sample values:
BLCKSZ pagesize
2kB4kB= 0.5
2kB2048kb = .0009765625
2kB1024*1024kb # 1GB  = .019073486328125 # worst-case?
8kB4kB= 2
8kB2048kb = .003906250 # example from above (x86_64, 2M HP)
8kB1024*1024kb # 1GB  = .0762939453
32kB   4kB= 8
32kB   2048kb = .0156250
32kB   1024*1024kb # 1GB  = .30517578125

So that loop:
for (size_t j = 0; j < pages_per_blk; j++)
is quite generic and launches in both cases. I've somehow failed to
somehow come up with integer-based math and generic code for this
(without special cases which seem to be no-go here?). So, that loop
then will:
a) launch many times to support BLCKSZ > pagesize, that is when single
DB block spans multiple memory pages
b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
example above)

Loop touches && stores addresses into os_page_ptrs[] as input to this
one big move_pages(2) query. So we basically ask for all memory pages
for NBuffers. Once we get our NUMA information we then use blk2page =
up_to_NBuffers * pages_per_blk to resolve memory pointers back to
Buffers, if anywhere it could be a problem here.

So let's say we have s_b=4TB (it wont work for sure for other reasons,
let's assume we have it), let's also assume we have no huge
pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
which multiplied by 2 = INT_MAX (integer overflow bug), so I think
that int is not big enough there in pg_buffercache_numa_pages() (it
should be "size_t blk2page" there as in
pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)

Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
NBuffers=2097152 * 0.003906250 = 8192.0 .

OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
float f = 2097152 * 0.003906250;
 under clang  -Weverything I got "implicit conversion increases
floating-point precision: 'float' to 'double'", so either it is:
- we somehow rewrite all o

Re: [PATCH] Add sortsupport for range types and btree_gist

2025-04-03 Thread Bernd Helmle

Am Mittwoch, dem 02.04.2025 um 22:41 +0300 schrieb Heikki Linnakangas:
> * The stuff to save the FmgrInfo for gbt_enum_sortsupport() was
> broken. 
> It saved the address of FmgrInfo struct that was allocated in a local
> variable, on the stack, in SortSupport->ssup-extra, and expected it
> to 
> be valid on subsequent call to gbt_enum_ssup_cmp(). It went unnoticed
> because enum_cmp() only accesses the FmgrInfo struct if it encounters
> odd enum values, i.e. enum values that have been added with ALTER
> TYPE 
> ADD VALUE. I fixed that, and modified the existing enum test to cover
> that case.
> 

Ugh, i obviously didn't pay enough attention here. 

> * The gist_bpchar_ops opfamily was using the built-in 
> bpchar_sortsupport() function directly. That's not right, you need to
> have a shim that extracts and comparse just the 'lower' value, just
> like 
> for all other datatypes. I think that was just a simple oversight,
> but 
> it happened to pass the tests because bpchar_sortsupport() would not 
> outright crash, even though we were passing it garbage. It's
> interesting 
> that because the gist sortsupport function is used just to order the 
> input during build, everything still works if it sorts the input to a
> completely bogus ordering, it just gets slower. At one point while 
> fixing that, I also accidentally used "btcharcmp" instead of 
> "bpcharcmp", and all the tests passed with that too.
> 

Yes, i missed that. Originally the code did this all over the place
with many other types, too. I seem to have overseen this somehow.

> Those are now fixed. I also harmonized the comments to use the same 
> phrasing for all the datatypes, marked all the sortsupport functions
> as 
> PARALLEL SAFE, and reformatted slightly.

Note that the original coding was by Christoph Heiss and i picked up
his work and rewrote/fixed especially the part with the direct use of
sortsupport functions and many other issues spotted by testing and by
Andrey.

Many thanks for picking this up. There are users out there which are
going to be very happy with this if no further issues arise. :)

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-04-03 Thread Peter Eisentraut


On 03.04.25 10:07, Alvaro Herrera wrote:

The new flag is there for quick access by get_relation_info.  We could
easily not have it otherwise, because clients don't need it, but its
lack would probably make planning measurably slower because it'd have to
do syscache access for every single not-null constraint to figure out if
it's valid or not.


In the v6 patch, you are adding a attnullability field to the 
CompactAttribute in the tuple descriptor and use that in 
get_relation_info().  That seems like the right approach, because then 
you're doing all that preprocessing about which constraint is active in 
the relcache.  So I don't see where the extra pg_attribute field 
attnotnullvalid is getting used.

Re: AIX support

2025-04-03 Thread Andres Freund

Hi,

On 2025-04-03 13:45:01 +, Srirama Kucherlapati wrote:
> We are nearly ready to deliver the patch. Currently, we have compiled the
> source using Meson and are investigating one test case issue. Once we pinpoint
> the cause, we will send you the patch. Notably, this test case behaves
> differently on AIX, opting for a Bitmap Index Scan instead of an Index Scan.
> Not sure if this is really specific to AIX?
> 
> >> meson test
> 
> 396 Ok: 81
> 397 Expected Fail:  0
> 398 Fail:   1  (regex)
> 399 Unexpected Pass:0
> 400 Skipped:219

FWIW, the skipped tests indicate that you are building without tap tests or
such - that means most of the testsuite is not run. That certainly is not good
enough.

Greetings,

Andres Freund

Re: pull-up subquery if JOIN-ON contains refs to upper-query

2025-04-03 Thread Ilia Evdokimov




On 02.04.2025 19:39, Alena Rybakina wrote:


I see that I need to add a walker that, when traversing the tree, 
determines whether there are conditions under which pull-up is 
impossible - the presence of
volatility of functions and other restrictions, and leave the 
transformation for the var objects that I added before, I described it 
here.




I have some concerns about pulling up every clause from the subquery 
with one column. In particular, not every clause is safe or beneficial 
to pull up: OR-clauses, CASE expressions, nested sublinks could 
significantly change how the planner estimates the number of rows or 
applies filters, especially when they are not true join predicates. 
Pulling them up might lead to worse plans, or even change the semantics 
in subtle ways. I think before applying such transformations, we should 
make sure they are not only safe but actually improve the resulting plan.

1 2 >

1 - 100 of 109 matches

Mail list logo