Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-27 Thread Ashutosh Bapat
On Thu, Mar 20, 2025 at 3:25 PM Rushabh Lathia  wrote:
>
> Hi Alvaro,
>
> Thank you for the offline discussion.
>
> As we all agree, changing the attnotnull datatype would not be a good idea 
> since it is
> a commonly used catalog column, and many applications and extensions depend 
> on it.
>
> Attached is another version of the patch (WIP), where I have introduced a new 
> catalog column,
> pg_attribute.attinvalidnotnull (boolean). This column will default to FALSE 
> but will be set to TRUE
> when an INVALID NOT NULL constraint is created.  With this approach, we can 
> avoid performing
> extra scans on the catalog table to identify INVALID NOT NULL constraints, 
> ensuring there is no
> performance impact.
>
> Also updated the pg_dump implementation patch and attaching the same here.
>

These patches do not address comments discussed in [1]. Since there
was a change in design, I am assuming that those will be addressed
once the design change is accepted.

[1] 
https://www.postgresql.org/message-id/202503121157.3zabg6m3anwp@alvherre.pgsql


-- 
Best Wishes,
Ashutosh Bapat




Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-03-27 Thread Peter Geoghegan
On Tue, Mar 18, 2025 at 1:01 PM Matthias van de Meent
 wrote:
> My comments on 0004:
>
> > _bt_skiparray_strat_decrement
> > _bt_skiparray_strat_increment
>
> In both functions the generated value isn't used when the in/decrement
> overflows (and thus invalidates the qual), or when the opclass somehow
> doesn't have a <= or >= operator, respectively.
> For byval types that's not much of an issue, but for by-ref types
> (such as uuid, or bigint on 32-bit systems) that's not great, as btree
> explicitly allows no leaks for the in/decrement functions, and now we
> use those functions and leak the values.

We don't leak any memory here. We temporarily switch over to using
so->arrayContext, for the duration of these calls. And so the memory
will be freed on rescan -- there's a MemoryContextReset call at the
top of _bt_preprocess_array_keys to take care of this, which is hit on
rescan.

> Additionally, the code is essentially duplicated between the
> functions, with as only differences which sksup function to call;
> which opstrategies to check, and where to retrieve/put the value. It's
> only 2 instances total, but if you figure out how to make a nice
> single function from the two that'd be appreciated, as it reduces
> duplication and chances for divergence.

I'll see if I can do something like that for the next revision, but I
think that it might be more awkward than it's worth.

The duplication isn't quite duplication, so much as it's two functions
that are mirror images of each other. The details/direction of things
is flipped in a number of places. The strategy number differs, even
though the same function is called.

-- 
Peter Geoghegan




Re: getting "shell command argument contains a newline or carriage return:" error with pg_dumpall when db name have new line in double quote

2025-03-27 Thread Mahendra Singh Thalor
On Thu, 27 Mar 2025 at 18:33, Álvaro Herrera 
wrote:
>
> On 2025-Mar-27, Andrew Dunstan wrote:
>
> > I don't think we can backpatch this. It's a behaviour change.
>
> I agree, we can't.
>
> Also, if we're going to enforce this rule, then pg_upgrade --check needs
> to alert users that they have a database name that's no longer valid.
> That needs to be part of this patch as well.  We should also ensure that
> these are the only problem characters, which IMO means it should add a
> test for pg_dumpall that creates a database whose name has all possible
> characters and ensures that it is dumpable.
>
> --
> Álvaro HerreraBreisgau, Deutschland  —
https://www.EnterpriseDB.com/
> "La virtud es el justo medio entre dos defectos" (Aristóteles)

Thanks Álvaro.

Yes, I also agree with you. "pg_upgrade --check" should alert users
regarding database names because pg_upgrade is failing if we have \n\r in
dbname(old cluster).

*pg_upgrade --check: This is passing but **pg_upgrade is failing.*

Checking for new cluster tablespace directories   ok
*Clusters are compatible*

*pg_upgrade*:
.
Creating dump of global objects   ok
Creating dump of database schemas

*shell command argument contains a newline or carriage return:
"dbname='invaliddb'"*

As a part of this patch, we can teach *pg_upgrade* to alert users regarding
database names with invalid characters or we can keep old behavior as
upgrade is already failing without this patch also.

I will try to make a separate patch to teach "*pg_upgrade --check"* to
alert users regarding database/user/role with \n\r in the old cluster.

Here, I am attaching an updated patch for review.

This patch has changes for: CREATE DATABASE, CREATE ROLE, CREATE USER and
RENAME DATABASE/USER/ROLE and have some tests also. (EXCEPT RENAME test
case)

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


v03_don-t-allow-newline-or-carriage-return-in-db-user-role-names.patch
Description: Binary data


Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-03-27 Thread Sadeq Dousti
> I think it's fine the way it is, with regards to v10 check. Can you post a
> rebased patch?
>

Hi Greg,
I just checked here: https://commitfest.postgresql.org/patch/5594/
Seems the patch is OK with the latest master, and no rebase is needed.

Do you mean that instead of a diff, I post a patch?

Best regards,
Sadeq


Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions

2025-03-27 Thread Aidar Imamov

Hi!

I've been looking at the patches v3 and there are a few things I want to 
talk

about.

EvictUnpinnedBuffer():
I think we should change the paragraph with "flushed" of the comment to
something more like this: "If the buffer was dirty at the time of the 
call it
will be flushed by calling FlushBuffer() and if 'flushed' is not NULL 
'*flushed'

will be set to true."

I also think it'd be a good idea to add null-checks for "flushed" before
dereferencing it:

*flushed = false;
*flushed = true;


If the (!buf_state) clause is entered then we assume that the header is 
not locked.
Maybe edit the comment: "Lock the header if it is not already locked." 
-> "Lock the header"?

 if (!buf_state)
 {
   /* Make sure we can pin the buffer. */
   ResourceOwnerEnlarge(CurrentResourceOwner);
   ReservePrivateRefCountEntry();

   /* Lock the header if it is not already locked. */
   buf_state = LockBufHdr(desc);
 }



EvictUnpinnedBuffersFromSharedRelation():
Maybe it will be more accurate to name the function as 
EvictRelUnpinnedBuffers()?


I think the comment will seem more correct in the following manner:
"Try to evict all the shared buffers containing provided relation's 
pages.


This function is intended for testing/development use only!

Before calling this function, it is important to acquire 
AccessExclusiveLock on
the specified relation to avoid replacing the current block of this 
relation with

another one during execution.

If not null, buffers_evicted and buffers_flushed are set to the total 
number of

buffers evicted and flushed respectively."

I also think it'd be a good idea to add null-checks for 
"buffers_evicted" and

"buffers_flushed" before dereferencing them:

*buffers_evicted = *buffers_flushed = 0;


I think we don't need to check this clause again if AccessExclusiveLock 
was acquired

before function call. Don't we?

if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
{
if (EvictUnpinnedBuffer(buf, buf_state, &flushed))
(*buffers_evicted)++;
if (flushed)
(*buffers_flushed)++;
}



MarkUnpinnedBufferDirty():
I think the comment will seem more correct in the following manner:
"Try to mark provided shared buffer as dirty.

This function is intended for testing/development use only!

Same as EvictUnpinnedBuffer() but with MarkBufferDirty() call inside.

Returns true if the buffer was already dirty or it has successfully been 
marked as

dirty."

And also I think the function should return true at the case when the 
buffer was

already dirty. What do you think?


pg_buffercache_evict_relation():
"pg_buffercache_evict_relation function is intended to" - 'be' is missed 
here.



pg_buffercache_mark_dirty():
Maybe edit the comment to: "Try to mark a shared buffer as dirty."?

Maybe edit the elog text to: "bad shared buffer ID" - just to clarify 
the case when

provided buffer number is negative (local buffer number).


pg_buffercache_mark_dirty_all():
Maybe also edit the comment to: "Try to mark all the shared buffers as 
dirty."?



bufmgr.h:
I think it might be a good idea to follow the Postgres formatting style 
and move the

function's arguments to the next line if they exceed 80 characters.


regards,
Aidar Imamov




Re: Improve monitoring of shared memory allocations

2025-03-27 Thread Tomas Vondra


On 3/27/25 13:56, Tomas Vondra wrote:
> ...
> 
> OK, I don't have any other comments for 0001 and 0002.  I'll do some
> more review and polishing on those, and will get them committed soon.
> 

Actually ... while polishing 0001 and 0002, I noticed a couple more
details that I'd like to ask about. I also ran pgindent and tweaked the
formatting, so some of the diff is caused by that.


1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x0100 is 4B-aligned,
while 0x0101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Now, back to the patch. The code originally did this in ShmemInitStruct

hashp = ShmemInitStruct(...)

to allocate the hctl, and then

firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+   (sizeof(HASHHDR) + \
+((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

to

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+   (MAXALIGN(sizeof(HASHHDR)) + \
+((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET

First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

So something like this might be more correct:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+   (MAXALIGN(sizeof(HASHHDR)) + \
+MAXALIGN((hctl)->dsize * izeof(HASHSEGMENT)) + \
+MAXALIGN((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+CACHELINEALIGN(sizeof(HASHHDR) + \
+ ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+ ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each eleme

Re: Add “FOR UPDATE NOWAIT” lock details to the log.

2025-03-27 Thread Fujii Masao




On 2025/03/21 15:21, Yuki Seino wrote:

Pushed the patch. Thanks!


Thank you. I'm very happy !!



Using the newly introduced mechanism, we can now easily extend
the log_lock_failure GUC to support additional NOWAIT lock failures,
such as LOCK TABLE ... NOWAIT, ALTER TABLE ... NOWAIT,
ALTER MATERIALIZED VIEW ... NOWAIT, and ALTER INDEX ... NOWAIT.

I've attached a patch implementing this.

That's a very good suggestion.

There is just one thing that bothers me.
ConditionalLockRelationOid() seems to be used by autovacuum as well.
Wouldn't this information be noise to the user?


Yes, logging every autovacuum lock failure would be too noisy.
I was thinking that log_lock_failure should apply only to LOCK TABLE ... NOWAIT,
but per the current code, restricting it that way seems difficult.

Anyway, expanding log_lock_failure further requires more discussion and design 
work,
which isn't feasible within this CommitFest. So, I'm withdrawing the patch for 
now.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: [PATCH] PGSERVICEFILE as part of a normal connection string

2025-03-27 Thread Michael Paquier
On Thu, Mar 27, 2025 at 06:31:14PM +0900, Michael Paquier wrote:
> I am not sure that I'll have the time to look at 0002 for this release
> cycle, could it be possible to get a rebase for it?

Here is a simple rebase that I have been able to assemble this
morning.  I won't have the space to review it for this release cycle
unfortunately, but at least it works in the CI.

I am moving this patch entry to the next CF for v19, as a result of
that.
--
Michael
From 83818caa5f5d5847ca99210446cc37fa8f8a1caf Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 28 Mar 2025 08:26:01 +0900
Subject: [PATCH v4] add servicefile connection option feature

---
 src/interfaces/libpq/fe-connect.c | 27 +++--
 src/interfaces/libpq/t/006_service.pl | 58 +++
 doc/src/sgml/libpq.sgml   | 16 +++-
 3 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index d5051f5e820f..f16468be7d14 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -195,6 +195,9 @@ static const internalPQconninfoOption PQconninfoOptions[] = {
 		"Database-Service", "", 20,
 	offsetof(struct pg_conn, pgservice)},
 
+	{"servicefile", "PGSERVICEFILE", NULL, NULL,
+	"Database-Service-File", "", 64, -1},
+
 	{"user", "PGUSER", NULL, NULL,
 		"Database-User", "", 20,
 	offsetof(struct pg_conn, pguser)},
@@ -5838,6 +5841,7 @@ static int
 parseServiceInfo(PQconninfoOption *options, PQExpBuffer errorMessage)
 {
 	const char *service = conninfo_getval(options, "service");
+	const char *service_fname = conninfo_getval(options, "servicefile");
 	char		serviceFile[MAXPGPATH];
 	char	   *env;
 	bool		group_found = false;
@@ -5857,11 +5861,18 @@ parseServiceInfo(PQconninfoOption *options, PQExpBuffer errorMessage)
 		return 0;
 
 	/*
-	 * Try PGSERVICEFILE if specified, else try ~/.pg_service.conf (if that
-	 * exists).
+	 * First, check servicefile option on connection string. Second, check
+	 * PGSERVICEFILE environment variable. Finally, check ~/.pg_service.conf
+	 * (if that exists).
 	 */
-	if ((env = getenv("PGSERVICEFILE")) != NULL)
+	if (service_fname != NULL)
+	{
+		strlcpy(serviceFile, service_fname, sizeof(serviceFile));
+	}
+	else if ((env = getenv("PGSERVICEFILE")) != NULL)
+	{
 		strlcpy(serviceFile, env, sizeof(serviceFile));
+	}
 	else
 	{
 		char		homedir[MAXPGPATH];
@@ -6023,6 +6034,16 @@ parseServiceFile(const char *serviceFile,
 	goto exit;
 }
 
+if (strcmp(key, "servicefile") == 0)
+{
+	libpq_append_error(errorMessage,
+	   "nested servicefile specifications not supported in service file \"%s\", line %d",
+	   serviceFile,
+	   linenr);
+	result = 3;
+	goto exit;
+}
+
 /*
  * Set the parameter --- but don't override any previous
  * explicit setting.
diff --git a/src/interfaces/libpq/t/006_service.pl b/src/interfaces/libpq/t/006_service.pl
index d3ecfa6b6fc8..752973d96534 100644
--- a/src/interfaces/libpq/t/006_service.pl
+++ b/src/interfaces/libpq/t/006_service.pl
@@ -136,6 +136,64 @@ local $ENV{PGSERVICEFILE} = "$srvfile_empty";
 	unlink($srvfile_default);
 }
 
+# Backslashes escaped path string for getting collect result at concatenation
+# for Windows environment
+my $srvfile_win_cared = $srvfile_valid;
+$srvfile_win_cared =~ s/\\//g;
+
+# Check that servicefile option works as expected
+{
+	$node->connect_ok(
+		q{service=my_srv servicefile='} . $srvfile_win_cared . q{'},
+		'service=my_srv servicefile=...',
+		sql => "SELECT 'connect4'",
+		expected_stdout => qr/connect4/);
+
+	# Encode slashes and backslash
+	my $encoded_srvfile = $srvfile_valid =~ s{([\\/])}{
+$1 eq '/' ? '%2F' : '%5C'
+}ger;
+
+	# Additionaly encode a colon in servicefile path of Windows
+	$encoded_srvfile =~ s/:/%3A/g;
+
+	$node->connect_ok(
+		'postgresql:///?service=my_srv&servicefile=' . $encoded_srvfile,
+		'postgresql:///?service=my_srv&servicefile=...',
+		sql => "SELECT 'connect5'",
+		expected_stdout => qr/connect5/);
+
+	local $ENV{PGSERVICE} = 'my_srv';
+	$node->connect_ok(
+		q{servicefile='} . $srvfile_win_cared . q{'},
+		'envvar: PGSERVICE=my_srv + servicefile=...',
+		sql => "SELECT 'connect6'",
+		expected_stdout => qr/connect6/);
+
+	$node->connect_ok(
+		'postgresql://?servicefile=' . $encoded_srvfile,
+		'envvar: PGSERVICE=my_srv + postgresql://?servicefile=...',
+		sql => "SELECT 'connect7'",
+		expected_stdout => qr/connect7/);
+}
+
+# Check that servicefile option takes precedence over PGSERVICEFILE environment variable
+{
+	local $ENV{PGSERVICEFILE} = 'non-existent-file.conf';
+
+	$node->connect_fails(
+		'service=my_srv',
+		'service=... fails with wrong PGSERVICEFILE',
+		expected_stderr =>
+		  qr/service file "non-existent-file\.conf" not found/);
+
+	$node->connect_ok(
+		q{service=my_srv servicefile='} . $srvfile_win_cared . q{'},
+		'servicefile= takes precedence over PGSERVICEFILE',
+		sql => "

Re: Remove restrictions in recursive query

2025-03-27 Thread Renan Alves Fonseca
On Thu, Mar 27, 2025 at 10:50 PM Tom Lane  wrote:
>
> Renan Alves Fonseca  writes:
> > The solution using GROUP BY in the recursive query is the following:
>
> > with recursive t1(node,nb_path) as
> >  (select 1,1::numeric
> >union all
> >   (select dag.target, sum(nb_path)
> >   from t1 join dag on t1.node=dag.source
> >   group by 1)
> >  ) select sum(nb_path) from t1 join sink_nodes using (node) ;
>
> This is not about the GROUP BY; it's about the SUM().

Cannot SUM() without GROUP BY, right? (I mean the SUM() in the recursive query)

> If you try this example you get
>
> regression=# with recursive t1(node,nb_path) as
>  (select 1,1::numeric
>union all
>   (select dag.target, sum(nb_path)
>   from t1 join dag on t1.node=dag.source
>   group by 1)
>  ) select sum(nb_path) from t1 join sink_nodes using (node) ;
> ERROR:  aggregate functions are not allowed in a recursive query's recursive 
> term
> LINE 4:   (select dag.target, sum(nb_path)
>   ^

Sorry, I forgot to mention that this query only works in my local hack
(I simply removed the check that raises this error message)

> The code says that that restriction is from the SQL spec, and
> it seems to be correct as of SQL:2021.  7.17 
> SR 3)j)ix)5)C) says
>
> C) WQEi shall not contain a  QS such that QS
>immediately contains a  TE that contains a
> referencing WQNX and either of the following is true:
>
>   I) TE immediately contains a  that contains a
>  .
>
>   II) QS immediately contains a  SL that contains
>   either a , or aspecification>, or both.
>
> ( is spec-ese for "aggregate function
> call"

I suspected that this restriction came straight from the specs. I
understand that while the proposed solution can help in some specific
use cases, it is not enough to justify an exception to the spec.

> I don't know the SQL committee's precise reasoning for this
> restriction, but I suspect it's because the recursive query
> expansion is not well-defined in the presence of an aggregate.
> The spec has an interesting comment at the bottom of sub-rule ix:
>
> NOTE 310 — The restrictions insure that each WLEi, viewed as a
> transformation of the query names of the stratum, is monotonically
> increasing. According to Tarski’s fixed point theorem, this
> insures that there is a fixed point. The General Rules use
> Kleene’s fixed point theorem to define a sequence that converges
> to the minimal fixed point.
>
> regards, tom lane




Re: Non-text mode for pg_dumpall

2025-03-27 Thread Andrew Dunstan


On 2025-03-19 We 2:41 AM, Mahendra Singh Thalor wrote:

On Wed, 12 Mar 2025 at 21:18, Andrew Dunstan  wrote:


On 2025-03-12 We 3:03 AM, jian he wrote:

On Wed, Mar 12, 2025 at 1:06 AM Álvaro Herrera  wrote:

Hello,

On 2025-Mar-11, Mahendra Singh Thalor wrote:


In map.dat file, I tried to fix this issue by adding number of characters
in dbname but as per code comments, as of now, we are not supporting \n\r
in dbnames so i removed handling.
I will do some more study to fix this issue.

Yeah, I think this is saying that you should not consider the contents
of map.dat as a shell string.  After all, you're not going to _execute_
that file via the shell.

Maybe for map.dat you need to escape such characters somehow, so that
they don't appear as literal newlines/carriage returns.


I am confused.
currently pg_dumpall plain format will abort when encountering dbname
containing newline.
the left dumped plain file does not contain all the cluster databases data.


if pg_dumpall non-text format aborts earlier,
it's aligned with pg_dumpall plain format?
it's also an improvement since aborts earlier, nothing will be dumped?


am i missing something?



I think we should fix that.

But for the current proposal, Álvaro and I were talking this morning,
and we thought the simplest thing here would be to have the one line
format and escape NL/CRs in the database name.


cheers


Okay. As per discussions, we will keep one line entry for each
database into map.file.

Thanks all for feedback and review.

Here, I am attaching updated patches for review and testing. These
patches can be applied on commit a6524105d20b.




I'm working through this patch set with a view to committing it. 
Attached is some cleanup which is where I got to today, although there 
is more to do. One thing I am wondering is why not put the 
SimpleDatabaseOidList stuff in fe_utils/simle_list.{c,h} ? That's where 
all the similar stuff belongs, and it feels strange to have this inline 
in pg_restore.c. (I also don't like the name much - SimpleOidStringList 
or maybe SimpleOidPlusStringList might be better).



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index a3dcc585ace..6aab1bfe831 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -434,13 +434,13 @@ main(int argc, char *argv[])
archDumpFormat = parseDumpFormat(formatName);
 
/*
-* If non-plain format is specified then we must provide the
-* file name to create one main directory.
+* If a non-plain format is specified, a file name is also required as 
the
+* path to the main directory.
 */
if (archDumpFormat != archNull &&
(!filename || strcmp(filename, "") == 0))
{
-   pg_log_error("options -F/--format=d|c|t requires option 
-f/--file with non-empty string");
+   pg_log_error("option -F/--format=d|c|t requires option 
-f/--file");
pg_log_error_hint("Try \"%s --help\" for more information.", 
progname);
exit_nicely(1);
}
@@ -513,14 +513,14 @@ main(int argc, char *argv[])
 */
if (archDumpFormat != archNull)
{
-   chartoc_path[MAXPGPATH];
+   charglobal_path[MAXPGPATH];
 
/* Create new directory or accept the empty existing directory. 
*/
create_or_open_dir(filename);
 
-   snprintf(toc_path, MAXPGPATH, "%s/global.dat", filename);
+   snprintf(global_path, MAXPGPATH, "%s/global.dat", filename);
 
-   OPF = fopen(toc_path, PG_BINARY_W);
+   OPF = fopen(global_path, PG_BINARY_W);
if (!OPF)
pg_fatal("could not open global.dat file: %s", 
strerror(errno));
}
@@ -1680,7 +1680,7 @@ dumpDatabases(PGconn *conn, ArchiveFormat archDumpFormat)
}
 
/*
-* If this is non-plain dump format, then append dboid and 
dbname to
+* If this is not a plain format dump, then append dboid and 
dbname to
 * the map.dat file.
 */
if (archDumpFormat != archNull)
@@ -1688,7 +1688,7 @@ dumpDatabases(PGconn *conn, ArchiveFormat archDumpFormat)
snprintf(dbfilepath, MAXPGPATH, "\"%s\"/\"%s\"", 
db_subdir, oid);
 
/* Put one line entry for dboid and dbname in map file. 
*/
-   fprintf(map_file, "%s %s\n", oid, pg_strdup(dbname));
+   fprintf(map_file, "%s %s\n", oid, dbname);
}
 
pg_log_info("dumping database \"%s\"", dbname);
@@ -1734,17 +1734,17 @@ dumpDatabases(PGconn *conn, ArchiveFormat 
archDumpFormat)
 
if (filename)
{
-   chartoc_path[MAXPGPATH];
+   charglo

Re: Remove restrictions in recursive query

2025-03-27 Thread Tom Lane
Renan Alves Fonseca  writes:
> I'll assume that the silence about allowing GROUP BY means it is not a
> great idea...

Well, you can do grouping, ordering, or whatever else you want to the
result of the recursive WITH in the outer query level.  I don't see
any advantage in allowing an additional level of that inside the WITH.

regards, tom lane




Re: Remove restrictions in recursive query

2025-03-27 Thread Renan Alves Fonseca
On Thu, Mar 27, 2025 at 7:32 PM Robert Haas  wrote:
>
> > I'll assume that the silence about allowing GROUP BY means it is not a
> > great idea...
>
> I don't think there's really anything to keep you from doing this --
> just put the grouping operation where you refer to the recursive CTE,
> instead of inside the recursive CTE itself. I think allowing it to
> appear inside the recursive CTE would be rather confusing -- it's
> probably best if the mandatory UNION operation is at the top level.
>

I don't think I was clear enough about the proposed improvements. I
hope this example will clarify.

Consider a table representing the edges of a directed acyclic graph.
The following lines will create an example of such a table.

-- create DAG example
\set output_degree 10
\set maxN 2

create temp table dag(source,target) as
   (with recursive t1(source,target) as
 (select 1,1
   union
 select target, target * i from t1, generate_series(2,:output_degree +1) i
 where target*i< :maxN
 ) select * from t1 where source!=target );

Then, suppose we want to count the number of distinct paths from the
root (1) to all sink nodes. We use the following auxiliary table for
the sink nodes:

create temp table sink_nodes(node) as
   (select target from dag except select source from dag);

The solution supported in PostgreSQL is the following:

with recursive t1(node,path) as
 (select 1,array[1]
   union
 select dag.target, t1.path||dag.target
 from t1 join dag on t1.node=dag.source
) select count(*) from t1 join sink_nodes using (node) ;

This query enumerates every path and probably has something like
exponential complexity on the number of edges. It gives these results
in my laptop:
  count
-
 1114163
(1 row)
Time: 8121.044 ms (00:08.121)

The solution using GROUP BY in the recursive query is the following:

with recursive t1(node,nb_path) as
 (select 1,1::numeric
   union all
  (select dag.target, sum(nb_path)
  from t1 join dag on t1.node=dag.source
  group by 1)
 ) select sum(nb_path) from t1 join sink_nodes using (node) ;

At every iteration step, we sum the number of paths that arrived at a
given node. That is another class of algorithm complexity. The first
query cannot scale to larger inputs, while this one can scale. Here is
the result:

   sum
-
 1114163
(1 row)
Time: 27.123 ms

I hope this example made clear that allowing the GROUP BY in the
recursive clause significantly increases the capabilities of doing
analytics on graph data, and it is not only syntax sugar. Please let
me know if there is another way of handling this problem efficiently
without the proposed change.

Thanks again for your attention,
Renan




Re: AIO v2.5

2025-03-27 Thread Andres Freund
Hi,

On 2025-03-26 21:07:40 -0400, Andres Freund wrote:
> TODO
> ...
> - Add an explicit test for the checksum verification in the completion 
> callback
>
>   There is an existing test for testing an invalid page due to page header
>   verification in test_aio, but not for checksum failures.
>
>   I think it's indirectly covered (e.g. in amcheck), but seems better to test
>   it explicitly.

Ah, for crying out loud.  As it turns out, no, we do not have *ANY* tests for
this for the server-side. Not a single one. I'm somewhat apoplectic,
data_checksums is a really complicated feature, which we just started *turning
on by default*, without a single test of the failure behaviour, when detecting
failures is the one thing the feature is supposed to do.


I now wrote some tests. And I both regret doing so (because it found problems,
which would have been apparent long ago, if the feature had come with *any*
tests, if I had gone the same way I could have just pushed stuff) and am glad
I did (because I dislike pushing broken stuff).

I have to admit, I was tempted to just ignore this issue and just not say
anything about tests for checksum failures anymore.


Problems:

1) PageIsVerifiedExtended() emits a WARNING, just like with ZERO_ON_ERROR, we
   don't want to emit it in a) io workers b) another backend if it completes
   the error.

   This isn't hard to address, we can add PIV_LOG_LOG (or something like that)
   to emit it at a different log level and an out-parameter to trigger sending
   a warning / adjust the warning/error message we already emit once the
   issuer completes the IO.


2) With IO workers (and "foreign completors", in rare cases), the checksum
   failures would be attributed wrongly, as it reports all stats to
   MyDatabaseId

   As it turns out, this is already borked on master for shared relations,
   since pg_stat_database.checksum_failures has existed, see [1].

   This isn't too hard to fix, if we adjust the signature to
   PageIsVerifiedExtended() to pass in the database oid. But see also 3)


3) We can't pgstat_report_checksum_failure() during the completion callback,
   as it *sometimes* allocates memory

   Aside from the allocation-in-critical-section asserts, I think this is
   *extremely* unlikely to actually cause a problem in practice. But we don't
   want to rely on that, obviously.



Addressing the first two is pretty simple and/or needs to be done anyway,
since it's a currently existing bug, as discussed in [1].


Addressing 3) is not at all trivial.  Here's what I've thought of so far:


Approach I)

My first thoughts were around trying to make the relevant pgstat
infrastructure either not need to allocate memory, or handle memory
allocation failures gracefully.

Unfortunately that seems not really viable:

The most successful approach I tried was to report stats directly to the
dshash table, and only report stats if there's already an entry (which
there just about always will, except for a very short period after stats
have been reset).

Unfortunately that fails because to access the shared memory with the stats
data we need to do dsa_get_address(), which can fail if the relevant dsm
segment wasn't already mapped in the current process (it allocates memory
in the process of mapping in the segment). There's no API to do that
without erroring out.

That aspect rules out a number of other approaches that sounded like they
could work - we e.g. could increase the refcount of the relevant pgstat
entry before issuing IO, ensuring that it's around by the time we need to
report. But that wouldn't get around the issue of needing to map in the dsm
segment.


Approach II)

Don't report the error in the completion callback.  The obvious place would be
to do it where we we'll raise the warning/error in the issuing process.  The
big disadvantage is that that that could lead to under-counting checksum
errors:

a) A read stream does 2+ concurrent reads for the same relation, and more than
one encounters checksum errors. When processing the results for the first
failed read, we raise an error and thus won't process the results of the
second+ reads with errors.

b) A read is started asynchronously, but before the backend gets around to
processing the result of the IO, it errors out during that other work
(possibly due to a cancellation). Because the backend never looked at the
results of the IOs, the checksum errors don't get accounted for.

b) doesn't overly bother me, but a) seems problematic.


Approach III)

Accumulate checksum errors in two backend local variables (one for database
specific errors, one for errors on shared relations), which will be flushed by
the backend that issued IO during the next pgstat_report_start().

Two disadvantages:

- Accumulation of errors will be delayed until the next
  pgstat_report_start(). That seems acceptable, after all we do so far a lot
  of other stats.

- We need to register a local callback for shared buffer reads, which don't
  need them today

Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-03-27 Thread Sadeq Dousti
> I think this is the wrong way round.
> It should be \dtN instead of \dNt.

Hi Christoph,
The order does not matter, the user can use \dNt or \dtN, as they do
exactly the same thing. Letters coming after \d can be freely permuted. If
you mean a change to the documentation or tests, I can apply whatever order
makes more sense, as I don't have any particular preference.

> Is the form \dN actually useful?
It's a shortcut for \dtiN. Similarly, \d is a shortcut for \dtvmsE. I'd
like to keep a default because otherwise, we should err if the user types
\dN without further qualifying it, and combining N with other letters (s,
m, E, v, ...) while not including t & i would also result in error.

> For partitioned tables, you'd only get the empty root indexes.
Your statement is correct, but bear with me to give the full version.
With \dN, you get all the tables and indexes that are not partitions.
* root tables
* non-partitioned tables
* indexes on root tables
* indexes on non-partitioned tables
For me, this is the top-level table and index structure in a database. It
can be further restricted by adding t (\dNt or \dtN) to limit it to tables,
and adding i (\dNi or \diN) to limit it to indexes.

Of course, other combinations like \dNvs are also supported.

Best regards,
Sadeq


Re: POC, WIP: OR-clause support for indexes

2025-03-27 Thread Alexander Korotkov
Hi!

On Mon, Mar 24, 2025 at 2:46 PM Alena Rybakina
 wrote:
> I agree with Andrey's changes and think we should fix this, because otherwise 
> it might be inconvenient.
> For example, without this changes we will have to have different test output 
> files for the same query for different versions of Postres in extensions if 
> the whole change is only related to the order of column output for a 
> transformation that was not applied.

I agree with problem spotted by Andrei: it should be preferred to
preserve original order of clauses as much as possible.  The approach
implemented in Andrei's patch seems fragile for me.  Original order is
preserved if we didn't find any group.  But once we find a single
group original order might be destroyed completely.

The attached patch changes the reordering algorithm of
group_similar_or_args() in the following way.  We reorder each group
of similar clauses so that the first item of the group stays in place,
but all the other items are moved after it.  So, if there are no
similar clauses, the order of clauses stays the same.  When there are
some groups, only required reordering happens while the rest of the
clauses remain in their places.

--
Regards,
Alexander Korotkov
Supabase


v1-0001-Make-group_similar_or_args-reorder-clause-list-as.patch
Description: Binary data


Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-03-27 Thread Sadeq Dousti
Sorry Greg,

Just understood what you mean. Please find attached the v6 of the patch.

Best regards,
Sadeq


v6-0001-psql-acommand-for-non-partitioned-tables-indexes.patch
Description: Binary data


Re: Better HINT message for "unexpected data beyond EOF"

2025-03-27 Thread Christoph Berg
Re: Robert Haas
> I think that would be better than what we have now, but I still wonder
> if we should give some kind of a hint that an external process may be
> doing something to that file. Jakub and I may be biased by having just
> seen a case of exactly that in the field, but I wonder now how many
> 'data beyond EOF' messages are exactly that -- and it's not like the
> user is going to guess that 'data beyond EOF' might mean that such a
> thing occurred.

HINT:  Did anything besides PostgreSQL touch that file?

Christoph




libpq maligning postgres stability

2025-03-27 Thread Andres Freund
Hi,

We have several places in libpq where libpq says that a connection closing is
probably due to a server crash with a message like:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing


I think this is rather unhelpful, at least these days. There are a lot of
reasons the connection could have failed, the server having terminated
abnormally is just one of them.

It's common to see this due to network issues, for example.  I've quite a few
times fielded worried questions of postgres users due to the message.

The reason I was looking at this message just now was a discussion of CI
failures on windows [1], which were likely caused by the known issue of
windows occasionally swallowing the server's last messages before the backend
exits (more detail e.g. in [2]).  It's easy to think that the failure was
wrongly caused by a postgres crash, due to the message, rather than due to not
receiving the expected FATAL.


And we don't even just add this message when the connection was actually
closed unexpectedly, we often do it even when we *did* get a FATAL, as in this
example:

psql -c 'select pg_terminate_backend(pg_backend_pid())'
FATAL:  57P01: terminating connection due to administrator command
LOCATION:  ProcessInterrupts, postgres.c:3351
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost


I think this one is mostly a weakness in how libpq tracks connection state,
but it kind of shows the silliness of claiming postgres probably crashed.


Greetings,

Andres Freund


[1] Via Bilal:

4 of the failures on the front page are related to Windows:
https://cirrus-ci.com/build/4878370632105984
https://cirrus-ci.com/build/5063665856020480
https://cirrus-ci.com/build/4636858312818688
https://cirrus-ci.com/build/6385762419081216

[2] 
https://postgr.es/m/CA%2BhUKGLR10ZqRCvdoRrkQusq75wF5%3DvEetRSs2_u1s%2BFAUosFQ%40mail.gmail.com




Re: pg_stat_database.checksum_failures vs shared relations

2025-03-27 Thread Robert Haas
On Thu, Mar 27, 2025 at 11:58 AM Andres Freund  wrote:
> So, today we have the weird situation that *some* checksum errors on shared
> relations get attributed to the current database (if they happen in a backend
> normally accessing a shared relation), whereas others get reported to the
> "shared relations" "database" (if they happen during a base backup).  That
> seems ... not optimal.
>
> One question is whether we consider this a bug that should be backpatched.

I think it would be defensible if pg_basebackup reported all errors
with OID 0 and backend connections reported all errors with OID
MyDatabaseId, but it seems hard to justify having pg_basebackup take
care to report things using the correct database OID and individual
backend connections not take care to do the same thing. So I think
this is a bug. If fixing it in the back-branches is too annoying, I
think it would be reasonable to fix it only in master, but
back-patching seems OK too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add Postgres module info

2025-03-27 Thread Tom Lane
Yurii Rashkovskii  writes:
> Would something like this work?

Works for me; pushed.

regards, tom lane




Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-03-27 Thread Greg Sabino Mullane
I think it's fine the way it is, with regards to v10 check. Can you post a
rebased patch?

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-03-27 Thread Christoph Berg
Re: Sadeq Dousti
> Thanks Greg and others for the feedback!
> 
> Please find attached the patch for implementing \dN (including \dNt, \dNi,
> \dNit).

I don't care particularly about the choice of letter, but I think this
is the wrong way round. It should be \dtN instead of \dNt.

Is the form \dN actually useful? For partitioned tables, you'd only
get the empty root indexes. If you actually wanted that, there's still
\dtiN.

Christoph




pg_stat_database.checksum_failures vs shared relations

2025-03-27 Thread Andres Freund
Hi,

First - I find it rather shocking that we have absolutely *zero* tests of
checksum failures in the backend. Zero. As evidenced by [1].  I really can't
quite believe it.  Nor do we have tests of ignore_checksum_failure or
zero_damaged_pages.


I was trying to write some tests for checksums vs AIO, and one thing I noticed
is that our docs seem to rather strongly hint that checksum failures on shared
objects would be reported to the dbname IS NULL row in pg_stat_database:

   The pg_stat_database view will contain one row
   for each database in the cluster, plus one for shared objects, showing
   database-wide statistics.

   Name of this database, or NULL for shared
   objects.

   Number of data page checksum failures detected in this
   database (or on a shared object), or NULL if data checksums are
   disabled.

But that is not the case, they always get reported to MyDatabaseId by
PageIsVerifiedExtended().

The docs hint more clearly at checksum failures of shared rels reported this
way starting with:

commit 252b707bc41
Author: Magnus Hagander 
Date:   2019-04-17 13:51:48 +0200

Return NULL for checksum failures if checksums are not enabled

which made changes like:
  Number of data page checksum failures detected in this
- database
+  database (or on a shared object), or NULL if data checksums are not
+  enabled.
 


The stats tracking of checksum failures was introduced in

Author: Magnus Hagander 
Date:   2019-03-09 10:45:17 -0800

Track block level checksum failures in pg_stat_database

which already did report stats on shared objects that way:

@@ -151,6 +152,8 @@ PageIsVerified(Page page, BlockNumber blkno)
  errmsg("page verification failed, calculated checksum %u but 
expected %u",
 checksum, p->pd_checksum)));

+pgstat_report_checksum_failure();
+
 if (header_sane && ignore_checksum_failure)
 return true;
 }

In basebackup.c however, it didn't track stats on shared relations at that time:

@@ -1580,6 +1583,9 @@ sendFile(const char *readfilename, const char 
*tarfilename, struct stat *statbuf
 ereport(WARNING,
 (errmsg("file \"%s\" has a total of %d checksum verification "
 "failures", readfilename, checksum_failures)));
+
+if (dboid != InvalidOid)
+pgstat_report_checksum_failures_in_db(dboid, checksum_failures);
 }
 total_checksum_failures += checksum_failures;


that was changed in:

commit 77bd49adba4
Author: Magnus Hagander 
Date:   2019-04-12 14:04:50 +0200

Show shared object statistics in pg_stat_database

This adds a row to the pg_stat_database view with datoid 0 and datname
NULL for those objects that are not in a database. This was added
particularly for checksums, but we were already tracking more satistics
for these objects, just not returning it.
...

which made the call to pgstat_report_checksum_failures_in_db()
in basebackup.c unconditional. It did leave this comment though:

 * If dboid is anything other than InvalidOid then any checksum failures
 * detected will get reported to the cumulative stats system.


I think the above commit makes it pretty clear that the intent is for checksum
errors on shared database entries to be reported to the "shared" entry in
pg_stat_database.


So, today we have the weird situation that *some* checksum errors on shared
relations get attributed to the current database (if they happen in a backend
normally accessing a shared relation), whereas others get reported to the
"shared relations" "database" (if they happen during a base backup).  That
seems ... not optimal.


One question is whether we consider this a bug that should be backpatched.

To fix it we'd need to provide a bit more information to
PageIsVerifiedExtended(), it currently doesn't know what database the page it
is verifying is in and therefore can't report an error with
pgstat_report_checksum_failures_in_db() (rather than
pgstat_report_checksum_failure(), which attributes to MyDatabaseId).

Obviously having to change the signature of PageIsVerifiedExtended() makes it
harder to fix in the backbranches.

Greetings,

Andres Freund

[1] 
https://coverage.postgresql.org/src/backend/storage/page/bufpage.c.gcov.html#136




Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-27 Thread Alvaro Herrera
On 2025-Mar-24, jian he wrote:

> hi.
> you may like the attached. it's based on your idea: attnotnullvalid.

This is quite close to what I was thinking, yeah.  I noticed a couple of
bugs however, and ended up cleaning up the whole thing.  Here's what I
have so far.  I'm not sure the pg_dump bits are okay (apart from what
you report below) -- I think it's losing the constraint names, which is
of course unacceptable.

I think the warnings about creating constraints as valid when
originating as invalid are unnecessary at this point.  We should add
those, or not, for all constraint types, not just not-null.  That's IMO
a separate discussion.

> I came across a case, not sure if it's a bug.
> CREATE TABLE ttchk (a INTEGER);
> ALTER TABLE ttchk ADD CONSTRAINT cc check (a is NOT NULL) NOT VALID;
> CREATE TABLE ttchk_child(a INTEGER) INHERITS(ttchk);
> ttchk_child's constraint cc will default to valid,
> but pg_dump && pg_restore will make ttchk_child's constraint invalid.
> since it's an existing behavior, so not-null constraint will align with it.

Hmm, yes, such pg_dump behavior would be incorrect.  I'll give the
pg_dump code another look tomorrow.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ac14c06c715..e9296e2d4bd 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5549,7 +5549,15 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 			   "SELECT relname, "
 			   "  attname, "
 			   "  format_type(atttypid, atttypmod), "
-			   "  attnotnull, "
+			   "  attnotnull, ");
+
+		/* NOT VALID NOT NULL columns are supported since Postgres 18 */
+		if (PQserverVersion(conn) >= 18)
+			appendStringInfoString(&buf, "attnotnullvalid, ");
+		else
+			appendStringInfoString(&buf, "attnotnull AS attnotnullvalid, ");
+
+		appendStringInfoString(&buf,
 			   "  pg_get_expr(adbin, adrelid), ");
 
 		/* Generated columns are supported since Postgres 12 */
@@ -5651,6 +5659,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 char	   *attname;
 char	   *typename;
 char	   *attnotnull;
+char	   *attnotnullvalid;
 char	   *attgenerated;
 char	   *attdefault;
 char	   *collname;
@@ -5663,14 +5672,15 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 attname = PQgetvalue(res, i, 1);
 typename = PQgetvalue(res, i, 2);
 attnotnull = PQgetvalue(res, i, 3);
-attdefault = PQgetisnull(res, i, 4) ? NULL :
-	PQgetvalue(res, i, 4);
-attgenerated = PQgetisnull(res, i, 5) ? NULL :
+attnotnullvalid = PQgetvalue(res, i, 4);
+attdefault = PQgetisnull(res, i, 5) ? NULL :
 	PQgetvalue(res, i, 5);
-collname = PQgetisnull(res, i, 6) ? NULL :
+attgenerated = PQgetisnull(res, i, 6) ? NULL :
 	PQgetvalue(res, i, 6);
-collnamespace = PQgetisnull(res, i, 7) ? NULL :
+collname = PQgetisnull(res, i, 7) ? NULL :
 	PQgetvalue(res, i, 7);
+collnamespace = PQgetisnull(res, i, 8) ? NULL :
+	PQgetvalue(res, i, 8);
 
 if (first_item)
 	first_item = false;
@@ -5714,7 +5724,11 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 
 /* Add NOT NULL if needed */
 if (import_not_null && attnotnull[0] == 't')
+{
 	appendStringInfoString(&buf, " NOT NULL");
+	if (attnotnullvalid[0] == 'f')
+		appendStringInfoString(&buf, " NOT VALID");
+}
 			}
 			while (++i < numrows &&
    strcmp(PQgetvalue(res, i, 0), tablename) == 0);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index fb050635551..f33f1ea1b57 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1260,7 +1260,16 @@
attnotnull bool
   
   
-   This column has a not-null constraint.
+   This column has a (possibly unvalidated) not-null constraint.
+  
+ 
+
+ 
+  
+   attnotnullvalid bool
+  
+  
+   Whether the not-null constraint, if one exists, has been validated.
   
  
 
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 11d1bc7dbe1..344387cd99d 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -243,6 +243,8 @@ WITH ( MODULUS numeric_literal, REM
   entire table; however, if a valid CHECK constraint is
   found which proves no NULL can exist, then the
   table scan is skipped.
+  If a column has a invalid not-null constraint,
+  SET NOT NULL validates it.
  
 
  
@@ -458,8 +460,8 @@ WITH ( MODULUS numeric_literal, REM
  
   This form adds a new constraint to a table using the same constraint
   syntax as CREATE TABLE, plus the option NOT
-  VALID, which is currently only allowed for foreign key
-  and CHECK constraints.
+  VALID, which is cu

Re: read stream on amcheck

2025-03-27 Thread Melanie Plageman
On Thu, Mar 27, 2025 at 2:46 PM Matheus Alcantara
 wrote:
>
> Just my 0.2 cents. I also like the first approach even though I prefer
> the v4 version, but anyway, thanks very much for reviewing and
> committing!

Thanks for the patch!

FWIW, I strongly disliked about v4 that two separate parts of the
callback were responsible for advancing current_blocknum, one to
advance it past blocks we chose to skip and the other to advance it to
the next block.

   for (; p->current_blocknum < p->last_exclusive; p->current_blocknum++)
and
 if (p->current_blocknum < p->last_exclusive)
  return p->current_blocknum++;

I found that alone to be undesirable, but once you add in
SKIP_PAGES_NONE, I think it was very hard to understand.

Besides that, when we implemented streaming read sequential scan, we
ended up making dedicated callbacks for the parallel and non-parallel
cases (see heap_scan_stream_read_next_parallel and
heap_scan_stream_read_next_serial) because it performed better than a
single combined callback with a branch. I didn't validate that amcheck
got the same performance benefit from the dedicated callbacks, but I
don't see why it would be any different.

- Melanie




Re: [PATCH] PGSERVICEFILE as part of a normal connection string

2025-03-27 Thread Michael Paquier
On Thu, Mar 27, 2025 at 06:31:14PM +0900, Michael Paquier wrote:
> With all that in mind and more documentation added to the test, I've
> applied 0001, so let's see what the buildfarm has to say.  The CI was
> stable, so it's a start.

The buildfarm (particularly the Windows members that worried me), have
reported back and I am not seeing any failures, so we should be good
with 72c2f36d5727.
--
Michael


signature.asc
Description: PGP signature


Re: Proposal: Progressive explain

2025-03-27 Thread Rafael Thofehrn Castro
Thanks Robert. Very thorough analysis there.

Things I don't comment on will be fixed without further discussion.

> There is a comment in ExplainOnePlan() that says "Handle progressive
> explain cleanup manually if enabled as it is not called as part of
> ExecutorFinish," but standard_ExecutorFinish does indeed call
> ProgressiveExplainFinish(), so either the comment is misleading or the
> code is wrong.

The comment is misleading. What I meant to say is that queries executed via
EXPLAIN (without analyze) don't call ExecutorFinish, so
ProgressiveExplainFinish
isn't called in that context. I will fix the comment.

> Calling ProgressiveExplainTrigger() directly from
> ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
> we've tried hard to make our signal handlers do only very simple
> things like setting a flag, and this one runs around the entire
> PlanState tree modifying Very Important Things. I fear that's going to
> be hard to make robust. Like, what happens if we're going around
> trying to change ExecProcNode pointers while the calling code was also
> going around trying to change ExecProcNode pointers? I can't quite see
> how this won't result in chaos.

Agreed. In that other similar patch to log query plans after a signal is
sent
from another session there was the same discussion and concerns.

I don't see another way of doing it here. This patch became 100x more
complex
after I added GUC progressive_explain_min_duration, that required changing
the
execProcNode wrapper on the fly.

I can see some alternatives here:

A) Use a temporary execProcNode wrapper set at query start here:

/*
 * If instrumentation is required, change the wrapper to one that just
 * does instrumentation.  Otherwise we can dispense with all wrappers and
 * have ExecProcNode() directly call the relevant function from now on.
 */
if (node->instrument)
{
  /*
   * Use wrapper for progressive explains if enabled and the node
   * belongs to the currently tracked query descriptor.
   */
  if (progressive_explain == PROGRESSIVE_EXPLAIN_ANALYZE &&
ProgressiveExplainIsActive(node->state->query_desc))
node->ExecProcNode = ExecProcNodeInstrExplain;
  else
node->ExecProcNode = ExecProcNodeInstr;

This wrapper will have an additional logic that will check if a boolean set
by the timeout function has changed. When that happens the initial
progressive
explain setup is done and execProcNode is unwrapped in place. This should be
safe.

The problem here is that all queries would always be using the custom
wrapper until timeout is triggered, and that can add unnecessary overhead.

B) We get rid of the idea of applying progressive explains to non
instrumented
runs (which was my original idea). My main use-case here is to see progress
of instrumented runs anyways. For that idea we have 2 possibilities:

B1) Keep implementation as is, with all the existing GUCs to control what
is included in the printed plan. We just change GUC progressive_explain to
be
a boolean again. If true, instrumentation will be enabled for the query
being
executed and progressive explain will be printed consecutively.

B2) Get rid of all new GUCs I added and change the progressive explain
feature
to become an option of the EXPLAIN command. Something like:

EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...

(B1) would allow progressive explans in regular queries, but I'm not sure
people
would be willing to enable it globally as it adds instrumentation overhead.

What do you think of the options?

Rafael.


Re: Add support for EXTRA_REGRESS_OPTS for meson

2025-03-27 Thread Rustam ALLAKOV
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

Hello everyone, 
for v2 patch I suggest to refactor from

> -sp = subprocess.Popen(args.test_command, env=env_dict, 
> stdout=subprocess.PIPE)
> +if args.testname in ['regress', 'isolation', 'ecpg'] and 
> 'EXTRA_REGRESS_OPTS' in env_dict:
> +test_command = args.test_command + 
> shlex.split(env_dict['EXTRA_REGRESS_OPTS'])
> +else:
> +test_command = args.test_command
> +
> +sp = subprocess.Popen(test_command, env=env_dict, stdout=subprocess.PIPE)
 
to

-sp = subprocess.Popen(args.test_command, env=env_dict, stdout=subprocess.PIPE)
+if args.testname in ['regress', 'isolation', 'ecpg']:
+test_command = args.test_command[:]
+if 'TEMP_CONFIG' in env_dict:
+test_command.insert(1, '--temp-config=' + env_dict['TEMP_CONFIG'])
+if 'EXTRA_REGRESS_OPTS' in env_dict:
+test_command += shlex.split(env_dict['EXTRA_REGRESS_OPTS'])
+else:
+test_command = args.test_command
+
+sp = subprocess.Popen(test_command, env=env_dict, stdout=subprocess.PIPE)

I double checked whether shlex module was built in Python, and yes it is, 
so no need for additional requirement.txt input for pip to install.

in addition to the above, might be worth to add some documentation like

Environment Variables Supported:  
EXTRA_REGRESS_OPTS: Additional options to pass to regression, isolation, or 
ecpg tests.
TEMP_CONFIG: Specify a temporary configuration file for testing purposes.
Example Usage:
# Use EXTRA_REGRESS_OPTS to load an extension
  EXTRA_REGRESS_OPTS="--load-extension=pgcrypto" meson test
# Use TEMP_CONFIG to specify a temporary configuration file
  TEMP_CONFIG="path/to/test.conf" meson test
# Use both EXTRA_REGRESS_OPTS and TEMP_CONFIG together
  TEMP_CONFIG="path/to/test.conf" 
EXTRA_REGRESS_OPTS="--load-extension=pgcrypto" meson test

Should we cover these new lines with test? Asking, because I see 
EXTRA_REGRESS_OPTS
being tested for autotools in src/interfaces/ecpg/test/makefile

Regards.

RE: Selectively invalidate caches in pgoutput module

2025-03-27 Thread Hayato Kuroda (Fujitsu)
Dear Andres,

> I think the new tests just failed in CI:
> https://cirrus-ci.com/task/5602950271205376?logs=test_world#L268

Thanks for reporting, I'll look into it.

Best regards,
Hayato Kuroda
FUJITSU LIMITED 



Re: Disabling vacuum truncate for autovacuum

2025-03-27 Thread Nathan Bossart
On Thu, Mar 20, 2025 at 09:59:45AM -0700, David G. Johnston wrote:
> I get the need for this kind of logic, since we used a boolean for the
> table option, but as a self-proclaimed hack it seems worth more comments
> than provided here.  Especially pertaining to whether this is indeed
> generic or vacuum_truncate specific.  It's unclear since both isset and
> vacuum_truncate_set have been introduced.

I'm happy to expand the comments, but...

> If it is now a general property
> applying to any setting then vacuum_truncate_set shouldn't be needed - we
> should just get the isset value of the existing vacuum_truncate reloption
> by name.

the reason I added this field is because I couldn't find any existing way
to get this information where it's needed.  So, I followed the existing
pattern of adding an offset to something we can access.  This can be used
for any reloption, but currently vacuum_truncate is the only one that does.

How does something like this look for the comment?

/*
 * isset_offset is an optional offset of a field in the result struct
 * that stores whether the value is explicitly set for the relation or
 * has just picked up the default value.  In most cases, this can be
 * deduced by giving the reloption a special default value (e.g., -2 is
 * a common one for integer reloptions), but this isn't always
 * possible.  One notable example is Boolean reloptions, where it's
 * difficult to discern the source of the value.  This offset is only
 * used if given a value greater than zero.
 */
int isset_offset;

-- 
nathan




Re: Update Unicode data to Unicode 16.0.0

2025-03-27 Thread Robert Haas
On Wed, Mar 19, 2025 at 5:47 PM Jeff Davis  wrote:
> Do you have a sketch of what the ideal Unicode version management
> experience might look like? Very high level, like "this is what happens
> by default during an upgrade" and "this is how a user discovers that
> that they might want to update Uniocde", etc.
>
> What ways can/should we nudge users to update more quickly, if at all,
> so that they are less likely to have problems with newly-assigned code
> points?
>
> And, if possible, how we might extend this user experience to libc or
> ICU updates?

As I think you know, I don't consider myself an expert in this area,
just somebody who has seen a decent amount of user pain (although I am
sure that even there some other people have seen more). That said, for
me the ideal would probably include the following things:

* When the collation/ctype/whatever definitions upon which you are
relying change, you can either decide to switch to the new ones
without rebuilding your indexes and risk wrong results until you
reindex, or you can decide to create new indexes using the new
definitions and drop the old ones.

* You're never forced to adopt new definitions during a SPECIFIC major
or minor release upgrade or when making some other big change to the
system. It's fine, IMHO, if we eventually remove support for old
stuff, but there should be a multi-year window of overlap. For
example, if PostgreSQL 42 adds support for Unicode 95.0.0, we'd keep
that support for, I don't know, at least the next four or five major
versions. So upgrading PG can eventually force you to upgrade
collation defs, but you don't get into a situation where PG 41
supports only Unicode < 95 and PG 42 supports only Unicode >= 95.

* In an absolutely perfect world, we'd have strong versioning of every
type of collation from every provider. This is probably very difficult
to achieve in practice, so maybe the somewhat more realistic goal
might be to get to a point where most users, most of the time, are
relying on collations with strong versioning. For glibc, this seems
relatively hopeless unless upstream changes their policy in a big way.
For ICU, loading multiple library versions seems like a possible path
forward. Relying more on built-in collations seems like another
possible approach, but I think that would require us to have more than
just a code-point sort: we'd need to have built-in collations for
users of various languages. That sounds like it would be a lot of work
to develop, but even worse, it sounds like it would be a tremendous
amount of work to maintain. I expect Tom will opine that this is an
absolutely terrible idea that we should never do under any
circumstances, and I understand the sentiment, but I think it might be
worth considering if we're confident we will have people to do the
maintenance over the long term.

* I would imagine pg_upgrade either keeping the behavior unchanged for
any strongly-versioned collation, or failing. I don't see a strong
need to try to notify users about the availability of new versions
otherwise. People who want to stay current will probably figure out
how to do that, and people who don't will ignore any warnings we give
them. I'm not completely opposed to some other form of notification,
but I think it's OK if "we finally removed support for your extremely
old ICU version" is the driving force that makes people upgrade.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: speedup COPY TO for partitioned table.

2025-03-27 Thread jian he
hi.

I made a mistake.
The regress test sql file should have a new line at the end of the file.
From a4c643ac3a9f40bbdf07dcacc38527ef6e86f1bc Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 28 Mar 2025 11:05:53 +0800
Subject: [PATCH v5 1/1] support COPY partitioned_table TO

CREATE TABLE pp (id  INT, val int ) PARTITION BY RANGE (id);
create table pp_1 (val int, id int);
create table pp_2 (val int, id int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10);
insert into pp select g, 10 + g from generate_series(1,9) g;
copy pp to stdout(header);

the above case is much slower (around 25% some case) than
``COPY (select * from pp) to stdout(header);``,
because of column remaping.  but this is still a new
feature, since master does not support ``COPY (partitioned_table)``.

reivewed by: vignesh C 
reivewed by: David Rowley 
reivewed by: Melih Mutlu 

discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=i...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5467/
---
 doc/src/sgml/ref/copy.sgml  |  8 +--
 src/backend/commands/copyto.c   | 89 +++--
 src/test/regress/expected/copy2.out | 16 ++
 src/test/regress/sql/copy2.sql  | 11 
 4 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index df093da97c5..f86e0b7ec35 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -521,15 +521,15 @@ COPY count
 

 COPY TO can be used only with plain
-tables, not views, and does not copy rows from child tables
-or child partitions.  For example, COPY COPY TO can be used with partitioned tables.
+For example, in a table inheritance hierarchy, COPY table TO copies
 the same rows as SELECT * FROM ONLY table.
 The syntax COPY (SELECT * FROM table) TO ... can be used to
-dump all of the rows in an inheritance hierarchy, partitioned table,
-or view.
+dump all of the rows in an inheritance hierarchy, or view.

 

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 84a3f3879a8..0973dc9c14b 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -19,6 +19,8 @@
 #include 
 
 #include "access/tableam.h"
+#include "access/table.h"
+#include "catalog/pg_inherits.h"
 #include "commands/copyapi.h"
 #include "commands/progress.h"
 #include "executor/execdesc.h"
@@ -82,6 +84,7 @@ typedef struct CopyToStateData
 	List	   *attnumlist;		/* integer list of attnums to copy */
 	char	   *filename;		/* filename, or NULL for STDOUT */
 	bool		is_program;		/* is 'filename' a program to popen? */
+	List	   *partitions;		/* oid list of partition oid for copy to */
 	copy_data_dest_cb data_dest_cb; /* function for writing data */
 
 	CopyFormatOptions opts;
@@ -643,6 +646,8 @@ BeginCopyTo(ParseState *pstate,
 		PROGRESS_COPY_COMMAND_TO,
 		0
 	};
+	List	   *children = NIL;
+	List	   *scan_oids = NIL;
 
 	if (rel != NULL && rel->rd_rel->relkind != RELKIND_RELATION)
 	{
@@ -670,11 +675,28 @@ BeginCopyTo(ParseState *pstate,
 	 errmsg("cannot copy from sequence \"%s\"",
 			RelationGetRelationName(rel;
 		else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-	 errmsg("cannot copy from partitioned table \"%s\"",
-			RelationGetRelationName(rel)),
-	 errhint("Try the COPY (SELECT ...) TO variant.")));
+		{
+			children = find_all_inheritors(RelationGetRelid(rel),
+		   AccessShareLock,
+		   NULL);
+			foreach_oid(childreloid, children)
+			{
+char		 relkind = get_rel_relkind(childreloid);
+
+if (relkind == RELKIND_FOREIGN_TABLE)
+	ereport(ERROR,
+			errcode(ERRCODE_WRONG_OBJECT_TYPE),
+			errmsg("cannot copy from foreign table \"%s\"",
+	RelationGetRelationName(rel)),
+			errdetail("partition \"%s\" is a foreign table", RelationGetRelationName(rel)),
+			errhint("Try the COPY (SELECT ...) TO variant."));
+
+if (RELKIND_HAS_PARTITIONS(relkind))
+	continue;
+
+scan_oids = lappend_oid(scan_oids, childreloid);
+			}
+		}
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -710,6 +732,7 @@ BeginCopyTo(ParseState *pstate,
 		cstate->rel = rel;
 
 		tupDesc = RelationGetDescr(cstate->rel);
+		cstate->partitions = list_copy(scan_oids);
 	}
 	else
 	{
@@ -1066,7 +1089,61 @@ DoCopyTo(CopyToState cstate)
 
 	cstate->routine->CopyToStart(cstate, tupDesc);
 
-	if (cstate->rel)
+	/*
+	 * if COPY TO source table is a partitioned table, then open each
+	 * partition and process each individual partition.
+	 */
+	if (cstate->rel && cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		processed = 0;
+
+		foreach_oid(scan_oid, cstate->partitions)
+		{
+			TupleTableSlot *slot;
+			TableScanDesc scandesc;
+			Relation		scan_

Re: AIO v2.5

2025-03-27 Thread Noah Misch
On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> I now wrote some tests. And I both regret doing so (because it found problems,
> which would have been apparent long ago, if the feature had come with *any*
> tests, if I had gone the same way I could have just pushed stuff) and am glad
> I did (because I dislike pushing broken stuff).
> 
> I have to admit, I was tempted to just ignore this issue and just not say
> anything about tests for checksum failures anymore.

I don't blame you.

> 3) We can't pgstat_report_checksum_failure() during the completion callback,
>as it *sometimes* allocates memory
> 
>Aside from the allocation-in-critical-section asserts, I think this is
>*extremely* unlikely to actually cause a problem in practice. But we don't
>want to rely on that, obviously.

> Addressing 3) is not at all trivial.  Here's what I've thought of so far:
> 
> 
> Approach I)

> Unfortunately that fails because to access the shared memory with the stats
> data we need to do dsa_get_address()


> Approach II)
> 
> Don't report the error in the completion callback.  The obvious place would be
> to do it where we we'll raise the warning/error in the issuing process.  The
> big disadvantage is that that that could lead to under-counting checksum
> errors:
> 
> a) A read stream does 2+ concurrent reads for the same relation, and more than
> one encounters checksum errors. When processing the results for the first
> failed read, we raise an error and thus won't process the results of the
> second+ reads with errors.
> 
> b) A read is started asynchronously, but before the backend gets around to
> processing the result of the IO, it errors out during that other work
> (possibly due to a cancellation). Because the backend never looked at the
> results of the IOs, the checksum errors don't get accounted for.
> 
> b) doesn't overly bother me, but a) seems problematic.

While neither are great, I could live with both.  I guess I'm optimistic that
clusters experiencing checksum failures won't lose enough reports to these
loss sources to make the difference in whether monitoring catches them.  In
other words, a cluster will report N failures without these losses and N-K
after these losses.  If N is large enough for relevant monitoring to flag the
cluster appropriately, N-K will also be large enough.


> Approach III)
> 
> Accumulate checksum errors in two backend local variables (one for database
> specific errors, one for errors on shared relations), which will be flushed by
> the backend that issued IO during the next pgstat_report_start().
> 
> Two disadvantages:
> 
> - Accumulation of errors will be delayed until the next
>   pgstat_report_start(). That seems acceptable, after all we do so far a lot
>   of other stats.

Yep, acceptable.

> - We need to register a local callback for shared buffer reads, which don't
>   need them today . That's a small bit of added overhead. It's a shame to do
>   so for counters that approximately never get incremented.

Fair concern.  An idea is to let the complete_shared callback change the
callback list associated with the IO, so it could change
PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW.  The
latter would differ from the former only in having the extra local callback.
Could that help?  I think the only overhead is using more PGAIO_HCB numbers.
We currently reserve 256 (uint8), but one could imagine trying to pack into
fewer bits.  That said, this wouldn't paint us into a corner.  We could change
the approach later.

pgaio_io_call_complete_local() starts a critical section.  Is that a problem
for this approach?


> Approach IV):
> 
> Embracy piercing abstractions / generic infrastructure and put two atomic
> variables (one for shared one for the backend's database) in some
> backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
> update that in the completion callback. Flush that variable to the shared
> stats in pgstat_report_start() or such.

I could live with that.  I feel better about Approach III currently, though.
Overall, I'm feeling best about III long-term, but II may be the right
tactical choice.


> Does anybody have better ideas?

I think no, but here are some ideas I tossed around:

- Like your Approach III, but have the completing process store the count
  locally and flush it, instead of the staging process doing so.  Would need
  more than 2 slots, but we could have a fixed number of slots and just
  discard any reports that arrive with all slots full.  Reporting checksum
  failures in, say, 8 databases in quick succession probably tells the DBA
  there's "enough corruption to start worrying".  Missing the 9th database
  would be okay.

- Pre-warm the memory allocations and DSAs we could possibly need, so we can
  report those stats in critical sections, from the completing process.  Bad
  since there's an entry per database, hence no reasonable limit on how much
  memory a process mig

Re: pg_stat_database.checksum_failures vs shared relations

2025-03-27 Thread Julien Rouhaud
On Thu, Mar 27, 2025 at 09:02:02PM -0400, Andres Freund wrote:
> Hi,
>
> On 2025-03-28 09:44:58 +0900, Michael Paquier wrote:
> > On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> > > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund  wrote:
> > > > So, today we have the weird situation that *some* checksum errors on 
> > > > shared
> > > > relations get attributed to the current database (if they happen in a 
> > > > backend
> > > > normally accessing a shared relation), whereas others get reported to 
> > > > the
> > > > "shared relations" "database" (if they happen during a base backup).  
> > > > That
> > > > seems ... not optimal.
> > > >
> > > > One question is whether we consider this a bug that should be 
> > > > backpatched.
> > >
> > > I think it would be defensible if pg_basebackup reported all errors
> > > with OID 0 and backend connections reported all errors with OID
> > > MyDatabaseId, but it seems hard to justify having pg_basebackup take
> > > care to report things using the correct database OID and individual
> > > backend connections not take care to do the same thing. So I think
> > > this is a bug. If fixing it in the back-branches is too annoying, I
> > > think it would be reasonable to fix it only in master, but
> > > back-patching seems OK too.
> >
> > Being able to get a better reporting for shared relations in back
> > branches would be nice, but that's going to require some invasive
> > chirurgy, isn't it?
>
> Yea, that's what I was worried about too.  I think we basically would need a
> PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(),
> with optional arguments that the "fixed" callers would use.

While it would be nice, I'm not sure that it would really be worth the trouble.

Maybe that's just me, but if I hit a corruption failure knowing whether it's a
global relation vs normal relation is definitely not something that will
radically change the following days / weeks of pain to fully resolve the
issue.  Instead there would be other improvements that I would welcome on top
of fixing those counters, which would impact such new API.

For instance one of the thing you need to do in case of a corruption is to
understand the reason for the corruption, and for that knowing the underlying
tablespace rather than the database seems like a way more useful information to
track.  For the rest, the relfilelocator, forknum and blocknum should already
be reported in the logs so you have the full details of what was intercepted
even if the pg_stat_database view is broken in the back branches.

But even if we had all that, there is still no guarantee (at least for now)
that we do see all the corruption as you might not read the "real" version of
the blockss if they are in shared buffers and/or in the OS cache, depending on
where the corruption actually happened.

And even if you could actually check what is physically stored on disk, that
would probably won't give you any strong guarantee that the rest data is
actually ok anyway.  The biggest source of corruption I know is an old vmware
bug usually referred as the SEsparse bug, where in some occasion some blocks
would get written at the wrong location.  In that case, the checksum can tell
me which are the blocks where the wrong write happened, but not what are the
blocks where the write should have happened, which are also entirely
inconsistent too.  That's clearly out of postgres scope, but that's in my
opinion just one out of probably a lot more examples that makes the current bug
in back branches not worth spending too many efforts to fix.




Re: making EXPLAIN extensible

2025-03-27 Thread Robert Haas
On Thu, Mar 20, 2025 at 8:37 PM Sami Imseih  wrote:
> 1/ provides a good example for future extensions on how to
> use the new hooks. it definitely would be nice to add an example
> for the hook added to check options against each other as mentioned
> by Andrei here [0] but I could not find a good reason to do so with
> pg_overexplain.

Yeah, there may be a good idea lurking somewhere here but I have not
yet come up with it.

> 2/ I agree that this is better than debug_print_plan to avoid all the
> verbosity of that output emitting at once. But maybe it will be useful
> to provide options to show other parts like the target list, quals, etc.
> I wonder if something like this will be better?
>
> explain (query_tree_show range_table) select ..
> explain (query_tree_show qualification) select ..
> explain (query_tree_show target_list) select ..

But why would those options be exclusive of each other? Surely you
could want more than one of those things, which suggests that separate
options are better.

Also, the reason I didn't do the quals or the target list is because
regular EXPLAIN already shows that stuff. There could possibly be some
value in trying to show things about those node trees that regular
EXPLAIN doesn't, but that quickly gets into (a) speculation about what
people actually want to see and/or (b) extremely voluminous output
which is the whole reason debug_print_plan is hard to use in the first
place.

> 1/ trailing whitespace
> 2/ typos

Acknowledged, thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_stat_database.checksum_failures vs shared relations

2025-03-27 Thread Michael Paquier
On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> On Thu, Mar 27, 2025 at 11:58 AM Andres Freund  wrote:
> > So, today we have the weird situation that *some* checksum errors on shared
> > relations get attributed to the current database (if they happen in a 
> > backend
> > normally accessing a shared relation), whereas others get reported to the
> > "shared relations" "database" (if they happen during a base backup).  That
> > seems ... not optimal.
> >
> > One question is whether we consider this a bug that should be backpatched.
> 
> I think it would be defensible if pg_basebackup reported all errors
> with OID 0 and backend connections reported all errors with OID
> MyDatabaseId, but it seems hard to justify having pg_basebackup take
> care to report things using the correct database OID and individual
> backend connections not take care to do the same thing. So I think
> this is a bug. If fixing it in the back-branches is too annoying, I
> think it would be reasonable to fix it only in master, but
> back-patching seems OK too.

Being able to get a better reporting for shared relations in back
branches would be nice, but that's going to require some invasive
chirurgy, isn't it?

We don't know currently the OID of the relation whose block is
corrupted with only PageIsVerifiedExtended().  There are two callers
of PIV_REPORT_STAT on HEAD:
- The checksum reports from RelationCopyStorage() know the
SMgrRelation.
- ReadBuffersOperation() has an optional Relation and a
SMgrRelationData.

We could just refactor PageIsVerifiedExtended() so as it reports a
state about why the verification failed and let the callers report the
checksum failure with a relation OID, splitting the data for shared
and non-shared relations? 
--
Michael


signature.asc
Description: PGP signature


Re: Cross-type index comparison support in contrib/btree_gin

2025-03-27 Thread Tom Lane
v3 needed to rebase over 55527368b.  (I guess "git am" cannot
tolerate any fuzz at all?)  No changes of any significance
whatsoever.

regards, tom lane

From 7a7da88f29972209ef75063140cbf4706ca527e3 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Thu, 27 Mar 2025 21:01:40 -0400
Subject: [PATCH v3 1/6] Break out xxx2yyy_opt_overflow APIs for more datetime
 conversions.

Previous commits invented timestamp2timestamptz_opt_overflow,
date2timestamp_opt_overflow, and date2timestamptz_opt_overflow
functions to perform non-error-throwing conversions between
datetime types.  This patch completes the set by adding
timestamp2date_opt_overflow, timestamptz2date_opt_overflow,
and timestamptz2timestamp_opt_overflow.

In addition, adjust timestamp2timestamptz_opt_overflow so that it
doesn't throw error if timestamp2tm fails, but treats that as an
overflow case.  The situation probably can't arise except with an
invalid timestamp value, and I can't think of a way that that would
happen except data corruption.  However, it's pretty silly to have a
function whose entire reason for existence is to not throw errors for
out-of-range inputs nonetheless throw an error for out-of-range input.

The new APIs are not used in this patch, but will be needed by
btree_gin.

Discussion: https://postgr.es/m/262624.1738460...@sss.pgh.pa.us
---
 src/backend/utils/adt/date.c  | 86 ++-
 src/backend/utils/adt/timestamp.c | 81 -
 src/include/utils/date.h  |  2 +
 src/include/utils/timestamp.h |  3 ++
 4 files changed, 156 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index f279853deb8..07f40eba696 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -1317,10 +1317,35 @@ timestamp_date(PG_FUNCTION_ARGS)
 {
 	Timestamp	timestamp = PG_GETARG_TIMESTAMP(0);
 	DateADT		result;
+
+	result = timestamp2date_opt_overflow(timestamp, NULL);
+	PG_RETURN_DATEADT(result);
+}
+
+/*
+ * Convert timestamp to date.
+ *
+ * On successful conversion, *overflow is set to zero if it's not NULL.
+ *
+ * If the timestamp is finite but out of the valid range for date, then:
+ * if overflow is NULL, we throw an out-of-range error.
+ * if overflow is not NULL, we store +1 or -1 there to indicate the sign
+ * of the overflow, and return the appropriate date infinity.
+ *
+ * Note: given the ranges of the types, overflow is only possible at
+ * the minimum end of the range, but we don't assume that in this code.
+ */
+DateADT
+timestamp2date_opt_overflow(Timestamp timestamp, int *overflow)
+{
+	DateADT		result;
 	struct pg_tm tt,
 			   *tm = &tt;
 	fsec_t		fsec;
 
+	if (overflow)
+		*overflow = 0;
+
 	if (TIMESTAMP_IS_NOBEGIN(timestamp))
 		DATE_NOBEGIN(result);
 	else if (TIMESTAMP_IS_NOEND(timestamp))
@@ -1328,14 +1353,30 @@ timestamp_date(PG_FUNCTION_ARGS)
 	else
 	{
 		if (timestamp2tm(timestamp, NULL, tm, &fsec, NULL, NULL) != 0)
+		{
+			if (overflow)
+			{
+if (timestamp < 0)
+{
+	*overflow = -1;
+	DATE_NOBEGIN(result);
+}
+else
+{
+	*overflow = 1;	/* not actually reachable */
+	DATE_NOEND(result);
+}
+return result;
+			}
 			ereport(ERROR,
 	(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
 	 errmsg("timestamp out of range")));
+		}
 
 		result = date2j(tm->tm_year, tm->tm_mon, tm->tm_mday) - POSTGRES_EPOCH_JDATE;
 	}
 
-	PG_RETURN_DATEADT(result);
+	return result;
 }
 
 
@@ -1362,11 +1403,36 @@ timestamptz_date(PG_FUNCTION_ARGS)
 {
 	TimestampTz timestamp = PG_GETARG_TIMESTAMP(0);
 	DateADT		result;
+
+	result = timestamptz2date_opt_overflow(timestamp, NULL);
+	PG_RETURN_DATEADT(result);
+}
+
+/*
+ * Convert timestamptz to date.
+ *
+ * On successful conversion, *overflow is set to zero if it's not NULL.
+ *
+ * If the timestamptz is finite but out of the valid range for date, then:
+ * if overflow is NULL, we throw an out-of-range error.
+ * if overflow is not NULL, we store +1 or -1 there to indicate the sign
+ * of the overflow, and return the appropriate date infinity.
+ *
+ * Note: given the ranges of the types, overflow is only possible at
+ * the minimum end of the range, but we don't assume that in this code.
+ */
+DateADT
+timestamptz2date_opt_overflow(TimestampTz timestamp, int *overflow)
+{
+	DateADT		result;
 	struct pg_tm tt,
 			   *tm = &tt;
 	fsec_t		fsec;
 	int			tz;
 
+	if (overflow)
+		*overflow = 0;
+
 	if (TIMESTAMP_IS_NOBEGIN(timestamp))
 		DATE_NOBEGIN(result);
 	else if (TIMESTAMP_IS_NOEND(timestamp))
@@ -1374,14 +1440,30 @@ timestamptz_date(PG_FUNCTION_ARGS)
 	else
 	{
 		if (timestamp2tm(timestamp, &tz, tm, &fsec, NULL, NULL) != 0)
+		{
+			if (overflow)
+			{
+if (timestamp < 0)
+{
+	*overflow = -1;
+	DATE_NOBEGIN(result);
+}
+else
+{
+	*overflow = 1;	/* not actually reachable */
+	DATE_NOEND(result);
+}
+return result;
+			}
 			ereport(ERROR,
 	(errcode(ERRCOD

Re: Snapshot related assert failure on skink

2025-03-27 Thread Heikki Linnakangas

On 21/03/2025 12:28, Tomas Vondra wrote:

But it seems it changed in 952365cded6, which is:

 commit 952365cded635e54c4177399c0280cb7a5e34c11
 Author: Heikki Linnakangas 
 Date:   Mon Dec 23 12:42:39 2024 +0200

 Remove unnecessary GetTransactionSnapshot() calls

 In get_database_list() and get_subscription_list(), the
 GetTransactionSnapshot() call is not required because the catalog
 table scans use the catalog snapshot, which is held until the end of
 the scan. See table_beginscan_catalog(), which calls
 RegisterSnapshot(GetCatalogSnapshot(relid)).

 ...

I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.


That commit removed three GetTransactionSnapshot() calls: one in 
autovacuum, one in logical replication launcher, and one in 
InitPostgres(). It must be the one in InitPostgres() that makes the 
difference, because the other two are not executed in a standby.


I was able to reproduce this and to catch it in action with 'rr'. After 
commit 952365cded6, the sequence of events looks like this:


1. At backend startup, GetCatalogSnapshot() sets 
CatalogSnapshotData.xmin=1000, xactCompletionCount=10
2. On first query, GetTransactionSnapshot() sets 
CurrentSnapshotData.xmin=1002, xactCompletionCount=10
3. GetCatalogSnapshot() is called again. It tries to reuse the snapshot 
with xmin=1000 and hits the assertion.


Before commit 952365cded6, which removed the GetTransactionSnapshot() 
call from InitPostgres, this usually happens instead:


0. At backend startup, GetTransactionSnapshot() sets 
CurrentSnapshotData.xmin=1000, xactCompletionCount=10
1. At backend startup, GetCatalogSnapshot() sets 
CatalogSnapshotData.xmin=1000, xactCompletionCount=10
2. On first query, GetTransactionSnapshot() reuses the snapshot with 
CurrentSnapshotData.xmin=1002, xactCompletionCount=10
3. GetCatalogSnapshot() is called again. It successfully reuses the 
snapshot  with xmin=1000


In other words, the GetTransactionSnapshot() call initializes the 
CurrentSnapshotData with the snapshot with an earlier xmin. The first 
GetCatalogSnapshot() call happens almost immediately after that, so 
they're very likely - but not guaranteed - to get the same snapshot. 
When I removed the GetTransactionSnapshot() call, the gap between the 
first GetTransactionSnapshot() and GetCatalogSnapshot() calls in the 
backend became much longer, making this more likely to happen.


In any case, the missing "xactCompletionCount++" in 
ExpireAllKnownAssignedTransactionIds() and 
ExpireOldKnownAssignedTransactionIds() is a clear bug and explains this. 
I will prepare and commit patches to fix that.


Thanks for the testing!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Remove restrictions in recursive query

2025-03-27 Thread Tom Lane
Renan Alves Fonseca  writes:
> I suspected that this restriction came straight from the specs. I
> understand that while the proposed solution can help in some specific
> use cases, it is not enough to justify an exception to the spec.

Well, we extend the spec in lots of places.  I'd be okay with removing
this restriction if I were sure there were no bad consequences, but
it seems likely that there are some.  College math was way too long
ago for me to be sure about the "fixed-point" business ... but I think
what they may be on about is that rows produced by aggregation may not
be stable in the face of adding more and more rows via additions to
the recursion workspace.  In your example, I think it somewhat
accidentally doesn't matter: an underlying row might get picked up
and added to a dag.target-grouped row in one recursion step or
a different one, but then you sum all those sums in the top-level
query, so the top sum comes out the same regardless of just what sets
the intermediate grouped rows consisted of.  With a different query
construction though, that would matter.

regards, tom lane




Re: speedup COPY TO for partitioned table.

2025-03-27 Thread jian he
On Fri, Mar 21, 2025 at 6:13 PM vignesh C  wrote:
>
> I find an issue with the patch:
>
> -- Setup
> CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS
> (dbname 'testdb', port '5432');
> CREATE TABLE t1(id int) PARTITION BY RANGE(id);
> CREATE TABLE part1 PARTITION OF t1 FOR VALUES FROM (0) TO (5);
> CREATE TABLE part2 PARTITION OF t1 FOR VALUES FROM (5) TO (15)
> PARTITION BY RANGE(id);
> CREATE FOREIGN TABLE part2_1 PARTITION OF part2 FOR VALUES FROM (10)
> TO (15) SERVER myserver;
>
> -- Create table in testdb
> create table part2_1(id int);
>
> -- Copy partitioned table data
> postgres=#  copy t1 to stdout(header);
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>

I manually tested:
sequence, temp table, materialized view, index, view,
composite types, partitioned indexes.
all these above can not attach to partitioned tables.

We should care about the unlogged table, foreign table attached to the
partition.
an unlogged table should work just fine.
we should error out foreign tables.

so:
copy t1 to stdout(header);
ERROR:  cannot copy from foreign table "t1"
DETAIL:  partition "t1" is a foreign table
HINT:  Try the COPY (SELECT ...) TO variant.
From a2db87abfe0e1a4dda0ace47c65a9778f29fe5f2 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 28 Mar 2025 11:01:52 +0800
Subject: [PATCH v4 1/1] support COPY partitioned_table TO

CREATE TABLE pp (id  INT, val int ) PARTITION BY RANGE (id);
create table pp_1 (val int, id int);
create table pp_2 (val int, id int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10);
insert into pp select g, 10 + g from generate_series(1,9) g;
copy pp to stdout(header);

the above case is much slower (around 25% some case) than
``COPY (select * from pp) to stdout(header);``,
because of column remaping.  but this is still a new
feature, since master does not support ``COPY (partitioned_table)``.

reivewed by: vignesh C 
reivewed by: David Rowley 
reivewed by: Melih Mutlu 

discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=iQ@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5467/
---
 doc/src/sgml/ref/copy.sgml  |  8 +--
 src/backend/commands/copyto.c   | 89 +++--
 src/test/regress/expected/copy2.out | 16 ++
 src/test/regress/sql/copy2.sql  | 11 
 4 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index df093da97c5..f86e0b7ec35 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -521,15 +521,15 @@ COPY count
 

 COPY TO can be used only with plain
-tables, not views, and does not copy rows from child tables
-or child partitions.  For example, COPY COPY TO can be used with partitioned tables.
+For example, in a table inheritance hierarchy, COPY table TO copies
 the same rows as SELECT * FROM ONLY table.
 The syntax COPY (SELECT * FROM table) TO ... can be used to
-dump all of the rows in an inheritance hierarchy, partitioned table,
-or view.
+dump all of the rows in an inheritance hierarchy, or view.

 

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 84a3f3879a8..0973dc9c14b 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -19,6 +19,8 @@
 #include 
 
 #include "access/tableam.h"
+#include "access/table.h"
+#include "catalog/pg_inherits.h"
 #include "commands/copyapi.h"
 #include "commands/progress.h"
 #include "executor/execdesc.h"
@@ -82,6 +84,7 @@ typedef struct CopyToStateData
 	List	   *attnumlist;		/* integer list of attnums to copy */
 	char	   *filename;		/* filename, or NULL for STDOUT */
 	bool		is_program;		/* is 'filename' a program to popen? */
+	List	   *partitions;		/* oid list of partition oid for copy to */
 	copy_data_dest_cb data_dest_cb; /* function for writing data */
 
 	CopyFormatOptions opts;
@@ -643,6 +646,8 @@ BeginCopyTo(ParseState *pstate,
 		PROGRESS_COPY_COMMAND_TO,
 		0
 	};
+	List	   *children = NIL;
+	List	   *scan_oids = NIL;
 
 	if (rel != NULL && rel->rd_rel->relkind != RELKIND_RELATION)
 	{
@@ -670,11 +675,28 @@ BeginCopyTo(ParseState *pstate,
 	 errmsg("cannot copy from sequence \"%s\"",
 			RelationGetRelationName(rel;
 		else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-	 errmsg("cannot copy from partitioned table \"%s\"",
-			RelationGetRelationName(rel)),
-	 errhint("Try the COPY (SELECT ...) TO variant.")));
+		{
+			children = find_all_inheritors(RelationGetRelid(rel),
+		   AccessShareLock,
+		   NULL);
+			foreach_oid(childreloid, children)
+			{
+char		 relkind = get_rel_relkind(childreloid);
+
+if (relkind == RELKIND_FOREIGN_T

Re: Selectively invalidate caches in pgoutput module

2025-03-27 Thread Amit Kapila
On Fri, Mar 28, 2025 at 6:03 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Andres,
>
> > I think the new tests just failed in CI:
> > https://cirrus-ci.com/task/5602950271205376?logs=test_world#L268
>
> Thanks for reporting, I'll look into it.
>

The problem here is that after ALTER SUBSCRIPTION tap_sub SET
PUBLICATION ..., we didn't wait for the new walsender on publisher to
start. We must use wait_for_subscription_sync both after the "CREATE
SUBSCRIPTION ..." and the "ALTER SUBSCRIPTION ..." commands and keep
copy_data=true to ensure the initial replication is setup between
publisher and subscriber. This is how we use these commands at other
places.

-- 
With Regards,
Amit Kapila.




Re: Parallel heap vacuum

2025-03-27 Thread Masahiko Sawada
On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
 wrote:
>
> On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada  wrote:
> >
> > You're right. I've studied the read stream code and figured out how to
> > use it. In the attached patch, we end the read stream at the end of
> > phase 1 and start a new read stream, as you suggested.
>
> I've started looking at this patch set some more.

Thank you for reviewing the patch!

>
> In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
> codepath and run out of unskippable blocks in the current chunk and
> then go back to get another chunk (goto retry) but we are near the
> memory limit so we can't get another block
> (!dead_items_check_memory_limit()), could we get an infinite loop?
>
> Or even incorrectly encroach on another worker's block? Asking that
> because of this math
> end_block = next_block +
>
> vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;

You're right. We should make sure that reset next_block is reset to
InvalidBlockNumber at the beginning of the retry loop.

>
> if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
> and we are in the goto retry loop, it seems like we could keep
> incrementing next_block even when we shouldn't be.

Right. Will fix.

>
> I just want to make sure that the skip pages optimization works with
> the parallel block assignment and the low memory read stream
> wind-down.
>
> I also think you do not need to expose
> table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
> only called in heap-specific code and the logic seems very
> heap-related. If table AMs want something to skip some blocks, they
> could easily implement it.

Agreed. Will remove it.

>
> On another topic, I think it would be good to have a comment above
> this code in parallel_lazy_scan_gather_scan_results(), stating why we
> are very sure it is correct.
> Assert(TransactionIdIsValid(data->NewRelfrozenXid));
> Assert(MultiXactIdIsValid(data->NewRelminMxid));
>
> if (TransactionIdPrecedes(data->NewRelfrozenXid,
> vacrel->scan_data->NewRelfrozenXid))
> vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
>
> if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
> vacrel->scan_data->NewRelminMxid))
> vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
>
> if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
> vacrel->scan_data->nonempty_pages = data->nonempty_pages;
>
> vacrel->scan_data->skippedallvis |= data->skippedallvis;
>
> Parallel relfrozenxid advancement sounds scary, and scary things are
> best with comments. Even though the way this works is intuitive, I
> think it is worth pointing out that this part is important to get
> right so future programmers know how important it is.
>
> One thing I was wondering about is if there are any implications of
> different workers having different values in their GlobalVisState.
> GlobalVisState can be updated during vacuum, so even if they start out
> with the same values, that could diverge. It is probably okay since it
> just controls what tuples are removable. Some workers may remove fewer
> tuples than they absolutely could, and this is probably okay.
>

Good point.

> And if it is okay for each worker to have different GlobalVisState
> then maybe you shouldn't have a GlobalVisState in shared memory. If
> you look at GlobalVisTestFor() it just returns the memory address of
> that global variable in the backend. So, it seems like it might be
> better to just let each parallel worker use their own backend local
> GlobalVisState and not try to put it in shared memory and copy it from
> one worker to the other workers when initializing them. I'm not sure.
> At the very least, there should be a comment explaining why you've
> done it the way you have done it.

Agreed. IIUC it's not a problem even if parallel workers use their own
GlobalVisState. I'll make that change and remove the 0004 patch which
exposes GlobalVisState.

I'll send the updated patch soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: making EXPLAIN extensible

2025-03-27 Thread Robert Haas
On Wed, Mar 19, 2025 at 11:38 AM Sami Imseih  wrote:
> ... as I made the hook signature match that of
> ParseExplainOptionList, so both pstate and the options list
> are now available to the hook.

This version LGTM, except it's not pgindent-clean.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Doc: Fixup misplaced filelist.sgml entities and add some commentary

2025-03-27 Thread Steven Niu

Hi, David,

In the file docguide.sgml, there is a typo.
"Within the book are parts, mostly defined within the same file, expect 
for the", the "expect" here should be "except".


Thanks,
Steven

在 2025/3/20 5:13, David G. Johnston 写道:

Hi.

Having been in filelist.sgml a bit recently I've noticed that the 
original alphabetical ordering of the entities therein hasn't been 
adhered to.  Partly, I suspect, because there is no guidance about these 
files and how they are organized.  The attached puts things back into 
alphabetical order (by section) and adds some commentary to this and 
related files, and the manual.


I made the choice to move the special %allfiles; reference to the top 
since placement doesn't matter and burying the one unique thing in the 
middle of the file didn't seem helpful.  Now both its immediate presence 
and the comment point out the existence and purpose of ref/allfiles.sgml.


David J.







Re: making EXPLAIN extensible

2025-03-27 Thread Man Zeng
I tried it out and felt good about this feature.
But I found that the following information is not very complete.

postgres=# select * from pg_get_loaded_modules();
 module_name  | version | file_name 
--+-+---
  | | pg_overexplain.so
 auto_explain | 18devel | auto_explain.so
 plpgsql  | 18devel | plpgsql.so
(3 rows)

Some minor changes may be needed.

// PG_MODULE_MAGIC;
PG_MODULE_MAGIC_EXT(
.name = "pg_overexplain",
.version = PG_VERSION
);

Re: Use XLOG_CONTROL_FILE macro everywhere?

2025-03-27 Thread Anton A. Melnikov

Hi!


* Patch needs rebase by CFbot


Rebased the patches onto the current master and marked them as v4.


Best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyFrom 689539285dd813a682507ce310b45c978d67b065 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 4 Sep 2024 16:15:16 +0900
Subject: [PATCH 1/3] Split file-related parts from xlog_internal.h

Some modules that require XLOG file manipulation had to include
xlog_internal.h, which contains excessively detailed internal
information. This commit separates the file and file path-related
components into xlogfilepaths.h, thereby avoiding the exposure of
unrelated details of the XLOG module.
---
 src/backend/access/transam/timeline.c |   1 -
 src/backend/postmaster/pgarch.c   |   1 -
 src/backend/utils/adt/genfile.c   |   1 -
 src/bin/initdb/initdb.c   |   1 +
 src/bin/pg_archivecleanup/pg_archivecleanup.c |   2 +-
 src/bin/pg_basebackup/pg_basebackup.c |   4 +-
 src/bin/pg_basebackup/pg_receivewal.c |   1 +
 src/bin/pg_basebackup/receivelog.c|   2 +-
 src/bin/pg_basebackup/streamutil.c|   2 +-
 src/bin/pg_controldata/pg_controldata.c   |   1 -
 src/bin/pg_rewind/pg_rewind.c |   2 +-
 src/include/access/xlog.h |  13 +-
 src/include/access/xlog_internal.h| 178 +-
 src/include/access/xlogfilepaths.h| 219 ++
 14 files changed, 230 insertions(+), 198 deletions(-)
 create mode 100644 src/include/access/xlogfilepaths.h

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index a27f27cc037..710e92297e3 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,6 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
-#include "access/xlogdefs.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7e622ae4bd2..4db8a00b849 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -30,7 +30,6 @@
 #include 
 
 #include "access/xlog.h"
-#include "access/xlog_internal.h"
 #include "archive/archive_module.h"
 #include "archive/shell_archive.h"
 #include "lib/binaryheap.h"
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 80bb807fbe9..af4c148c809 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -21,7 +21,6 @@
 #include 
 
 #include "access/htup_details.h"
-#include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_tablespace_d.h"
 #include "catalog/pg_type.h"
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 22b7d31b165..6768ef84c7c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -65,6 +65,7 @@
 #endif
 
 #include "access/xlog_internal.h"
+#include "lib/stringinfo.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_collation_d.h"
diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c b/src/bin/pg_archivecleanup/pg_archivecleanup.c
index c25348bcb85..edf1828bc73 100644
--- a/src/bin/pg_archivecleanup/pg_archivecleanup.c
+++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c
@@ -15,7 +15,7 @@
 #include 
 #include 
 
-#include "access/xlog_internal.h"
+#include "access/xlogfilepaths.h"
 #include "common/logging.h"
 #include "getopt_long.h"
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1da4bfc2351..01f36e1c38e 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,7 +25,7 @@
 #include 
 #endif
 
-#include "access/xlog_internal.h"
+#include "access/xlogfilepaths.h"
 #include "astreamer_inject.h"
 #include "backup/basebackup.h"
 #include "common/compression.h"
@@ -35,6 +35,8 @@
 #include "fe_utils/option_utils.h"
 #include "fe_utils/recovery_gen.h"
 #include "getopt_long.h"
+#include "pgtime.h"
+#include "port/pg_bswap.h"
 #include "receivelog.h"
 #include "streamutil.h"
 
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index de3584018b0..4facc35f845 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -27,6 +27,7 @@
 #include 
 #endif
 
+//#include "access/xlogfilepaths.h"
 #include "access/xlog_internal.h"
 #include "common/file_perm.h"
 #include "common/logging.h"
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 6b6e32dfbdf..4f72ed42e11 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-#include "access/xlog_internal.h"
+#include "access/xlogfilepaths.h"
 #include "common/logging.h"
 #include "libpq-fe.h"
 #

Re: [PATCH] SVE popcount support

2025-03-27 Thread John Naylor
On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart
 wrote:
> I also noticed a silly mistake in 0003 that would cause us to potentially
> skip part of the tail.  That should be fixed now.

I'm not sure whether that meant it could return the wrong answer, or
just make more work for paths further down.
If the former, then our test coverage is not adequate.

Aside from that, I only found one more thing that may be important: I
tried copying the configure/meson checks into godbolt.org, and both
gcc and clang don't like it, so unless there is something weird about
their setup (or my use of it) it's possible some other hosts won't
like it either.:

```
:29:10: error: call to 'svwhilelt_b8' is ambiguous
pred = svwhilelt_b8(0, sizeof(buf));
   ^~~~
/opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15526:10:
note: candidate function
svbool_t svwhilelt_b8(uint64_t, uint64_t);
 ^
/opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15534:10:
note: candidate function
svbool_t svwhilelt_b8(int32_t, int32_t);
 ^

: In function 'autoconf_popcount_test':
:29:24: error: call to 'svwhilelt_b8' is ambiguous; argument 1
has type 'int32_t' but argument 2 has type 'uint64_t'
   29 | pred = svwhilelt_b8(0, sizeof(buf));
  |^~~~
Compiler returned: 1
```

...Changing it to

pred = svwhilelt_b8((uint64_t)0, sizeof(buf));"

clears it up.

-- 
John Naylor
Amazon Web Services




Re: read stream on amcheck

2025-03-27 Thread Nazir Bilal Yavuz
Hi,

On Thu, 27 Mar 2025 at 03:48, Melanie Plageman
 wrote:
>
> On Mon, Mar 17, 2025 at 9:47 AM Matheus Alcantara
>  wrote:
> >
> > Sorry for the delay, attached v4 with the remaining fixes.
>
> Thanks for the patch.
>
> I started reviewing this with the intent to commit it. But, I decided
> while studying it that I want to separate the SKIP_PAGES_NONE case and
> the other cases into two callbacks. I think it is easier to read the
> skip pages callback this way. The SKIP_PAGES_NONE case is just read
> all blocks in the range, so we can use the existing default callback,
> block_range_read_cb(). Then the callback for the
> SKIP_PAGES_ALL_VISIBLE and SKIP_PAGES_ALL_FROZEN options can be clear
> and simple.
>
> I've attached two versions with this proposed structure.
>
> amcheck-readsteram-1callback.patch implements this with one callback
> and has the amcheck specific callback private data struct subclass
> BlockRangeReadStreamPrivate (I called it
> heapamcheck_rs_perblock_data).
>
> amcheck-readstream-2callbacks.patch wraps block_range_read_cb() in an
> amcheck specific callback and creates a BlockRangeReadStreamPrivate
> and fills it in from the heapamcheck_rs_perblock_data to pass as
> callback_private_data. Because this version is more explicit, it is
> more safe. We don't have any type checking facilities that will alert
> us if someone adds a member above the BlockRangeReadStreamPrivate in
> heapamcheck_rs_perblock_data. But, I'm open to feedback.

I liked the first approach more. We can solve the first approach's
problems by introducing a void pointer to pass to
read_stream_begin_relation(). We can set it to &rsdata.range for the
SKIP_PAGES_NONE case and &rsdata for the rest.

Example patch is attached, heapamcheck_rs_perblock_data is added to
typedefs.list too.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 642a01c6298742879cefac1197d466b6fec5df7f Mon Sep 17 00:00:00 2001
From: Matheus Alcantara 
Date: Fri, 29 Nov 2024 18:52:43 -0300
Subject: [PATCH] Use read stream on amcheck

ci-os-only:
---
 contrib/amcheck/verify_heapam.c  | 102 +++
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 79 insertions(+), 24 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 827312306f6..be031a1795e 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/rel.h"
@@ -185,6 +186,43 @@ static XidBoundsViolation get_xid_status(TransactionId xid,
 		 HeapCheckContext *ctx,
 		 XidCommitStatus *status);
 
+typedef struct heapamcheck_rs_perblock_data
+{
+	BlockRangeReadStreamPrivate range;
+	SkipPages	skip_option;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+} heapamcheck_rs_perblock_data;
+
+static BlockNumber
+heapam_read_stream_next_block(ReadStream *stream,
+			  void *callback_private_data,
+			  void *per_buffer_data)
+{
+	heapamcheck_rs_perblock_data *p = callback_private_data;
+
+	for (BlockNumber i; (i = p->range.current_blocknum++) < p->range.last_exclusive;)
+	{
+		int32		mapbits = visibilitymap_get_status(p->rel, i, p->vmbuffer);
+
+		if (p->skip_option == SKIP_PAGES_ALL_FROZEN)
+		{
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+continue;
+		}
+
+		if (p->skip_option == SKIP_PAGES_ALL_VISIBLE)
+		{
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+continue;
+		}
+
+		return i;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Scan and report corruption in heap pages, optionally reconciling toasted
  * attributes with entries in the associated toast table.  Intended to be
@@ -231,6 +269,11 @@ verify_heapam(PG_FUNCTION_ARGS)
 	BlockNumber last_block;
 	BlockNumber nblocks;
 	const char *skip;
+	ReadStream *stream;
+	int			read_stream_flags;
+	ReadStreamBlockNumberCB cb;
+	heapamcheck_rs_perblock_data rsdata;
+	void	   *stream_callback_private;
 
 	/* Check supplied arguments */
 	if (PG_ARGISNULL(0))
@@ -404,7 +447,34 @@ verify_heapam(PG_FUNCTION_ARGS)
 	if (TransactionIdIsNormal(ctx.relfrozenxid))
 		ctx.oldest_xid = ctx.relfrozenxid;
 
-	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
+	rsdata.range.current_blocknum = first_block;
+	rsdata.range.last_exclusive = last_block + 1;
+	rsdata.skip_option = skip_option;
+	rsdata.rel = ctx.rel;
+	rsdata.vmbuffer = &vmbuffer;
+
+	if (skip_option == SKIP_PAGES_NONE)
+	{
+		cb = block_range_read_stream_cb;
+		read_stream_flags = READ_STREAM_SEQUENTIAL | READ_STREAM_FULL;
+		stream_callback_private = &rsdata.range;
+	}
+	else
+	{
+		cb = heapam_read_stream_next_block;
+		read_stream_flags = READ_STREAM_DEFAULT;
+		stream_callback_private = &rsdata;
+	}
+
+	stream = read_stream_begin_relation(read_stream_flags,
+		ctx.bstrategy,
+		ctx.rel,
+		MAIN_FORKNUM,
+		cb,
+	

Re: Better HINT message for "unexpected data beyond EOF"

2025-03-27 Thread Jakub Wartak
On Wed, Mar 26, 2025 at 4:01 PM Robert Haas  wrote:
[..]
> > so how about:
> > -HINT:  This has been seen to occur with buggy kernels; consider
> > updating your system.
> > +HINT:  This has been observed with files being overwritten, buggy
> > kernels and potentially other external file system influence.
>
> I agree that we should emphasize the possibility of files being
> overwritten.

> I'm not sure we should even mention buggy kernels -- is
> there any evidence that's still a thing on still-running hardware?

No, I do not have any, other than comments in source code from Tom.

> I don't really like "other external file system influence" because that
> sounds like vague weasel-wording.

That was somehow intended, because I did not want to rule out any
external factor(s) and state it as vaguely as possible to stay
generic, because it is literally "paranormal" / "rogue" activity
happening from perspective of the core server itself (another entity
opening and overwriting data files) , but I suppose bugs or in some
cases fs corruption could cause it too ?)

E.g. I've tracked down that e.g. Pavan fixed something in 2ndQ
fast_redo/pg_xlog_prefetch extension in 2016, where some concurrency
bug in that extension was causing similiar problem back then on at
least one occasion: ```...issue was caused because the prefetch worker
process reading back blocks that are being concurrently dropped by the
startup process (as a result of truncate operation). When the startup
process later tries to extend the relation, it finds an existing valid
block in the shared buffers and panics. ``` (sounds like it is related
with data beyond EOF).

Proposals:
1. HINT:  This has been observed with files being overwritten.
2. HINT:  This has been observed with files being overwritten, old
(2.6.x) buggy Linux kernels .
3. HINT:  This has been observed with files being overwritten, old
(2.6.x) buggy Linux kernels, corruption or other non-core PostgreSQL
bugs.
4. HINT:  This has been observed with files being overwritten, buggy
kernels and potentially other external file system influence.

TBH, anything else is better that simply avoids blaming kernel folks
directly, but as a non-native speaker I'm finding it a little hard to
articulate.

-J.




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-27 Thread Shubham Khanna
On Thu, Mar 27, 2025 at 12:16 PM Amit Kapila  wrote:
>
> On Wed, Mar 26, 2025 at 4:02 PM Shubham Khanna
>  wrote:
> >
>
> Let's combine 0001 and 0002.
>

Combined 0001 and 0002.

> A few minor comments:
> *
> +   If database name is not specified in publisher-server, it will try to
> +   connect with postgres/template1 database to fetch the databases from
> +   primary.
>
> Can we change the above to: "If the database name is not specified in
> publisher-server, the postgres database will be used, or if that does
> not exist, template1 will be used."
>

Fixed.

> *
> +   Create one subscription on the target server for each non-template
> +   database on the source server that allows connections, excluding
> +   template databases or databases with connection restrictions.
>
> The following text proposed by Euler seems better: "Create one
> subscription per database on the target server. Exceptions are
> template databases and databases that don't allow connections."
>
> Please use this one at the above and other places where required.
>

Fixed.

The attached patches contain the suggested changes. They also address
the comments provided by Kuroda-san (at [1]).

[1] - 
https://www.postgresql.org/message-id/OSCPR01MB149660B6D00665F47896CA812F5A12%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Thanks and regards,
Shubham Khanna.


v23-0001-Enhance-pg_createsubscriber-to-fetch-and-append-.patch
Description: Binary data


v23-0002-Synopsis-for-all-option.patch
Description: Binary data


v23-0003-Additional-test-cases.patch
Description: Binary data


Re: [fixed] Trigger test

2025-03-27 Thread Dmitrii Bondar


Hi, Paul,

Thanks for the suggestions.

> This looks good. I have a couple small grammar suggestions. This:

I have replaced the incorrect articles with the correct ones.

> We can put all the new lines inside the #ifdef, can't we?

You're right. I have done that.

Best regards,

Dmitrii
From f52d52fcc6c2febb54dfb6f3aa5e128bcea1 Mon Sep 17 00:00:00 2001
From: Bondar Dmitrii 
Date: Thu, 27 Mar 2025 14:58:49 +0700
Subject: [PATCH v6] Triggers test fix with the invalid cache in refint.c

---
 contrib/spi/refint.c   | 26 
 doc/src/sgml/contrib-spi.sgml  | 12 +-
 src/test/regress/expected/triggers.out | 57 +-
 src/test/regress/sql/triggers.sql  | 30 +++---
 4 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/contrib/spi/refint.c b/contrib/spi/refint.c
index e1aef7cd2a3..b1e697eb935 100644
--- a/contrib/spi/refint.c
+++ b/contrib/spi/refint.c
@@ -81,6 +81,10 @@ check_primary_key(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "check_primary_key: must be fired for row");
 
+	if (!TRIGGER_FIRED_AFTER(trigdata->tg_event))
+		/* internal error */
+		elog(ERROR, "check_primary_key: must be fired by AFTER trigger");
+
 	/* If INSERTion then must check Tuple to being inserted */
 	if (TRIGGER_FIRED_BY_INSERT(trigdata->tg_event))
 		tuple = trigdata->tg_trigtuple;
@@ -264,7 +268,6 @@ check_foreign_key(PG_FUNCTION_ARGS)
 #ifdef DEBUG_QUERY
 	elog(DEBUG4, "check_foreign_key: Enter Function");
 #endif
-
 	/*
 	 * Some checks first...
 	 */
@@ -284,6 +287,10 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "check_foreign_key: cannot process INSERT events");
 
+	if (!TRIGGER_FIRED_AFTER(trigdata->tg_event))
+		/* internal error */
+		elog(ERROR, "check_foreign_key: must be fired by AFTER trigger");
+
 	/* Have to check tg_trigtuple - tuple being deleted */
 	trigtuple = trigdata->tg_trigtuple;
 
@@ -335,10 +342,10 @@ check_foreign_key(PG_FUNCTION_ARGS)
 	kvals = (Datum *) palloc(nkeys * sizeof(Datum));
 
 	/*
-	 * Construct ident string as TriggerName $ TriggeredRelationId and try to
-	 * find prepared execution plan(s).
+	 * Construct ident string as TriggerName $ TriggeredRelationId $ OperationType
+	 * and try to find prepared execution plan(s).
 	 */
-	snprintf(ident, sizeof(ident), "%s$%u", trigger->tgname, rel->rd_id);
+	snprintf(ident, sizeof(ident), "%s$%u$%c", trigger->tgname, rel->rd_id, is_update ? 'U' : 'D');
 	plan = find_plan(ident, &FPlans, &nFPlans);
 
 	/* if there is no plan(s) then allocate argtypes for preparation */
@@ -570,7 +577,6 @@ check_foreign_key(PG_FUNCTION_ARGS)
 
 		relname = args[0];
 
-		snprintf(ident, sizeof(ident), "%s$%u", trigger->tgname, rel->rd_id);
 		plan = find_plan(ident, &FPlans, &nFPlans);
 		ret = SPI_execp(plan->splan[r], kvals, NULL, tcount);
 		/* we have no NULLs - so we pass     here */
@@ -593,9 +599,15 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		else
 		{
 #ifdef REFINT_VERBOSE
+			const char* operation;
+
+			if (action == 'c')
+operation = is_update ? "updated" : "deleted";
+			else
+operation = "set to null";
+
 			elog(NOTICE, "%s: " UINT64_FORMAT " tuple(s) of %s are %s",
- trigger->tgname, SPI_processed, relname,
- (action == 'c') ? "deleted" : "set to null");
+ trigger->tgname, SPI_processed, relname, operation);
 #endif
 		}
 		args += nkeys + 1;		/* to the next relation */
diff --git a/doc/src/sgml/contrib-spi.sgml b/doc/src/sgml/contrib-spi.sgml
index e7cae4e38dc..3a838199484 100644
--- a/doc/src/sgml/contrib-spi.sgml
+++ b/doc/src/sgml/contrib-spi.sgml
@@ -36,7 +36,7 @@
 
   
check_primary_key() checks the referencing table.
-   To use, create a BEFORE INSERT OR UPDATE trigger using this
+   To use, create an AFTER INSERT OR UPDATE trigger using this
function on a table referencing another table. Specify as the trigger
arguments: the referencing table's column name(s) which form the foreign
key, the referenced table name, and the column names in the referenced table
@@ -46,7 +46,7 @@
 
   
check_foreign_key() checks the referenced table.
-   To use, create a BEFORE DELETE OR UPDATE trigger using this
+   To use, create an AFTER DELETE OR UPDATE trigger using this
function on a table referenced by other table(s).  Specify as the trigger
arguments: the number of referencing tables for which the function has to
perform checking, the action if a referencing key is found
@@ -63,6 +63,14 @@
   
There are examples in refint.example.
   
+
+  
+   Note that if these triggers are executed from another BEFORE
+   trigger, they can fail unexpectedly. For example, if a user inserts row1 and then
+   the BEFORE trigger inserts row2 and calls a trigger with the
+   check_foreign_key(), the check_foreign_key()
+   function will not see row1 and will fail.
+  
  
 
  
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 247c67c32ae..3b8bc558140

Re: simplifying grammar for ALTER CONSTRAINT .. SET [NO] INHERIT

2025-03-27 Thread Álvaro Herrera
On 2025-Mar-26, Suraj Kharage wrote:

>  Yes, I agree. As Peter said, it is now inline with other commands.
> 
> I have reviewed the patch and it looks good to me.

Thanks for reviewing!

> Since we are removing the SET keyword, how about removing that from the
> below comment as well.
> 
> /*
> * Propagate the change to children.  For SET NO INHERIT, we don't
> * recursively affect children, just the immediate level.
> */
> 
> This is the comment from ATExecAlterConstrInheritability().

Ah right.  Fixed that and pushed.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: why there is not VACUUM FULL CONCURRENTLY?

2025-03-27 Thread Antonin Houska
Alvaro Herrera  wrote:

> On 2025-Mar-22, Antonin Houska wrote:
> 
> > Alvaro Herrera  wrote:
> > 
> > > I rebased this patch series; here's v09.  No substantive changes from v08.
> > > I made sure the tree still compiles after each commit.
> 
> I rebased again, fixing a compiler warning reported by CI and applying
> pgindent to each individual patch.  I'm slowly starting to become more
> familiar with the whole of this new code.

I'm trying to reflect Robert's suggestions about locking [1]. The next version
should be a bit simpler, so maybe wait for it before you continue studying the
code.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobUZ0g%3D%3DSZv-OSFCQTGFPis5Qz1UsiMn18HGOWzsiyOLQ%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Add Postgres module info

2025-03-27 Thread Tom Lane
Yurii Rashkovskii  writes:
> This recent patch is great but causes a small problem. It mixes designated
> and non-designated initializers, specifically in `PG_MODULE_MAGIC_DATA(0)`.

Ugh.  I felt a bit itchy about that, but my compiler wasn't
complaining...

Can you propose a specific change to clean it up?  I wanted to write
just "PG_MODULE_MAGIC_DATA()", but I'm not sure that's valid C either.

regards, tom lane




Re: Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state

2025-03-27 Thread cca5507
Hi,


This change doesn't solve a bug. But I think it makes the code and comments 
more consistent. I currently have no idea about how to test it.


--
Regards,
ChangAo Chen

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-27 Thread Amit Kapila
On Wed, Mar 26, 2025 at 1:17 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Seeing all these failures, I wonder whether we can reliably test
> > active slots apart from wal_level change test (aka Scenario 6:
> > incorrect wal_level on primary.). Sure, we can try by having some
> > injection point kind of tests, but is it really worth because, anyway
> > the active slots won't get invalidated in the scenarios for row
> > removal we are testing in this case. The other possibility is to add a
> > developer-level debug_disable_running_xact GUC to test this and
> > similar cases, or can't we have an injection point to control logging
> > this WAL record? I have seen the need to control logging running_xact
> > record in other cases as well.
>
> Based on the idea which controls generating RUNNING_XACTS, I prototyped a 
> patch.
> When the instance is attached the new injection point, all processes would 
> skip
> logging the record. This does not need to extend injection_point module.
>

Right, I think this is a better idea. I have a few comments:
1.
+ /*
+ * In 035_standby_logical_decoding.pl, RUNNING_XACTS could move slots's
+ * xmin forward and cause random failures.

No need to use test file name in code comments.

2. The comments atop wait_until_vacuum_can_remove can be changed to
indicate that we will avoid logging running_xact with the help of
injection points.

3.
+ # Note that from this point the checkpointer and bgwriter will wait before
+ # they write RUNNING_XACT record.
+ $node_primary->safe_psql('testdb',
+ "SELECT injection_points_attach('log-running-xacts', 'wait');");

Isn't it better to use 'error' as the second parameter as we don't
want to wait at this injection point?

4.
+ # XXX If the instance does not attach 'log-running-xacts', the bgwriter
+ # pocess would generate RUNNING_XACTS record, so that the test would fail.
+ sleep(20);

I think it is better to make a separate patch (as a first patch) for
this so that it can be used as a reproducer. I suggest to use
checkpoint as used by one of Bertrand's patches to ensure that the
issue reproduces in every environment.

> Sadly IS_INJECTION_POINT_ATTACHED() was introduced for PG18 so that the patch
> could not backport for PG17 as-is.
>

We can use 'wait' mode API in PG17 as used in one of the tests
(injection_points_attach('heap_update-before-pin', 'wait');) but I
think it may be better to just leave testing active slots in
backbranches because anyway the new development happens on HEAD and we
want to ensure that no breakage happens there.

-- 
With Regards,
Amit Kapila.




Re: Thread-safe nl_langinfo() and localeconv()

2025-03-27 Thread Peter Eisentraut

On 14.02.25 15:13, Peter Eisentraut wrote:

On 09.02.25 15:52, Peter Eisentraut wrote:
This patch set is still desirable.  Here is a rebased version of the 
v5 patches.  I haven't applied any corrections or review comments.


Here is the same patch set with some review comments.

Patch 0001 looks okay to me.  I'm just offering some cosmetic 
improvements in patch 0004.


I have committed this patch.

The original patch had a typo that prevented the BSD-ish branch (using 
localeconv_l()) from being taken, so it actually used the fallback 
uselocale() branch, which then failed on CI on NetBSD.  (But conversely, 
this gave some testing on CI for the uselocale() branch.)


Patch 0002 also looks okay, except that the error handling could be 
unified with existing code, as I had previously pointed out.  Patch 0005 
fixes that.


I plan to commit this one next, after the above had a bit of time to stew.


About patch 0003:

I had previously pointed out that the canonicalization might have been 
intentional, and that we have canonicalization of ICU locale names.  But 
we don't have to keep the setlocale()-based locale checking 
implementation just for that, I think.  (If this was meant to be a real 
feature offered by libc, there should be a get_locale_name(locale_t) 
function.)


POSIX 2024 actually has getlocalename_l(), but it doesn't appear to be 
widely implemented.


Anyway, this patch fails tests on CI on NetBSD, so it will need some 
further investigation.






Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-03-27 Thread Ilia Evdokimov


On 26.03.2025 22:37, Robert Haas wrote:

I also don't think that we should be too concerned about bloating the
EXPLAIN output in the context of a patch that only affects Memoize.
Memoize nodes are not incredibly common in the query plans that I see,
so even if we added another line or three to the output for each one,
I don't think that would create major problems. On the other hand,
maybe there's an argument that what this patch touches is only the tip
of the iceberg, and that we're eventually going to want the same kinds
of things for Nested Loop and Hash Joins and Merge Joins, and maybe
even more detail that can be displayed in say 3 lines. In that case,
there's a double concern. On the one hand, that really would make the
output a whole lot more verbose, and on the other hand, it might
generate a fair amount of work to maintain it across future planner
changes. I can see deciding to reject changes of that sort on the
grounds that we're not prepared to maintain it, or deciding to gate it
behind a new option on the grounds that it is so verbose that even
people who say EXPLAIN VERBOSE are going to be sad if they get all
that crap by default. I'm not saying that we SHOULD make those
decisions -- I think exposing more detail here could be pretty useful
to people trying to solve query plan problems, including me, so I hope
we don't just kick that idea straight to the curb without due thought
-- but I would understand them.

The part I'm least sure about with respect to the proposed patch is
the actual stuff being displayed. I don't have the experience to know
whether it's useful for tracking down issues. If it's not, then I
agree we shouldn't display it. If it is, then I'm tentatively in favor
of showing it in standard EXPLAIN, possibly only with VERBOSE, with
the caveats from the previous paragraph: if more-common node types are
also going to have a bunch of stuff like this, then we need to think
more carefully. If Memoize is exceptional in needing additional
information displayed, then I think it's fine.

--
Robert Haas
EDB:http://www.enterprisedb.com



I understand the concerns raised about the risk of opening the door to 
more diagnostic detail across various plan nodes. However, in Hash Join, 
Merge Join, and Nested Loop, EXPLAIN typically reveals at least some of 
the planner’s expectations. For example, Hash Join shows the number of 
batches and originally expected buckets, giving insight into whether the 
hash table fit in memory. Merge Join shows unexpected Sort nodes when 
presorted inputs were assumed. Nested Loop reflects planner assumptions 
via loops and row estimates. In other words, these nodes expose at least 
some information about what the planner thought would happen.


Memoize is unique in that it shows runtime statistics (hits, misses, 
evictions), but reveals nothing about the planner’s expectations. We 
don’t see how many distinct keys were estimated or how many entries the 
planner thought would fit in memory. This makes it very difficult to 
understand whether Memoize was a good choice or not, or how to fix it 
when it performs poorly.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.


Re: FileFallocate misbehaving on XFS

2025-03-27 Thread Andrea Gelmini
Il giorno lun 9 dic 2024 alle ore 10:47 Andrea Gelmini <
andrea.gelm...@gmail.com> ha scritto:

>
> Funny, i guess it's the same reason I see randomly complain of WhatsApp
> web interface, on Chrome, since I switched to XFS. It says something like
> "no more space on disk" and logout, with more than 300GB available.
>

To be fair, I found after months (changes fs, and googled about), it's not
XFS related, and it happens with different browsers, also. So, my suspect
was wrong.

Ciao,
Gelma


Re: Amcheck verification of GiST and GIN

2025-03-27 Thread Mark Dilger
On Fri, Feb 21, 2025 at 6:29 AM Tomas Vondra  wrote:

> Hi,
>
> I see this patch didn't move since December :-( I still think these
> improvements would be useful, it certainly was very helpful when I was
> working on the GIN/GiST parallel builds (the GiST builds stalled, but I
> hope to push the GIN patches soon).
>
> So I'd like to get some of this in too. I'm not sure about the GiST
> bits, because I know very little about that AM (the parallel builds made
> me acutely aware of that).
>
> But I'd like to get the GIN parts in. We're at v34 already, and the
> recent changes were mostly cosmetic. Does anyone object to me polishing
> and pushing those parts?
>

Kirill may have addressed my concerns in the latest version.  I have not
had time for another review.  Tomas, would you still like to review and
push this patch?  I have no objection.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Support "make check" for PGXS extensions

2025-03-27 Thread Peter Eisentraut

On 20.03.25 18:20, David E. Wheeler wrote:

On Mar 20, 2025, at 09:06, Peter Eisentraut  wrote:


This is a quick follow-up to the extension_control_path patch.  With this little 
additional patch, you can now run "make check" in PGXS-using extensions 
(instead of having to do make install; make installcheck with a running instance).  I 
think this would be very convenient for extension development.


I LOVE this idea! But one thing to keep in mind is that not all files are in 
CURDIR. Might make sense to use `dirname` on all the entires in DATA and 
MODULES to figure out what to put in the search paths. I usually have my C 
files in `src` and SQL files in `sql`, and wrote the PGXN tutorial[1] back in 
2012 with that pattern (for better or worse). A simple example is the envvar 
extension[2]:

DATA = $(wildcard sql/*.sql)
MODULES  = $(patsubst %.c,%,$(wildcard src/*.c))


Interesting.  I think to support that, we would need to do a temp 
install kind of thing of the extension, and then point the path settings 
into that temp install directory.  Which is probably more sensible anyway.






Remove restrictions in recursive query

2025-03-27 Thread Renan Alves Fonseca
Hi,

I'm confused about what we should allow in a recursive query. For example,
in the following query:

WITH RECURSIVE t1 AS ( SELECT 1 UNION  SELECT generate_series(2,3) FROM t1
ORDER BY  1 DESC) SELECT * FROM t1 ;

The parser attaches the "order by" clause to the "union" operator, and then
we error out with the following message: "ORDER BY in a recursive query is
not implemented"

The comment in the code (parser_cte.c:900) says "Disallow ORDER BY and
similar decoration atop the UNION". Then, if we wrap the recursive clause
around parentheses:

WITH RECURSIVE t1 AS ( SELECT 1 UNION  (SELECT generate_series(2,3) FROM t1
ORDER BY  1 DESC)) SELECT * FROM t1 ;

It works as expected. So, do we support the ORDER BY in a recursive query
or not? If the answer is yes, I suggest one of the following modifications:
1. Change the error message to something like "ORDER BY at the top level of
a recursive query is not implemented. HINT: wrap the respective statement
around ()"
2. (preferred) Modify the parser or simply remove these checks to allow the
first query.

If the answer is no, then there is a minor bug that allows us to bypass the
check. Even though the ORDER BY happens inside the working table, I think
it can be a useful feature combined with LIMIT and OFFSET.

There is a similar restriction regarding GROUP BY. But in this case, the
error message is very clear and it is consistent with the comment in the
code. I suggest removing this restriction as well in order to improve
PostgreSQL's capabilities to process graph data. For example, counting the
number of paths in a DAG can be computed more efficiently using an
aggregation in each step.

I don't know what the standard says about this, but it certainly does not
allow DISTINCT ON in the recursive query, while PostgreSQL does support it.
So, we could eventually skip the specs in this case to be more consistent
since a DISTINCT ON has many similarities with a GROUP BY.

I did some tests, and it is enough to remove the check regarding the GROUP
BY. The machinery to perform the GROUP BY in a recursive clause is already
there.

Of course, if it is the case, I would be glad to send a patch.

Best regards,
Renan Fonseca


Docs and tests for RLS policies applied by command type

2025-03-27 Thread Dean Rasheed
While looking at the INSERT ... ON CONFLICT DO SELECT patch, I noticed
that the "Policies Applied by Command Type" table on the CREATE POLICY
page doesn't fully or accurately describe all the policies that are
actually checked in all cases:

* INSERT ON CONFLICT checks the new row from the INSERT against SELECT
policy expressions, regardless of what ON CONFLICT action is
performed.

* If an ON CONFLICT DO UPDATE is executed, the new row from the
auxiliary UPDATE command is also checked against SELECT policy
expressions.

* MERGE always checks all candidate source and target rows against
SELECT policy expressions, even if no action is performed.

* MERGE ... THEN INSERT checks the new row against SELECT policy
expressions, if there is a RETURNING clause.

* MERGE ... THEN UPDATE always checks the new and existing rows
against SELECT policy expressions, even if there is no RETURNING
clause.

* MERGE ... THEN DELETE isn't mentioned at all. It always checks the
existing row against SELECT policy expressions.

I think having MERGE use the same row in the doc table as other
commands makes it harder to read, and it would be better to just list
each of the MERGE cases separately, even if that does involve some
repetition.

In addition, a paragraph above the table for INSERT policies says:

"""
Note that INSERT with ON CONFLICT DO UPDATE checks INSERT policies'
WITH CHECK expressions only for rows appended to the relation by the
INSERT path.
"""

Maybe that was once true, but it isn't true now, in any supported PG
version. The WITH CHECK expressions from INSERT policies are always
checked, regardless of which path it ends up taking.

I think it would be good to have regression tests specifically
covering all these cases. Yes, there are a lot of existing RLS
regression tests, but they tend to cover more complex scenarios, and
focus on whether the result of the command was what was expected,
rather than precisely which policies were checked in the process.
Thus, it's not obvious whether they provide complete coverage.

So patch 0001, attached, adds a new set of regression tests, near the
start of rowsecurity.sql, which specifically tests which policies are
applied for each command variant.

Patch 0002 updates the doc table to try to be clearer and more
accurate, and consistent with the test results from 0001, and fixes
the paragraph mentioned above.

Regards,
Dean
From c2c49cd10f001a5ee7a2d52083b2fcd3232fc53e Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Thu, 27 Mar 2025 14:08:09 +
Subject: [PATCH v1 1/2] New RLS tests to test policies applied by command
 type.

The existing RLS tests focus on the outcome of various testing
scenarios, rather than the exact policies applied. These new tests
list out the policies applied for each command type, including the
different paths through INSERT ... ON CONFLICT and MERGE.
---
 src/test/regress/expected/rowsecurity.out | 226 ++
 src/test/regress/sql/rowsecurity.sql  | 111 +++
 2 files changed, 337 insertions(+)

diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 87929191d06..ce80cbde938 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -31,6 +31,232 @@ CREATE OR REPLACE FUNCTION f_leak(text) RETURNS bool
 COST 0.001 LANGUAGE plpgsql
 AS 'BEGIN RAISE NOTICE ''f_leak => %'', $1; RETURN true; END';
 GRANT EXECUTE ON FUNCTION f_leak(text) TO public;
+-- Test policies applied by command type
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rls_test_src (a int PRIMARY KEY, b text);
+ALTER TABLE rls_test_src ENABLE ROW LEVEL SECURITY;
+INSERT INTO rls_test_src VALUES (1, 'src a');
+CREATE TABLE rls_test_tgt (a int PRIMARY KEY, b text, c text);
+ALTER TABLE rls_test_tgt ENABLE ROW LEVEL SECURITY;
+CREATE FUNCTION rls_test_tgt_set_c() RETURNS trigger AS
+  $$ BEGIN new.c = upper(new.b); RETURN new; END; $$
+  LANGUAGE plpgsql;
+CREATE TRIGGER rls_test_tgt_set_c BEFORE INSERT OR UPDATE ON rls_test_tgt
+  FOR EACH ROW EXECUTE FUNCTION rls_test_tgt_set_c();
+CREATE FUNCTION sel_using_fn(text, record) RETURNS bool AS
+  $$ BEGIN RAISE NOTICE 'SELECT USING on %.%', $1, $2; RETURN true; END; $$
+  LANGUAGE plpgsql;
+CREATE FUNCTION ins_check_fn(text, record) RETURNS bool AS
+  $$ BEGIN RAISE NOTICE 'INSERT CHECK on %.%', $1, $2; RETURN true; END; $$
+  LANGUAGE plpgsql;
+CREATE FUNCTION upd_using_fn(text, record) RETURNS bool AS
+  $$ BEGIN RAISE NOTICE 'UPDATE USING on %.%', $1, $2; RETURN true; END; $$
+  LANGUAGE plpgsql;
+CREATE FUNCTION upd_check_fn(text, record) RETURNS bool AS
+  $$ BEGIN RAISE NOTICE 'UPDATE CHECK on %.%', $1, $2; RETURN true; END; $$
+  LANGUAGE plpgsql;
+CREATE FUNCTION del_using_fn(text, record) RETURNS bool AS
+  $$ BEGIN RAISE NOTICE 'DELETE USING on %.%', $1, $2; RETURN true; END; $$
+  LANGUAGE plpgsql;
+CREATE POLICY sel_pol ON rls_test_src FOR SELECT
+  USING (sel_using_fn('rls_test_src',

Re: libpq maligning postgres stability

2025-03-27 Thread Christoph Berg
Re: Robert Haas
> I wonder if, in addition to removing the hint, we could also consider
> rewording the message. For example, a slight rewording to "server
> connection closed unexpectedly" would avoid implying that it was the

There is a lot of software doing string-parsing of this part of the
message, so it might be advisable to leave the first line alone.

https://sources.debian.org/src/php-laravel-framework/10.48.25+dfsg-2/src/Illuminate/Database/DetectsLostConnections.php/?hl=28#L28
https://sources.debian.org/src/python-taskflow/5.9.1-4/taskflow/persistence/backends/impl_sqlalchemy.py/?hl=87#L87
https://sources.debian.org/src/gnucash/1:5.10-0.1/libgnucash/backend/dbi/gnc-backend-dbi.cpp/?hl=798#L798
https://sources.debian.org/src/pgbouncer/1.24.0-3/test/test_misc.py/?hl=301#L301
https://sources.debian.org/src/icingaweb2-module-reporting/1.0.2-2/library/Reporting/RetryConnection.php/?hl=23#L23
https://sources.debian.org/src/storm/1.0-1/storm/databases/postgres.py/?hl=353#L353
https://sources.debian.org/src/timescaledb/2.19.0+dfsg-1/test/expected/loader-tsl.out/?hl=473#L473
https://sources.debian.org/src/odoo/18.0.0+dfsg-2/addons/web/tests/test_db_manager.py/?hl=277#L277

https://codesearch.debian.net/search?q=server+closed+the+connection+unexpectedly&literal=1

(There might be room for asking why this string parsing is being done,
is libpq missing "connection lost" detection vs. other errors?)

The remaining message lines are admittedly very pessimistic about
PostgreSQL's stability and should mention networking issues first.

Christoph




Re: Remove restrictions in recursive query

2025-03-27 Thread Renan Alves Fonseca
On Thu, Mar 27, 2025 at 7:10 PM David G. Johnston
 wrote:
>
> There is distinct behavior between group by and order by here.  You seem to 
> be mixing them up.
>
> From Select:
>
> select_statement is any SELECT statement without an ORDER BY, LIMIT, FOR NO 
> KEY UPDATE, FOR UPDATE, FOR SHARE, or FOR KEY SHARE clause. (ORDER BY and 
> LIMIT can be attached to a subexpression if it is enclosed in parentheses. 
> Without parentheses, these clauses will be taken to apply to the result of 
> the UNION, not to its right-hand input expression.)
>
> This is the exact same parsing precedence order by is being given in the 
> recursive CTE query situation presented earlier.
>
> David J.
>

You're right. I'm really mixing these 2 here. Thanks for the clarification.

I'll assume that the silence about allowing GROUP BY means it is not a
great idea...




Re: Remove restrictions in recursive query

2025-03-27 Thread Renan Alves Fonseca
On Thu, Mar 27, 2025 at 5:38 PM Robert Haas  wrote:
> It's not a problem if UNION ALL is used within the initial_query and
> it's not a problem if UNION ALL is used within the iterated_query. But
> you can't apply ORDER BY to the result of the UNION, because the UNION
> is kind of fake -- we're not running the UNION as a single query,
> we're running the two halves separately, the first once and the second
> as many times as needed.

I understand that we can only apply ORDER BY separately in the
initial/iterated query. What disturbs me here is that the UNION
operator has associativity precedence over the ORDER BY only when
inside a recursive CTE. Consider the following query:

SELECT 1 UNION SELECT 1 GROUP BY 1;

It returns 2 rows. The GROUP BY clause attaches to the second
selectStmt without the need to add parenthesis. I would expect the
same syntax inside a recursive CTE.




Re: Remove restrictions in recursive query

2025-03-27 Thread David G. Johnston
On Thu, Mar 27, 2025 at 11:03 AM Renan Alves Fonseca 
wrote:

> On Thu, Mar 27, 2025 at 5:38 PM Robert Haas  wrote:
> > It's not a problem if UNION ALL is used within the initial_query and
> > it's not a problem if UNION ALL is used within the iterated_query. But
> > you can't apply ORDER BY to the result of the UNION, because the UNION
> > is kind of fake -- we're not running the UNION as a single query,
> > we're running the two halves separately, the first once and the second
> > as many times as needed.
>
> I understand that we can only apply ORDER BY separately in the
> initial/iterated query. What disturbs me here is that the UNION
> operator has associativity precedence over the ORDER BY only when
> inside a recursive CTE. Consider the following query:
>
> SELECT 1 UNION SELECT 1 GROUP BY 1;
>
> It returns 2 rows. The GROUP BY clause attaches to the second
> selectStmt without the need to add parenthesis. I would expect the
> same syntax inside a recursive CTE.
>

There is distinct behavior between group by and order by here.  You seem to
be mixing them up.

>From Select:

select_statement is any SELECT statement without an ORDER BY, LIMIT, FOR NO
KEY UPDATE, FOR UPDATE, FOR SHARE, or FOR KEY SHARE clause. (ORDER BY and
LIMIT can be attached to a subexpression if it is enclosed in parentheses.
Without parentheses, these clauses will be taken to apply to the result of
the UNION, not to its right-hand input expression.)

This is the exact same parsing precedence order by is being given in the
recursive CTE query situation presented earlier.

David J.


Re: AIO v2.5

2025-03-27 Thread Noah Misch
On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> On 2025-03-11 20:57:43 -0700, Noah Misch wrote:
> > - Like you say, "redefine max_files_per_process to be about the number of
> >   files each *backend* will additionally open".  It will become normal that
> >   each backend's actual FD list length is max_files_per_process + 
> > MaxBackends
> >   if io_method=io_uring.  Outcome is not unlike
> >   v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
> >   v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we 
> > don't
> >   mutate max_files_per_process.  Benchmark results should not change beyond
> >   the inter-major-version noise level unless one sets io_method=io_uring.  
> > I'm
> >   feeling best about this one, but I've not been thinking about it long.
> 
> Yea, I think that's something probably worth doing separately from Jelte's
> patch.  I do think that it'd be rather helpful to have jelte's patch to
> increase NOFILE in addition though.

Agreed.

> > > > > +static void
> > > > > +maybe_adjust_io_workers(void)
> > > >
> > > > This also restarts workers that exit, so perhaps name it
> > > > start_io_workers_if_missing().
> > > 
> > > But it also stops IO workers if necessary?
> > 
> > Good point.  Maybe just add a comment like "start or stop IO workers to 
> > close
> > the gap between the running count and the configured count intent".
> 
> It's now
> /*
>  * Start or stop IO workers, to close the gap between the number of running
>  * workers and the number of configured workers.  Used to respond to change of
>  * the io_workers GUC (by increasing and decreasing the number of workers), as
>  * well as workers terminating in response to errors (by starting
>  * "replacement" workers).
>  */

Excellent.

> > > > > +{
> > > > ...
> > > > > + /* Try to launch one. */
> > > > > + child = StartChildProcess(B_IO_WORKER);
> > > > > + if (child != NULL)
> > > > > + {
> > > > > + io_worker_children[id] = child;
> > > > > + ++io_worker_count;
> > > > > + }
> > > > > + else
> > > > > + break;  /* XXX try 
> > > > > again soon? */
> > > >
> > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > > > function, replacing the current mix of callers?  That would be more 
> > > > conducive
> > > > to promptly doing the right thing after launch failure.
> > > 
> > > I'm not sure that'd be a good idea - right now IO workers are started 
> > > before
> > > the startup process, as the startup process might need to perform IO. If 
> > > we
> > > started it only later in ServerLoop() we'd potentially do a fair bit of 
> > > work,
> > > including starting checkpointer, bgwriter, bgworkers before we started IO
> > > workers.  That shouldn't actively break anything, but it would likely make
> > > things slower.
> > 
> > I missed that.  How about keeping the two calls associated with PM_STARTUP 
> > but
> > replacing the assign_io_workers() and process_pm_child_exit() calls with one
> > in LaunchMissingBackgroundProcesses()?
> 
> I think replacing the call in assign_io_workers() is a good idea, that way we
> don't need assign_io_workers().
> 
> Less convinced it's a good idea to do the same for process_pm_child_exit() -
> if IO workers errored out we'll launch backends etc before we get to
> LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
> seems a bit odd.

Works for me.

> I think LaunchMissingBackgroundProcesses() should be split into one that
> starts aux processes and one that starts bgworkers. The one maintaining aux
> processes should be called before we start backends, the latter not.

That makes sense, though I've not thought about it much.

> > > > > + /*
> > > > > +  * It's very unlikely, but possible, that 
> > > > > reopen fails. E.g. due
> > > > > +  * to memory allocations failing or file 
> > > > > permissions changing or
> > > > > +  * such.  In that case we need to fail the IO.
> > > > > +  *
> > > > > +  * There's not really a good errno we can 
> > > > > report here.
> > > > > +  */
> > > > > + error_errno = ENOENT;
> > > >
> > > > Agreed there's not a good errno, but let's use a fake errno that we're 
> > > > mighty
> > > > unlikely to confuse with an actual case of libc returning that errno.  
> > > > Like
> > > > one of EBADF or EOWNERDEAD.
> > > 
> > > Can we rely on that to be present on all platforms, including windows?
> > 
> > I expect EBADF is universal.  EBADF would be fine.
> 
> Hm, that's actually an error that could happen for other reasons, and IMO
> would be more confusing than ENOENT. The latter describes the issue to a
> reasonable extent, EBADFD seems like it would be more confusing

Re: Remove restrictions in recursive query

2025-03-27 Thread Robert Haas
On Thu, Mar 27, 2025 at 2:21 PM Renan Alves Fonseca
 wrote:
> You're right. I'm really mixing these 2 here. Thanks for the clarification.

It looks like GROUP BY binds to the particular UNION branch but ORDER
BY binds to the UNION as a whole:

robert.haas=# select 2 union all select 1;
 ?column?
--
2
1
(2 rows)

robert.haas=# select 2 union all select 1 order by 1;
 ?column?
--
1
2
(2 rows)

robert.haas=# select 2 union all select 1 group by 1;
 ?column?
--
2
1
(2 rows)

> I'll assume that the silence about allowing GROUP BY means it is not a
> great idea...

I don't think there's really anything to keep you from doing this --
just put the grouping operation where you refer to the recursive CTE,
instead of inside the recursive CTE itself. I think allowing it to
appear inside the recursive CTE would be rather confusing -- it's
probably best if the mandatory UNION operation is at the top level.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Proposal: Progressive explain

2025-03-27 Thread Robert Haas
On Wed, Mar 26, 2025 at 5:58 PM Rafael Thofehrn Castro
 wrote:
> So implementation was done based on transaction nested level. Cleanup is only
> performed when AbortSubTransaction() is called in the same transaction nested
> level as the one where the query is running. This covers both PL/pgSQL 
> exception
> blocks and savepoints.

Right. I think this is much closer to being correct. However, I
actually think it should look more like this:

void
AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
{
if (activeQueryDesc != NULL &&
activeQueryXactNestLevel >= nestDepth)
{
if (isCommit)
elog(WARNING, "leaked progressive explain query descriptor");
ProgressiveExplainCleanup(NULL);
}
}

By including the is-commit case in there, we can catch any bugs where
things aren't cleaned up properly before a transaction is committed.
We generally want to test >= nestDepth instead of == nestDepth in case
multiple subtransaction levels abort all at once; I'm not sure it
matters here, but even if it isn't, it's best to be consistent with
the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
nestDepth instead of calling GetCurrentTransactionNestLevel() also has
precedent e.g. see AtEOSubXact_HashTables.

As a further refinement, consider initializing
activeQueryXactNestLevel to -1 and resetting it to that value each
time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
and only if activeQueryXactNestLevel >= 0. Then the test above can be
simplified to just if (activeQueryXactNestLevel >= nestDepth).

The comment for ProgressiveExplainIsActive() is a copy-paste from
another function that you forgot to update. Also, the function body
could be reduced to a one-liner: return queryDesc == activeQueryDesc;

There is a comment in ExplainOnePlan() that says "Handle progressive
explain cleanup manually if enabled as it is not called as part of
ExecutorFinish," but standard_ExecutorFinish does indeed call
ProgressiveExplainFinish(), so either the comment is misleading or the
code is wrong.

standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
dependent on the value of the progressive_explain GUC, but that GUC
could be changed in mid-query via set_config() or a plpgsql function
that calls SET, which could result in skipping the cleanup even when
it's needed. I think you should make the call unconditional and make
it entirely the job of ProgressiveExplainFinish() to decide whether
any cleanup is needed.

ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
cases, but sometimes just disables the timeout instead. I think this
is weird. I think you should just always call
ProgressiveExplainCleanup() and make sure it's robust and cleans up
however much or little is appropriate in all cases.

On the flip side, I can't really see why
dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
ProgressiveExplainHashLock. Why not just have
ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
afterwards it can do the dsa_detach() in the caller? Then
ProgressiveExplainCleanup() no longer needs an argument.

ProgressiveExplainPrint() can save a level of indentation in a large
chunk of the function by understanding that elog(ERROR) does not
return. You don't need to wrap everything that follows in else {}.

In the documentation table called pg-stat-progress-explain-view, some
descriptions end with a period and others do not.

Calling ProgressiveExplainTrigger() directly from
ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
we've tried hard to make our signal handlers do only very simple
things like setting a flag, and this one runs around the entire
PlanState tree modifying Very Important Things. I fear that's going to
be hard to make robust. Like, what happens if we're going around
trying to change ExecProcNode pointers while the calling code was also
going around trying to change ExecProcNode pointers? I can't quite see
how this won't result in chaos.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: read stream on amcheck

2025-03-27 Thread Melanie Plageman
On Thu, Mar 27, 2025 at 4:45 AM Nazir Bilal Yavuz  wrote:
>
> I liked the first approach more. We can solve the first approach's
> problems by introducing a void pointer to pass to
> read_stream_begin_relation(). We can set it to &rsdata.range for the
> SKIP_PAGES_NONE case and &rsdata for the rest.

Cool. I've gone with this approach but have renamed all the functions
and structs to try and be more consistent and clear.
Committed in 043799fa08c2c and I marked the commitfest entry as such.

- Melanie




Re: read stream on amcheck

2025-03-27 Thread Matheus Alcantara
On Thu, Mar 27, 2025 at 3:35 PM Melanie Plageman
 wrote:
>
> On Thu, Mar 27, 2025 at 4:45 AM Nazir Bilal Yavuz  wrote:
> >
> > I liked the first approach more. We can solve the first approach's
> > problems by introducing a void pointer to pass to
> > read_stream_begin_relation(). We can set it to &rsdata.range for the
> > SKIP_PAGES_NONE case and &rsdata for the rest.
>
> Cool. I've gone with this approach but have renamed all the functions
> and structs to try and be more consistent and clear.
> Committed in 043799fa08c2c and I marked the commitfest entry as such.

Just my 0.2 cents. I also like the first approach even though I prefer
the v4 version, but anyway, thanks very much for reviewing and
committing!

(I was sending my anwer just when I received your reply)

-- 
Matheus Alcantara




Re: Change log level for notifying hot standby is waiting non-overflowed snapshot

2025-03-27 Thread torikoshia

On 2025-03-27 11:06, torikoshia wrote:

Hi,

I had another off-list discussion with Fujii-san, and according to the
following manual[1], it seems that a transaction with an overflowed
subtransaction is already considered inconsistent:

  Reaching a consistent state can also be delayed in the presence of
both of these conditions:

  - A write transaction has more than 64 subtransactions
  - Very long-lived write transactions

IIUC, the manual suggests that both conditions must be met -- recovery
reaching at least minRecoveryPoint and no overflowed subtransactions
—- for the standby to be considered consistent.

OTOH, the following log message is emitted even when subtransactions
have overflowed, which appears to contradict the definition of
consistency mentioned above:

  LOG:  consistent recovery state reached

This log message is triggered when recovery progresses beyond
minRecoveryPoint(according to CheckRecoveryConsistency()).
However, since this state does not satisfy 'consistency' defined in
the manual, I think it would be more accurate to log that it has
merely reached the "minimum recovery point".
Furthermore, it may be better to emit the above log message only when
recovery has progressed beyond minRecoveryPoint and there are no
overflowed subtransactions.

Attached patch does this.

Additionally, renaming variables such as reachedConsistency in
CheckRecoveryConsistency might also be appropriate.
However, in the attached patch, I have left them unchanged for now.


On 2025-03-25 00:55, Fujii Masao wrote:

-  case CAC_NOTCONSISTENT:
+  case CAC_NOTCONSISTENT_OR_OVERFLOWED:

This new name seems a bit too long. I'm OK to leave the name as it is.
Or, something like CAC_NOTHOTSTANDBY seems simpler and better to me.


Beyond just the length issue, given the understanding outlined above,
I now think CAC_NOTCONSISTENT does not actually need to be changed.


In high-availability.sgml, the "Administrator's Overview" section 
already

describes the conditions for accepting hot standby connections.
This section should also be updated accordingly.


Agreed.
I have updated this section to mention that the resolution is to close
the problematic transaction.
OTOH the changes made in v2 patch seem unnecessary, since the concept
of 'consistent' is already explained in the "Administrator's
Overview."


-			 errdetail("Consistent recovery state has not been yet 
reached.")));

+errdetail("Consistent 
recovery state has not been yet
reached, or snappshot is pending because subtransaction is
overflowed."),

Given the above understanding, "or" is not appropriate in this
context, so I left this message unchanged.
Instead, I have added an errhint. The phrasing in the hint message
aligns with the manual, allowing users to search for this hint and
find the newly added resolution instructions.


On second thought, it may not be appropriate to show this output to 
clients attempting to connect. This message should be notified not to 
clients but to administrators.


From this point of view, it'd be better to output a message indicating 
the status inside ProcArrayApplyRecoveryInfo(). However, a 
straightforward implementation would result in the same message being 
logged every time an XLOG_RUNNING_XACTS WAL is received, making it 
noisy.


Instead of directly outputting a log indicating that a hot standby 
connection cannot be established due to subtransaction overflow, the 
attached patch updates the manual so that administrators can determine 
whether a subtransaction overflow has occurred based on the modified log 
output.



What do you think?


--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 908db07687d9d4473ab86d93356b7d605329c0be Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 27 Mar 2025 23:50:08 +0900
Subject: [PATCH v4] Modify and log to align with the doc

There is an inconsistency between the documentation[1] and the log
messages regarding the definition of a 'consistent' recovery state.
The documentation states that a consistent state requires both that
recovery has progressed beyond minRecoveryPoint and that there are
no overflowed subtransactions, whereas the source code previously
considered only minRecoveryPoint.

This patch modifies and adds log messages to align with the
documentation. Specifically, messages are now logged both when
recovery progresses beyond minRecoveryPoint and when the system
reaches a consistent state in the updated definition. As a result,
the "consistent recovery state reached" message is now output
later than before.

As a side effect, previously, if hot standby remained inaccessible
due to overflowed subtransactions, it was difficult to determine
the cause. With this change in log messages and the corresponding
update to the documentation, the reason for the delay can now be
identified more clearly.

[1] https://www.postgresql.org/d

Re: Amcheck verification of GiST and GIN

2025-03-27 Thread Tomas Vondra



On 3/27/25 16:30, Mark Dilger wrote:
> 
> 
> On Fri, Feb 21, 2025 at 6:29 AM Tomas Vondra  > wrote:
> 
> Hi,
> 
> I see this patch didn't move since December :-( I still think these
> improvements would be useful, it certainly was very helpful when I was
> working on the GIN/GiST parallel builds (the GiST builds stalled, but I
> hope to push the GIN patches soon).
> 
> So I'd like to get some of this in too. I'm not sure about the GiST
> bits, because I know very little about that AM (the parallel builds made
> me acutely aware of that).
> 
> But I'd like to get the GIN parts in. We're at v34 already, and the
> recent changes were mostly cosmetic. Does anyone object to me polishing
> and pushing those parts?
> 
> 
> Kirill may have addressed my concerns in the latest version.  I have not
> had time for another review.  Tomas, would you still like to review and
> push this patch?  I have no objection.
> 

Thanks for reminding me. I think the patches are in good share, but I'll
take a look once more, and I hope to get it committed.


regards

-- 
Tomas Vondra





Re: Test to dump and restore objects left behind by regression

2025-03-27 Thread Alvaro Herrera
On 2025-Mar-27, Ashutosh Bapat wrote:

> On Thu, Mar 27, 2025 at 6:01 PM vignesh C  wrote:

> > Couple of minor thoughts:
> > 1) I felt this error message is not conveying the error message correctly:
> > +   if ($src_node->pg_version != $dst_node->pg_version
> > +   or defined $src_node->{_install_path})
> > +   {
> > +   fail("same version dump and restore test using default
> > installation");
> > +   return;
> > +   }
> >
> > how about something like below:
> > fail("source and destination nodes must have the same PostgreSQL
> > version and default installation paths");
> 
> The text in ok(), fail() etc. are test names and not error messages.
> See [1]. Your suggestion and other versions that I came up with became
> too verbose to be test names. So I think the text here is compromise
> between conveying enough information and not being too long. We
> usually have to pick the testname and lookup the test code to
> investigate the failure. This text serves that purpose.

Maybe
fail("roundtrip dump/restore of the regression database")


BTW another idea to shorten this tests's runtime might be to try and
identify which of parallel_schedule tests leave objects behind and
create a shorter schedule with only those (a possible implementation
might keep a list of the slow tests that don't leave any useful object
behind, then filter parallel_schedule to exclude those; this ensures
test files created in the future are still used.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I love the Postgres community. It's all about doing things _properly_. :-)"
(David Garamond)




Re: Support "make check" for PGXS extensions

2025-03-27 Thread David E. Wheeler
On Mar 27, 2025, at 12:21, Peter Eisentraut  wrote:

> Interesting.  I think to support that, we would need to do a temp install 
> kind of thing of the extension, and then point the path settings into that 
> temp install directory.  Which is probably more sensible anyway.

If it runs against a temporary cluster anyway, I think that makes perfect sense.

D



signature.asc
Description: Message signed with OpenPGP


Re: Use CLOCK_MONOTONIC_COARSE for instr_time when available

2025-03-27 Thread Jianghua Yang
*I agree, so this patch only affects explain analyze.*

*1. This change to CLOCK_MONOTONIC_COARSE only affects EXPLAIN ANALYZE and
does not impact other modules.*

The patch introduces optional support for CLOCK_MONOTONIC_COARSE specifically
within the INSTR_TIMEinstrumentation framework. The modifications are
guarded by the compile-time macro USE_CLOCK_MONOTONIC_COARSE, and are only
used when gathering timing data for performance instrumentation. Given that
INSTR_TIME is mainly used in EXPLAIN ANALYZE, and there are no changes to
runtime or planner logic, this patch ensures that only diagnostic outputs
are affected—leaving core execution paths and other modules untouched.

--

*2. With this modification, EXPLAIN ANALYZE produces timing results that
are closer to real-world wall-clock time, making performance debugging more
accurate.*


By using CLOCK_MONOTONIC_COARSE, which has lower overhead compared to
CLOCK_MONOTONIC, the patch improves the efficiency of timing
collection in EXPLAIN
ANALYZE. While it may slightly reduce precision, the resulting measurements
more closely reflect actual elapsed time observed by users, especially in
performance-sensitive environments. This makes EXPLAIN ANALYZE outputs more
reliable and helpful for developers diagnosing query performance
bottlenecks.

--- origin version

explain analyze select count(*) from t1;
Thu 27 Mar 2025 01:31:20 AM CST
(every 1s)

QUERY PLAN

---
 Aggregate  (cost=1852876.63..1852876.64 rows=1 width=8) (actual
time=4914.037..4914.038 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..1570796.90 rows=112831890 width=0)
(actual time=0.039..3072.303 rows=1 loops=1)
 Planning Time: 0.132 ms
 Execution Time: 4914.072 ms
(4 rows)

Time: 4914.676 ms (00:04.915)


--- apply patch

postgres=# explain analyze select count(*) from t1;
QUERY PLAN

---
 Aggregate  (cost=1692478.40..1692478.41 rows=1 width=8) (actual
time=3116.164..3116.164 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..1442478.32 rows=10032 width=0)
(actual time=0.000..2416.127 rows=1 loops=1)
 Planning Time: 0.000 ms
 Execution Time: 3116.164 ms
(4 rows)

Time: 3114.059 ms (00:03.114)
postgres=# select count(*) from t1;
   count
---
 1
(1 row)

Time: 2086.130 ms (00:02.086)

Andres Freund  于2025年3月27日周四 07:19写道:

> Hi,
>
> On 2025-03-26 23:09:42 -0400, Tom Lane wrote:
> > =?UTF-8?B?5p2o5rGf5Y2O?=  writes:
> > > This patch modifies the instr_time.h header to prefer
> CLOCK_MONOTONIC_COARSE
> > > when available.
> >
> > As far as I know, our usage of instr_time really needs the highest
> > resolution available, because we are usually trying to measure pretty
> > short intervals.  You say that this patch reduces execution time,
> > and I imagine that's true ... but I wonder if it doesn't do so at
> > the cost of totally destroying the reliability of the output numbers.
>
> The reason, on x86, the timestamp querying has a somewhat high overhead is
> that the "accurate" "read the tsc" instruction serves as a barrier for
> out-of-order execution. With modern highly out-of-order execution that
> means
> we'll wait for all scheduled instructions to finish before determining the
> current time, multiple times for each tuple.  That of course slows things
> down
> substantially.
>
> There's a patch to use the version of rdtsc that does *not* have barrier
> semantics:
>
> https://postgr.es/m/CAP53PkzO2KpscD-tgFW_V-4WS%2BvkniH4-B00eM-e0bsBF-xUxg%40mail.gmail.com
>
> Greetings,
>
> Andres Freund
>


0001-Add-optional-support-for-CLOCK_MONOTONIC_COARSE-only.patch
Description: Binary data


Re: acronym, glossary and other related matters in the docs

2025-03-27 Thread David G. Johnston
On Fri, Mar 21, 2025 at 1:55 PM Andres Freund  wrote:

> Hi,
>
> I was trying to write some docs for an AIO related view. As part of that I
> noticed that we referenced I/O a lot, but never defined it as an acronym or
> had it in the glossary.  While adding those I got somewhat confused by
> things
> that have confused me in the past:
>
> 1) When are we supposed to use IDK and when not? We
> seem to
>be extremely inconsistent. E.g.
>git grep -P 'WAL(?=\42
>git grep -P 'WAL(?!\904
>
>Given that, at least for html,  doesn't change the display,
> that's
>perhaps not too surprising. But if we don't do it consistently, perhaps
> we
>should just stop doing it at all?
>

See recent discussion regarding TOAST.

https://www.postgresql.org/message-id/CAKFQuwaOXpOuYwN%3D8s3%2Bm-nRv5-Dgr_borfcRFmmjBxtg7kwXA%40mail.gmail.com

The vast majority of our uses of WAL can easily be considered using it as
part of a noun and not as an acronym.

I haven't pondered I/O.


> 2) We have glossary entries explaining acronyms, but we rarely reference
>them. Is there a reason for that?  It'd probably be silly to link
> glossary
>entries every time an acronym is used, often there are many uses
> close-by,
>but that doesn't mean we shouldn't ever do it?
>
>ISTM we should have a guideline for this.
>

I'm leaning toward not linking glossary entries in general.  Most concepts
that you'd want to link there have actually sections in the documentation
that would be better linked to instead.  And linking from those sections to
the glossary seems redundant.

The glossary is basically a description-augmented index.  You go there if
you see a term in the wild you are unfamiliar with to see its description
and get links into the main documentation for further discussion.  Or to
the index to see more completely all the places where it is mentioned.
Like the index it is an interface to the outside world.


> 4) We don't put the glossary entries in the index, despite that sometimes
>being the only place something is documented.
>
>
Extending from my observation above, the glossary and index should probably
cross-reference each other.

David J.


Re: Remove vardata parameters from eqjoinsel_inner

2025-03-27 Thread Richard Guo
On Fri, Feb 21, 2025 at 7:04 PM Ilia Evdokimov
 wrote:
> When calculating selectivity for an inner equijoin, we call
> eqjoinsel_inner, which uses unused parameters vardata1 and vardata2.
> These parameters might have been left behind accidentally when we moved
> getting sslots out of the function. I suggest removing them, as they can
> be added back at any time if needed. I attached patch with fixes.

Yeah, these parameters haven't been used since a314c3407, when we
moved get_variable_numdistinct and get_attstatsslot out of
eqjoinsel_inner and eqjoinsel_semi to avoid repetitive information
lookup when we call both eqjoinsel_inner and eqjoinsel_semi.

I'm wondering whether we should also remove parameter vardata1 from
eqjoinsel_semi.  vardata2 is still needed though to clamp nd2 to be
not more than the rel's row estimate.

Thanks
Richard




Re: Remove restrictions in recursive query

2025-03-27 Thread Robert Haas
On Thu, Mar 27, 2025 at 12:02 PM Renan Alves Fonseca
 wrote:
> WITH RECURSIVE t1 AS ( SELECT 1 UNION  SELECT generate_series(2,3) FROM t1 
> ORDER BY  1 DESC) SELECT * FROM t1 ;
>
> The parser attaches the "order by" clause to the "union" operator, and then 
> we error out with the following message: "ORDER BY in a recursive query is 
> not implemented"
>
> The comment in the code (parser_cte.c:900) says "Disallow ORDER BY and 
> similar decoration atop the UNION". Then, if we wrap the recursive clause 
> around parentheses:
>
> WITH RECURSIVE t1 AS ( SELECT 1 UNION  (SELECT generate_series(2,3) FROM t1 
> ORDER BY  1 DESC)) SELECT * FROM t1 ;
>
> It works as expected. So, do we support the ORDER BY in a recursive query or 
> not?

A recursive CTE effectively takes two queries that will be run as
arguments. For some reason, instead of choosing syntax like WITH
RECURSIVE t1 AS BASE_CASE (initial_query) RECURSIVE_CASE
(iterated_query), somebody chose WITH RECURSIVE t1 AS (initial_query
UNION iterated_query) which really doesn't make it very clear that we
need to be able to break it apart into two separate queries, one of
which will be run once and one of which will be iterated.

It's not a problem if UNION ALL is used within the initial_query and
it's not a problem if UNION ALL is used within the iterated_query. But
you can't apply ORDER BY to the result of the UNION, because the UNION
is kind of fake -- we're not running the UNION as a single query,
we're running the two halves separately, the first once and the second
as many times as needed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Remove restrictions in recursive query

2025-03-27 Thread Robert Haas
On Thu, Mar 27, 2025 at 12:37 PM Robert Haas  wrote:
> It's not a problem if UNION ALL is used within the initial_query and
> it's not a problem if UNION ALL is used within the iterated_query. But
> you can't apply ORDER BY to the result of the UNION, because the UNION
> is kind of fake -- we're not running the UNION as a single query,
> we're running the two halves separately, the first once and the second
> as many times as needed.

Oops. The two UNION ALL references in the first part of this paragraph
should have said ORDER BY.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Remove restrictions in recursive query

2025-03-27 Thread David G. Johnston
On Thu, Mar 27, 2025 at 9:02 AM Renan Alves Fonseca 
wrote:

> So, do we support the ORDER BY in a recursive query or not?
>

Apparently we do.  The "in" recurses.

If the answer is yes, I suggest one of the following modifications:
>
1. Change the error message to something like "ORDER BY at the top level of
> a recursive query is not implemented. HINT: wrap the respective statement
> around ()"
>

The error message is now correct but I'd lose the hint.  It assumes a typo
and that the query author meant to sort each iteration of the recursive
term; when maybe they did wish for the final result of the completed
recursive query to then be fed into a sort node and ordered, in which case
the parentheses wouldn't help.

David J.


Re: Test to dump and restore objects left behind by regression

2025-03-27 Thread Ashutosh Bapat
On Thu, Mar 27, 2025 at 10:45 PM Alvaro Herrera  wrote:
>
> On 2025-Mar-27, Ashutosh Bapat wrote:
>
> > On Thu, Mar 27, 2025 at 6:01 PM vignesh C  wrote:
>
> > > Couple of minor thoughts:
> > > 1) I felt this error message is not conveying the error message correctly:
> > > +   if ($src_node->pg_version != $dst_node->pg_version
> > > +   or defined $src_node->{_install_path})
> > > +   {
> > > +   fail("same version dump and restore test using default
> > > installation");
> > > +   return;
> > > +   }
> > >
> > > how about something like below:
> > > fail("source and destination nodes must have the same PostgreSQL
> > > version and default installation paths");
> >
> > The text in ok(), fail() etc. are test names and not error messages.
> > See [1]. Your suggestion and other versions that I came up with became
> > too verbose to be test names. So I think the text here is compromise
> > between conveying enough information and not being too long. We
> > usually have to pick the testname and lookup the test code to
> > investigate the failure. This text serves that purpose.
>
> Maybe
> fail("roundtrip dump/restore of the regression database")

No, that's losing some information like default installation and the
same version.

-- 
Best Wishes,
Ashutosh Bapat




RE: Selectively invalidate caches in pgoutput module

2025-03-27 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> The problem here is that after ALTER SUBSCRIPTION tap_sub SET
> PUBLICATION ..., we didn't wait for the new walsender on publisher to
> start. We must use wait_for_subscription_sync both after the "CREATE
> SUBSCRIPTION ..." and the "ALTER SUBSCRIPTION ..." commands and keep
> copy_data=true to ensure the initial replication is setup between
> publisher and subscriber. This is how we use these commands at other
> places.

Agreed. PSA the patch to fix the issue.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Stablize-tests-added-in-3abe9dc188.patch
Description: 0001-Stablize-tests-added-in-3abe9dc188.patch


Re: dblink: Add SCRAM pass-through authentication

2025-03-27 Thread Matheus Alcantara
On Fri, Mar 21, 2025 at 1:28 PM Jacob Champion
 wrote:
>
> Great, thank you! Looking over v10, I think I've run out of feedback
> at this point. Marked Ready for Committer.

Thanks for all the effort reviewing this patch!

-- 
Matheus Alcantara




Re: Statistics import and export: difference in statistics of materialized view dumped

2025-03-27 Thread Jeff Davis
On Thu, 2025-03-27 at 17:07 +0530, Ashutosh Bapat wrote:
> Pulled the latest sources but the test is still failing with the same
> differences.

The attached set of patches (your 0001 and 0002, combined with my patch
v2j-0003) applied on master (058b5152f0) are passing the pg_upgrade
test suite for me.

Are you saying that the tests don't work for you even when v2j-0003 is
applied? Or are you saying that your tests are failing on master, and
that v2j-0002 should be committed?

Regards,
Jeff Davis

From 6fc3b98dc9a2589b9943e075b492b4c31044c14e Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 27 Jun 2024 10:03:53 +0530
Subject: [PATCH v2j 1/3] Test pg_dump/restore of regression objects

002_pg_upgrade.pl tests pg_upgrade of the regression database left
behind by regression run. Modify it to test dump and restore of the
regression database as well.

Regression database created by regression run contains almost all the
database objects supported by PostgreSQL in various states. Hence the
new testcase covers dump and restore scenarios not covered by individual
dump/restore cases. Till now 002_pg_upgrade only tested dump/restore
through pg_upgrade which only uses binary mode. Many regression tests
mention that they leave objects behind for dump/restore testing but they
are not tested in a non-binary mode. The new testcase closes that
gap.

Testing dump and restore of regression database makes this test run
longer for a relatively smaller benefit. Hence run it only when
explicitly requested by user by specifying "regress_dump_test" in
PG_TEST_EXTRA.

Note For the reviewers:
The new test has uncovered many bugs so far in one year.
1. Introduced by 14e87ffa5c54. Fixed in fd41ba93e4630921a72ed5127cd0d552a8f3f8fc.
2. Introduced by 0413a556990ba628a3de8a0b58be020fd9a14ed0. Reverted in 74563f6b90216180fc13649725179fc119dddeb5.
3. Fixed by d611f8b1587b8f30caa7c0da99ae5d28e914d54f
3. Being discussed on hackers at https://www.postgresql.org/message-id/caexhw5s47kmubpbbrjzsm-zfe0tj2o3gbagb7yaye8rq-v2...@mail.gmail.com

Author: Ashutosh Bapat
Reviewed by: Michael Pacquire, Daniel Gustafsson, Tom Lane, Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/CAExHW5uF5V=Cjecx3_Z=7xfh4rg2wf61pt+hfquzjbqourz...@mail.gmail.com
---
 doc/src/sgml/regress.sgml   |  12 ++
 src/bin/pg_upgrade/t/002_pg_upgrade.pl  | 144 -
 src/test/perl/Makefile  |   2 +
 src/test/perl/PostgreSQL/Test/AdjustDump.pm | 167 
 src/test/perl/meson.build   |   1 +
 5 files changed, 324 insertions(+), 2 deletions(-)
 create mode 100644 src/test/perl/PostgreSQL/Test/AdjustDump.pm

diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index 0e5e8e8f309..237b974b3ab 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -357,6 +357,18 @@ make check-world PG_TEST_EXTRA='kerberos ldap ssl load_balance libpq_encryption'
   
  
 
+
+
+ regress_dump_test
+ 
+  
+   When enabled, src/bin/pg_upgrade/t/002_pg_upgrade.pl
+   tests dump and restore of regression database left behind by the
+   regression run. Not enabled by default because it is time and resource
+   consuming.
+  
+ 
+

 
Tests for features that are not supported by the current build
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 00051b85035..d08eea6693f 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -12,6 +12,7 @@ use File::Path qw(rmtree);
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::AdjustUpgrade;
+use PostgreSQL::Test::AdjustDump;
 use Test::More;
 
 # Can be changed to test the other modes.
@@ -35,8 +36,8 @@ sub generate_db
 		"created database with ASCII characters from $from_char to $to_char");
 }
 
-# Filter the contents of a dump before its use in a content comparison.
-# This returns the path to the filtered dump.
+# Filter the contents of a dump before its use in a content comparison for
+# upgrade testing. This returns the path to the filtered dump.
 sub filter_dump
 {
 	my ($is_old, $old_version, $dump_file) = @_;
@@ -262,6 +263,21 @@ else
 		}
 	}
 	is($rc, 0, 'regression tests pass');
+
+	# Test dump/restore of the objects left behind by regression. Ideally it
+	# should be done in a separate TAP test, but doing it here saves us one full
+	# regression run.
+	#
+	# This step takes several extra seconds and some extra disk space, so
+	# requires an opt-in with the PG_TEST_EXTRA environment variable.
+	#
+	# Do this while the old cluster is running before it is shut down by the
+	# upgrade test.
+	if (   $ENV{PG_TEST_EXTRA}
+		&& $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/)
+	{
+		test_regression_dump_restore($oldnode, %node_params);
+	}
 }
 
 # Initialize a new node for the upgrade.
@@ -539,4 +555,128 @@ my $dump2_filtered = filte

Re: UUID v7

2025-03-27 Thread Masahiko Sawada
On Wed, Mar 26, 2025 at 12:32 PM Andrei Borodin  wrote:
>
>
>
> 26.03.2025, 21:06, "Masahiko Sawada" :
>
> Agreed. I've done this in the attached patch.
>
> Great! The patch looks good to me.

Thank you for reviewing it. I'm going to push the fix tomorrow,
barring further comments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




duplicated comments on get_relation_constraints

2025-03-27 Thread jian he
hi.

in plancat.c, function: get_relation_constraints

```
for (i = 0; i < num_check; i++)
{
Node   *cexpr;
/*
  * If this constraint hasn't been fully validated yet, we must
  * ignore it here.  Also ignore if NO INHERIT and we weren't told
  * that that's safe.
  */
if (!constr->check[i].ccvalid)
  continue;

/*
  * NOT ENFORCED constraints are always marked as invalid, which
  * should have been ignored.
  */
Assert(constr->check[i].ccenforced);

/*
  * Also ignore if NO INHERIT and we weren't told that that's safe.
  */
if (constr->check[i].ccnoinherit && !include_noinherit)
  continue;
}
``

The first "Also ignore if NO INHERIT and we weren't told that that's
safe." is duplicated,
also it's in the wrong place?
The second one is fine.




Re: pg_stat_database.checksum_failures vs shared relations

2025-03-27 Thread Andres Freund
Hi,

On 2025-03-28 09:44:58 +0900, Michael Paquier wrote:
> On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund  wrote:
> > > So, today we have the weird situation that *some* checksum errors on 
> > > shared
> > > relations get attributed to the current database (if they happen in a 
> > > backend
> > > normally accessing a shared relation), whereas others get reported to the
> > > "shared relations" "database" (if they happen during a base backup).  That
> > > seems ... not optimal.
> > >
> > > One question is whether we consider this a bug that should be backpatched.
> > 
> > I think it would be defensible if pg_basebackup reported all errors
> > with OID 0 and backend connections reported all errors with OID
> > MyDatabaseId, but it seems hard to justify having pg_basebackup take
> > care to report things using the correct database OID and individual
> > backend connections not take care to do the same thing. So I think
> > this is a bug. If fixing it in the back-branches is too annoying, I
> > think it would be reasonable to fix it only in master, but
> > back-patching seems OK too.
> 
> Being able to get a better reporting for shared relations in back
> branches would be nice, but that's going to require some invasive
> chirurgy, isn't it?

Yea, that's what I was worried about too.  I think we basically would need a
PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(),
with optional arguments that the "fixed" callers would use.


> We don't know currently the OID of the relation whose block is
> corrupted with only PageIsVerifiedExtended().

I don't think the relation oid is really the relevant bit, it's the database
oid (or alternatively tablespace). But PageIsVerifiedExtended() doesn't know
that either, obviously.


> There are two callers of PIV_REPORT_STAT on HEAD:

> - The checksum reports from RelationCopyStorage() know the
> SMgrRelation.
> - ReadBuffersOperation() has an optional Relation and a
> SMgrRelationData.

An SMgrRelationData suffices, via ->smgr_rlocator.locator.dbOid.


FWIW, it turns out that there are more cases than just MyDatabaseId and
InvalidOid - ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer()
read buffers in a different database than MyDatabaseId.


> We could just refactor PageIsVerifiedExtended() so as it reports a
> state about why the verification failed and let the callers report the
> checksum failure with a relation OID, splitting the data for shared
> and non-shared relations?

Yea, I think we basically need a *checksum_failed out argument, and then the
callers need to do

if (checksum_failure)
   pgstat_report_checksum_failures_in_db(src->smgr_rlocator.locator.dbOid, 1);

Or alternatively we can optionally pass in the rlocator to
PageIsVerifiedExtended2(), so it can do the above internally.

Btw, it seems somewhat odd that we accumulate stats for checksum failures but
not for invalid page headers - the latter seems even worse...

Greetings,

Andres Freund




Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-27 Thread Robert Haas
On Wed, Mar 12, 2025 at 2:20 PM Alvaro Herrera  wrote:
> Taking a step back after discussing this with some colleagues, I need to
> contradict what I said at the start of this thread.  There's a worry
> that changing pg_attribute.attnotnull in the way I initially suggested
> might not be such a great idea after all.  I did a quick search using
> codesearch.debian.net for applications reading that column and thinking
> about how they would react to this change; I think in the end it's going
> to be quite disastrous.  We would break a vast number of these apps, and
> there are probably countless other apps and frameworks that we would
> also break.  Everybody would hate us forever.  Upgrading to Postgres 18
> would become as bad an experience as the drastic change of implicit
> casts to text in 8.3.  Nothing else in the intervening 17 years strikes
> me as so problematic as this change would be.

I don't agree with this conclusion. The 8.3 casting changes were
problematic because any piece of SQL you'd ever written could have
problems. This change will only break queries that look at the
attnotnull column. While there may be quite a few of those, it can't
possibly be of the same order. I think it's routine that changing the
name or type of system catalog columns breaks things for a few people
(e.g. procpid->pid, or relistemp->relpersistence) and we sometimes get
complaints about that, but at least you can grep for it and it's
mostly going to affect admin tools rather than all the queries
everywhere.

That's not to say that adding a second bool column instead of changing
the existing column's data type is necessarily the wrong way to go.
But I think you're overestimating the blast radius by quite a lot.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making EXPLAIN extensible

2025-03-27 Thread Andrei Lepikhov

On 3/26/25 19:09, Robert Haas wrote:

On Sat, Mar 22, 2025 at 12:10 PM Robert Haas  wrote:

I'm not going
to insist on shipping this if I'm the ONLY one who would ever get any
use out of it, but I doubt that's the case.


Hearing no other votes against this, I have committed it, but now I
wonder if that is going to break the buildfarm, because I just noticed
that the changes I made in v9 seem not to have resolved the problem
with debug_parallel_query, for reasons I do not yet understand.
Investigating...
I’m afraid to sound like a bore, but I think pg_overexplain should 
include a call into the hook call chain (see attachment). Who knows, 
maybe this extension will be used someday in production?


--
regards, Andrei LepikhovFrom 03a88d291cfe085bcff10e0595a679bcfcf14fc8 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Thu, 27 Mar 2025 22:27:33 +0100
Subject: [PATCH v0] Add into pg_overexplain example on a hook call chain
 agreement.

---
 contrib/pg_overexplain/pg_overexplain.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/contrib/pg_overexplain/pg_overexplain.c b/contrib/pg_overexplain/pg_overexplain.c
index 4554c3abbbf..5f623e06019 100644
--- a/contrib/pg_overexplain/pg_overexplain.c
+++ b/contrib/pg_overexplain/pg_overexplain.c
@@ -135,6 +135,10 @@ overexplain_per_node_hook(PlanState *planstate, List *ancestors,
 	overexplain_options *options;
 	Plan	   *plan = planstate->plan;
 
+	if (prev_explain_per_node_hook)
+		(*prev_explain_per_node_hook) (planstate, ancestors, relationship,
+	   plan_name, es);
+
 	options = GetExplainExtensionState(es, es_extension_id);
 	if (options == NULL)
 		return;
@@ -251,6 +255,10 @@ overexplain_per_plan_hook(PlannedStmt *plannedstmt,
 {
 	overexplain_options *options;
 
+	if (prev_explain_per_plan_hook)
+		(*prev_explain_per_plan_hook) (plannedstmt, into, es, queryString,
+	   params, queryEnv);
+
 	options = GetExplainExtensionState(es, es_extension_id);
 	if (options == NULL)
 		return;
-- 
2.39.5



Re: libpq maligning postgres stability

2025-03-27 Thread Robert Haas
On Thu, Mar 27, 2025 at 11:19 AM Andres Freund  wrote:
> We have several places in libpq where libpq says that a connection closing is
> probably due to a server crash with a message like:
>
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing
>
> I think this is rather unhelpful, at least these days. There are a lot of
> reasons the connection could have failed, the server having terminated
> abnormally is just one of them.
>
> It's common to see this due to network issues, for example.  I've quite a few
> times fielded worried questions of postgres users due to the message.

Yeah, I agree. I used to think this hint was helpful, but it's gotten
less helpful as the years have passed, because the server is more
stable these days. Another thing that can cause this (as discussed in
Discord) is that the individual backend process can have died, but not
the server as a whole. In that case, the hint is only accurate if you
mean "server" to read your individual server process.

I wonder if, in addition to removing the hint, we could also consider
rewording the message. For example, a slight rewording to "server
connection closed unexpectedly" would avoid implying that it was the
server that took action, which is correct, because it could be a
firewall in between the machines or even security software on the
client side.  Maybe there is some more dramatic rewording that is even
better, but there's probably some value in keeping it similar to what
people are used to seeing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-03-27 Thread Robert Haas
On Thu, Mar 27, 2025 at 6:12 AM Ilia Evdokimov
 wrote:
> I understand the concerns raised about the risk of opening the door to more 
> diagnostic detail across various plan nodes. However, in Hash Join, Merge 
> Join, and Nested Loop, EXPLAIN typically reveals at least some of the 
> planner’s expectations. For example, Hash Join shows the number of batches 
> and originally expected buckets, giving insight into whether the hash table 
> fit in memory. Merge Join shows unexpected Sort nodes when presorted inputs 
> were assumed. Nested Loop reflects planner assumptions via loops and row 
> estimates. In other words, these nodes expose at least some information about 
> what the planner thought would happen.
>
> Memoize is unique in that it shows runtime statistics (hits, misses, 
> evictions), but reveals nothing about the planner’s expectations. We don’t 
> see how many distinct keys were estimated or how many entries the planner 
> thought would fit in memory. This makes it very difficult to understand 
> whether Memoize was a good choice or not, or how to fix it when it performs 
> poorly.

Right. Without taking a strong position on this particular patch, I
think it's entirely reasonable, as a concept, to think of making
Memoize work similarly to what other nodes already do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Test to dump and restore objects left behind by regression

2025-03-27 Thread Ashutosh Bapat
On Thu, Mar 27, 2025 at 6:01 PM vignesh C  wrote:
>
> On Tue, 25 Mar 2025 at 16:09, Ashutosh Bapat
>  wrote:
> >
> > On Mon, Mar 24, 2025 at 5:44 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2025-Mar-24, Ashutosh Bapat wrote:
> > >
> > > > One concern I have with directory format is the dumped database is not
> > > > readable. This might make investigating a but identified the test a
> > > > bit more complex.
> > >
> > > Oh, it's readable all right.  You just need to use `pg_restore -f-` to
> > > read it.  No big deal.
> > >
> > >
> > > So I ran this a few times:
> > > /usr/bin/time make -j8 -Otarget -C /pgsql/build/master check-world -s 
> > > PROVE_FLAGS="-c -j6" > /dev/null
> > >
> > > commenting out the call to test_regression_dump_restore() to test how
> > > much additional runtime does the new test incur.
> > >
> > > With test:
> > >
> > > 136.95user 116.56system 1:13.23elapsed 346%CPU (0avgtext+0avgdata 
> > > 250704maxresident)k
> > > 4928inputs+55333008outputs (114major+14784937minor)pagefaults 0swaps
> > >
> > > 138.11user 117.43system 1:15.54elapsed 338%CPU (0avgtext+0avgdata 
> > > 278592maxresident)k
> > > 48inputs+55333464outputs (80major+14794494minor)pagefaults 0swaps
> > >
> > > 137.05user 113.13system 1:08.19elapsed 366%CPU (0avgtext+0avgdata 
> > > 279272maxresident)k
> > > 48inputs+55330064outputs (83major+14758028minor)pagefaults 0swaps
> > >
> > > without the new test:
> > >
> > > 135.46user 114.55system 1:14.69elapsed 334%CPU (0avgtext+0avgdata 
> > > 145372maxresident)k
> > > 32inputs+55155256outputs (105major+14737549minor)pagefaults 0swaps
> > >
> > > 135.48user 114.57system 1:09.60elapsed 359%CPU (0avgtext+0avgdata 
> > > 148224maxresident)k
> > > 16inputs+55155432outputs (95major+14749502minor)pagefaults 0swaps
> > >
> > > 133.76user 113.26system 1:14.92elapsed 329%CPU (0avgtext+0avgdata 
> > > 148064maxresident)k
> > > 48inputs+55154952outputs (84major+14749531minor)pagefaults 0swaps
> > >
> > > 134.06user 113.83system 1:16.09elapsed 325%CPU (0avgtext+0avgdata 
> > > 145940maxresident)k
> > > 32inputs+55155032outputs (83major+14738602minor)pagefaults 0swaps
> > >
> > > The increase in duration here is less than a second.
> > >
> > >
> > > My conclusion with these numbers is that it's not worth hiding this test
> > > in PG_TEST_EXTRA.
> >
> > Thanks for the conclusion.
> >
> > On Mon, Mar 24, 2025 at 3:29 PM Daniel Gustafsson  wrote:
> > >
> > > > On 24 Mar 2025, at 10:54, Ashutosh Bapat  
> > > > wrote:
> > >
> > > > 0003 - same as 0002 in the previous patch set. It excludes statistics
> > > > from comparison, otherwise the test will fail because of bug reported
> > > > at [1]. Ideally we shouldn't commit this patch so as to test
> > > > statistics dump and restore, but in case we need the test to pass till
> > > > the bug is fixed, we should merge this patch to 0001 before
> > > > committing.
> > >
> > > If the reported bug isn't fixed before feature freeze I think we should 
> > > commit
> > > this regardless as it has clearly shown value by finding bugs (though 
> > > perhaps
> > > under PG_TEST_EXTRA or in some disconnected till the bug is fixed to 
> > > limit the
> > > blast-radius in the buildfarm).
> >
> > Combining Alvaro's and Daniel's recommendations, I think we should
> > squash all the three of my patches while committing the test if the
> > bug is not fixed by then. Otherwise we should squash first two patches
> > and commit it. Just attaching the patches again for reference.
>
> Couple of minor thoughts:
> 1) I felt this error message is not conveying the error message correctly:
> +   if ($src_node->pg_version != $dst_node->pg_version
> +   or defined $src_node->{_install_path})
> +   {
> +   fail("same version dump and restore test using default
> installation");
> +   return;
> +   }
>
> how about something like below:
> fail("source and destination nodes must have the same PostgreSQL
> version and default installation paths");

The text in ok(), fail() etc. are test names and not error messages.
See [1]. Your suggestion and other versions that I came up with became
too verbose to be test names. So I think the text here is compromise
between conveying enough information and not being too long. We
usually have to pick the testname and lookup the test code to
investigate the failure. This text serves that purpose.

>
> 2) Should "`" be ' or " here, we   generally use "`" to enclose commands:
> +# It is expected that regression tests, which create `regression` database, 
> are
> +# run on `src_node`, which in turn, is left in running state. A fresh node is
> +# created using given `node_params`, which are expected to be the
> same ones used
> +# to create `src_node`, so as to avoid any differences in the databases.

Looking at prologues or some other functions, I see that we don't add
any decoration around the name of the argument. Hence dropped ``
altogether. Will post it with the next set of patches.

[1] https:/

Re: Remove restrictions in recursive query

2025-03-27 Thread Tom Lane
Renan Alves Fonseca  writes:
> The solution using GROUP BY in the recursive query is the following:

> with recursive t1(node,nb_path) as
>  (select 1,1::numeric
>union all
>   (select dag.target, sum(nb_path)
>   from t1 join dag on t1.node=dag.source
>   group by 1)
>  ) select sum(nb_path) from t1 join sink_nodes using (node) ;

This is not about the GROUP BY; it's about the SUM().
If you try this example you get

regression=# with recursive t1(node,nb_path) as
 (select 1,1::numeric
   union all
  (select dag.target, sum(nb_path)
  from t1 join dag on t1.node=dag.source
  group by 1)
 ) select sum(nb_path) from t1 join sink_nodes using (node) ;
ERROR:  aggregate functions are not allowed in a recursive query's recursive 
term
LINE 4:   (select dag.target, sum(nb_path)
  ^

The code says that that restriction is from the SQL spec, and
it seems to be correct as of SQL:2021.  7.17 
SR 3)j)ix)5)C) says

C) WQEi shall not contain a  QS such that QS
   immediately contains a  TE that contains a
referencing WQNX and either of the following is true:

  I) TE immediately contains a  that contains a
 .

  II) QS immediately contains a  SL that contains
  either a , or a , or both.

( is spec-ese for "aggregate function
call")

I don't know the SQL committee's precise reasoning for this
restriction, but I suspect it's because the recursive query
expansion is not well-defined in the presence of an aggregate.
The spec has an interesting comment at the bottom of sub-rule ix:

NOTE 310 — The restrictions insure that each WLEi, viewed as a
transformation of the query names of the stratum, is monotonically
increasing. According to Tarski’s fixed point theorem, this
insures that there is a fixed point. The General Rules use
Kleene’s fixed point theorem to define a sequence that converges
to the minimal fixed point.

regards, tom lane




Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-03-27 Thread Alena Rybakina

Hi!

On 26.03.2025 02:45, Peter Geoghegan wrote:

On Sat, Mar 22, 2025 at 1:47 PM Peter Geoghegan  wrote:

Attached is v30, which fully replaces the pstate.prechecked
optimization with the new _bt_skip_ikeyprefix optimization (which now
appears in v30-0002-Lower-nbtree-skip-array-maintenance-overhead.patch,
and not in 0003-*, due to my committing the primscan scheduling patch
just now).

Attached is v31, which has a much-improved _bt_skip_ikeyprefix (which
I've once again renamed, this time to _bt_set_startikey).

I reviewed your first patch and noticed that you added the ability to 
define new quals if the first column isn't usedin the query.


I replied an example like this:

CREATE TABLE sales ( id SERIAL PRIMARY KEY, region TEXT, product TEXT, 
year INT );


INSERT INTO sales (region, product, year) SELECT CASE WHEN i % 4 <= 1 
THEN 'North' WHEN i % 4 <= 2 THEN 'South' WHEN i % 4 <= 3 THEN 'East' 
ELSE 'West' END, CASE WHEN random() > 0.5 THEN 'Laptop' ELSE 'Phone' 
END, 2023 FROM generate_series(1, 10) AS i;


vacuum analyze;

CREATE INDEX sales_idx ON sales (region, product, year);

EXPLAIN ANALYZE SELECT * FROM sales WHERE product = 'Laptop' AND year = 
2023;


master gives the query plan with SeqScan:

QUERY PLAN 
--- 
Seq Scan on sales (cost=0.00..2137.00 rows=49703 width=19) (actual 
time=0.035..31.438 rows=50212.00 loops=1) Filter: ((product = 
'Laptop'::text) AND (year = 2023)) Rows Removed by Filter: 49788 
Buffers: shared hit=637 Planning: Buffers: shared hit=35 Planning Time: 
0.695 ms Execution Time: 33.942 ms (8 rows)


Your patch sets fair costs for it and it helps take into account index 
scan path and in my opinion it is perfect!)


QUERY PLAN 
- 
Bitmap Heap Scan on sales (cost=652.46..2031.86 rows=49493 width=19) 
(actual time=18.039..33.723 rows=49767.00 loops=1) Recheck Cond: 
((product = 'Laptop'::text) AND (year = 2023)) Heap Blocks: exact=637 
Buffers: shared hit=642 read=50 -> Bitmap Index Scan on sales_idx 
(cost=0.00..640.09 rows=49493 width=0) (actual time=17.756..17.756 
rows=49767.00 loops=1) Index Cond: ((product = 'Laptop'::text) AND (year 
= 2023)) Index Searches: 4 Buffers: shared hit=5 read=50 Planning: 
Buffers: shared hit=55 read=1 Planning Time: 0.984 ms Execution Time: 
36.940 ms


I think it would be useful to show information that we used an index 
scan but at the same time we skipped the "region" column and I assume we 
should output how many distinct values the "region" column had.


For example it will look like this "*Skip Scan on region (4 distinct 
values)"*:


QUERY PLAN 
- 
Bitmap Heap Scan on sales (cost=652.46..2031.86 rows=49493 width=19) 
(actual time=18.039..33.723 rows=49767.00 loops=1) Recheck Cond: 
((product = 'Laptop'::text) AND (year = 2023)) Heap Blocks: exact=637 
Buffers: shared hit=642 read=50 -> Bitmap Index Scan on sales_idx 
(cost=0.00..640.09 rows=49493 width=0) (actual time=17.756..17.756 
rows=49767.00 loops=1) Index Cond: ((product = 'Laptop'::text) AND (year 
= 2023)) *Skip Scan on region (4 distinct values)* Index Searches: 4 
Buffers: shared hit=5 read=50 Planning: Buffers: shared hit=55 read=1 
Planning Time: 0.984 ms Execution Time: 36.940 ms


What do you think?

I didn't see any regression tests. Maybe we should add some tests? To be 
honest I didn't see it mentioned in the commit message but I might have 
missed something.


--
Regards,
Alena Rybakina
Postgres Professional


Re: read stream on amcheck

2025-03-27 Thread Matheus Alcantara
On Thu, Mar 27, 2025 at 4:42 PM Melanie Plageman
 wrote:
>
> On Thu, Mar 27, 2025 at 2:46 PM Matheus Alcantara
>  wrote:
> >
> > Just my 0.2 cents. I also like the first approach even though I prefer
> > the v4 version, but anyway, thanks very much for reviewing and
> > committing!
>
> Thanks for the patch!
>
> FWIW, I strongly disliked about v4 that two separate parts of the
> callback were responsible for advancing current_blocknum, one to
> advance it past blocks we chose to skip and the other to advance it to
> the next block.
>
>for (; p->current_blocknum < p->last_exclusive; p->current_blocknum++)
> and
>  if (p->current_blocknum < p->last_exclusive)
>   return p->current_blocknum++;
>
> I found that alone to be undesirable, but once you add in
> SKIP_PAGES_NONE, I think it was very hard to understand.
>
> Besides that, when we implemented streaming read sequential scan, we
> ended up making dedicated callbacks for the parallel and non-parallel
> cases (see heap_scan_stream_read_next_parallel and
> heap_scan_stream_read_next_serial) because it performed better than a
> single combined callback with a branch. I didn't validate that amcheck
> got the same performance benefit from the dedicated callbacks, but I
> don't see why it would be any different.

Yeah, it totally makes sense to me now, thanks very much for the details!

-- 
Matheus Alcantara




Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-03-27 Thread Peter Geoghegan
On Mon, Mar 17, 2025 at 6:51 PM Matthias van de Meent
 wrote:
> This hasn't changed meaningfully in this patch, but I noticed that
> pstate.finaltup is never set for the final page of the scan direction
> (i.e. P_RIGHTMOST or P_LEFTMOST for forward or backward,
> respectively). If it isn't used more than once after the first element
> of non-P_RIGHTMOST/LEFTMOST pages, why is it in pstate? Or, if it is
> used more than once, why shouldn't it be used in

We don't set pstate.finaltup on either the leftmost or rightmost page
(nothing new, this is all from the Postgres 17 SAOP patch) because we
cannot possibly need to start another primitive index scan from that
point. We only use pstate.finaltup to determine if we should start a
new primitive index scan, every time the scan's array keys advance
(except during "sktrig_required=false" array advancement, and barring
the case where we don't have to check pstate.finaltup because we see
that the new set of array keys are already an exact match for the
tuple passed to _bt_advance_array_keys). In short, pstate.finaltup is
used during a significant fraction of all calls to
_bt_advance_array_keys, and there is no useful limit on the number of
times that pstate.finaltup can be used per _bt_readpage.

Although you didn't say it in your email (you said it to me on our
most recent call), I believe that you're asking about pstate.finaltup
for reasons that are more related to 0003-* than to 0001-*. As I
recall, you asked me about why it was that the 0003-* patch avoids
_bt_readpage's call to _bt_skip_ikeyprefix on the leftmost or
rightmost leaf page (i.e. a page that lacks a pstate.finaltup). You
were skeptical about that -- understandably so.

To recap, 0003-* avoids calling _bt_skip_ikeyprefix on the rightmost
(and leftmost) page because it reasons that activating that
optimization is only okay when we know that we'll be able to "recover"
afterwards, during the finaltup _bt_checkkeys call. By "recover", I
mean restore the invariant for required array keys: that they track
our progress through the key space. It felt safer and easier for
0003-* to just not call _bt_skip_ikeyprefix on a page that has no
finaltup -- that way we won't have anything that we need to recover
from, when we lack the pstate.finaltup that we generally expect to use
for this purpose. This approach allowed me to add the following
assertions a little bit later on, right at the end of _bt_readpage
(quoting 0003-* here):

@@ -1993,6 +2031,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection
dir, OffsetNumber offnum,
so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
}

+   /*
+* As far as our caller is concerned, the scan's arrays always track its
+* progress through the index's key space.
+*
+* If _bt_skip_ikeyprefix told us to temporarily treat all scan keys as
+* nonrequired (during a skip scan), then we must recover afterwards by
+* advancing our arrays using finaltup (with !pstate.forcenonrequired).
+*/
+   Assert(pstate.ikey == 0 && !pstate.forcenonrequired);
+
return (so->currPos.firstItem <= so->currPos.lastItem);
 }

I've actually come around to your point of view on this (or what I
thought was your PoV from our call). That is, I now *think* that it
would be better if the code added by 0003-* called
_bt_skip_ikeyprefix, without regard for whether or not we'll have a
finaltup _bt_checkkeys call to "recover" (i.e. whether we're on the
leftmost or rightmost page shouldn't matter).

My change in perspective on this question is related to another change
of perspective, on the question of whether we actually need to call
_bt_start_array_keys as part of "recovering/restoring the array
invariant", just ahead of the finaltup _bt_checkkeys call. As you
know, 0003-* calls _bt_start_array_keys in this way, but that now
seems like overkill. It can have undesirable side-effects when the
array keys spuriously appear to advance, when in fact they were
restarted via the _bt_start_array_keys call, only to have their
original values restored via the finaltup call that immediately
follows.

Here's what I mean about side-effects:

We shouldn't allow _bt_advance_array_keys' primitive index scan
scheduling logic to falsely believe that the array keys have advanced,
when they didn't really advance in any practical sense. The commit
message of 0003-* promises that its optimization cannot influence
primitive index scan scheduling at all, which seems like a good
general principle for us to follow. But I recently discovered that
that promise was subtly broken, which I tied to the way that 0003-*
calls _bt_start_array_keys (this didn't produce wrong answers, and
didn't even really hurt performance, but it seems wonky and hard to
test). So I now think that I need to expect more from
_bt_skip_ikeyprefix and its pstate.forcenonrequired optimization, so
that my promise about primscan scheduling is honored.

Tying it back to your concern, once I do that (once I stop calling
_bt_start_array

Bringing non-atomic operation to a PL that hasn't had it

2025-03-27 Thread Chapman Flack
Hi,

If a maintainer of a 20-year-old PL that never supported non-atomic
operation wants to catch up, it is easy enough to find the documentation
on SPI_connect_ext and SPI_OPT_NONATOMIC, and learn how to determine
the atomic/non-atomic state on entry from the atomic field of
InlineCodeBlock for a DO, or of CallContext for a CALL. So for a start,
noting that flag and passing it on to SPI_connect_ext seems straightforward.

The questions seem to quickly outrun the documentation though. I think I can
safely say a few things:

- Even in non-atomic mode, SPI only allows transaction control via its
  dedicated SPI_commit / SPI_rollback functions, and never by passing
  a transaction utility command as query text.

That seems unobjectionable once understood - avoid having many ways to
do the same thing, and require use of the dedicated functions for the purpose.

- SPI_execute_extended and SPI_execute_plan_extended *also* have an
  allow_nonatomic option. That option does not make them any more tolerant
  of a transaction command in query text: they still reject that. But they
  will pass the nonatomic state along to a nested PL invocation. So they
  still won't accept "ROLLBACK", but by passing true for allow_nonatomic
  you can get "DO LANGUAGE plpgsql 'BEGIN ROLLBACK; END;'" past them.

That naturally leads to the question of when I ought to pass allow_nonatomic
to those functions, which seems to be a thornier question than just "did
you pass SPI_OPT_NONATOMIC to SPI_connect_ext?". (Otherwise, why couldn't
SPI just do that bookkeeping itself?)

git blame on the allow_nonatomic option leads to a long discussion thread
on explicitly managing snapshots and such, which got me wondering how far
into those weeds a PL maintainer has to trek to do something that works.

Has anyone here blogged or written more extensively than what's in the docs
on just how to approach bringing old PL code up to date with this feature?


I can confirm one piece of unexpected thorny behavior. Where dosql is
a dead-simple toy language I cooked up[1]:

# do language dosql 'do language plpgsql ''begin rollback; end;''';
ERROR:  plancache reference 0x36e1a88 is not owned by resource owner
PL/pgSQL simple expressions
CONTEXT:  SQL statement "do language plpgsql 'begin rollback; end;'"

This error is reported after the rollback succeeded, PL/pgSQL is all done
and control has returned to _SPI_execute_plan. So why is anything referring
to a PL/pgSQL resource owner?

Turns out I had passed "do language plpgsql 'begin rollback; end;'" to SPI
as a saved plan. (Why? Because the 20-year-old code I started with saves
plans.) I didn't supply a ResourceOwner of my choice in the call, so
_SPI_execute_plan assigned plan_owner = CurrentResourceOwner before calling
GetCachedPlan.

The work then proceeds smoothly, rollback gets executed, lots of things
go away including the ResourceOwner that plan_owner points to, and in
a fabulous (but, for me, quite repeatable) coincidence, the next thing
to be allocated at that exact address is a new ResourceOwner for
PL/pgSQL simple expressions.

So when control gets back to _SPI_execute_plan and it uses the
technically-now-dangling plan_owner pointer to ask the owner to forget
the plan, instead of a crash it gets the PL/pgSQL simple expressions
owner complaining that it's never heard of the plan it's being asked
to forget.


So I suppose one moral I should take from that is "never pass
allow_nonatomic unless you're sure you're not passing a saved plan"?

Or is it "... you're sure you don't have any saved plans?"

And how many other morals like that lie ahead for me to discover?


Also, would it be worthwhile at all to make this behavior in
_SPI_execute_plan a little less perplexing? It does have plan->saved
and allow_nonatomic right there in scope, so it could easily ereport(
"you've passed allow_nonatomic and a saved plan, so you haven't thought
hard enough about this, come back when you've read document X"), and
that could simplify the learning process. Especially if there is
such a document X somewhere.

Regards,
-Chap


[1]
https://github.com/tada/pljava/blob/feature/REL1_7_STABLE/model/pljava-examples/src/main/java/org/postgresql/pljava/example/polyglot/DoSQL.java




Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-03-27 Thread Richard Guo
On Wed, Mar 26, 2025 at 6:45 PM David Rowley  wrote:
> On Wed, 26 Mar 2025 at 19:31, Richard Guo  wrote:
> > On Wed, Mar 26, 2025 at 3:06 PM Tender Wang  wrote:
> > > The comment about  notnullattnums in struct RangeTblEntry says that:
> > > * notnullattnums is zero-based set containing attnums of NOT NULL
> > > * columns.
> > >
> > > But in get_relation_notnullatts():
> > > rte->notnullattnums = bms_add_member(rte->notnullattnums,
> > > i + 
> > > 1);
> > >
> > > The notnullattnums seem to be 1-based.

> > This corresponds to the attribute numbers in Var nodes; you can
> > consider zero as representing a whole-row Var.

> Yeah, and a negative number is a system attribute, which the Bitmapset
> can't represent... The zero-based comment is meant to inform the
> reader that they don't need to offset by
> FirstLowInvalidHeapAttributeNumber when indexing the Bitmapset. If
> there's some confusion about that then maybe the wording could be
> improved. I used "zero-based" because I wanted to state what it was
> and that was the most brief terminology that I could think of to do
> that. The only other way I thought about was to say that "it's not
> offset by FirstLowInvalidHeapAttributeNumber", but I thought it was
> better to say what it is rather than what it isn't.
>
> I'm open to suggestions if people are confused about this.

I searched the current terminology used in code and can find "offset
by FirstLowInvalidHeapAttributeNumber", but not "not offset by
FirstLowInvalidHeapAttributeNumber".  I think "zero-based" should be
sufficient to indicate that this bitmapset is offset by zero, not by
FirstLowInvalidHeapAttributeNumber.  So I'm fine to go with
"zero-based".

I'm planning to push this patch soon, barring any objections.

Thanks
Richard




  1   2   >