date:20240314

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-14 Thread shveta malik

On Thu, Mar 14, 2024 at 10:57 AM Amit Kapila  wrote:
>
> On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  wrote:
> >
> > This fact makes me think that the slotsync worker might be able to
> > accept the primary_conninfo value even if there is no dbname in the
> > value. That is, if there is no dbname in the primary_conninfo, it uses
> > the username in accordance with the specs of the connection string.
> > Currently, the slotsync worker connects to the local database first
> > and then establishes the connection to the primary server. But if we
> > can reverse the two steps, it can get the dbname that has actually
> > been used to establish the remote connection and use it for the local
> > connection too. That way, the primary_conninfo generated by
> > pg_basebackup could work even without the patch. For example, if the
> > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > would connect to the postgres database. Given the 'postgres' database
> > is created by default and 'postgres' OS user is used in common, I
> > guess it could cover many cases in practice actually.
> >
>
> I think this is worth investigating but I suspect that in most cases
> users will end up using a replication connection without specifying
> the user name and we may not be able to give a meaningful error
> message when slotsync worker won't be able to connect. The same will
> be true even when the dbname same as the username would be used.
>

I attempted the change as suggested by Swada-San. Attached the PoC
patch .For it to work, I have to expose a new get api in libpq-fe
which gets dbname from stream-connection. Please have a look.

Without this PoC patch, the errors in slot-sync worker:

-
a) If dbname is missing:
[1230932] LOG:  slot sync worker started
[1230932] ERROR:  slot synchronization requires dbname to be specified
in primary_conninfo

b) If specified db does not exist
[1230913] LOG:  slot sync worker started
[1230913] FATAL:  database "postgres1" does not exist
-

Now with this patch:
-
a) If the dbname same as user does not exist:
[1232473] LOG:  slot sync worker started
[1232473] ERROR:  could not connect to the primary server: connection
to server at "127.0.0.1", port 5433 failed: FATAL: database
"bckp_user" does not exist

b) If user itself is removed from primary_conninfo, libpq takes user
who has authenticated the system by default and gives error if db of
same name does not exist
ERROR:  could not connect to the primary server: connection to server
at "127.0.0.1", port 5433 failed: FATAL:  database "shveta" does not
exist
-

The errors in second case look slightly confusing to me.

thanks
Shveta


v1-0001-Use-user-as-dbname-for-slot-sync.patch
Description: Binary data

Re: Using the %m printf format more

2024-03-14 Thread Michael Paquier

On Wed, Mar 13, 2024 at 02:33:52PM +0100, Peter Eisentraut wrote:
> The 0002 patch looks sensible.  It would be good to fix that, otherwise it
> could have some confusing outcomes in the future.

You mean if we begin to use %m in future callers of
emit_tap_output_v(), hypothetically?  That's a fair argument.
--
Michael


signature.asc
Description: PGP signature

Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-03-14 Thread Anton A. Melnikov


On 14.03.2024 03:19, Alexander Korotkov wrote:


Pushed.



Thanks!


--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: LogwrtResult contended spinlock

2024-03-14 Thread Bharath Rupireddy

On Mon, Mar 4, 2024 at 9:15 PM Bharath Rupireddy
 wrote:
>
> > 0003:
> >
> > * We need to maintain the invariant that Copy >= Write >= Flush. I
> > believe that's always satisfied, because the
> > XLogWaitInsertionsToFinish() is always called before XLogWrite(). But
> > we should add an assert or runtime check of this invariant somewhere.
>
> Yes, that invariant is already maintained by the server. Although, I'm
> not fully agree, I added an assertion to WaitXLogInsertionsToFinish
> after updating XLogCtl->LogwrtResult.Copy. CF bot is happy with it -
> https://github.com/BRupireddy2/postgres/tree/atomic_LogwrtResult_v13.

I've now separated these invariants out into the 0004 patch.

With the assertions placed in WaitXLogInsertionsToFinish after
updating Copy ptr, I observed the assertion failing in one of the CF
bot machines - https://cirrus-ci.com/build/6202112288227328. I could
reproduce it locally with [1]. I guess the reason is that the Write
and Flush ptrs are now updated independently and atomically without
lock, they might drift and become out-of-order for a while if
concurrently they are accessed in WaitXLogInsertionsToFinish. So, I
guess the right place to verify the invariant Copy >= Write >= Flush
is in XLogWrite once Write and Flush ptrs in shared memory are updated
(note that only one process at a time can do this). Accordingly, I've
moved the assertions to XLogWrite in the attached v14-0004 patch.

> Please see the attached v13 patch set for further review.

Earlier versions of the patches removed a piece of code ensuring
shared WAL 'request' values did not fall beading the 'result' values.
There's a good reason for us to have it. So, I restored it.

-   /*
-* Update shared-memory status
-*
-* We make sure that the shared 'request' values do not fall behind the
-* 'result' values.  This is not absolutely essential, but it saves some
-* code in a couple of places.
-*/

Please see the attached v14 patch set.

[1] for i in {1..100}; do make check
PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The
command failed on iteration $i"; break; fi; done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v14-0001-Add-monotonic-advancement-functions-for-atomics.patch
Description: Binary data

v14-0002-Make-XLogCtl-LogwrtResult-accessible-with-atomic.patch
Description: Binary data

v14-0003-Add-Copy-pointer-to-track-data-copied-to-WAL-buf.patch
Description: Binary data

v14-0004-Add-invariants-for-shared-LogwrtResult-members.patch
Description: Binary data

Re: type cache cleanup improvements

2024-03-14 Thread Michael Paquier

On Wed, Mar 13, 2024 at 04:40:38PM +0300, Teodor Sigaev wrote:
> Done, all patches should be applied consequentially.

One thing that first pops out to me is that we can do the refactor of
hash_initial_lookup() as an independent piece, without the extra paths
introduced.  But rather than returning the bucket hash and have the
bucket number as an in/out argument of hash_initial_lookup(), there is
an argument for reversing them: hash_search_with_hash_value() does not
care about the bucket number.

> 02-hash_seq_init_with_hash_value.v5.patch - introduces a
> hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as
> inline, but I suppose, modern compilers are smart enough to inline it
> automatically.

Likely so, though that does not hurt to show the intention to the
reader.

So I would like to suggest the attached patch for this first piece.
What do you think?

It may also be an idea to use `git format-patch` when generating a
series of patches.  That makes for easier reviews.
--
Michael
From 6b1fe126b9f72ff27aca08128948f4e617ba70dd Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 14 Mar 2024 16:35:40 +0900
Subject: [PATCH] Refactor initial hash lookup in dynahash.c

---
 src/backend/utils/hash/dynahash.c | 75 ++-
 1 file changed, 33 insertions(+), 42 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index a4152080b5..e1bd92a01c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -273,6 +273,8 @@ static void hdefault(HTAB *hashp);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
+static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
+  HASHBUCKET **bucketptr);
 static long next_pow2_long(long num);
 static int	next_pow2_int(long num);
 static void register_seq_scan(HTAB *hashp);
@@ -972,10 +974,6 @@ hash_search_with_hash_value(HTAB *hashp,
 	HASHHDR*hctl = hashp->hctl;
 	int			freelist_idx = FREELIST_IDX(hctl, hashvalue);
 	Size		keysize;
-	uint32		bucket;
-	long		segment_num;
-	long		segment_ndx;
-	HASHSEGMENT segp;
 	HASHBUCKET	currBucket;
 	HASHBUCKET *prevBucketPtr;
 	HashCompareFunc match;
@@ -1008,17 +1006,7 @@ hash_search_with_hash_value(HTAB *hashp,
 	/*
 	 * Do the initial lookup
 	 */
-	bucket = calc_bucket(hctl, hashvalue);
-
-	segment_num = bucket >> hashp->sshift;
-	segment_ndx = MOD(bucket, hashp->ssize);
-
-	segp = hashp->dir[segment_num];
-
-	if (segp == NULL)
-		hash_corrupted(hashp);
-
-	prevBucketPtr = &segp[segment_ndx];
+	(void) hash_initial_lookup(hashp, hashvalue, &prevBucketPtr);
 	currBucket = *prevBucketPtr;
 
 	/*
@@ -1159,14 +1147,10 @@ hash_update_hash_key(HTAB *hashp,
 	 const void *newKeyPtr)
 {
 	HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
-	HASHHDR*hctl = hashp->hctl;
 	uint32		newhashvalue;
 	Size		keysize;
 	uint32		bucket;
 	uint32		newbucket;
-	long		segment_num;
-	long		segment_ndx;
-	HASHSEGMENT segp;
 	HASHBUCKET	currBucket;
 	HASHBUCKET *prevBucketPtr;
 	HASHBUCKET *oldPrevPtr;
@@ -1187,17 +1171,8 @@ hash_update_hash_key(HTAB *hashp,
 	 * this to be able to unlink it from its hash chain, but as a side benefit
 	 * we can verify the validity of the passed existingEntry pointer.
 	 */
-	bucket = calc_bucket(hctl, existingElement->hashvalue);
-
-	segment_num = bucket >> hashp->sshift;
-	segment_ndx = MOD(bucket, hashp->ssize);
-
-	segp = hashp->dir[segment_num];
-
-	if (segp == NULL)
-		hash_corrupted(hashp);
-
-	prevBucketPtr = &segp[segment_ndx];
+	bucket = hash_initial_lookup(hashp, existingElement->hashvalue,
+ &prevBucketPtr);
 	currBucket = *prevBucketPtr;
 
 	while (currBucket != NULL)
@@ -1219,18 +1194,7 @@ hash_update_hash_key(HTAB *hashp,
 	 * chain we want to put the entry into.
 	 */
 	newhashvalue = hashp->hash(newKeyPtr, hashp->keysize);
-
-	newbucket = calc_bucket(hctl, newhashvalue);
-
-	segment_num = newbucket >> hashp->sshift;
-	segment_ndx = MOD(newbucket, hashp->ssize);
-
-	segp = hashp->dir[segment_num];
-
-	if (segp == NULL)
-		hash_corrupted(hashp);
-
-	prevBucketPtr = &segp[segment_ndx];
+	newbucket = hash_initial_lookup(hashp, newhashvalue, &prevBucketPtr);
 	currBucket = *prevBucketPtr;
 
 	/*
@@ -1741,6 +1705,33 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	return true;
 }
 
+/*
+ * Do initial lookup of a bucket for the given hash value, retrieving its
+ * bucket number and its hash bucket.
+ */
+static inline uint32
+hash_initial_lookup(HTAB *hashp, uint32 hashvalue, HASHBUCKET **bucketptr)
+{
+	HASHHDR*hctl = hashp->hctl;
+	HASHSEGMENT segp;
+	long		segment_num;
+	long		segment_ndx;
+	uint32		bucket;
+
+	bucket = calc_bucket(hctl, hashvalue);
+
+	segment_num = bucket >> hashp->sshift;
+	segment_ndx = MOD(bucket, hashp->ssize);
+
+	segp = hashp->dir[segment_num];
+
+	if (segp == NULL)
+		hash_corrupted(hashp);
+
+	*bucketptr =  &segp[segmen

Re: Support json_errdetail in FRONTEND builds

2024-03-14 Thread Michael Paquier

On Wed, Mar 13, 2024 at 11:20:16AM -0700, Jacob Champion wrote:
> Sounds good, split into v2-0002. (That gives me room to switch other
> call sites to the new API, too.) Thanks both!

That looks OK to me.  I can see 7~8 remaining sites where StringInfo
data is freed, like in the syslogger, but we should not touch the
StringInfo.

>>  case JSON_EXPECTED_STRING:
>> -return psprintf(_("Expected string, but found \"%s\"."),
>> -extract_token(lex));
>> +appendStringInfo(lex->errormsg,
>> + _("Expected string, but found \"%.*s\"."),
>> + toklen, lex->token_start);
>>
>> Maybe this could be wrapped in a macro to avoid copying around
>> token_start and toklen for all the error cases.
> 
> I gave it a shot in 0001; see what you think.

That's cleaner, thanks.

Hmm, I think that it is cleaner to create the new API first, and then
rely on it, reversing the order of both patches (perhaps the extra
destroyStringInfo in freeJsonLexContext() should have been moved in
0001).  See the attached with few tweaks to 0001, previously 0002 @-@.
I'd still need to do a more serious lookup of 0002, previously 0001.
--
Michael
From 78bfbfe3cbdfd4095bc9ef220dfa11dfdc404556 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 14 Mar 2024 16:58:06 +0900
Subject: [PATCH v3 1/2] Add a helper function for cleaning up StringInfos

destroyStringInfo() is a counterpart to makeStringInfo(): it frees a
palloc'd StringInfo and its data. API originally by Daniel Gustafsson,
with docs and tweaks added by Jacob Champion.

Co-authored-by: Daniel Gustafsson 
---
 src/include/lib/stringinfo.h|  9 -
 src/backend/backup/basebackup.c |  3 +--
 src/backend/commands/subscriptioncmds.c |  3 +--
 src/backend/utils/adt/jsonb.c   |  3 +--
 src/backend/utils/adt/xml.c |  6 ++
 src/common/stringinfo.c | 18 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  5 +
 src/test/regress/pg_regress.c   |  3 +--
 8 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/src/include/lib/stringinfo.h b/src/include/lib/stringinfo.h
index 2cd636b01c..cd9632e3fc 100644
--- a/src/include/lib/stringinfo.h
+++ b/src/include/lib/stringinfo.h
@@ -87,7 +87,8 @@ typedef StringInfoData *StringInfo;
  *		to be len + 1 in size.
  *
  * To destroy a StringInfo, pfree() the data buffer, and then pfree() the
- * StringInfoData if it was palloc'd.  There's no special support for this.
+ * StringInfoData if it was palloc'd.  For StringInfos created with
+ * makeStringInfo(), destroyStringInfo() is provided for this purpose.
  * However, if the StringInfo was initialized using initReadOnlyStringInfo()
  * then the caller will need to consider if it is safe to pfree the data
  * buffer.
@@ -233,4 +234,10 @@ extern void appendBinaryStringInfoNT(StringInfo str,
  */
 extern void enlargeStringInfo(StringInfo str, int needed);
 
+/*
+ * destroyStringInfo
+ * Frees a StringInfo and its buffer (opposite of makeStringInfo()).
+ */
+extern void destroyStringInfo(StringInfo str);
+
 #endif			/* STRINGINFO_H */
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5fbbe5ffd2..5fea86ad0f 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -397,8 +397,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 		endtli = backup_state->stoptli;
 
 		/* Deallocate backup-related variables. */
-		pfree(tablespace_map->data);
-		pfree(tablespace_map);
+		destroyStringInfo(tablespace_map);
 		pfree(backup_state);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(do_pg_abort_backup, BoolGetDatum(false));
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index a05d69922d..5a47fa984d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -506,8 +506,7 @@ check_publications(WalReceiverConn *wrconn, List *publications)
 	appendStringInfoChar(cmd, ')');
 
 	res = walrcv_exec(wrconn, cmd->data, 1, tableRow);
-	pfree(cmd->data);
-	pfree(cmd);
+	destroyStringInfo(cmd);
 
 	if (res->status != WALRCV_OK_TUPLES)
 		ereport(ERROR,
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index a5e48744ac..a941654d5a 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -133,8 +133,7 @@ jsonb_send(PG_FUNCTION_ARGS)
 	pq_begintypsend(&buf);
 	pq_sendint8(&buf, version);
 	pq_sendtext(&buf, jtext->data, jtext->len);
-	pfree(jtext->data);
-	pfree(jtext);
+	destroyStringInfo(jtext);
 
 	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
 }
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index beecd0c2ac..3e4ca874d8 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -2163,8 +2163,7 @@ xml_errorHandler(void *data, PgXmlErrorPtr

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila  wrote:
>
> On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  wrote:
> >
> > This fact makes me think that the slotsync worker might be able to
> > accept the primary_conninfo value even if there is no dbname in the
> > value. That is, if there is no dbname in the primary_conninfo, it uses
> > the username in accordance with the specs of the connection string.
> > Currently, the slotsync worker connects to the local database first
> > and then establishes the connection to the primary server. But if we
> > can reverse the two steps, it can get the dbname that has actually
> > been used to establish the remote connection and use it for the local
> > connection too. That way, the primary_conninfo generated by
> > pg_basebackup could work even without the patch. For example, if the
> > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > would connect to the postgres database. Given the 'postgres' database
> > is created by default and 'postgres' OS user is used in common, I
> > guess it could cover many cases in practice actually.
> >
>
> I think this is worth investigating but I suspect that in most cases
> users will end up using a replication connection without specifying
> the user name and we may not be able to give a meaningful error
> message when slotsync worker won't be able to connect. The same will
> be true even when the dbname same as the username would be used.

What do you mean by not being able to give a meaningful error message?

If the slotsync worker uses the user name as the dbname, and such a
database doesn't exist, the error message the user will get is
"database "test_user" does not exist". ISTM the same is true when the
user specifies the wrong database in the primary_conninfo.

>
> > Having said that, even with (or without) the above change, we might
> > want to change the pg_basebackup so that it writes the dbname to the
> > primary_conninfo if -R option is specified. Since the database where
> > the slotsync worker connects cannot be dropped while the slotsync
> > worker is running, the user might want to change the database to
> > connect, and it would be useful if they can do that using
> > pg_basebackup instead of modifying the configuration file manually.
> >
> > While the current approach makes sense to me, I'm a bit concerned that
> > we might end up having the pg_basebackup search the actual database
> > name (e.g. 'dbname=template1') from the .pgpass file instead of
> > 'dbname=replication'. As far as I tested on my environment, suppose
> > that I execute:
> >
> > pg_basebackup -D tmp -d "dbname=testdb" -R
> >
> > The pg_basebackup established a replication connection but looked for
> > the password of the 'testdb' database. This could be another
> > inconvenience for the existing users who want to use the slot
> > synchronization.
> >
>
> This is true because it is internally using logical replication
> connection (as we will set set replication=database).

Did you mean the pg_basebackup is using a logical replication
connection in this case? As far as I tested, even if we specify dbname
to the -d option of pg_basebackup, it uses a physical replication
connection. For example, it can take a backup even if I specify a
non-existing database name.

> > A random idea I came up with is, we add a new option to the
> > pg_basebackup to overwrite the full or some portion of the connection
> > string that is eventually written in the primary_conninfo in
> > postgresql.auto.conf. For example, the command:
> >
> > pg_basebackup -D tmp -d "host=1.1.1.1 port=" -R
> > --primary-coninfo-ext "host=2.2.2.2 dbname=postgres"
> >
> > will produce the connection string that is based on -d option value
> > but is overwritten by --primary-conninfo-ext option value, which will
> > be like:
> >
> > host=2.2.2.2 dbname=postgres port=
> >
> > This option might help not only for users who want to use the slotsync
> > worker but also for users who want to take a basebackup from a standby
> > but have the new standby connect to the primary.
> >
>
> Agreed, this could be another way though it would be good to get some
> inputs from users or otherwise about the preferred way to specify
> dbname. One can also imagine using the Alter System for this purpose.

Agreed.

>
> > But it's still just an idea and I might be missing something. And
> > given we're getting closer to the feature freeze, it would be a PG18
> > item.
> >
>
> +1. At this stage, it is important to discuss whether we should allow
> pg_baseback to write dbname (either a specified one or a default one)
> along with other parameters in primary_conninfo?
>

True. While I basically agree that pg_basebackup writes dbname in
primary_conninfo, I'm concerned that writing "dbname=replication"
could be problematic. Quoting the case 3) Vignesh summarized before:

3) ./pg_basebackup -d "user=vignesh"  -D data -R
-> primary_conninfo = "dbname=replication"  (In this case
primary_conninf

Re: Fix the synopsis of pg_md5_hash

2024-03-14 Thread Daniel Gustafsson

> On 14 Mar 2024, at 07:02, Tatsuro Yamada  wrote:

> So, I created a patch to fix them.

Thanks, applied.

--
Daniel Gustafsson

Re: Built-in CTYPE provider

2024-03-14 Thread Peter Eisentraut


On 14.03.24 09:08, Jeff Davis wrote:

On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote:

New series attached. I plan to commit 0001 very soon.


Committed the basic builtin provider, supporting only the "C" locale.


As you were committing this, I had another review of 
v23-0001-Introduce-collation-provider-builtin.patch in progress.  Some 
of the things I found you have already addressed in what you committed. 
Please check the remaining comments.



* doc/src/sgml/charset.sgml

I don't understand the purpose of this sentence:

"When using this locale, the behavior may depend on the database encoding."


* doc/src/sgml/ref/create_database.sgml

The new parameter builtin_locale is not documented.


* src/backend/commands/collationcmds.c

I think DefineCollation() should set collencoding = -1 for the
COLLPROVIDER_BUILTIN case.  -1 stands for any encoding.  Or at least
explain why not?


* src/backend/utils/adt/pg_locale.c

This part is a bit confusing:

+   cache_entry->collate_is_c = true;
+   cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0);

Is collate always C but ctype only sometimes?  Does this anticipate
future patches in this series?  Maybe in this patch it should always
be true?


* src/bin/initdb/initdb.c

+   printf(_("  --builtin-locale=LOCALE   set builtin locale name 
for new databases\n"));


Put in a line break so that the right "column" lines up.

This output should line up better:

The database cluster will be initialized with this locale configuration:
  default collation provider:  icu
  default collation locale:en
  LC_COLLATE:  C
  LC_CTYPE:C
  ...

Also, why are there two spaces after "provider:  "?

Also we call these locale provider on input, why are they collation
providers on output?  What is a "collation locale"?


* src/bin/pg_upgrade/t/002_pg_upgrade.pl

+if ($oldnode->pg_version >= '17devel')

This is weird.  >= is a numeric comparison, so providing a string with
non-digits is misleading at best.


* src/test/icu/t/010_database.pl

-# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE
-# are specified

Why remove this test?

+my ($ret, $stdout, $stderr) = $node1->psql('postgres',
+   q{CREATE DATABASE dbicu LOCALE_PROVIDER builtin LOCALE 'C' TEMPLATE 
dbicu}

+);

Change the name of the new database to be different from the name of
the template database.

Re: POC, WIP: OR-clause support for indexes

2024-03-14 Thread Alexander Korotkov

On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov 
wrote:
> On 13/3/2024 18:05, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
> > Given all of the above, I think moving transformation to the
> > canonicalize_qual() would be the right way to go.
> Ok, I will try to move the code.
> I have no idea about the timings so far. I recall the last time I got
> bogged down in tons of duplicated code. I hope with an almost-ready
> sketch, it will be easier.

Thank you!  I'll be looking forward to the updated patch.

I also have notes about the bitmap patch.

/*
 * Building index paths over SAOP clause differs from the logic of OR
clauses.
 * Here we iterate across all the array elements and split them to SAOPs,
 * corresponding to different indexes. We must match each element to an
index.
 */

This covers the case I posted before.  But in order to fix all possible
cases we probably need to handle the SAOP clause in the same way as OR
clauses.  Check also this case.

Setup
create table t (a int not null, b int not null, c int not null);
insert into t (select 1, 1, i from generate_series(1,1) i);
insert into t (select i, 2, 2 from generate_series(1,1) i);
create index t_a_b_idx on t (a, b);
create statistics t_a_b_stat (mcv) on a, b from t;
create statistics t_b_c_stat (mcv) on b, c from t;
vacuum analyze t;

Plan with enable_or_transformation = on:
# explain select * from t where a = 1 and (b = 1 or b = 2) and c = 2;
  QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=156.55..440.56 rows=5001 width=12)
   Recheck Cond: (a = 1)
   Filter: ((b = ANY ('{1,2}'::integer[])) AND (c = 2))
   ->  Bitmap Index Scan on t_a_b_idx  (cost=0.00..155.29 rows=10001
width=0)
 Index Cond: (a = 1)
(5 rows)

Plan with enable_or_transformation = off:
# explain select * from t where a = 1 and (b = 1 or b = 2) and c = 2;
  QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=11.10..18.32 rows=5001 width=12)
   Recheck Cond: (((b = 1) AND (c = 2)) OR ((a = 1) AND (b = 2)))
   Filter: ((a = 1) AND (c = 2))
   ->  BitmapOr  (cost=11.10..11.10 rows=2 width=0)
 ->  Bitmap Index Scan on t_b_c_idx  (cost=0.00..4.30 rows=1
width=0)
   Index Cond: ((b = 1) AND (c = 2))
 ->  Bitmap Index Scan on t_a_b_idx  (cost=0.00..4.30 rows=1
width=0)
   Index Cond: ((a = 1) AND (b = 2))
(8 rows)

As you can see this case is not related to partial indexes.  Just no index
selective for the whole query.  However, splitting scan by the OR qual lets
use a combination of two selective indexes.

--
Regards,
Alexander Korotkov

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-14 Thread Jelte Fennema-Nio

On Wed, 13 Mar 2024 at 20:08, Jacob Champion
 wrote:
> I hit this on my machine. With the attached diff I can reproduce
> constantly (including with the most recent test patch); I think the
> cancel must be arriving between the bind/execute steps?

Nice find! Your explanation makes total sense. Attached a patchset
that fixes/works around this issue by using the simple query protocol
in the cancel test.


v38-0001-Use-simple-query-protocol-for-cancel-test.patch
Description: Binary data


v38-0003-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data


v38-0002-Revert-Comment-out-noisy-libpq_pipeline-test.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread John Naylor

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > Okay, here's an another idea: Change test_lookup_tids() to be more
> > general and put the validation down into C as well. First we save the
> > blocks from do_set_block_offsets() into a table, then with all those
> > blocks lookup a sufficiently-large range of possible offsets and save
> > found values in another array. So the static items structure would
> > have 3 arrays: inserts, successful lookups, and iteration (currently
> > the iteration output is private to check_set_block_offsets(). Then
> > sort as needed and check they are all the same.
>
> That's a promising idea. We can use the same mechanism for randomized
> tests too. If you're going to work on this, I'll do other tests on my
> environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine. Some open questions on the test module:

- Since there are now three arrays we should reduce max bytes to
something smaller.
- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.
- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.
- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.

Re: Support json_errdetail in FRONTEND builds

2024-03-14 Thread Daniel Gustafsson

> On 14 Mar 2024, at 09:06, Michael Paquier  wrote:

> I think that it is cleaner to create the new API first, and then
> rely on it, reversing the order of both patches

I agree with this ordering.

> (perhaps the extra destroyStringInfo in freeJsonLexContext() should
> have been moved in 0001).

I wouldn't worry about that, seems fine as is to me.

>  See the attached with few tweaks to 0001, previously 0002 @-@.
> I'd still need to do a more serious lookup of 0002, previously 0001.

A few small comments:

- *
+*
Whitespace


+   /* don't allow destroys of read-only StringInfos */
+   Assert(str->maxlen != 0);
Considering that StringInfo.c don't own the memory here I think it's warranted
to turn this assert into an elog() to avoid the risk of use-after-free bugs.


+ * The returned allocation is either static or owned by the JsonLexContext and
+ * should not be freed.
The most important part of that comment is at the very end, to help readers I
would reword this to just "The returned pointer should not be freed.", or at
least put that part first.


+#define token_error(lex, format) \
I'm not sure this buys much more than reduced LoC, the expansion isn't
unreadable to the point that the format constraint encoded in the macro is
worth it IMO.

--
Daniel Gustafsson

Re: POC, WIP: OR-clause support for indexes

2024-03-14 Thread Andrei Lepikhov


On 14/3/2024 16:31, Alexander Korotkov wrote:
On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

 > On 13/3/2024 18:05, Alexander Korotkov wrote:
 > > On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
 > > Given all of the above, I think moving transformation to the
 > > canonicalize_qual() would be the right way to go.
 > Ok, I will try to move the code.
 > I have no idea about the timings so far. I recall the last time I got
 > bogged down in tons of duplicated code. I hope with an almost-ready
 > sketch, it will be easier.

Thank you!  I'll be looking forward to the updated patch.
Okay, I moved the 0001-* patch to the prepqual.c module. See it in the 
attachment. I treat it as a transient patch.

It has positive outcomes as well as negative ones.
The most damaging result you can see in the partition_prune test:
partition pruning, in some cases, moved to the executor initialization 
stage. I guess, we should avoid it somehow in the next version.


--
regards,
Andrei Lepikhov
Postgres Professional
From 170f6871540025d0d1683750442e7af902b11a40 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Fri, 2 Feb 2024 22:01:09 +0300
Subject: [PATCH 1/2] Transform OR clauses to ANY expression.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) 
on the
preliminary stage of optimization when we are still working with the
expression tree.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an operator which returns boolean result and has a commuter (for the case of
reverse order of constant and non-constant parts of the expression,
like 'CX op expr').
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
Authors: Alena Rybakina , Andrey Lepikhov 

Reviewed-by: Peter Geoghegan , Ranier Vilela 
Reviewed-by: Alexander Korotkov , Robert Haas 

Reviewed-by: jian he 
---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 doc/src/sgml/config.sgml  |  17 +
 src/backend/nodes/queryjumblefuncs.c  |  27 ++
 src/backend/optimizer/prep/prepqual.c | 371 +-
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/optimizer/optimizer.h |   2 +
 src/test/regress/expected/create_index.out| 156 +++-
 src/test/regress/expected/join.out|  62 ++-
 src/test/regress/expected/partition_prune.out | 235 +--
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  19 +-
 src/test/regress/sql/create_index.sql |  35 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 19 files changed, 939 insertions(+), 61 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 58a603ac56..a965b43cc6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8838,18 +8838,18 @@ insert into utrtest values (2, 'qux');
 -- Check case where the foreign partition is a subplan target rel
 explain (verbose, costs off)
 update utrtest set a = 1 where a = 1 or a = 2 returning *;
- QUERY PLAN
 
-
+  QUERY PLAN   
   
+--
  Update on public.utrtest
Output: utrtest_1.a, utrtest_1.b
Foreign Update on public.remp utrtest_1
Update on public.locp utrtest_2
->  Append
  ->  Foreign Update on public.remp utrtest_1
-   Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a 
= 2))) RETURNING a, b
+   Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY 
('{1,2}'::integer[]))) RETURNING a, b
  ->  Seq Scan on public.locp utrtest_2
Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record
-   Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2))
+   Filter: (utrtest_2.a = ANY ('{1,2}'::integer[]))
 (10 rows)
 
 -- The new values are concatenated with ' triggered !'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..2de6ae301a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml

Re: remaining sql/json patches

2024-03-14 Thread jian he

one more question...
SELECT  JSON_value(NULL::int, '$' returning int);
ERROR:  cannot use non-string types with implicit FORMAT JSON clause
LINE 1: SELECT  JSON_value(NULL::int, '$' returning int);
   ^

SELECT  JSON_query(NULL::int, '$' returning int);
ERROR:  cannot use non-string types with implicit FORMAT JSON clause
LINE 1: SELECT  JSON_query(NULL::int, '$' returning int);
   ^

SELECT * FROM JSON_TABLE(NULL::int, '$' COLUMNS (foo text));
ERROR:  cannot use non-string types with implicit FORMAT JSON clause
LINE 1: SELECT * FROM JSON_TABLE(NULL::int, '$' COLUMNS (foo text));
 ^

SELECT  JSON_value(NULL::text, '$' returning int);
ERROR:  JSON_VALUE() is not yet implemented for the json type
LINE 1: SELECT  JSON_value(NULL::text, '$' returning int);
   ^
HINT:  Try casting the argument to jsonb


SELECT  JSON_query(NULL::text, '$' returning int);
ERROR:  JSON_QUERY() is not yet implemented for the json type
LINE 1: SELECT  JSON_query(NULL::text, '$' returning int);
   ^
HINT:  Try casting the argument to jsonb

in all these cases, the error message seems strange.

we already mentioned:
  
   
SQL/JSON query functions currently only accept values of the
jsonb type, because the SQL/JSON path language only
supports those, so it might be necessary to cast the
context_item argument of these functions to
jsonb.
   
  

we can simply say, only accept the first argument to be jsonb data type.

Re: Make attstattarget nullable

2024-03-14 Thread Peter Eisentraut


On 12.03.24 14:32, Tomas Vondra wrote:

On 3/12/24 13:47, Peter Eisentraut wrote:

On 06.03.24 22:34, Tomas Vondra wrote:

0001


1) I think this bit in ALTER STATISTICS docs is wrong:

-  new_target
+  SET STATISTICS { integer | DEFAULT }

because it means we now have list entries for name, ..., new_name,
new_schema, and then suddenly "SET STATISTICS { integer | DEFAULT }".
That's a bit weird.


Ok, how would you change it?  List out the full clauses of the other
variants under Parameters as well?


I'd go with a parameter, essentially exactly as it used to be, except
for adding the DEFAULT option. So the list would define new_target, and
mention DEFAULT as a special value.


Ok, done that way (I think).


2) The newtarget handling in AlterStatistics seems rather confusing. Why
does it get set to -1 just to ignore the value later? For a while I was
99% sure ALTER STATISTICS ... SET STATISTICS DEFAULT will set the field
to -1. Maybe ditching the first if block and directly checking
stmt->stxstattarget before setting repl_val/repl_null would be better?


But we also need to continue accepting -1 for default on input.  The
current code achieves that, the proposed variant would not.


OK, I did not realize that. But then maybe this should be explained in a
comment before the new "if" block, because people won't realize why it
needs to be this way.


In the new version, I tried to write this more explicitly, and updated 
tablecmds.c to match.
From d98742439bfe82a20cffcdda6d5a05fdd06b46ea Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 14 Mar 2024 11:06:49 +0100
Subject: [PATCH v5 1/3] Make stxstattarget nullable

To match attstattarget change (commit 4f622503d6d).

Reviewed-by: Tomas Vondra 
Discussion: 
https://www.postgresql.org/message-id/flat/4da8d211-d54d-44b9-9847-f2a9f1184...@eisentraut.org
---
 doc/src/sgml/catalogs.sgml  | 28 +++
 doc/src/sgml/ref/alter_statistics.sgml  | 11 +++---
 src/backend/commands/statscmds.c| 46 
 src/backend/commands/tablecmds.c| 47 +
 src/backend/parser/gram.y   |  4 +--
 src/backend/statistics/extended_stats.c |  4 ++-
 src/bin/pg_dump/pg_dump.c   | 10 +++---
 src/bin/psql/describe.c |  3 +-
 src/include/catalog/catversion.h|  2 +-
 src/include/catalog/pg_statistic_ext.h  |  6 ++--
 src/include/nodes/parsenodes.h  |  2 +-
 11 files changed, 93 insertions(+), 70 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 387a14b1869..2f091ad09d1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7657,6 +7657,19 @@ pg_statistic_ext 
Columns
   
  
 
+ 
+  
+   stxkeys int2vector
+   (references pg_attribute.attnum)
+  
+  
+   An array of attribute numbers, indicating which table columns are
+   covered by this statistics object;
+   for example a value of 1 3 would
+   mean that the first and the third table columns are covered
+  
+ 
+
  
   
stxstattarget int2
@@ -7666,7 +7679,7 @@ pg_statistic_ext 
Columns
of statistics accumulated for this statistics object by
ANALYZE.
A zero value indicates that no statistics should be collected.
-   A negative value says to use the maximum of the statistics targets of
+   A null value says to use the maximum of the statistics targets of
the referenced columns, if set, or the system default statistics target.
Positive values of stxstattarget
determine the target number of most common values
@@ -7674,19 +7687,6 @@ pg_statistic_ext 
Columns
   
  
 
- 
-  
-   stxkeys int2vector
-   (references pg_attribute.attnum)
-  
-  
-   An array of attribute numbers, indicating which table columns are
-   covered by this statistics object;
-   for example a value of 1 3 would
-   mean that the first and the third table columns are covered
-  
- 
-
  
   
stxkind char[]
diff --git a/doc/src/sgml/ref/alter_statistics.sgml 
b/doc/src/sgml/ref/alter_statistics.sgml
index 73cc9e830de..c82a728a910 100644
--- a/doc/src/sgml/ref/alter_statistics.sgml
+++ b/doc/src/sgml/ref/alter_statistics.sgml
@@ -26,7 +26,7 @@
 ALTER STATISTICS name OWNER TO { 
new_owner | CURRENT_ROLE | 
CURRENT_USER | SESSION_USER }
 ALTER STATISTICS name RENAME TO 
new_name
 ALTER STATISTICS name SET SCHEMA 
new_schema
-ALTER STATISTICS name SET 
STATISTICS new_target
+ALTER STATISTICS name SET 
STATISTICS { new_target | DEFAULT }
 
  
 
@@ -101,10 +101,11 @@ Parameters

 The statistic-gathering target for this statistics object for 
subsequent
 ANALYZE 
operations.
-The target can be set in the range 0 to 1; alternatively, set it
-to -1 to revert to using the maximum of the statistics target of the
-referenced columns

Re: pg16: XX000: could not find pathkey item to sort

2024-03-14 Thread David Rowley

On Thu, 14 Mar 2024 at 18:23, Ashutosh Bapat
 wrote:
> I don't understand why root->query_pathkeys has both a and b. "a" is there 
> because of GROUP BY and ORDER BY clause. But why "b"?

So that the ORDER BY aggregate function can be evaluated without
nodeAgg.c having to perform the sort. See
adjust_group_pathkeys_for_groupagg().

David

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-14 Thread Amit Kapila

On Thu, Mar 14, 2024 at 1:45 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila  wrote:
> >
> > On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  
> > wrote:
> > >
> > > This fact makes me think that the slotsync worker might be able to
> > > accept the primary_conninfo value even if there is no dbname in the
> > > value. That is, if there is no dbname in the primary_conninfo, it uses
> > > the username in accordance with the specs of the connection string.
> > > Currently, the slotsync worker connects to the local database first
> > > and then establishes the connection to the primary server. But if we
> > > can reverse the two steps, it can get the dbname that has actually
> > > been used to establish the remote connection and use it for the local
> > > connection too. That way, the primary_conninfo generated by
> > > pg_basebackup could work even without the patch. For example, if the
> > > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > > would connect to the postgres database. Given the 'postgres' database
> > > is created by default and 'postgres' OS user is used in common, I
> > > guess it could cover many cases in practice actually.
> > >
> >
> > I think this is worth investigating but I suspect that in most cases
> > users will end up using a replication connection without specifying
> > the user name and we may not be able to give a meaningful error
> > message when slotsync worker won't be able to connect. The same will
> > be true even when the dbname same as the username would be used.
>
> What do you mean by not being able to give a meaningful error message?
>
> If the slotsync worker uses the user name as the dbname, and such a
> database doesn't exist, the error message the user will get is
> "database "test_user" does not exist". ISTM the same is true when the
> user specifies the wrong database in the primary_conninfo.
>

Right, the exact error message as mentioned by Shveta will be:
ERROR:  could not connect to the primary server: connection to server
at "127.0.0.1", port 5433 failed: FATAL: database "bckp_user" does not
exist

Now, without this idea, the ERROR message will be:
 ERROR:  slot synchronization requires dbname to be specified in
primary_conninfo

I am not sure how much this matters but the second message sounds more useful.

> >
> > > Having said that, even with (or without) the above change, we might
> > > want to change the pg_basebackup so that it writes the dbname to the
> > > primary_conninfo if -R option is specified. Since the database where
> > > the slotsync worker connects cannot be dropped while the slotsync
> > > worker is running, the user might want to change the database to
> > > connect, and it would be useful if they can do that using
> > > pg_basebackup instead of modifying the configuration file manually.
> > >
> > > While the current approach makes sense to me, I'm a bit concerned that
> > > we might end up having the pg_basebackup search the actual database
> > > name (e.g. 'dbname=template1') from the .pgpass file instead of
> > > 'dbname=replication'. As far as I tested on my environment, suppose
> > > that I execute:
> > >
> > > pg_basebackup -D tmp -d "dbname=testdb" -R
> > >
> > > The pg_basebackup established a replication connection but looked for
> > > the password of the 'testdb' database. This could be another
> > > inconvenience for the existing users who want to use the slot
> > > synchronization.
> > >
> >
> > This is true because it is internally using logical replication
> > connection (as we will set set replication=database).
>
> Did you mean the pg_basebackup is using a logical replication
> connection in this case? As far as I tested, even if we specify dbname
> to the -d option of pg_basebackup, it uses a physical replication
> connection. For example, it can take a backup even if I specify a
> non-existing database name.
>

You are right. I misunderstood some part of the code in GetConnection.
However, I think my point is still valid that if the user has provided
dbname in the connection string it means that she wants that database
entry to be looked upon not "replication" entry.

>
> >
> > > But it's still just an idea and I might be missing something. And
> > > given we're getting closer to the feature freeze, it would be a PG18
> > > item.
> > >
> >
> > +1. At this stage, it is important to discuss whether we should allow
> > pg_baseback to write dbname (either a specified one or a default one)
> > along with other parameters in primary_conninfo?
> >
>
> True. While I basically agree that pg_basebackup writes dbname in
> primary_conninfo, I'm concerned that writing "dbname=replication"
> could be problematic. Quoting the case 3) Vignesh summarized before:
>
> 3) ./pg_basebackup -d "user=vignesh"  -D data -R
> -> primary_conninfo = "dbname=replication"  (In this case
> primary_conninfo will have dbname as replication which is the default
> value from GetConnection as connection string i

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-14 Thread Alvaro Herrera

On 2024-Mar-14, Jelte Fennema-Nio wrote:

> On Wed, 13 Mar 2024 at 20:08, Jacob Champion
>  wrote:
> > I hit this on my machine. With the attached diff I can reproduce
> > constantly (including with the most recent test patch); I think the
> > cancel must be arriving between the bind/execute steps?
> 
> Nice find! Your explanation makes total sense. Attached a patchset
> that fixes/works around this issue by using the simple query protocol
> in the cancel test.

Hmm, isn't this basically saying that we're giving up on reliably
canceling queries altogether?  I mean, maybe we'd like to instead fix
the bug about canceling queries in extended query protocol ...
Isn't that something you're worried about?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Heikki Linnakangas


On 14/03/2024 06:54, Dilip Kumar wrote:

On Wed, Mar 13, 2024 at 9:25 PM Robert Haas  wrote:


On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar  wrote:

Andres already commented on the snapshot stuff on an earlier patch
version, and that's much nicer with this version. However, I don't
understand why a parallel bitmap heap scan needs to do anything at all
with the snapshot, even before these patches. The parallel worker
infrastructure already passes the active snapshot from the leader to the
parallel worker. Why does bitmap heap scan code need to do that too?


Yeah thinking on this now it seems you are right that the parallel
infrastructure is already passing the active snapshot so why do we
need it again.  Then I checked other low scan nodes like indexscan and
seqscan and it seems we are doing the same things there as well.
Check for SerializeSnapshot() in table_parallelscan_initialize() and
index_parallelscan_initialize() which are being called from
ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
respectively.


I remember thinking about this when I was writing very early parallel
query code. It seemed to me that there must be some reason why the
EState has a snapshot, as opposed to just using the active snapshot,
and so I took care to propagate that snapshot, which is used for the
leader's scans, to the worker scans also. Now, if the EState doesn't
need to contain a snapshot, then all of that mechanism is unnecessary,
but I don't see how it can be right for the leader to do
table_beginscan() using estate->es_snapshot and the worker to use the
active snapshot.


Yeah, that's a very valid point. So I think now Heikki/Melanie might
have got an answer to their question, about the thought process behind
serializing the snapshot for each scan node.  And the same thing is
followed for BitmapHeapNode as well.


I see. Thanks, understanding the thought process helps.

So when a parallel table or index scan runs in the executor as part of a 
query, we could just use the active snapshot. But there are some other 
callers of parallel table scans that don't use the executor, namely 
parallel index builds. For those it makes sense to pass the snapshot for 
the scan independent of the active snapshot.


A parallel bitmap heap scan isn't really a parallel scan as far as the 
table AM is concerned, though. It's more like an independent bitmap heap 
scan in each worker process, nodeBitmapHeapscan.c does all the 
coordination of which blocks to scan. So I think that 
table_parallelscan_initialize() was the wrong role model, and we should 
still remove the snapshot serialization code from nodeBitmapHeapscan.c.



Digging deeper into the question of whether es_snapshot == 
GetActiveSnapshot() is a valid assumption:




es_snapshot is copied from the QueryDesc in standard_ExecutorStart(). 
Looking at the callers of ExecutorStart(), they all get the QueryDesc by 
calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any 
callers changing the active snapshot between the ExecutorStart() and 
ExecutorRun() calls either. In pquery.c, we explicitly 
PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So 
no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds.


_SPI_execute_plan() has code to deal with the possibility that the 
active snapshot is not set. That seems fishy; do we really support SPI 
without any snapshot? I'm inclined to turn that into an error. I ran the 
regression tests with an "Assert(ActiveSnapshotSet())" there, and 
everything worked.


If es_snapshot was different from the active snapshot, things would get 
weird, even without parallel query. The scans would use es_snapshot for 
the visibility checks, but any functions you execute in quals would use 
the active snapshot.


We could double down on that assumption, and remove es_snapshot 
altogether and use GetActiveSnapshot() instead. And perhaps add 
"PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun().




In summary, this es_snapshot stuff is a bit confusing and could use some 
cleanup. But for now, I'd like to just add some assertions and a 
comments about this, and remove the snapshot serialization from bitmap 
heap scan node, to make it consistent with other non-parallel scan nodes 
(it's not really a parallel scan as far as the table AM is concerned). 
See attached patch, which is the same as previous patch with some extra 
assertions.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 44eb77162ccc6f18a0a5eaccf0c083d4fefd076f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 14 Mar 2024 11:56:11 +0200
Subject: [PATCH v2 1/1] Remove redundant snapshot copying from parallel leader
 to workers

The parallel query infrastructure copies the leader backend's active
snapshot to the worker processes. But BitmapHeapScan node also had
bespoken code to pass the snapshot from leader to the worker. That was
redundant, so remove it.

The removed code was analogous to the snapshot serialization

Re: POC, WIP: OR-clause support for indexes

2024-03-14 Thread Alexander Korotkov

On Thu, Mar 14, 2024 at 12:11 PM Andrei Lepikhov
 wrote:
>
> On 14/3/2024 16:31, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov
> > mailto:a.lepik...@postgrespro.ru>> wrote:
> >  > On 13/3/2024 18:05, Alexander Korotkov wrote:
> >  > > On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
> >  > > Given all of the above, I think moving transformation to the
> >  > > canonicalize_qual() would be the right way to go.
> >  > Ok, I will try to move the code.
> >  > I have no idea about the timings so far. I recall the last time I got
> >  > bogged down in tons of duplicated code. I hope with an almost-ready
> >  > sketch, it will be easier.
> >
> > Thank you!  I'll be looking forward to the updated patch.
> Okay, I moved the 0001-* patch to the prepqual.c module. See it in the
> attachment. I treat it as a transient patch.
> It has positive outcomes as well as negative ones.
> The most damaging result you can see in the partition_prune test:
> partition pruning, in some cases, moved to the executor initialization
> stage. I guess, we should avoid it somehow in the next version.

Thank you, Andrei.  Looks like a very undesirable side effect.  Do you
have any idea why it happens?  Partition pruning should work correctly
for both transformed and non-transformed quals, why does
transformation hurt it?

--
Regards,
Alexander Korotkov

Re: Fix expecteddir argument in pg_regress

2024-03-14 Thread Daniel Gustafsson

> On 11 Mar 2024, at 09:23, Anthonin Bonnefoy  
> wrote:

> pg_regress accepts the expecteddir argument. However, it is never used
> and outputdir is used instead to get the expected files paths.

Nice catch, c855872074b5bf44ecea033674d22fac831cfc31 added --expecteddir
support to pg_regress but only implemented it for the ECPG tests.  Will have
another look at this before applying with a backpatch to v16 where
--expecteddir was added.

--
Daniel Gustafsson

Re: Add publisher and subscriber to glossary documentation.

2024-03-14 Thread Shlok Kyal

Hi Andrew,

> If there's a movement towards "node" to refer to the database which has the 
> Subscription object, then perhaps the documentation for
>
> 31.2. Subscription, Chapter 31. Logical Replication should be updated as 
> well, since it uses both the "database" and "node" terms on the same page, 
> and to me referring to the same thing (I could be missing a subtlety).
>
>
> See:
>
>
> "The subscriber database..."
>
>
> "A subscriber node may..."
>
>
> Also, the word "database" in this sentence: "A subscription defines the 
> connection to another database" to me works, but I think using "node" there 
> could be more consistent if it’s referring to the server instance running the 
> database that holds the PUBLICATION. The connection string information 
> example later on the page shows "host" and "dbname" configured in the 
> CONNECTION value for the SUBSCRIPTION. This sentence seems like the use of 
> "database" in casual style to mean the "server instance" (or "node").
>
>
> Also, the "The node where a subscription is defined". That one actually feels 
> to me like "The database where a subscription is defined", but then that 
> contradicts what I just said, and "node" is fine here but I think "node" 
> should be on the preceding sentence too.
>
>
> Anyway, hopefully these examples show “node” and “database” are mixed and 
> perhaps others agree using one consistently might help the goals of the docs.

For me the existing content looks good, I felt let's keep it as it is
unless others feel differently.

Thanks and regards,
Shlok Kyal

Re: Can Execute commands for different portals interleave?

2024-03-14 Thread Heikki Linnakangas


On 14/03/2024 07:05, Evgeny Smirnov wrote:
The question (a short version): is it possible for a client to send two 
selects in the same transaction using the extended query protocol 
(without declaring cursors) and pull rows simultaneously by means of 
interleaving portal names and restricting fetch size in Execute commands.


Yes, that's possible.

Named portals created with the extended query protocol are the same as 
cursors, really. You can even use the Execute protocol message to fetch 
from cursor created with DECLARE CURSOR, or use FETCH command to fetch 
from a portal created with the Bind message.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Dilip Kumar

On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas  wrote:
>
> > Yeah, that's a very valid point. So I think now Heikki/Melanie might
> > have got an answer to their question, about the thought process behind
> > serializing the snapshot for each scan node.  And the same thing is
> > followed for BitmapHeapNode as well.
>
> I see. Thanks, understanding the thought process helps.
>
> So when a parallel table or index scan runs in the executor as part of a
> query, we could just use the active snapshot. But there are some other
> callers of parallel table scans that don't use the executor, namely
> parallel index builds. For those it makes sense to pass the snapshot for
> the scan independent of the active snapshot.

Right

> A parallel bitmap heap scan isn't really a parallel scan as far as the
> table AM is concerned, though. It's more like an independent bitmap heap
> scan in each worker process, nodeBitmapHeapscan.c does all the
> coordination of which blocks to scan. So I think that
> table_parallelscan_initialize() was the wrong role model, and we should
> still remove the snapshot serialization code from nodeBitmapHeapscan.c.

I think that seems right.

> Digging deeper into the question of whether es_snapshot ==
> GetActiveSnapshot() is a valid assumption:
>
> 
>
> es_snapshot is copied from the QueryDesc in standard_ExecutorStart().
> Looking at the callers of ExecutorStart(), they all get the QueryDesc by
> calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any
> callers changing the active snapshot between the ExecutorStart() and
> ExecutorRun() calls either. In pquery.c, we explicitly
> PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So
> no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds.
>
> _SPI_execute_plan() has code to deal with the possibility that the
> active snapshot is not set. That seems fishy; do we really support SPI
> without any snapshot? I'm inclined to turn that into an error. I ran the
> regression tests with an "Assert(ActiveSnapshotSet())" there, and
> everything worked.

IMHO, we can call SPI_Connect() and SPI_Execute() from any C
extension, so I don't think there we can guarantee that the snapshot
must be set, do we?

> If es_snapshot was different from the active snapshot, things would get
> weird, even without parallel query. The scans would use es_snapshot for
> the visibility checks, but any functions you execute in quals would use
> the active snapshot.
>
> We could double down on that assumption, and remove es_snapshot
> altogether and use GetActiveSnapshot() instead. And perhaps add
> "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun().
>
> 
>
> In summary, this es_snapshot stuff is a bit confusing and could use some
> cleanup. But for now, I'd like to just add some assertions and a
> comments about this, and remove the snapshot serialization from bitmap
> heap scan node, to make it consistent with other non-parallel scan nodes
> (it's not really a parallel scan as far as the table AM is concerned).
> See attached patch, which is the same as previous patch with some extra
> assertions.

Maybe for now we can just handle this specific case to remove the
snapshot serializing for the BitmapHeapScan as you are doing in the
patch.  After looking into the code your theory seems correct that we
are just copying the ActiveSnapshot while building the query
descriptor and from there we are copying into the Estate so logically
there should not be any reason for these two to be different.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: UUID v7

2024-03-14 Thread Peter Eisentraut


On 10.03.24 13:59, Andrey M. Borodin wrote:

The functions uuid_extract_ver and uuid_extract_var could be named
uuid_extract_version and uuid_extract_variant.  Otherwise, it's hard
to tell them apart, with only one letter different.


Renamed.


Another related comment: Throughout your patch, swap the order of 
uuid_extract_variant and uuid_extract_version.  First, this makes more 
sense because version is subordinate to variant, and also it makes it 
alphabetical.



I think the behavior of uuid_extract_var(iant) is wrong.  The code
takes just two bits to return, but the draft document is quite clear
that the variant is 4 bits (see Table 1).


Well, it was correct only for implemented variant. I've made version that 
implements full table 1 from section 4.1.


I think we are still interpreting this differently.  I think 
uuid_extract_variant should just return whatever is in those four bits. 
Your function comment says "Can return only 0, 0b10, 0b110 and 0b111.", 
which I don't think it is correct.  It should return 0 through 15.



I would have expected that, since gettimeofday() provides microsecond
precision, we'd put the extra precision into "rand_a" as per Section 6.2 method 
3.


I had chosen method 2 over method 3 as most portable. Can we be sure how many 
bits (after reading milliseconds) are there across different OSes?


I think this should have been researched.  If we don't know how many 
bits we have, how do we know we have enough for milliseconds?  I think 
we should at least have some kind of idea, if we are going to have this 
conversation.

Re: Flushing large data immediately in pqcomm

2024-03-14 Thread Melih Mutlu

Hi hackers,

I did some experiments with this patch, after previous discussions. This
probably does not answer all questions, but would be happy to do more if
needed.

First, I updated the patch according to what suggested here [1]. PSA  v2.
I tweaked the master branch a bit to not allow any buffering. I compared
HEAD, this patch and no buffering at all.
I also added a simple GUC to control PqSendBufferSize, this change only
allows to modify the buffer size and should not have any impact on
performance.

I again ran the COPY TO STDOUT command and timed it. AFAIU COPY sends data
row by row, and I tried running the command under different scenarios with
different # of rows and row sizes. You can find the test script attached
(see test.sh).
All timings are in ms.

1- row size = 100 bytes, # of rows = 100
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 1036   │ 998  │ 940  │ 910  │ 894  │ 874  │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 1107   │ 1032 │ 980  │ 957  │ 917  │ 909  │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 6230   │ 6125 │ 6282 │ 6279 │ 6255 │ 6221 │
└───┴┴──┴──┴──┴──┴──┘

2-  row size = half of the rows are 1KB and rest is 10KB , # of rows =
100
┌───┬┬───┬───┬───┬───┬───┐
│   │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───┼┼───┼───┼───┼───┼───┤
│ HEAD  │ 25197  │ 23414 │ 20612 │ 19206 │ 18334 │ 18033 │
├───┼┼───┼───┼───┼───┼───┤
│ patch │ 19843  │ 19889 │ 19798 │ 19129 │ 18578 │ 18260 │
├───┼┼───┼───┼───┼───┼───┤
│ no buffer │ 23752  │ 23565 │ 23602 │ 23622 │ 23541 │ 23599 │
└───┴┴───┴───┴───┴───┴───┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 3137   │ 2937 │ 2687 │ 2551 │ 2456 │ 2465 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 2399   │ 2390 │ 2402 │ 2415 │ 2417 │ 2422 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 2417   │ 2414 │ 2429 │ 2418 │ 2435 │ 2404 │
└───┴┴──┴──┴──┴──┴──┘

3-  row size = all rows are 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 6113   │ 5764 │ 5281 │ 5009 │ 4885 │ 4872 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 4759   │ 4754 │ 4754 │ 4758 │ 4782 │ 4805 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 4756   │ 4774 │ 4793 │ 4766 │ 4770 │ 4774 │
└───┴┴──┴──┴──┴──┴──┘

Some quick observations:
1- Even though I expect both the patch and HEAD behave similarly in case of
small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.
2- In cases where the data does not fit into the buffer, the patch starts
performing better than HEAD. For example, in case 2, patch seems faster
until the buffer size exceeds the data length. When the buffer size is set
to something larger than 10KB (16KB/32KB in this case), there is again a
slight performance loss with the patch as in case 1.
3- With large row sizes (i.e. sizes that do not fit into the buffer) not
buffering at all starts performing better than HEAD. Similarly the patch
performs better too as it stops buffering if data does not fit into the
buffer.



[1]
https://www.postgresql.org/message-id/CAGECzQTYUhnC1bO%3DzNiSpUgCs%3DhCYxVHvLD2doXNx3My6ZAC2w%40mail.gmail.com


Thanks,
-- 
Melih Mutlu
Microsoft
From 7f1b72a0948156f8e35ce3b07b5e763a5578d641 Mon Sep 17 00:00:00 2001
From: Melih Mutlu 
Date: Mon, 26 Feb 2024 14:41:35 +0300
Subject: [PATCH] Added pq_send_buffer_size GUC

---
 src/backend/libpq/pqcomm.c  |  2 +-
 src/backend/utils/misc/guc_tables.c | 11 +++
 src/include/libpq/libpq.h   |  1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index c606bf3447..92708e46e6 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -116,7 +116,7 @@ static List *sock_paths = NIL;
  * enlarged by pq_putmessage_noblock() if the message doesn't fit otherwise.
  */
 
-#define PQ_SEND_BUFFER_SIZE 8192
+int PQ_SEND_BUFFER_SIZE = 8192;
 #define PQ_RECV_BUFFER_SIZE 8192
 
 static ch

Re: Using the %m printf format more

2024-03-14 Thread Dagfinn Ilmari Mannsåker

Michael Paquier  writes:

> On Wed, Mar 13, 2024 at 02:33:52PM +0100, Peter Eisentraut wrote:
>> The 0002 patch looks sensible.  It would be good to fix that, otherwise it
>> could have some confusing outcomes in the future.
>
> You mean if we begin to use %m in future callers of
> emit_tap_output_v(), hypothetically?  That's a fair argument.

Yeah, developers would rightfully expect to be able to use %m with
anything that takes a printf format string.  Case in point: when I was
first doing the conversion I did change the bail() and diag() calls in
pg_regress to %m, and only while I was preparing the patch for
submission did I think to check the actual implementation to see if it
was safe to do so.

The alternative would be to document that you can't use %m with these
functions, which is silly IMO, given how simple the fix is.

One minor improvement I can think of is to add a comment by the
save_errno declaration noting that it's needed in order to support %m.

- ilmari

Re: UUID v7

2024-03-14 Thread Andrey M. Borodin




> On 14 Mar 2024, at 16:07, Peter Eisentraut  wrote:
> 
> On 10.03.24 13:59, Andrey M. Borodin wrote:
>>> The functions uuid_extract_ver and uuid_extract_var could be named
>>> uuid_extract_version and uuid_extract_variant.  Otherwise, it's hard
>>> to tell them apart, with only one letter different.
>> Renamed.
> 
> Another related comment: Throughout your patch, swap the order of 
> uuid_extract_variant and uuid_extract_version.  First, this makes more sense 
> because version is subordinate to variant, and also it makes it alphabetical.
I will do it soon.
> 
>>> I think the behavior of uuid_extract_var(iant) is wrong.  The code
>>> takes just two bits to return, but the draft document is quite clear
>>> that the variant is 4 bits (see Table 1).
>> Well, it was correct only for implemented variant. I've made version that 
>> implements full table 1 from section 4.1.
> 
> I think we are still interpreting this differently.  I think 
> uuid_extract_variant should just return whatever is in those four bits. Your 
> function comment says "Can return only 0, 0b10, 0b110 and 0b111.", which I 
> don't think it is correct.  It should return 0 through 15.
We will return "do not care" bits. This bits can confuse someone. E.g. for 
varaint 0b10 we can return 8, 9, 10 and 11 randomly. Is it OK? BTW for some 
reason document lists number 1-15, but your are correct that range is 0-15.

> 
>>> I would have expected that, since gettimeofday() provides microsecond
>>> precision, we'd put the extra precision into "rand_a" as per Section 6.2 
>>> method 3.
>> I had chosen method 2 over method 3 as most portable. Can we be sure how 
>> many bits (after reading milliseconds) are there across different OSes?
> 
> I think this should have been researched.  If we don't know how many bits we 
> have, how do we know we have enough for milliseconds?  I think we should at 
> least have some kind of idea, if we are going to have this conversation.

Bits for milliseconds are strictly defined by the document: there are always 48 
bits, independently from clock resolution.
But I don't think it's main problem for Method 3. Method 1 actually guarantees 
strictly increasing order of UUIDs generated by single backend. Method 3 can 
generate a lot of unsorted data in case of time leaping backward.

BTW Kyzer (in an off-list discussion) and Sergey confirmed that implemented 
method from the patch actually is Method 1.


Best regards, Andrey Borodin.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-14 Thread Jelte Fennema-Nio

On Thu, 14 Mar 2024 at 11:33, Alvaro Herrera  wrote:
> Hmm, isn't this basically saying that we're giving up on reliably
> canceling queries altogether?  I mean, maybe we'd like to instead fix
> the bug about canceling queries in extended query protocol ...
> Isn't that something you're worried about?

In any case I think it's worth having (non-flaky) test coverage of our
libpq cancellation sending code. So I think it makes sense to commit
the patch I proposed, even if the backend code to handle that code is
arguably buggy.

Regarding the question if the backend code is actually buggy or not:
the way cancel requests are defined to work is a bit awkward. They
cancel whatever operation is running on the session when they arrive.
So if the session is just in the middle of a Bind and Execute message
there is nothing to cancel. While surprising and probably not what
someone would want, I don't think this behaviour is too horrible in
practice in this case. Most of the time people cancel queries while
the Execute message is being processed. The new test really only runs
into this problem because it sends a cancel request, immediately after
sending the query.

I definitely think it's worth rethinking the way we do query
cancellations though. I think what we would probably want is a way to
cancel a specific query/message on a session. Instead of cancelling
whatever is running at the moment when the cancel request is processed
by Postgres. Because this "cancel whatever is running" behaviour is
fraught with issues, this Bind/Execute issue being only one of them.
One really annoying race condition of a cancel request cancelling
another query than intended can happen with this flow (that I spend
lots of time on addressing in PgBouncer):
1. You send query A on session 1
2. You send a cancel request for session 1 (intending to cancel query A)
3. Query A completes by itself
4. You now send query B
5. The cancel request is now processed
6. Query B is now cancelled

But solving that race condition would involve changing the postgres
protocol. Which I'm trying to make possible with the first few commits
in[1]. And while those first few commits might still land in PG17, I
don't think a large protocol change like adding query identifiers to
cancel requests is feasible for PG17 anymore.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > general and put the validation down into C as well. First we save the
> > > blocks from do_set_block_offsets() into a table, then with all those
> > > blocks lookup a sufficiently-large range of possible offsets and save
> > > found values in another array. So the static items structure would
> > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > the iteration output is private to check_set_block_offsets(). Then
> > > sort as needed and check they are all the same.
> >
> > That's a promising idea. We can use the same mechanism for randomized
> > tests too. If you're going to work on this, I'll do other tests on my
> > environment in the meantime.
>
> Some progress on this in v72 -- I tried first without using SQL to
> save the blocks, just using the unique blocks from the verification
> array. It seems to work fine.

Thanks!

>
> - Since there are now three arrays we should reduce max bytes to
> something smaller.

Agreed.

> - Further on that, I'm not sure if the "is full" test is telling us
> much. It seems we could make max bytes a static variable and set it to
> the size of the empty store. I'm guessing it wouldn't take much to add
> enough tids so that the contexts need to allocate some blocks, and
> then it would appear full and we can test that. I've made it so all
> arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

> - Why are we switching to TopMemoryContext? It's not explained -- the
> comment only tells what the code is doing (which is obvious), but not
> why.

This is because the tidstore needs to live across the transaction
boundary. We can use TopMemoryContext or CacheMemoryContext.

> - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> now have a separate lookup test, the only thing it can tell us is that
> lookups fail on an empty store. I arranged it so that
> check_set_block_offsets() works on an empty store. Although that's
> even more trivial, it's just reusing what we already need.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

REVOKE FROM warning on grantor

2024-03-14 Thread Étienne BERSAC

Hi,

Since ldap2pg 6, I'm working on running by default as non-super role
with CREATEDB. Robert Haas made this a viable solution as of Postgres
16.

I got a case where ldap2pg tries to remove a role from a group. But
ldap2pg user is not the grantor of this membership. This triggers a
warning:

$ REVOKE owners FROM alice;
WARNING:  role "alice" has not been granted membership in role "owners"
by role "ldap2pg"

I'll add a condition on grantor when listing manageable membership to
simply avoid this.

However, I'd prefer if Postgres fails properly. Because the GRANT is
actually not revoked. This prevent ldap2pg to report an issue in
handling privileges on such roles.

What do you think of make this warning an error ?

Re: Flushing large data immediately in pqcomm

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu  wrote:
> 1- Even though I expect both the patch and HEAD behave similarly in case of 
> small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.

I wonder why this happens. It seems like maybe something that could be fixed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: small_cleanups around login event triggers

2024-03-14 Thread Daniel Gustafsson

> On 14 Mar 2024, at 02:47, Robert Treat  wrote:

> I was taking a look at the login event triggers work (nice work btw)

Thanks for reviewing committed code, that's something which doesn't happen
often enough and is much appreciated.

> and saw a couple of minor items that I thought would be worth cleaning
> up. This is mostly just clarifying the exiting docs and code comments.

+ either in a connection string or configuration file. Alternativly, you can
This should be "Alternatively" I think.

- canceling connection in psql wouldn't cancel
+ canceling a connection in psql will not cancel
Nitpickery (perhaps motivated by english not being my first language), but
since psql only deals with one connection I would expect this to read "the
connection".

- * Returns true iff the lock was acquired.
+ * Returns true if the lock was acquired.
Using "iff" here is being consistent with the rest of the file (and technically
correct):

$ grep -c "if the lock was" src/backend/storage/lmgr/lmgr.c
1
$ grep -c "iff the lock was" src/backend/storage/lmgr/lmgr.c
5

--
Daniel Gustafsson

Re: Flushing large data immediately in pqcomm

2024-03-14 Thread Jelte Fennema-Nio

On Thu, 14 Mar 2024 at 12:22, Melih Mutlu  wrote:
> I did some experiments with this patch, after previous discussions

One thing I noticed is that the buffer sizes don't seem to matter much
in your experiments, even though Andres his expectation was that 1400
would be better. I think I know the reason for that:

afaict from your test.sh script you connect psql over localhost or
maybe even unix socket to postgres. Neither of those would not have an
MTU of 1500. You'd probably want to do those tests over an actual
network or at least change the MTU of the loopback interface. e.g. my
"lo" interface mtu is 65536 by default:

❯ ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 6:37 AM Heikki Linnakangas  wrote:
> If es_snapshot was different from the active snapshot, things would get
> weird, even without parallel query. The scans would use es_snapshot for
> the visibility checks, but any functions you execute in quals would use
> the active snapshot.

Hmm, that's an interesting point.

The case where the query is suspended and resumed - i.e. cursors are
used - probably needs more analysis. In that case, perhaps there's
more room for the snapshots to diverge.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Inconsistent printf placeholders

2024-03-14 Thread Daniel Gustafsson

> On 14 Mar 2024, at 05:20, Kyotaro Horiguchi  wrote:

> I'd be happy if the two messages kept consistency. I suggest aligning
> types instead of making the messages different, as attached.

I've only skimmed this so far but +1 on keeping the messages the same where
possible to reduce translation work.  Adding a comment on the message where the
casting is done to indicate that it is for translation might reduce the risk of
it "getting fixed" down the line.

--
Daniel Gustafsson

Re: Flushing large data immediately in pqcomm

2024-03-14 Thread Heikki Linnakangas


On 14/03/2024 13:22, Melih Mutlu wrote:

@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
if (internal_flush())
return EOF;
}
-   amount = PqSendBufferSize - PqSendPointer;
-   if (amount > len)
-   amount = len;
-   memcpy(PqSendBuffer + PqSendPointer, s, amount);
-   PqSendPointer += amount;
-   s += amount;
-   len -= amount;
+
+   /*
+* If the buffer is empty and data length is larger than the 
buffer
+* size, send it without buffering. Otherwise, put as much data 
as
+* possible into the buffer.
+*/
+   if (!pq_is_send_pending() && len >= PqSendBufferSize)
+   {
+   int start = 0;
+
+   socket_set_nonblocking(false);
+   if (internal_flush_buffer(s, &start, (int *)&len))
+   return EOF;
+   }
+   else
+   {
+   amount = PqSendBufferSize - PqSendPointer;
+   if (amount > len)
+   amount = len;
+   memcpy(PqSendBuffer + PqSendPointer, s, amount);
+   PqSendPointer += amount;
+   s += amount;
+   len -= amount;
+   }
}
+
return 0;
 }


Two small bugs:

- the "(int *) &len)" cast is not ok, and will break visibly on 
big-endian systems where sizeof(int) != sizeof(size_t).


- If internal_flush_buffer() cannot write all the data in one call, it 
updates 'start' for how much it wrote, and leaves 'end' unchanged. You 
throw the updated 'start' value away, and will send the same data again 
on next iteration.


Not a correctness issue, but instead of pq_is_send_pending(), I think it 
would be better to check "PqSendStart == PqSendPointer" directly, or 
call socket_is_send_pending() directly here. pq_is_send_pending() does 
the same, but it's at a higher level of abstraction.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Catalog domain not-null constraints

2024-03-14 Thread Peter Eisentraut


On 12.02.24 11:24, Alvaro Herrera wrote:

On 2024-Feb-11, Peter Eisentraut wrote:

But I see that table constraints do not work that way.  A command like ALTER
TABLE t1 ADD NOT NULL c1 does nothing if the column already has a NOT NULL
constraint.  I'm not sure this is correct.  At least it's not documented.
We should probably make the domains feature work the same way, but I would
like to understand why it works that way first.



The main source of nastiness, when we allow multiple constraints, is
constraint inheritance.  If we allow just one constraint per column,
then it's always easy to know what to do on inheritance attach and
detach: just coninhcount+1 or coninhcount-1 of the one relevant
constraint (which can be matched by column name).  If we have multiple
ones, we have to know which one(s) to match and how (by constraint
name?); if the parent has two and the child has one, we need to create
another in the child, with its own coninhcount adjustments; if the
parent has one named parent_col_not_null and the child also has
child_col_not_null, then at ADD INHERIT do we match these ignoring the
differing name, or do we rename the one on child so that we now have
two?  Also, the clutter in psql/pg_dump becomes worse.

I would suggest that domain not-null constraints should also allow just
one per column.


Perhaps it would make sense if we change the ALTER TABLE command to be like

ALTER TABLE t1 ADD IF NOT EXISTS NOT NULL c1

Then the behavior is like one would expect.

For ALTER TABLE, we would reject this command if IF NOT EXISTS is not 
specified.  (Since this is mainly for pg_dump, it doesn't really matter 
for usability.)  For ALTER DOMAIN, we could accept both variants.

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Heikki Linnakangas


On 14/03/2024 14:34, Robert Haas wrote:

On Thu, Mar 14, 2024 at 6:37 AM Heikki Linnakangas  wrote:

If es_snapshot was different from the active snapshot, things would get
weird, even without parallel query. The scans would use es_snapshot for
the visibility checks, but any functions you execute in quals would use
the active snapshot.


Hmm, that's an interesting point.

The case where the query is suspended and resumed - i.e. cursors are
used - probably needs more analysis. In that case, perhaps there's
more room for the snapshots to diverge.


The portal code is pretty explicit about it, the ExecutorRun() call in 
PortalRunSelect() looks like this:


PushActiveSnapshot(queryDesc->snapshot);
ExecutorRun(queryDesc, direction, (uint64) count,
portal->run_once);
nprocessed = queryDesc->estate->es_processed;
PopActiveSnapshot();

I looked at all the callers of ExecutorRun(), and they all have the 
active snapshot equal to queryDesc->snapshot, either because they called 
CreateQueryDesc() with the active snapshot before ExecutorRun(), or they 
set the active snapshot like above.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Inconsistent printf placeholders

2024-03-14 Thread Peter Eisentraut


On 14.03.24 05:20, Kyotaro Horiguchi wrote:

A recent commit 6612185883 introduced two error messages that are
identical in text but differ in their placeholders.

-   pg_fatal("could not read file \"%s\": read only %d of %d 
bytes",
-filename, (int) rb, (int) st.st_size);
+   pg_fatal("could not read file \"%s\": read only %zd of %lld 
bytes",
+filename, rb, (long long int) 
st.st_size);
...
-   pg_fatal("could not read file \"%s\": read only %d of %d 
bytes",
+   pg_fatal("could not read file \"%s\": read only %d of %u 
bytes",
 rf->filename, rb, length);

I'd be happy if the two messages kept consistency. I suggest aligning
types instead of making the messages different, as attached.


If you want to make them uniform, then I suggest the error messages 
should both be "%zd of %zu bytes", which are the actual types read() 
deals with.

Re: Flushing large data immediately in pqcomm

2024-03-14 Thread Jelte Fennema-Nio

On Thu, 14 Mar 2024 at 13:12, Robert Haas  wrote:
>
> On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu  wrote:
> > 1- Even though I expect both the patch and HEAD behave similarly in case of 
> > small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.
>
> I wonder why this happens. It seems like maybe something that could be fixed.

some wild guesses:
1. maybe it's the extra call overhead of the new internal_flush
implementation. What happens if you make that an inline function?
2. maybe swap these conditions around (the call seems heavier than a
simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

BTW, the improvements for the larger rows are awesome!

Re: [PATCH] LockAcquireExtended improvement

2024-03-14 Thread Robert Haas

On Tue, Mar 12, 2024 at 9:33 AM Robert Haas  wrote:
> On Mon, Mar 11, 2024 at 11:11 PM Jingxian Li  wrote:
> > Your version changes less code than mine by pushing the nowait flag down
> > into ProcSleep(). This looks fine in general, except for a little advice,
> > which I don't think there is necessary to add 'waiting' suffix to the
> > process name in function WaitOnLock with dontwait being true, as follows:
>
> That could be done, but in my opinion it's not necessary. The waiting
> suffix will appear only very briefly, and probably only in relatively
> rare cases. It doesn't seem worth adding code to avoid it.

Seeing no further discussion, I have committed my version of this
patch, with your test case.

Thanks for pursuing this improvement!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 9:00 AM Heikki Linnakangas  wrote:
> The portal code is pretty explicit about it, the ExecutorRun() call in
> PortalRunSelect() looks like this:
>
>  PushActiveSnapshot(queryDesc->snapshot);
>  ExecutorRun(queryDesc, direction, (uint64) count,
>  portal->run_once);
>  nprocessed = queryDesc->estate->es_processed;
>  PopActiveSnapshot();
>
> I looked at all the callers of ExecutorRun(), and they all have the
> active snapshot equal to queryDesc->snapshot, either because they called
> CreateQueryDesc() with the active snapshot before ExecutorRun(), or they
> set the active snapshot like above.

Well, maybe there's a bunch of code cleanup possible, then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: small_cleanups around login event triggers

2024-03-14 Thread Robert Treat

On Thu, Mar 14, 2024 at 8:21 AM Daniel Gustafsson  wrote:
>
> > On 14 Mar 2024, at 02:47, Robert Treat  wrote:
>
> > I was taking a look at the login event triggers work (nice work btw)
>
> Thanks for reviewing committed code, that's something which doesn't happen
> often enough and is much appreciated.
>
> > and saw a couple of minor items that I thought would be worth cleaning
> > up. This is mostly just clarifying the exiting docs and code comments.
>
> + either in a connection string or configuration file. Alternativly, you 
> can
> This should be "Alternatively" I think.
>

Yes.

> - canceling connection in psql wouldn't cancel
> + canceling a connection in psql will not 
> cancel
> Nitpickery (perhaps motivated by english not being my first language), but
> since psql only deals with one connection I would expect this to read "the
> connection".
>

My interpretation of this is that "a connection" is more correct
because it could be your connection or someone else's connection (ie,
you are canceling one of many possible connections). Definitely
nitpickery either way.

> - * Returns true iff the lock was acquired.
> + * Returns true if the lock was acquired.
> Using "iff" here is being consistent with the rest of the file (and 
> technically
> correct):
>
> $ grep -c "if the lock was" src/backend/storage/lmgr/lmgr.c
> 1
> $ grep -c "iff the lock was" src/backend/storage/lmgr/lmgr.c
> 5
>

Ah, yeah, I was pretty focused on the event trigger stuff and didn't
notice it being used elsewhere; thought it was a typo, but I guess
it's meant as shorthand for "if and only if", I wonder how many people
are familiar with that.


Robert Treat
https://xzilla.net

Re: type cache cleanup improvements

2024-03-14 Thread Teodor Sigaev


One thing that first pops out to me is that we can do the refactor of
hash_initial_lookup() as an independent piece, without the extra paths
introduced.  But rather than returning the bucket hash and have the
bucket number as an in/out argument of hash_initial_lookup(), there is
an argument for reversing them: hash_search_with_hash_value() does not
care about the bucket number.

Ok, no problem




02-hash_seq_init_with_hash_value.v5.patch - introduces a
hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as
inline, but I suppose, modern compilers are smart enough to inline it
automatically.


Likely so, though that does not hurt to show the intention to the
reader.

Agree



So I would like to suggest the attached patch for this first piece.
What do you think?

I have not any objections



It may also be an idea to use `git format-patch` when generating a
series of patches.  That makes for easier reviews.

Thanks, will try

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Heikki Linnakangas


On 14/03/2024 12:55, Dilip Kumar wrote:

On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas  wrote:

_SPI_execute_plan() has code to deal with the possibility that the
active snapshot is not set. That seems fishy; do we really support SPI
without any snapshot? I'm inclined to turn that into an error. I ran the
regression tests with an "Assert(ActiveSnapshotSet())" there, and
everything worked.


IMHO, we can call SPI_Connect() and SPI_Execute() from any C
extension, so I don't think there we can guarantee that the snapshot
must be set, do we?


I suppose, although the things you could do without a snapshot would be 
pretty limited. The query couldn't access any tables. Could it even look 
up functions in the parser? Not sure.



Maybe for now we can just handle this specific case to remove the
snapshot serializing for the BitmapHeapScan as you are doing in the
patch.  After looking into the code your theory seems correct that we
are just copying the ActiveSnapshot while building the query
descriptor and from there we are copying into the Estate so logically
there should not be any reason for these two to be different.


Ok, committed that for now. Thanks for looking!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: REVOKE FROM warning on grantor

2024-03-14 Thread David G. Johnston

On Thursday, March 14, 2024, Étienne BERSAC 
wrote:

>
> However, I'd prefer if Postgres fails properly. Because the GRANT is
> actually not revoked. This prevent ldap2pg to report an issue in
> handling privileges on such roles.
>
> What do you think of make this warning an error ?
>
>
The choice of warning is made because after the command ends the grantmin
question does not exist.  The revoke was a no-op and the final state is as
the user intended.  Historically doing this didn’t give any message at all
which was confusing so we added a warning so the semantics of not failing
were preserved but there was some indication that something was amiss.  I
don’t have a compelling argument to,change the long-standing behavior.
Client code can and probably should look for a show errors reported by the
backend.  It is indeed possibly to treat this warning more serverly than
the server chooses to.

David J.

Re: Catalog domain not-null constraints

2024-03-14 Thread Alvaro Herrera

On 2024-Mar-14, Peter Eisentraut wrote:

> Perhaps it would make sense if we change the ALTER TABLE command to be like
> 
> ALTER TABLE t1 ADD IF NOT EXISTS NOT NULL c1
> 
> Then the behavior is like one would expect.
> 
> For ALTER TABLE, we would reject this command if IF NOT EXISTS is not
> specified.  (Since this is mainly for pg_dump, it doesn't really matter for
> usability.)  For ALTER DOMAIN, we could accept both variants.

I don't understand why you want to change this behavior, though.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"

Re: MERGE ... RETURNING

2024-03-14 Thread Alvaro Herrera

On 2024-Mar-13, Dean Rasheed wrote:

> On Wed, 13 Mar 2024 at 06:44, jian he  wrote:
> >
> > 
> > [ WITH with_query [, ...] ]
> > MERGE INTO [ ONLY ]  >
> > here the "WITH" part should have "[ RECURSIVE ]"
> 
> Actually, no. MERGE doesn't support WITH RECURSIVE.
> 
> It's not entirely clear to me why though. I did a quick test, removing
> that restriction in the parse analysis code, and it seemed to work
> fine. Alvaro, do you remember why that restriction is there?

There's no real reason for it, other than I didn't want to have to think
it through; I did suspect that it might Just Work, but I felt I would
have had to come up with more nontrivial test cases than I wanted to
write at the time.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"People get annoyed when you try to debug them."  (Larry Wall)

Re: Reports on obsolete Postgres versions

2024-03-14 Thread Robert Haas

On Wed, Mar 13, 2024 at 3:05 PM Laurenz Albe  wrote:
> I think we are pretty conservative with backpatching changes to the
> optimizer that could destabilize existing plans.

We have gotten better about that, which is good.

> I feel quite strongly that we should not use overly conservative language
> there.  If people feel that they have to test their applications for new
> minor releases, the only effect will be that they won't install minor 
> releases.
> Installing a minor release should be as routine as the operating system
> patches that many companies apply automatically during weekend maintenance
> windows.  They can also introduce bugs, and everybody knows and accepts that.

I don't agree with this. If we tell people that they don't need to
test their applications after an upgrade, I do not think that the
result will be that they skip the testing and everything works
perfectly. I think that the result will be that we lose all
credibility. Nobody who has been working on computers for longer than
a week is going to believe that a software upgrade can't break
anything, and someone whose entire business depends on things not
breaking is going to want to validate that they haven't. And they're
not wrong to think that way, either.

I think that whatever we say here should focus on what we try to do or
guarantee, not on what actions we think users ought to take, never
mind must take. We can say that we try to avoid making any changes
upon which an application might be relying -- but there surely is some
weasel-wording there, because we have made such changes before in the
name of security, and sometimes to fix bugs, and we will likely to do
so again in the future. But it's not for us to decide how much testing
is warranted. It's the user's system, not ours.

In the end, while I certainly don't mind improving the web page, I
think that a lot of what we're seeing here probably has to do with the
growing popularity and success of PostgreSQL. If you have more people
using your software, you're also going to have more people using
out-of-date versions of your software.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Tomas Vondra




On 3/13/24 23:38, Thomas Munro wrote:
> On Sun, Mar 3, 2024 at 11:41 AM Tomas Vondra
>  wrote:
>> On 3/2/24 23:28, Melanie Plageman wrote:
>>> On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra
>>>  wrote:
 With the current "master" code, eic=1 means we'll issue a prefetch for B
 and then read+process A. And then issue prefetch for C and read+process
 B, and so on. It's always one page ahead.
>>>
>>> Yes, that is what I mean for eic = 1
> 
> I spent quite a few days thinking about the meaning of eic=0 and eic=1
> for streaming_read.c v7[1], to make it agree with the above and with
> master.  Here's why I was confused:
> 
> Both eic=0 and eic=1 are expected to generate at most 1 physical I/O
> at a time, or I/O queue depth 1 if you want to put it that way.  But
> this isn't just about concurrency of I/O, it's also about computation.
> Duh.
> 
> eic=0 means that the I/O is not concurrent with executor computation.
> So, to annotate an excerpt from [1]'s random.txt, we have:
> 
> effective_io_concurrency = 0, range size = 1
> unpatched  patched
> ==
> pread(43,...,8192,0x58000) = 8192  pread(82,...,8192,0x58000) = 8192
>  *** executor now has page at 0x58000 to work on ***
> pread(43,...,8192,0xb) = 8192  pread(82,...,8192,0xb) = 8192
>  *** executor now has page at 0xb to work on ***
> 
> eic=1 means that a single I/O is started and then control is returned
> to the executor code to do useful work concurrently with the
> background read that we assume is happening:
> 
> effective_io_concurrency = 1, range size = 1
> unpatched  patched
> ==
> pread(43,...,8192,0x58000) = 8192  pread(82,...,8192,0x58000) = 8192
> posix_fadvise(43,0xb,0x2000,...)   posix_fadvise(82,0xb,0x2000,...)
>  *** executor now has page at 0x58000 to work on ***
> pread(43,...,8192,0xb) = 8192  pread(82,...,8192,0xb) = 8192
> posix_fadvise(43,0x108000,0x2000,...)  posix_fadvise(82,0x108000,0x2000,...)
>  *** executor now has page at 0xb to work on ***
> pread(43,...,8192,0x108000) = 8192 pread(82,...,8192,0x108000) = 8192
> posix_fadvise(43,0x16,0x2000,...)  posix_fadvise(82,0x16,0x2000,...)
> 
> In other words, 'concurrency' doesn't mean 'number of I/Os running
> concurrently with each other', it means 'number of I/Os running
> concurrently with computation', and when you put it that way, 0 and 1
> are different.
> 

Interesting. For some reason I thought with eic=1 we'd issue the fadvise
for page #2 before pread of page #1, so that there'd be 2 IO requests in
flight at the same time for a bit of time ... it'd give the fadvise more
time to actually get the data into page cache.

> Note that the first read is a bit special: by the time the consumer is
> ready to pull a buffer out of the stream when we don't have a buffer
> ready yet, it is too late to issue useful advice, so we don't bother.
> FWIW I think even in the AIO future we would have a synchronous read
> in that specific place, at least when using io_method=worker, because
> it would be stupid to ask another process to read a block for us that
> we want right now and then wait for it wake us up when it's done.
> 
> Note that even when we aren't issuing any advice because eic=0 or
> because we detected sequential access and we believe the kernel can do
> a better job than us, we still 'look ahead' (= call the callback to
> see which block numbers are coming down the pipe), but only as far as
> we need to coalesce neighbouring blocks.  (I deliberately avoid using
> the word "prefetch" except in very general discussions because it
> means different things to different layers of the code, hence talk of
> "look ahead" and "advice".)  That's how we get this change:
> 
> effective_io_concurrency = 0, range size = 4
> unpatched  patched
> ==
> pread(43,...,8192,0x58000) = 8192  pread(82,...,8192,0x58000) = 8192
> pread(43,...,8192,0x5a000) = 8192  preadv(82,...,2,0x5a000) = 16384
> pread(43,...,8192,0x5c000) = 8192  pread(82,...,8192,0x5e000) = 8192
> pread(43,...,8192,0x5e000) = 8192  preadv(82,...,4,0xb) = 32768
> pread(43,...,8192,0xb) = 8192  preadv(82,...,4,0x108000) = 32768
> pread(43,...,8192,0xb2000) = 8192  preadv(82,...,4,0x16) = 32768
> 
> And then once we introduce eic > 0 to the picture with neighbouring
> blocks that can be coalesced, "patched" starts to diverge even more
> from "unpatched" because it tracks the number of wide I/Os in
> progress, not the number of single blocks.
> 

So, IIUC this means (1) the patched code is more aggressive wrt
prefetching (because we prefetch more data overall, because master would

Re: Add publisher and subscriber to glossary documentation.

2024-03-14 Thread Alvaro Herrera

On 2024-Mar-14, Shlok Kyal wrote:

> Andrew Atkinson wrote:
>
> > Anyway, hopefully these examples show “node” and “database” are
> > mixed and perhaps others agree using one consistently might help the
> > goals of the docs.
> 
> For me the existing content looks good, I felt let's keep it as it is
> unless others feel differently.

Actually it's these small terminology glitches that give me pause.  If
we're going to have terms that are interchangeable (in this case "node"
and "database"), then they should be always interchangeable, not just in
some unspecified cases.  Maybe the idea of using "node" (which sounds
like something that's instance-wide) is wrong for logical replication,
which is necessarily something that happens database-locally.

Then again, maybe defining "node" as something that exists at a
database-local level when used in the context of logical replication is
sufficient.  In that case, it would be better to avoid defining it as a
synonym of "instance".  Then the terms are not always interchangeable,
but it's clear when they are and when they aren't.

"Node: in replication, each of the endpoints to which or
from which data is replicated.  In the context of physical replication,
each node is an instance.  In the context of logical replication, each
node is a database".

Does that make sense?

I'd also look at altering "Primary" and "Standby" so that it's clearer
that they're about physical replication, and don't mention "database"
anymore, since that's the wrong level.  Maybe turn them into "Primary
(node)" and "Standby (node)" instead.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-14 Thread Bharath Rupireddy

On Thu, Mar 14, 2024 at 12:24 PM Amit Kapila  wrote:
>
> On Wed, Mar 13, 2024 at 9:24 PM Bharath Rupireddy
> >
> > Yes, there will be some sort of duplicity if we emit conflict_reason
> > as a text field. However, I still think the better way is to turn
> > conflict_reason text to conflict boolean and set it to true only on
> > rows_removed and wal_level_insufficient invalidations. When conflict
> > boolean is true, one (including all the tests that we've added
> > recently) can look for invalidation_reason text field for the reason.
> > This sounds reasonable to me as opposed to we just mentioning in the
> > docs that "if invalidation_reason is rows_removed or
> > wal_level_insufficient it's the reason for conflict with recovery".
> >
> Fair point. I think we can go either way. Bertrand, Nathan, and
> others, do you have an opinion on this matter?

While we wait to hear from others on this, I'm attaching the v9 patch
set implementing the above idea (check 0001 patch). Please have a
look. I'll come back to the other review comments soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 18855c08cd8bcbaf41aba10048f0ea23a246e546 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 14 Mar 2024 12:48:52 +
Subject: [PATCH v9 1/4] Track invalidation_reason in pg_replication_slots

Up until now, reason for replication slot invalidation is not
tracked in pg_replication_slots. A recent commit 007693f2a added
conflict_reason to show the reasons for slot invalidation, but
only for logical slots.

This commit adds a new column to show invalidation reasons for
both physical and logical slots. And, this commit also turns
conflict_reason text column to conflicting boolean column
(effectively reverting commit 007693f2a). One now can look at the
new invalidation_reason column for logical slots conflict with
recovery.
---
 doc/src/sgml/ref/pgupgrade.sgml   |  4 +-
 doc/src/sgml/system-views.sgml| 63 +++
 src/backend/catalog/system_views.sql  |  5 +-
 src/backend/replication/logical/slotsync.c|  2 +-
 src/backend/replication/slot.c|  8 +--
 src/backend/replication/slotfuncs.c   | 25 +---
 src/bin/pg_upgrade/info.c |  4 +-
 src/include/catalog/pg_proc.dat   |  6 +-
 src/include/replication/slot.h|  2 +-
 .../t/035_standby_logical_decoding.pl | 35 ++-
 .../t/040_standby_failover_slots_sync.pl  |  4 +-
 src/test/regress/expected/rules.out   |  7 ++-
 12 files changed, 95 insertions(+), 70 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 58c6c2df8b..8de52bf752 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -453,8 +453,8 @@ make prefix=/usr/local/pgsql.new install
   
All slots on the old cluster must be usable, i.e., there are no slots
whose
-   pg_replication_slots.conflict_reason
-   is not NULL.
+   pg_replication_slots.conflicting
+   is not true.
   
  
  
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index be90edd0e2..f3fb5ba1b0 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2525,34 +2525,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
 
  
   
-   conflict_reason text
+   conflicting bool
   
   
-   The reason for the logical slot's conflict with recovery. It is always
-   NULL for physical slots, as well as for logical slots which are not
-   invalidated. The non-NULL values indicate that the slot is marked
-   as invalidated. Possible values are:
-   
-
- 
-  wal_removed means that the required WAL has been
-  removed.
- 
-
-
- 
-  rows_removed means that the required rows have
-  been removed.
- 
-
-
- 
-  wal_level_insufficient means that the
-  primary doesn't have a  sufficient to
-  perform logical decoding.
- 
-
-   
+   True if this logical slot conflicted with recovery (and so is now
+   invalidated). When this column is true, check
+   invalidation_reason column for the conflict
+   reason.
   
  
 
@@ -2581,6 +2560,38 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+ 
+  
+   invalidation_reason text
+  
+  
+   The reason for the slot's invalidation. NULL if the
+   slot is currently actively being used. The non-NULL values indicate that
+   the slot is marked as invalidated. Possible values are:
+   
+
+ 
+  wal_removed means that the required WAL has been
+  removed.
+ 
+
+
+ 
+  rows_remov

Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2024-03-14 Thread Robert Haas

On Wed, Oct 4, 2023 at 9:12 PM James Coleman  wrote:
> All right, attached is a v3 which attempts to fix the wrong
> information with an economy of words. I may at some point submit a
> separate patch that adds a broader pruning section, but this at least
> brings the docs inline with reality insofar as they address it.

I don't think this is as good as what I proposed back on October 2nd.
IMHO, that version does a good job making the text accurate and clear,
and is directly responsive to your original complaint, namely, that
the root of the HOT chain can't be removed. But this version seems to
contain a number of incidental changes that are unrelated to that
point, e.g. "old versions" -> "old, no longer visible versions", "can
be completely removed" -> "may be pruned", and the removal of the
sentence "In summary, heap-only tuple updates can only be created - if
columns used by indexes are not updated" which AFAICT is both
completely correct as-is and unrelated to the original complaint.

Maybe I shouldn't be, but I'm slightly frustrated here. I thought I
had proposed an alternative which you found acceptable, but then you
proposed several more versions that did other things instead, and I
never really understood why we couldn't just adopt the version that
you seemed to think was OK. If there's a problem with that, say what
it is. If there's not, let's do that and move on.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Catalog domain not-null constraints

2024-03-14 Thread Peter Eisentraut


On 14.03.24 15:03, Alvaro Herrera wrote:

On 2024-Mar-14, Peter Eisentraut wrote:


Perhaps it would make sense if we change the ALTER TABLE command to be like

 ALTER TABLE t1 ADD IF NOT EXISTS NOT NULL c1

Then the behavior is like one would expect.

For ALTER TABLE, we would reject this command if IF NOT EXISTS is not
specified.  (Since this is mainly for pg_dump, it doesn't really matter for
usability.)  For ALTER DOMAIN, we could accept both variants.


I don't understand why you want to change this behavior, though.


Because in the abstract, the behavior of

ALTER TABLE t1 ADD 

should be to add a constraint.

In the current implementation, the behavior is different for different 
constraint types.

Re: Should we remove -Wdeclaration-after-statement?

2024-03-14 Thread Robert Haas

On Wed, Feb 7, 2024 at 7:55 PM Noah Misch  wrote:
> > So my suggestion is for people to respond with -1, -0.5, +-0, +0.5, or
> > +1 to indicate support against/for the change.
>
> I'm +1 for the change, for these reasons:
>
> - Fewer back-patch merge conflicts.  The decls section of long functions is a
>   classic conflict point.
> - A mid-func decl demonstrates that its var is unused in the first half of the
>   func.
> - We write Perl in the mixed decls style, without problems.
>
> For me personally, the "inconsistency" concern is negligible.  We allowed "for
> (int i = 0", and that inconsistency has been invisible to me.

This thread was interesting as an opinion poll, but it seems clear
that the consensus is still against the proposed change, so I've
marked the CommitFest entry rejected.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: abi-compliance-checker

2024-03-14 Thread Robert Haas

On Mon, Mar 4, 2024 at 7:50 AM Peter Eisentraut  wrote:
> Looking at this again, if we don't want the .xml files in the tree, then
> we don't really need this patch series.

Based on this, I've updated the status of this patch in the CommitFest
application to Withdrawn. If that's not correct, please feel free to
adjust.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-03-14 Thread Peter Eisentraut


On 14.03.24 09:08, Jeff Davis wrote:

0001 (the C.UTF-8 locale) is also close. Considering that most of the
infrastructure is already in place, that's not a large patch. You many
have some comments about the way I'm canonicalizing and validating in
initdb -- that could be cleaner, but it feels like I should refactor
the surrounding code separately first.


If have tested this against the libc locale C.utf8 that was available on 
the OS, and the behavior is consistent.


I wonder if we should version the builtin locales too.  We might make a 
mistake and want to change something sometime?


Tiny comments:

* src/bin/scripts/t/020_createdb.pl

The two added tests should have different names that tells them apart
(like the new initdb tests).

* src/include/catalog/pg_collation.dat

Maybe use 'and' instead of '&' in the description.


0002 (inlining utf8 functions) is also ready.


Seems ok.


For 0003 and beyond, I'd like some validation that it's what you had in
mind.


I'll look into those later.

Re: Make attstattarget nullable

2024-03-14 Thread Tomas Vondra




On 3/14/24 11:13, Peter Eisentraut wrote:
> On 12.03.24 14:32, Tomas Vondra wrote:
>> On 3/12/24 13:47, Peter Eisentraut wrote:
>>> On 06.03.24 22:34, Tomas Vondra wrote:
 0001
 

 1) I think this bit in ALTER STATISTICS docs is wrong:

 -  >>> class="parameter">new_target
 +  SET STATISTICS { >>> class="parameter">integer | DEFAULT }

 because it means we now have list entries for name, ..., new_name,
 new_schema, and then suddenly "SET STATISTICS { integer | DEFAULT }".
 That's a bit weird.
>>>
>>> Ok, how would you change it?  List out the full clauses of the other
>>> variants under Parameters as well?
>>
>> I'd go with a parameter, essentially exactly as it used to be, except
>> for adding the DEFAULT option. So the list would define new_target, and
>> mention DEFAULT as a special value.
> 
> Ok, done that way (I think).
> 

Seems OK to me.

 2) The newtarget handling in AlterStatistics seems rather confusing.
 Why
 does it get set to -1 just to ignore the value later? For a while I was
 99% sure ALTER STATISTICS ... SET STATISTICS DEFAULT will set the field
 to -1. Maybe ditching the first if block and directly checking
 stmt->stxstattarget before setting repl_val/repl_null would be better?
>>>
>>> But we also need to continue accepting -1 for default on input.  The
>>> current code achieves that, the proposed variant would not.
>>
>> OK, I did not realize that. But then maybe this should be explained in a
>> comment before the new "if" block, because people won't realize why it
>> needs to be this way.
> 
> In the new version, I tried to write this more explicitly, and updated
> tablecmds.c to match.

WFM. It still seems a bit hard to read, but I don't know how to do it
better. I guess it's how it has to be to deal with multiple default
values in a backwards-compatible way. Good thing is it's localized in
two places.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-14 Thread vignesh C

On Thu, 14 Mar 2024 at 15:49, Amit Kapila  wrote:
>
> On Thu, Mar 14, 2024 at 1:45 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > This fact makes me think that the slotsync worker might be able to
> > > > accept the primary_conninfo value even if there is no dbname in the
> > > > value. That is, if there is no dbname in the primary_conninfo, it uses
> > > > the username in accordance with the specs of the connection string.
> > > > Currently, the slotsync worker connects to the local database first
> > > > and then establishes the connection to the primary server. But if we
> > > > can reverse the two steps, it can get the dbname that has actually
> > > > been used to establish the remote connection and use it for the local
> > > > connection too. That way, the primary_conninfo generated by
> > > > pg_basebackup could work even without the patch. For example, if the
> > > > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > > > would connect to the postgres database. Given the 'postgres' database
> > > > is created by default and 'postgres' OS user is used in common, I
> > > > guess it could cover many cases in practice actually.
> > > >
> > >
> > > I think this is worth investigating but I suspect that in most cases
> > > users will end up using a replication connection without specifying
> > > the user name and we may not be able to give a meaningful error
> > > message when slotsync worker won't be able to connect. The same will
> > > be true even when the dbname same as the username would be used.
> >
> > What do you mean by not being able to give a meaningful error message?
> >
> > If the slotsync worker uses the user name as the dbname, and such a
> > database doesn't exist, the error message the user will get is
> > "database "test_user" does not exist". ISTM the same is true when the
> > user specifies the wrong database in the primary_conninfo.
> >
>
> Right, the exact error message as mentioned by Shveta will be:
> ERROR:  could not connect to the primary server: connection to server
> at "127.0.0.1", port 5433 failed: FATAL: database "bckp_user" does not
> exist
>
> Now, without this idea, the ERROR message will be:
>  ERROR:  slot synchronization requires dbname to be specified in
> primary_conninfo
>
> I am not sure how much this matters but the second message sounds more useful.
>
> > >
> > > > Having said that, even with (or without) the above change, we might
> > > > want to change the pg_basebackup so that it writes the dbname to the
> > > > primary_conninfo if -R option is specified. Since the database where
> > > > the slotsync worker connects cannot be dropped while the slotsync
> > > > worker is running, the user might want to change the database to
> > > > connect, and it would be useful if they can do that using
> > > > pg_basebackup instead of modifying the configuration file manually.
> > > >
> > > > While the current approach makes sense to me, I'm a bit concerned that
> > > > we might end up having the pg_basebackup search the actual database
> > > > name (e.g. 'dbname=template1') from the .pgpass file instead of
> > > > 'dbname=replication'. As far as I tested on my environment, suppose
> > > > that I execute:
> > > >
> > > > pg_basebackup -D tmp -d "dbname=testdb" -R
> > > >
> > > > The pg_basebackup established a replication connection but looked for
> > > > the password of the 'testdb' database. This could be another
> > > > inconvenience for the existing users who want to use the slot
> > > > synchronization.
> > > >
> > >
> > > This is true because it is internally using logical replication
> > > connection (as we will set set replication=database).
> >
> > Did you mean the pg_basebackup is using a logical replication
> > connection in this case? As far as I tested, even if we specify dbname
> > to the -d option of pg_basebackup, it uses a physical replication
> > connection. For example, it can take a backup even if I specify a
> > non-existing database name.
> >
>
> You are right. I misunderstood some part of the code in GetConnection.
> However, I think my point is still valid that if the user has provided
> dbname in the connection string it means that she wants that database
> entry to be looked upon not "replication" entry.
>
> >
> > >
> > > > But it's still just an idea and I might be missing something. And
> > > > given we're getting closer to the feature freeze, it would be a PG18
> > > > item.
> > > >
> > >
> > > +1. At this stage, it is important to discuss whether we should allow
> > > pg_baseback to write dbname (either a specified one or a default one)
> > > along with other parameters in primary_conninfo?
> > >
> >
> > True. While I basically agree that pg_basebackup writes dbname in
> > primary_conninfo, I'm concerned that writing "dbname=replication"
> > could be problematic. Quoting the case 3) Vignesh summa

Re: psql: fix variable existence tab completion

2024-03-14 Thread Alexander Korotkov

On Sun, Mar 3, 2024 at 5:37 PM Erik Wienhold  wrote:
> On 2024-03-03 03:00 +0100, Steve Chavez wrote:
> > psql has the :{?name} syntax for testing a psql variable existence.
> >
> > But currently doing \echo :{?VERB doesn't trigger tab completion.
> >
> > This patch fixes it. I've also included a TAP test.
>
> Thanks.  The code looks good, all tests pass, and the tab completion
> works as expected when testing manually.

A nice improvement.  I've checked why we have at all the '{' at
WORD_BREAKS and if we're going to break anything by removing that.  It
seems that '{' here from the very beginning and it comes from the
default value of rl_basic_word_break_characters [1].  It seems that
:{?name} is the only usage of '{' sign in psql.  So, removing it from
WORD_BREAKS shouldn't break anything.

I'm going to push this patch if no objections.

Links.
1. 
https://tiswww.case.edu/php/chet/readline/readline.html#index-rl_005fbasic_005fword_005fbreak_005fcharacters

--
Regards,
Alexander Korotkov

Re: Add system identifier to backup manifest

2024-03-14 Thread Amul Sul

On Thu, Mar 14, 2024 at 12:48 AM Robert Haas  wrote:

> On Fri, Mar 8, 2024 at 12:14 AM Amul Sul  wrote:
> > Thank you for the improvement.
> >
> > The caller of verify_control_file() has the full path of the control
> file that
> > can pass it and avoid recomputing. With this change, it doesn't really
> need
> > verifier_context argument -- only the manifest's system identifier is
> enough
> > along with the control file path.  Did the same in the attached delta
> patch
> > for v11-0002 patch, please have a look, thanks.
>
> Those seem like sensible changes. I incorporated them and committed. I
> also:
>
> * ran pgindent, which changed a bit of your formatting
> * changed some BAIL_OUT calls to die; I think we should hardly ever be
> using BAIL_OUT, as that terminates the entire TAP test run, not just
> the current file
>

Thank you, Robert.

Regards,
Amul

Re: UUID v7

2024-03-14 Thread Peter Eisentraut


On 14.03.24 12:25, Andrey M. Borodin wrote:

I think the behavior of uuid_extract_var(iant) is wrong.  The code
takes just two bits to return, but the draft document is quite clear
that the variant is 4 bits (see Table 1).

Well, it was correct only for implemented variant. I've made version that 
implements full table 1 from section 4.1.

I think we are still interpreting this differently.  I think uuid_extract_variant should 
just return whatever is in those four bits. Your function comment says "Can return 
only 0, 0b10, 0b110 and 0b111.", which I don't think it is correct.  It should 
return 0 through 15.

We will return "do not care" bits. This bits can confuse someone. E.g. for 
varaint 0b10 we can return 8, 9, 10 and 11 randomly. Is it OK? BTW for some reason 
document lists number 1-15, but your are correct that range is 0-15.


I agree it's confusing.  Before I studied the RFC 4122bis project, I 
didn't even know about variant vs. version.  I think overall people will 
find this more confusing than useful.  If you just want to know, "is 
this UUID of the kind specified in RFC 4122", you can query it with 
uuid_extract_version(x) IS NOT NULL.  So maybe we don't need the 
_extract_variant function?

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2024-03-14 Thread Heikki Linnakangas


On 18/02/2024 00:31, Tomas Vondra wrote:

Do you plan to work continue working on this patch? I did take a look,
and on the whole it looks reasonable - it modifies the right places etc.


+1


I think there are two things that may need an improvement:

1) Storing variable-length data in ParallelBitmapHeapState

I agree with Robert the snapshot_and_stats name is not great. I see
Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
the reasons are somewhat different (phs_snapshot_off exists because we
don't know which exact struct will be allocated), while here we simply
need to allocate two variable-length pieces of memory. But it seems like
it would work nicely for this. That is, we could try adding an offset
for each of those pieces of memory:

  - snapshot_off
  - stats_off

I don't like the GetSharedSnapshotData name very much, it seems very
close to GetSnapshotData - quite confusing, I think.

Dmitry also suggested we might add a separate piece of shared memory. I
don't quite see how that would work for ParallelBitmapHeapState, but I
doubt it'd be simpler than having two offsets. I don't think the extra
complexity (paid by everyone) would be worth it just to make EXPLAIN
ANALYZE work.


I just removed phs_snapshot_data in commit 84c18acaf6. I thought that 
would make this moot, but now that I rebased this, there are stills some 
aesthetic questions on how best to represent this.


In all the other node types that use shared instrumentation like this, 
the pattern is as follows: (using Memoize here as an example, but it's 
similar for Sort, IncrementalSort, Agg and Hash)


/* 
 *   Shared memory container for per-worker memoize information
 * 
 */
typedef struct SharedMemoizeInfo
{
int num_workers;
MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
} SharedMemoizeInfo;

/* this struct is backend-private */
typedef struct MemoizeState
{
ScanState   ss; /* its first field is 
NodeTag */
...
MemoizeInstrumentation stats;   /* execution statistics */
SharedMemoizeInfo *shared_info; /* statistics for parallel workers */
} MemoizeState;

While the scan is running, the node updates its private data in 
MemoizeState->stats. At the end of a parallel scan, the worker process 
copies the MemoizeState->stats to MemoizeState->shared_info->stats, 
which lives in shared memory. The leader process copies 
MemoizeState->shared_info->stats to its own backend-private copy, which 
it then stores in its MemoizeState->shared_info, replacing the pointer 
to the shared memory with a pointer to the private copy. That happens in 
ExecMemoizeRetrieveInstrumentation().


This is a little different for parallel bitmap heap scans, because a 
bitmap heap scan keeps some other data in shared memory too, not just 
instrumentation data. Also, the naming is inconsistent: the equivalent 
of SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think 
we should rename it to SharedBitmapHeapInfo, to make it clear that it 
lives in shared memory, but I digress.


We could now put the new stats at the end of ParallelBitmapHeapState as 
a varlen field. But I'm not sure that's a good idea. In 
ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private 
copy of the whole ParallelBitmapHeapState struct, even though the other 
fields don't make sense after the shared memory is released? Sounds 
confusing. Or we could introduce a separate struct for the stats, and 
copy just that:


typedef struct SharedBitmapHeapInstrumentation
{
int num_workers;
BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
} SharedBitmapHeapInstrumentation;

typedef struct BitmapHeapScanState
{
ScanState   ss; /* its first field is 
NodeTag */
...
SharedBitmapHeapInstrumentation sinstrument;
} BitmapHeapScanState;

that compiles, at least with my compiler, but I find it weird to have a 
variable-length inner struct embedded in an outer struct like that.


Long story short, I think it's still better to store 
ParallelBitmapHeapInstrumentationInfo separately in the DSM chunk, not 
as part of ParallelBitmapHeapState. Attached patch does that, rebased 
over current master.



I didn't address any of the other things that you, Tomas, pointed out, 
but I think they're valid concerns.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 04f11a37ee04b282c51e9dae68ead7c9c3d5fb3d Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Tue, 8 Nov 2022 19:40:31 +0100
Subject: [PATCH v3 1/1] Parallel Bitmap Heap Scan reports per-worker stats

Similarly to other nodes (e.g. hash join, sort, memoize),
Bitmap Heap Scan now reports per-worker stats in the EXPLAIN
ANALYZE output. Previously only the heap blocks stats for the
leader were reported which was incomplete in parallel scans.

Discussion: https://

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-03-14 Thread Robert Haas

On Fri, Mar 8, 2024 at 6:47 AM Jelte Fennema-Nio  wrote:
> 1. 0001-0005 are needed for any protocol change, and hopefully
> shouldn't require much discussion

I feel bad arguing about the patches that you think are a slam-dunk,
but I find myself disagreeing with the design choices.

Regarding 0001, I considered making this change as part of
ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed but decided against it,
because it seemed like it was making the assumption that we always
wanted to initiate new connections with the latest protocol version
that we know how to accept, and I wasn't sure that would always be
true. I don't think it would be catastrophic if this got committed or
anything -- it could always be changed later if the need arose -- but
I wanted to mention that I had a reason for not doing it, and I'm
still not particularly convinced that we should do it.

I'm really unhappy with 0002-0004. Before
ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed, any parameter included in
the startup message that wasn't in a short, hard-coded list was
treated as a request to set a GUC. That left no room for any other
type of protocol modification, so that commit carved out an exception,
where when we something that starts with _pq_., it's assumed to be
setting some other kind of thing, not a GUC. But 0004 throws that out
the window and decides, nope, those are just GUCs, too. Even if we
don't have a specific reason today why we'd ever want such a thing, it
seems short-sighted to give up on the possibility that in the future
we will. Because if we implement what this patch wants to do in this
way, basically consuming the entire namespace that
ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed created in on shot, and then
later we want the sort of thing that I'm postulating, we'll have to
manufacture another new namespace for that need.

And it seems to me that there are other ways we could do this. For
example, suppose we introduced just one new protocol parameter; for
the sake of argument, I'll call it _pq_.protocol_set. If the client
sends this parameter and the server accepts it, then the client knows
that the server supports a new protocol message ProtocolSetParameter,
which is the only way to set GUCs in the new PROTOCOL_EXTENSION
category. New libpq functions with names like, I don't know,
PQsetProtocolParameter(), are added to send such messages, and they
return an error if there are network problems or whatever, or if the
server didn't accept the _pq_.protocol_set option at startup time.

With this kind of design, you're just consuming one element of the
_pq_ namespace, and the next person who wants to do something can
consume one more element, and we'll be able to go on for a very long
time without running out of room. This is really how I intended this
mechanism to be used, and the only real downside I see as compared to
what you've done is that you can't set the protocol GUCs in the
startup packet, but must set them afterward via separate messages. If
that's a problem, then the proposal I just outline needs modification
... but no matter what we do exactly, I don't want the very first
protocol extension we ever add to eat up all of _pq_. I intended that
to support decades worth of extensibility, not just one patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reports on obsolete Postgres versions

2024-03-14 Thread Peter Eisentraut


On 13.03.24 18:12, Bruce Momjian wrote:

On Wed, Mar 13, 2024 at 09:21:27AM -0700, Jeremy Schneider wrote:

It's not just roadmaps and release pages where we mix up these terms
either, it's even in user-facing SQL and libpq routines: both
PQserverVersion and current_setting('server_version_num') return the
patch release version in the numeric patch field, rather than the
numeric minor field (which is always 0).

In my view, the best thing would be to move toward consistently using
the word "patch" and moving away from the word "minor" for the
PostgreSQL quarterly maintenance updates.


I think "minor" is a better term since it contrasts with "major".  We
don't actually supply patches to upgrade minor versions.


There are potentially different adjectives that could apply to "version" 
and "release".


The version numbers can be called major and minor, because that just 
describes their ordering and significance.


But I do agree that "minor release" isn't quite as clear, because one 
could also interpret that as "a release, but a bit smaller this time". 
(Also might not translate well, since "minor" and "small" could 
translate to the same thing.)


One could instead, for example, describe those as "maintenance releases":

"Are you on the latest maintenance release?  Why not?  Are you not 
maintaining your database?"


That carries much more urgency than the same with "minor".

A maintenance release corresponds to an increase in the minor version 
number.  It's still tied together, but has different terminology.


The last news item reads:

"The PostgreSQL Global Development Group has released an update to all 
supported versions of PostgreSQL"


which has no urgency, but consider

"The PostgreSQL Global Development Group has published maintenance 
releases to all supported versions of PostgreSQL"

Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2024-03-14 Thread Nathan Bossart

On Wed, Mar 13, 2024 at 01:09:18PM +0700, Yugo NAGATA wrote:
> On Tue, 12 Mar 2024 22:07:17 -0500
> Nathan Bossart  wrote:
>> I did some light editing to prepare this for commit.
> 
> Thank you. I confirmed the test you improved and I am fine with that.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: DOCS: add helpful partitioning links

2024-03-14 Thread Ashutosh Bapat

Hi Robert,

On Thu, Mar 7, 2024 at 10:49 PM Robert Treat  wrote:

> This patch adds a link to the "attach partition" command section
> (similar to the detach partition link above it) as well as a link to
> "create table like" as both commands contain additional information
> that users should review beyond what is laid out in this section.
> There's also a couple of wordsmiths in nearby areas to improve
> readability.
>

Thanks.

The patch gives error when building html

ddl.sgml:4300: element link: validity error : No declaration for attribute
linked of element link
 CREATE TABLE ...
LIKECREATE TABLE ...
LIKEhttps://english.stackexchange.com/questions/9700/outside-or-outside-of. But
I think the meaning of the sentence will be more clear if we rephrase it as
in the attached patch.

- convenient, as not only will the existing partitions become indexed,
but
- also any partitions that are created in the future will.  One
limitation is
+ convenient as not only will the existing partitions become indexed,
but
+ any partitions created in the future will as well.  One limitation is

I am finding the current construct hard to read. The comma is misplaced as
you have pointed out. The pair of commas break the "not only" ... "but
also" construct. I have tried to simplify the sentence in the attached.
Please review.

- the partitioned table; such an index is marked invalid, and the
partitions
- do not get the index applied automatically.  The indexes on
partitions can
- be created individually using CONCURRENTLY, and
then
+ the partitioned table; such an index is marked invalid and the
partitions
+ do not get the index applied automatically.  The partition indexes can

"indexes on partition" is clearer than "partition index". Fixed in the
attached patch.

Please review.

-- 
Best Wishes,
Ashutosh Bapat
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 8616a8e9cc..043717136c 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4283,18 +4283,20 @@ CREATE TABLE measurement_y2008m02 PARTITION OF measurement
 TABLESPACE fasttablespace;
 
 
- As an alternative, it is sometimes more convenient to create the
- new table outside the partition structure, and attach it as a
- partition later. This allows new data to be loaded, checked, and
- transformed prior to it appearing in the partitioned table.
- Moreover, the ATTACH PARTITION operation requires
- only SHARE UPDATE EXCLUSIVE lock on the
- partitioned table, as opposed to the ACCESS
- EXCLUSIVE lock that is required by CREATE TABLE
- ... PARTITION OF, so it is more friendly to concurrent
- operations on the partitioned table.
- The CREATE TABLE ... LIKE option is helpful
- to avoid tediously repeating the parent table's definition:
+ As an alternative, it is sometimes more convenient to create the partition
+ as a standalone new table, and attach it to the partitioned table later.
+ This allows new data to be loaded, checked, and transformed prior to it
+ appearing in the partitioned table. Moreover, the ATTACH
+ PARTITION operation requires only SHARE UPDATE
+ EXCLUSIVE lock on the partitioned table, as opposed to the
+ ACCESS EXCLUSIVE lock that is required by
+ CREATE TABLE ... PARTITION OF, so it is more friendly
+ to concurrent operations on the partitioned table; see ALTER TABLE ... ATTACH
+ PARTITION for additional details. The CREATE TABLE ...
+ LIKE command can be helpful to avoid tediously repeating
+ the parent table's definition, for example:
 
 
 CREATE TABLE measurement_y2008m02
@@ -4345,16 +4347,14 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
 
  As explained above, it is possible to create indexes on partitioned tables
- so that they are applied automatically to the entire hierarchy.
- This is very
- convenient, as not only will the existing partitions become indexed, but
- also any partitions that are created in the future will.  One limitation is
+ so that they are applied automatically to the entire hierarchy. The operation of creating an index on a partitioned table also creates corresponding indexes on its partitions. Any new partition created in future also inherits the indexes on the partitioned table.
+ One limitation is
  that it's not possible to use the CONCURRENTLY
  qualifier when creating such a partitioned index.  To avoid long lock
  times, it is possible to use CREATE INDEX ON ONLY
- the partitioned table; such an index is marked invalid, and the partitions
+ the partitioned table; such an index is marked invalid and the partitions
  do not get the index applied automatically.  The indexes on partitions can
- be created individually using CONCURRENTLY, and then
+ then be created individually using CONCURRENTLY and
  attached to the index on the parent using
  ALTER INDEX .. ATTACH PA

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Robert Haas

On Tue, Feb 13, 2024 at 2:05 AM Joel Jacobson  wrote:
> > Wouldn't having system wide EVTs be a generic solution which could be the
> > infrastructure for this requested change as well as others in the same area?
>
> +1
>
> I like the wider vision of providing the necessary infrastructure to provide 
> a solution for the general case.

We don't seem to be making much progress here.

As far as I can see from reading the thread, most people agree that
it's reasonable to have some way to disable ALTER SYSTEM, but there
are at least six competing ideas about how to do that:

1. command-line option
2. GUC
3. event trigger
4. extension
5. sentinel file
6. remove permissions on postgresql.auto.conf

As I see it, (5) or (6) are most convenient for the system
administrator, since they let that person make changes without needing
to log into the database or, really, worry very much about the
database's usual configuration mechanisms at all, and (5) seems like
less work to implement than (6), because (6) probably breaks a bunch
of client tools in weird ways that might not be easy for us to
discover during patch review. (1) doesn't allow changing things at
runtime, and might require the system administrator to fiddle with the
startup scripts, which seems like it could be inconvenient. (2) and
(3) seem like they put the superuser in a position to easily reverse a
policy about what the superuser ought to do, but in the case of (2),
that can be mitigated if the GUC can only be set in postgresql.conf
and not elsewhere. (4) has no real advantages except for allowing core
to maintain the fiction that we don't support this while actually
supporting it; I think we should reject that approach outright.

So what I'd like to see is a patch that implements (5), or in the
alternative (2) but with the GUC being PGC_SIGHUP and
GUC_DISALLOW_IN_AUTO_FILE. I believe there would be adequate consensus
to proceed with either of those approaches. Anybody feel like coding
it up?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add system identifier to backup manifest

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 11:05 AM Amul Sul  wrote:
> Thank you, Robert.

Thanks for the patch!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Psql meta-command conninfo+

2024-03-14 Thread Nathan Bossart

Hi Maiquel,

On Thu, Feb 29, 2024 at 10:02:21PM +, Maiquel Grassi wrote:
> Sorry for the delay. I will make the adjustments as requested soon.

We have only a few weeks left before feature-freeze for v17.  Do you think
you will be able to send an updated patch soon?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-03-14 Thread Jelte Fennema-Nio

On Thu, 14 Mar 2024 at 16:45, Robert Haas  wrote:
> I feel bad arguing about the patches that you think are a slam-dunk,
> but I find myself disagreeing with the design choices.

No worries, thanks a lot for responding. I'm happy to discuss this
design further. I didn't necessarily mean these patches were a
slam-dunk. I mainly meant that these first few patches were not
specific to any protocol change, but are changes we should agree on
before any change to the protocol is possible at all. Based on your
response, we currently disagree on a bunch of core things.

I'll first try to summarize my view on (future) protocol changes and
why I think the current core design in this patchset is the correct
path forward, and then go into some details inline in your response
below:

In my view there can be, **by definition,** only two general types of
protocol changes:
1. A "simple protocol change": This is one that requires agreement by
both the client and server, that they understand the new message types
involved in this change. e.g. the ParameterSet message proposal (this
message type is either supported or it's not)
2. A "parameterized protocol change": This requires the same as 1, but
should also allow some parameterization from the client, e.g. for the
compression proposal, the client should specify what compression
algorithm the server should use to compress data when sending it to
the client.

Client and Server can agree that a "simple protocol change" is
supported by both advertising a minimum minor protocol version. And
for a "parameterized protocol change" the client can send a _pq_
parameter in the startup packet.

So, new _pq_ parameters should only ever be introduced for
parameterized protocol changes. They are not meant to advertise
support, they are meant to configure protocol features. For a
non-configurable protocol feature, we'd simply bump the protocol
version. And since _pq_ parameters thus always control some setting at
the session level, we can simply store it as a GUC, because that's how
we store all our parameters at a session level.

> Regarding 0001, I considered making this change as part of
> ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed but decided against it,
> because it seemed like it was making the assumption that we always
> wanted to initiate new connections with the latest protocol version
> that we know how to accept, and I wasn't sure that would always be
> true.

I think given the automatic downgrade supported by the
NegotiateProtocolVersion, there's no real down-side to requesting the
newest version by default. The only downside I can see is when
connecting to other applications (i.e. non PostgreSQL servers) that
implement the postgres protocol, but don't implement
NegotiateProtocolVersion. But for that I added the
max_protocol_version connection option in 0006 (of my latest
patchset).

> I'm really unhappy with 0002-0004.

Just to be clear, (afaict) your argument below seems to only really be
about 0004, not about 0002 or 0003. Was there anything about 0002 &
0003 that you were unhappy with? 0002 & 0003 are not dependent an 0004
imho. Because even when not making _pq_ parameters map to GUCs, we'd
still need to change libpq to not instantly close the connection
whenever a _pq_ parameter (or new protocol version) is not supported
by the server (which is what 0002 & 0003 do).

> That left no room for any other
> type of protocol modification, so that commit carved out an exception,
> where when we something that starts with _pq_., it's assumed to be
> setting some other kind of thing, not a GUC.

Okay, our interpretation is very different here. From my perspective
introducing a non-GUC namespace is NOT AT ALL the benefit of the _pq_
prefix. The main benefit (imho) is that it allows sending an
"optional" parameter (i.e GUC) in the startup packet. So, one where if
the server doesn't recognize it the connection attempt still succeeds.
If you specify a normal GUC in the connection parameters and the
server doesn't know about it, the server will close the connection.
So, to be able to send a GUC that depends on the protocol and/or
server version in an optional way, you'd need to wait for an extra
network roundtrip until the server tells you what protocol and/or
server version they are.

> But 0004 throws that out
> the window and decides, nope, those are just GUCs, too. Even if we
> don't have a specific reason today why we'd ever want such a thing, it
> seems short-sighted to give up on the possibility that in the future
> we will.

Since I believe a _pq_ parameter should only be used to control
settings at the session level. I don't think it would be short-sighted
to give-up on the possibility to store them as anything else as GUCs.
Because in the many years that we've had GUCs, we've been able to
store all session settings using that infrastructure. BUT PLEASE NOTE:
I don't think we are giving up on the thing you're describing (see
explanation in final part of this email)

> With this

Re: Recent 027_streaming_regress.pl hangs

2024-03-14 Thread Alexander Lakhin


Hello Thomas and Michael,

14.03.2024 06:16, Thomas Munro wrote:


Yeah, I was wondering if its checkpoint delaying logic might have
got the checkpointer jammed or something like that, but I don't
currently see how.  Yeah, the replay of bulk newpages could be
relevant, but it's not exactly new technology.  One thing I wondered
about is whether the Perl "wait for catchup" thing, which generates
large volumes of useless log, could be somehow changed to actually
show the progress after some time.  Something like "I'm still waiting
for this replica to reach LSN X, but it has so far only reported LSN
Y, and here's a dump of the WAL around there"?


I have perhaps reproduced the issue here (at least I'm seeing something
similar), and going to investigate the issue in the coming days, but what
I'm confused with now is the duration of poll_query_until:
For the failure you referenced:
[15:55:54.740](418.725s) # poll_query_until timed out executing this query:

And a couple of others:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-08%2000%3A34%3A06
[00:45:57.747](376.159s) # poll_query_until timed out executing this query:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-04%2016%3A32%3A17
[16:45:24.870](407.970s) # poll_query_until timed out executing this query:

Could it be that the timeout (360 sec?) is just not enough for the test
under the current (changed due to switch to meson) conditions?

Best regards,
Alexander

Re: linux cachestat in file Readv and Prefetch

2024-03-14 Thread Robert Haas

On Sat, Feb 17, 2024 at 6:10 PM Tomas Vondra
 wrote:
> I may be missing some important bit behind this idea, but this does not
> seem like a great idea to me. The comment added to FilePrefetch says this:
>
>   /*
>* last time we visit this file (somewhere), nr_recently_evicted pages
>* of the range were just removed from vm cache, it's a sign a memory
>* pressure. so do not prefetch further.
>* it is hard to guess if it is always the right choice in absence of
>* more information like:
>*  - prefetching distance expected overall
>*  - access pattern/backend maybe
>*/
>
> Firstly, is this even a good way to detect memory pressure? It's clearly
> limited to a single 1GB segment, so what's the chance we'll even see the
> "pressure" on a big database with many files?
>
> If we close/reopen the file (which on large databases we tend to do very
> often) how does that affect the data reported for the file descriptor?
>
> I'm not sure I even agree with the idea that we should stop prefetching
> when there is memory pressure. IMHO it's perfectly fine to keep
> prefetching stuff even if it triggers eviction of unnecessary pages from
> page cache. That's kinda why the eviction exists.

I agree with all of these criticisms. I think it's the job of
pg_prewarm to do what the user requests, not to second-guess whether
the user requested the right thing. One of the things that frustrates
people about the ring-buffer system is that it's hard to get all of
your data cached in shared_buffers by just reading it, e.g. SELECT *
FROM my_table. If pg_prewarm also isn't guaranteed to actually read
your data, and may decide that your data didn't need to be read after
all, then what exactly is a user supposed to do if they're absolutely
sure that they know better than PostgreSQL itself and want to
guarantee that their data actually does get read?

So I think a feature like this would at the very least need to be
optional, but it's unclear to me why we'd want it at all, and I feel
like Cedric's email doesn't really answer that question. I suppose
that if you could detect useless prefetching and skip it, you'd save a
bit of work, but under what circumstances does anyone use pg_prewarm
so aggressively as to make that a problem, and why wouldn't the
solution be for the user to just calm down a little bit? There
shouldn't be any particular reason why the user can't know both the
size of shared_buffers and the approximate size of the OS cache;
indeed, they can probably know the latter much more reliably than
PostgreSQL itself can. So it should be fairly easy to avoid just
prefetching more data than will fit, and then you don't have to worry
about any of this. And you'll probably get a better result, too,
because, along the same lines as Tomas's remarks above, I doubt that
this would be an accurate method anyway.

> Well ... I'd argue at least some basic evaluation of performance is a
> rather important / expected part of a proposal for a patch that aims to
> improve a performance-focused feature. It's impossible to have any sort
> of discussion about such patch without that.

Right.

I'm going to mark this patch as Rejected in the CommitFest application
for now. If in subsequent discussion that comes to seem like the wrong
result, then we can revise accordingly, but right now it looks
extremely unlikely to me that this is something that we'd want.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: UUID v7

2024-03-14 Thread Andrey M. Borodin




> On 14 Mar 2024, at 20:10, Peter Eisentraut  wrote:
> 
> I think the behavior of uuid_extract_var(iant) is wrong.  The code
> takes just two bits to return, but the draft document is quite clear
> that the variant is 4 bits (see Table 1).
 Well, it was correct only for implemented variant. I've made version that 
 implements full table 1 from section 4.1.
>>> I think we are still interpreting this differently.  I think 
>>> uuid_extract_variant should just return whatever is in those four bits. 
>>> Your function comment says "Can return only 0, 0b10, 0b110 and 0b111.", 
>>> which I don't think it is correct.  It should return 0 through 15.
>> We will return "do not care" bits. This bits can confuse someone. E.g. for 
>> varaint 0b10 we can return 8, 9, 10 and 11 randomly. Is it OK? BTW for some 
>> reason document lists number 1-15, but your are correct that range is 0-15.
> 
> I agree it's confusing.  Before I studied the RFC 4122bis project, I didn't 
> even know about variant vs. version.  I think overall people will find this 
> more confusing than useful.  If you just want to know, "is this UUID of the 
> kind specified in RFC 4122", you can query it with uuid_extract_version(x) IS 
> NOT NULL.  So maybe we don't need the _extract_variant function?

I think it's the best possible solution. The variant has no value besides 
detecting if a version can be extracted.


Best regards, Andrey Borodin.

Re: Pre-proposal: unicode normalized text

2024-03-14 Thread Jeff Davis

On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote:
> Attached is an implementation of a per-database option STRICT_UNICODE
> which enforces the use of assigned code points only.

The CF app doesn't seem to point at the latest patch:

https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b31a840a915a9d349.ca...@j-davis.com

which is perhaps why nobody has looked at it yet.

But in any case, I'm OK if this gets bumped to 18. I still think it's a
good feature, but some of the value will come later in v18 anyway, when
I plan to propose support for case folding. Case folding is a version
of lowercasing with compatibility guarantees when you only use assigned
code points.

Regards,
Jeff Davis

Re: broken JIT support on Fedora 40

2024-03-14 Thread Robert Haas

On Wed, Mar 6, 2024 at 1:54 AM Pavel Stehule  wrote:
> after today update, the build with --with-llvm produces broken code, and make 
> check fails with crash
>
> Upgradehwdata-0.380-1.fc40.noarch   
> @updates-testing
> Upgraded   hwdata-0.379-1.fc40.noarch   @@System
> Upgradellvm-18.1.0~rc4-1.fc40.x86_64
> @updates-testing
> Upgraded   llvm-17.0.6-6.fc40.x86_64@@System
> Upgradellvm-devel-18.1.0~rc4-1.fc40.x86_64  
> @updates-testing
> Upgraded   llvm-devel-17.0.6-6.fc40.x86_64  @@System
> Upgradellvm-googletest-18.1.0~rc4-1.fc40.x86_64 
> @updates-testing
> Upgraded   llvm-googletest-17.0.6-6.fc40.x86_64 @@System
> Upgradellvm-libs-18.1.0~rc4-1.fc40.i686 
> @updates-testing
> Upgraded   llvm-libs-17.0.6-6.fc40.i686 @@System
> Upgradellvm-libs-18.1.0~rc4-1.fc40.x86_64   
> @updates-testing
> Upgraded   llvm-libs-17.0.6-6.fc40.x86_64   @@System
> Upgradellvm-static-18.1.0~rc4-1.fc40.x86_64 
> @updates-testing
> Upgraded   llvm-static-17.0.6-6.fc40.x86_64 @@System
> Upgradellvm-test-18.1.0~rc4-1.fc40.x86_64   
> @updates-testing
> Instalovat llvm17-libs-17.0.6-7.fc40.i686   
> @updates-testing
> Instalovat llvm17-libs-17.0.6-7.fc40.x86_64 
> @updates-testing

I don't know anything about LLVM, but somehow I'm guessing that people
who do will need more detail than this in order to be able to fix the
problem. I see that support for LLVM 18 was added in commit
d282e88e50521a457fa1b36e55f43bac02a3167f on January 18th; perhaps you
were building from an older commit?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Jelte Fennema-Nio

On Thu, 14 Mar 2024 at 17:37, Robert Haas  wrote:
> or in the
> alternative (2) but with the GUC being PGC_SIGHUP and
> GUC_DISALLOW_IN_AUTO_FILE. I believe there would be adequate consensus
> to proceed with either of those approaches. Anybody feel like coding
> it up?

Here is a slightly modified version of Gabrielle his original patch,
which already implemented GUC approach. The changes I made are adding
PGC_SIGHUP and GUC_DISALLOW_IN_AUTO_FILE as well as adding some more
docs.

v2-0001-Add-enable_alter_system-GUC.patch
Description: Binary data

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Tom Lane

Robert Haas  writes:
> As far as I can see from reading the thread, most people agree that
> it's reasonable to have some way to disable ALTER SYSTEM, but there
> are at least six competing ideas about how to do that:

> 1. command-line option
> 2. GUC
> 3. event trigger
> 4. extension
> 5. sentinel file
> 6. remove permissions on postgresql.auto.conf

With the possible exception of #1, every one of these is easily
defeatable by an uncooperative superuser.  I'm not excited about
adding a "security" feature with such obvious holes in it.
We reverted MAINTAIN last year for much less obvious holes;
how is it that we're going to look the other way on this one?

#2 with the GUC_DISALLOW_IN_AUTO_FILE flag can be made secure
(I think) by putting the main postgresql.conf file outside the
data directory and then making it not owned by or writable by the
postgres user.  But I doubt that's a common configuration, and
I'm sure we will get complaints from people who failed to set it
up that way.  The proposed patch certainly doesn't bother to
document the hazard.

Really we'd need to do something about removing superusers'
access to the filesystem in order to build something with
fewer holes.  I'm not against inventing such a feature,
but it'd take a fair amount of work and likely would end
in a noticeably less usable system (no plpython for example).

regards, tom lane

Re: JIT compilation per plan node

2024-03-14 Thread Robert Haas

On Tue, Feb 20, 2024 at 5:31 AM Tomas Vondra
 wrote:
> I certainly agree that the current JIT costing is quite crude, and we've
> all seen cases where the decision turns out to not be great. And I think
> the plan to make the decisions at the node level makes sense, so +1 to
> that in general.

Seems reasonable to me also.

> And I think you're right that looking just at the node total cost may
> not be sufficient - that we may need a better cost model, considering
> how many times an expression is executed and so on. But I think we
> should try to do this in smaller steps, meaningful on their own,
> otherwise we won't move at all. The two threads linked by Melih are ~4y
> old and *nothing* changed since then, AFAIK.
>
> I think it's reasonable to start by moving the decision to the node
> level - it's where the JIT happens, anyway. It may not be perfect, but
> it seems like a clear improvement. And if we then choose to improve the
> "JIT cost model" to address some of the issues you pointed out, surely
> that would need to happen at the node level too ...

I'm not sure I understand whether you (Tomas) think that this patch is
a good idea or a bad idea as it stands. I read the first of these two
paragraphs to suggest that the patch hasn't really evolved much in the
last few years, perhaps suggesting that if it wasn't good enough to
commit back then, it still isn't now. But the second of these two
paragraphs seems more supportive.

>From my own point of view, I definitely agree with David's statement
that what we really want to know is how many times each expression
will be evaluated. If we had that information, or just an estimate, I
think we could make much better decisions in this area. But we don't
have that infrastructure now, and it doesn't seem easy to create, so
it seems to me that what we have to decide now is whether applying a
cost threshold on a per-plan-node basis will produce better or worse
results than what making one decision for the whole plan. David's
provided an example of where it does indeed work better back in
https://www.postgresql.org/message-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com
- but could there be enough cases where the opposite happens to make
us think that the patch is overall a bad idea?

I personally find that a bit unlikely, although not impossible. I see
a couple of ways that using the per-node cost can distort things -- it
seems like it will tend to heavily feature JIT for "interior" plan
nodes because the cost of a plan node includes it's children -- and as
was mentioned previously, it doesn't really care whether the node cost
is high because of expression evaluation or something else. But
neither of those things seem like they'd be bad enough to make this a
bad way forward over all. For the patch to lose, it seems like we'd
need a case where the overall plan cost would have been high enough to
trigger JIT pre-patch, but most of the benefit would have come from
relatively low-cost nodes that don't get JITted post-patch. The
easiest way for that to happen is if the planner's estimates are off,
but that's not really an argument against this patch as much as it is
an argument that query planning is hard in general.

A slightly subtler way the patch could lose is if the new threshold is
harder to adjust than the old one. For example, imagine that you have
a query that does a Cartesian join. That makes the cost of the input
nodes rather small compared to the cost of the join node, and it also
means that JITting the inner join child in particular is probably
rather important. But if you set join_above_cost low enough to JIT
that node post-patch, then maybe you'll also JIT a bunch of things
that aren't on the inner side of a nested loop and which might
therefore not really need JIT. Unless I'm missing something, this is a
fairly realistic example of where this patch's approach to costing
could turn out to be painful ... but it's not like the current system
is pain-free either.

I don't really know what's best here, but I'm mildly inclined to
believe that the patch might be a change for the better. I have not
reviewed the implementation and have no comment on whether it's good
or bad from that point of view.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: broken JIT support on Fedora 40

2024-03-14 Thread Pavel Stehule

Hi

čt 14. 3. 2024 v 19:20 odesílatel Robert Haas 
napsal:

> On Wed, Mar 6, 2024 at 1:54 AM Pavel Stehule 
> wrote:
> > after today update, the build with --with-llvm produces broken code, and
> make check fails with crash
> >
> > Upgradehwdata-0.380-1.fc40.noarch
>  @updates-testing
> > Upgraded   hwdata-0.379-1.fc40.noarch
>  @@System
> > Upgradellvm-18.1.0~rc4-1.fc40.x86_64
> @updates-testing
> > Upgraded   llvm-17.0.6-6.fc40.x86_64
> @@System
> > Upgradellvm-devel-18.1.0~rc4-1.fc40.x86_64
> @updates-testing
> > Upgraded   llvm-devel-17.0.6-6.fc40.x86_64
> @@System
> > Upgradellvm-googletest-18.1.0~rc4-1.fc40.x86_64
>  @updates-testing
> > Upgraded   llvm-googletest-17.0.6-6.fc40.x86_64
>  @@System
> > Upgradellvm-libs-18.1.0~rc4-1.fc40.i686
>  @updates-testing
> > Upgraded   llvm-libs-17.0.6-6.fc40.i686
>  @@System
> > Upgradellvm-libs-18.1.0~rc4-1.fc40.x86_64
>  @updates-testing
> > Upgraded   llvm-libs-17.0.6-6.fc40.x86_64
>  @@System
> > Upgradellvm-static-18.1.0~rc4-1.fc40.x86_64
>  @updates-testing
> > Upgraded   llvm-static-17.0.6-6.fc40.x86_64
>  @@System
> > Upgradellvm-test-18.1.0~rc4-1.fc40.x86_64
>  @updates-testing
> > Instalovat llvm17-libs-17.0.6-7.fc40.i686
>  @updates-testing
> > Instalovat llvm17-libs-17.0.6-7.fc40.x86_64
>  @updates-testing
>
> I don't know anything about LLVM, but somehow I'm guessing that people
> who do will need more detail than this in order to be able to fix the
> problem. I see that support for LLVM 18 was added in commit
> d282e88e50521a457fa1b36e55f43bac02a3167f on January 18th; perhaps you
> were building from an older commit?
>

I repeated build and check today, fresh master, today fedora update with
same result

build is ok, but regress tests fails with crash (it works without
-with-llvm)

Regards

Pavel






>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 3:13 PM Tom Lane  wrote:
> With the possible exception of #1, every one of these is easily
> defeatable by an uncooperative superuser.  I'm not excited about
> adding a "security" feature with such obvious holes in it.
> We reverted MAINTAIN last year for much less obvious holes;
> how is it that we're going to look the other way on this one?

We're going to document that it's not a security feature along the
lines of what Magnus suggested in
http://postgr.es/m/CABUevEx9m=CV8=wpxvw+rtvvs858kdj6yprkexv7n+f6mk0...@mail.gmail.com

And then maybe someday we'll do this:

> Really we'd need to do something about removing superusers'
> access to the filesystem in order to build something with
> fewer holes.  I'm not against inventing such a feature,
> but it'd take a fair amount of work and likely would end
> in a noticeably less usable system (no plpython for example).

Yep. It would be useful if you replied to the portion of
http://postgr.es/m/ca+tgmoasugkz27x0xzh4edmq_b6jbrt6csuxf+phdgj-esk...@mail.gmail.com
where I enumerate the methods that I know about for the superuser to
get filesystem access. I don't think it's going to be practical to
block all of those methods in a single commit, and I'm not entirely
convinced that we can ever close all the holes without compromising
the superuser's ability to do necessary system administration tasks,
but maybe it's possible, and documenting the list of such methods
would make it a lot easier for users to understand the risks and
hackers to pick problems to try to tackle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

2024-03-14 Thread Jacob Champion

I've been poking at the partial token logic. The json_errdetail() bug
mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk
size) seems to be due to the disconnection between the "main" lex
instance and the dummy_lex that's created from it. The dummy_lex
contains all the information about the failed token, which is
discarded upon an error return:

> partial_result = json_lex(&dummy_lex);
> if (partial_result != JSON_SUCCESS)
> return partial_result;

In these situations, there's an additional logical error:
lex->token_start is pointing to a spot in the string after
lex->token_terminator, which breaks an invariant that will mess up
later pointer math. Nothing appears to be setting lex->token_start to
point into the partial token buffer until _after_ the partial token is
successfully lexed, which doesn't seem right -- in addition to the
pointer math problems, if a previous chunk was freed (or on a stale
stack frame), lex->token_start will still be pointing off into space.
Similarly, wherever we set token_terminator, we need to know that
token_start is pointing into the same buffer.

Determining the end of a token is now done in two separate places
between the partial- and full-lexer code paths, which is giving me a
little heartburn. I'm concerned that those could drift apart, and if
the two disagree on where to end a token, we could lose data into the
partial token buffer in a way that would be really hard to debug. Is
there a way to combine them?

Thanks,
--Jacob

RE: Popcount optimization using AVX512

2024-03-14 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 11, 2024 6:35 PM
> To: Amonson, Paul D 

> Thanks.  There's no need to wait to post the AVX portion.  I recommend using
> "git format-patch" to construct the patch set for the lists.

After exploring git format-patch command I think I understand what you need. 
Attached.
 
> > What exactly do you suggest here? I am happy to always call either
> > pg_popcount32() or pg_popcount64() with the understanding that it may
> > not be optimal, but I do need to know which to use.
> 
> I'm recommending that we don't change any of the code in the pg_popcount()
> function (which is renamed to pg_popcount_slow() in your v6 patch).  If
> pointers are 8 or more bytes, we'll try to process the buffer in 64-bit 
> chunks.
> Else, we'll try to process it in 32-bit chunks.  Any remaining bytes will be
> processed one-by-one.

Ok, we are on the same page now. :)  It is already fixed that way in the 
refactor patch #1.

As for new performance numbers: I just ran a full suite like I did earlier in 
the process. My latest results an equivalent to a pgbench scale factor 10 DB 
with the target column having varying column widths and appropriate random data 
are 1.2% improvement with a 2.2% Margin of Error at a 98% confidence level. 
Still seeing improvement and no regressions.

As stated in the previous separate chain I updated the code removing the extra 
"extern" keywords.

Thanks,
Paul



v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patch
Description: v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patch


v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patch
Description: v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patch

Re: JIT compilation per plan node

2024-03-14 Thread David Rowley

Thanks for chipping in here.

On Fri, 15 Mar 2024 at 08:14, Robert Haas  wrote:
> A slightly subtler way the patch could lose is if the new threshold is
> harder to adjust than the old one. For example, imagine that you have
> a query that does a Cartesian join. That makes the cost of the input
> nodes rather small compared to the cost of the join node, and it also
> means that JITting the inner join child in particular is probably
> rather important. But if you set join_above_cost low enough to JIT
> that node post-patch, then maybe you'll also JIT a bunch of things
> that aren't on the inner side of a nested loop and which might
> therefore not really need JIT. Unless I'm missing something, this is a
> fairly realistic example of where this patch's approach to costing
> could turn out to be painful ... but it's not like the current system
> is pain-free either.

I think this case would be covered as the cost of the inner side of
the join would be multiplied by the estimated outer-side rows.
Effectively, making this part work is the bulk of the patch as we
currently don't know the estimated number of loops of a node during
create plan.

David

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2024-03-14 Thread Melanie Plageman

On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote:
> On 18/02/2024 00:31, Tomas Vondra wrote:
> > Do you plan to work continue working on this patch? I did take a look,
> > and on the whole it looks reasonable - it modifies the right places etc.
> 
> +1
> 
> > I think there are two things that may need an improvement:
> > 
> > 1) Storing variable-length data in ParallelBitmapHeapState
> > 
> > I agree with Robert the snapshot_and_stats name is not great. I see
> > Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> > the reasons are somewhat different (phs_snapshot_off exists because we
> > don't know which exact struct will be allocated), while here we simply
> > need to allocate two variable-length pieces of memory. But it seems like
> > it would work nicely for this. That is, we could try adding an offset
> > for each of those pieces of memory:
> > 
> >   - snapshot_off
> >   - stats_off
> > 
> > I don't like the GetSharedSnapshotData name very much, it seems very
> > close to GetSnapshotData - quite confusing, I think.
> > 
> > Dmitry also suggested we might add a separate piece of shared memory. I
> > don't quite see how that would work for ParallelBitmapHeapState, but I
> > doubt it'd be simpler than having two offsets. I don't think the extra
> > complexity (paid by everyone) would be worth it just to make EXPLAIN
> > ANALYZE work.
> 
> I just removed phs_snapshot_data in commit 84c18acaf6. I thought that would
> make this moot, but now that I rebased this, there are stills some aesthetic
> questions on how best to represent this.
> 
> In all the other node types that use shared instrumentation like this, the
> pattern is as follows: (using Memoize here as an example, but it's similar
> for Sort, IncrementalSort, Agg and Hash)
> 
> /* 
>  * Shared memory container for per-worker memoize information
>  * 
>  */
> typedef struct SharedMemoizeInfo
> {
>   int num_workers;
>   MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
> } SharedMemoizeInfo;
> 
> /* this struct is backend-private */
> typedef struct MemoizeState
> {
>   ScanState   ss; /* its first field is 
> NodeTag */
>   ...
>   MemoizeInstrumentation stats;   /* execution statistics */
>   SharedMemoizeInfo *shared_info; /* statistics for parallel workers */
> } MemoizeState;
> 
> While the scan is running, the node updates its private data in
> MemoizeState->stats. At the end of a parallel scan, the worker process
> copies the MemoizeState->stats to MemoizeState->shared_info->stats, which
> lives in shared memory. The leader process copies
> MemoizeState->shared_info->stats to its own backend-private copy, which it
> then stores in its MemoizeState->shared_info, replacing the pointer to the
> shared memory with a pointer to the private copy. That happens in
> ExecMemoizeRetrieveInstrumentation().
> 
> This is a little different for parallel bitmap heap scans, because a bitmap
> heap scan keeps some other data in shared memory too, not just
> instrumentation data. Also, the naming is inconsistent: the equivalent of
> SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think we
> should rename it to SharedBitmapHeapInfo, to make it clear that it lives in
> shared memory, but I digress.

FWIW, if we merge a BHS streaming read user like the one I propose in
[1] (not as a pre-condition to this but just as something to make you
more comfortable with these names), the ParallelBitmapHeapState will
basically only contain the shared iterator and the coordination state
for accessing it and could be named as such.

Then if you really wanted to be consistent with Memoize, you could name
the instrumentation SharedBitmapHeapInfo. But, personally I prefer the
name you gave it: SharedBitmapHeapInstrumentation. I think that would
have been a better name for SharedMemoizeInfo since num_workers is
really just used as the length of the array of instrumentation info.

> We could now put the new stats at the end of ParallelBitmapHeapState as a
> varlen field. But I'm not sure that's a good idea. In
> ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private
> copy of the whole ParallelBitmapHeapState struct, even though the other
> fields don't make sense after the shared memory is released? Sounds
> confusing. Or we could introduce a separate struct for the stats, and copy
> just that:
> 
> typedef struct SharedBitmapHeapInstrumentation
> {
>   int num_workers;
>   BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
> } SharedBitmapHeapInstrumentation;
> 
> typedef struct BitmapHeapScanState
> {
>   ScanState   ss; /* its first field is 
> NodeTag */
>   ...
>   SharedBitmapHeapInstrumentation sinstrument;
> } BitmapHeapScanState;
> 
> that compiles, at least with my compiler, but I find it we

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Tom Lane

Robert Haas  writes:
> On Thu, Mar 14, 2024 at 3:13 PM Tom Lane  wrote:
>> With the possible exception of #1, every one of these is easily
>> defeatable by an uncooperative superuser.  I'm not excited about
>> adding a "security" feature with such obvious holes in it.

> We're going to document that it's not a security feature along the
> lines of what Magnus suggested in
> http://postgr.es/m/CABUevEx9m=CV8=wpxvw+rtvvs858kdj6yprkexv7n+f6mk0...@mail.gmail.com

The patch-of-record contains no such wording.  And if this isn't a
security feature, then what is it?  If you have to say to your
(super) users "please don't mess with the system configuration",
you might as well just trust them not to do it the easy way as not
to do it the hard way.  If they're untrustworthy, why have they
got superuser?

What I think this is is a loaded foot-gun painted in kid-friendly
colors.  People will use it and then file CVEs about how it did
not turn out to be as secure as they imagined (probably without
reading the documentation).

regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

2024-03-14 Thread Thomas Munro

On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin  wrote:
> Could it be that the timeout (360 sec?) is just not enough for the test
> under the current (changed due to switch to meson) conditions?

Hmm, well it looks like he switched over to meson around 42 days ago
2024-02-01, looking at "calliphoridae" (skink has the extra
complication of valgrind, let's look at a more 'normal' animal
instead).  The first failure that looks like that on calliphoridae is
19 days ago 2024-02-23, and after that it's happening every 3 days,
sometimes in clusters.

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=calliphoridae&br=HEAD

But you're right that under meson the test takes a lot longer, I guess
due to increased concurrency:

287/287 postgresql:recovery / recovery/027_stream_regress
   OK  684.50s   6 subtests passed

With make we don't have an individual time per script, but for for all
of the recovery tests we had for example:

t/027_stream_regress.pl ... ok
All tests successful.
Files=39, Tests=542, 65 wallclock secs ( 0.26 usr  0.06 sys + 20.16
cusr 31.65 csys = 52.13 CPU)

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 1:54 PM Jelte Fennema-Nio  wrote:
> In my view there can be, **by definition,** only two general types of
> protocol changes:
> 1. A "simple protocol change": This is one that requires agreement by
> both the client and server, that they understand the new message types
> involved in this change. e.g. the ParameterSet message proposal (this
> message type is either supported or it's not)
> 2. A "parameterized protocol change": This requires the same as 1, but
> should also allow some parameterization from the client, e.g. for the
> compression proposal, the client should specify what compression
> algorithm the server should use to compress data when sending it to
> the client.

You seem to be supposing here that all protocol changes will consist
of adding new message types. While I think that will be a common
pattern, I do not think it will be a universal one. I do agree,
however, that every possible variation of the protocol is either
Boolean-valued (i.e. are we doing X or not?) or something more
complicated (i.e. how exactly are doing X?).

> Client and Server can agree that a "simple protocol change" is
> supported by both advertising a minimum minor protocol version. And
> for a "parameterized protocol change" the client can send a _pq_
> parameter in the startup packet.
>
> So, new _pq_ parameters should only ever be introduced for
> parameterized protocol changes. They are not meant to advertise
> support, they are meant to configure protocol features. For a
> non-configurable protocol feature, we'd simply bump the protocol
> version. And since _pq_ parameters thus always control some setting at
> the session level, we can simply store it as a GUC, because that's how
> we store all our parameters at a session level.

This is definitely not how I was thinking about it. I was imagining
that we wanted to reserve bumping the protocol version for more
significant changes, and that we'd use _pq_ parameters for relatively
minor new functionality whether Boolean-valued or otherwise.

> I think given the automatic downgrade supported by the
> NegotiateProtocolVersion, there's no real down-side to requesting the
> newest version by default. The only downside I can see is when
> connecting to other applications (i.e. non PostgreSQL servers) that
> implement the postgres protocol, but don't implement
> NegotiateProtocolVersion. But for that I added the
> max_protocol_version connection option in 0006 (of my latest
> patchset).

You might be right. This is a minor point that's not worth arguing
about too much.

> > I'm really unhappy with 0002-0004.
>
> Just to be clear, (afaict) your argument below seems to only really be
> about 0004, not about 0002 or 0003. Was there anything about 0002 &
> 0003 that you were unhappy with? 0002 & 0003 are not dependent an 0004
> imho. Because even when not making _pq_ parameters map to GUCs, we'd
> still need to change libpq to not instantly close the connection
> whenever a _pq_ parameter (or new protocol version) is not supported
> by the server (which is what 0002 & 0003 do).

I completely agree that we need to avoid slamming the connection shut.
What I don't agree with is taking the list of protocol extensions that
the server knows about and putting it into an array of strings that
the user can see. I don't want the user to know or care so much about
what's happening at the wire protocol level. The user is entitled to
know whether PQsetProtocolParameter() will work or not, and the user
is entitled to know whether it has a chance of working next time if it
didn't work this time, and when it fails, the user is entitled to a
good error message explaining the reason for the failure. But the user
is not entitled to know what negotiation took place over the wire to
figure that out. They shouldn't need to know that the _pq_ namespace
exists, and they shouldn't need to know whether we negotiated the
availability or unavailability of PQsetProtocolParameter() using [a]
the protocol minor version number, [b] the protocol major version
number, [c] a protocol option called parameter_set, or [d] a protocol
option called bananas_foster. Those are all things that might change
in the future.

Just as a for instance, I had a thought that we might accumulate a few
new message types controlled by protocol options (ParameterSet,
AlwaysSendTypeInBinaryFormat, whatever else) while keeping the
protocol version as 3.0, and then eventually bump the protocol version
to 3.1 where all of that would be mandatory functionality. So the
protocol parameters wouldn't be specified any more when using 3.1, but
they would be specified when talking to older 3.0 servers. That
difference shouldn't be visible to the user. The user should only know
the result of the negotiation.

> Okay, our interpretation is very different here. From my perspective
> introducing a non-GUC namespace is NOT AT ALL the benefit of the _pq_
> prefix. The main benefit (imho) is that it allows sending an
> "optio

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2024-03-14 Thread Anton A. Melnikov


On 13.03.2024 10:41, Anton A. Melnikov wrote:


Here is a version updated for the current master.



During patch updating i mistakenly added double counting of deallocatated 
blocks.
That's why the tests in the patch tester failed.
Fixed it and squashed fix 0002 with 0001.
Here is fixed version.

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyFrom 7a0925a38acfb9a945087318a5d91fae4680db0e Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 26 Dec 2023 17:54:40 +0100
Subject: [PATCH 1/2] Add tracking of backend memory allocated

Add tracking of backend memory allocated in total and by allocation
type (aset, dsm, generation, slab) by process.

allocated_bytes tracks the current bytes of memory allocated to the
backend process. aset_allocated_bytes, dsm_allocated_bytes,
generation_allocated_bytes and slab_allocated_bytes track the
allocation by type for the backend process. They are updated for the
process as memory is malloc'd/freed.  Memory allocated to items on
the freelist is included.  Dynamic shared memory allocations are
included only in the value displayed for the backend that created
them, they are not included in the value for backends that are
attached to them to avoid double counting. DSM allocations that are
not destroyed by the creating process prior to it's exit are
considered long lived and are tracked in a global counter
global_dsm_allocated_bytes. We limit the floor of allocation
counters to zero. Created views pg_stat_global_memory_allocation and
pg_stat_memory_allocation for access to these trackers.
---
 doc/src/sgml/monitoring.sgml| 246 
 src/backend/catalog/system_views.sql|  34 +++
 src/backend/storage/ipc/dsm.c   |  11 +-
 src/backend/storage/ipc/dsm_impl.c  |  78 +++
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/backend_status.c | 114 +
 src/backend/utils/adt/pgstatfuncs.c |  84 +++
 src/backend/utils/init/miscinit.c   |   3 +
 src/backend/utils/mmgr/aset.c   |  17 ++
 src/backend/utils/mmgr/generation.c |  13 ++
 src/backend/utils/mmgr/slab.c   |  22 ++
 src/include/catalog/pg_proc.dat |  17 ++
 src/include/storage/proc.h  |   2 +
 src/include/utils/backend_status.h  | 144 +++-
 src/test/regress/expected/rules.out |  27 +++
 src/test/regress/expected/stats.out |  36 +++
 src/test/regress/sql/stats.sql  |  20 ++
 17 files changed, 867 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..41d788be45 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4599,6 +4599,252 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
  
 
+ 
+  pg_stat_memory_allocation
+
+  
+   pg_stat_memory_allocation
+  
+
+  
+   The pg_stat_memory_allocation view will have one
+   row per server process, showing information related to the current memory
+   allocation of that process in total and by allocator type. Due to the
+   dynamic nature of memory allocations the allocated bytes values may not be
+   exact but should be sufficient for the intended purposes. Dynamic shared
+   memory allocations are included only in the value displayed for the backend
+   that created them, they are not included in the value for backends that are
+   attached to them to avoid double counting.  Use
+   pg_size_pretty described in
+to make these values more easily
+   readable.
+  
+
+  
+   pg_stat_memory_allocation View
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   datid oid
+  
+  
+   OID of the database this backend is connected to
+  
+ 
+
+ 
+  
+   pid integer
+  
+  
+   Process ID of this backend
+  
+ 
+
+ 
+  
+   allocated_bytes bigint
+  
+ 
+  Memory currently allocated to this backend in bytes. This is the balance
+  of bytes allocated and freed by this backend. Dynamic shared memory
+  allocations are included only in the value displayed for the backend that
+  created them, they are not included in the value for backends that are
+  attached to them to avoid double counting.
+ 
+ 
+
+ 
+  
+   aset_allocated_bytes bigint
+  
+  
+   Memory currently allocated to this backend in bytes via the allocation
+   set allocator.
+  
+ 
+
+ 
+  
+   dsm_allocated_bytes bigint
+  
+  
+   Memory currently allocated to this backend in bytes via the dynamic
+   shared memory allocator. Upon process exit, dsm allocations that have
+   not been freed are considered long lived and added to
+   global_dsm_allocated_bytes foun

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Robert Haas

On Thu, Mar 14, 2024 at 4:08 PM Tom Lane  wrote:
> The patch-of-record contains no such wording.

I plan to fix that, if nobody else beats me to it.

> And if this isn't a
> security feature, then what is it?  If you have to say to your
> (super) users "please don't mess with the system configuration",
> you might as well just trust them not to do it the easy way as not
> to do it the hard way.  If they're untrustworthy, why have they
> got superuser?

I mean, I feel like this question has been asked and answered before,
multiple times, on this thread. If you sincerely don't understand the
use case, I can try again to explain it. But somehow I feel like it's
more that you just don't like the idea, which is fair, but it seems
like a considerable number of people feel otherwise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele


On 3/14/24 20:00, Michael Paquier wrote:

On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote:

I think you are right that the start message is better since it can only
appear once when the backup_label is found. The completed message could in
theory appear after a restart, though the backup_label must have been found
at some point.


So, I've given a try to this patch with 99b4a63bef94, to note that
sidewinder failed because of a timing issue on Windows: the recovery
of the node without backup_label, expected to fail, would try to
backup the last segment it has replayed, because, as it has no
backup_label, it behaves like the primary.  It would try to use the
same archive location as the primary, leading to a conflict failure on
Windows.  This one was easy to fix, by overwritting postgresql.conf on
the node to not do archiving.


Hmmm, I wonder why this did not show up in the Windows tests on CI?


Following that, I've noticed a second race condition: we don't wait
for the segment after pg_switch_wal() to be archived.  This one can be
easily avoided with a poll on pg_stat_archiver.


Ugh, yeah, good change.


After that, also because I've initially managed to, cough, forget an
update of meson.build to list the new test, I've noticed a third
failure on Windows for the case of the node that has a backup_label.
Here is one of the failures:
https://cirrus-ci.com/task/5245341683941376


Is the missing test in meson the reason we did not see test failures for 
Windows in CI?



regress_log_042_low_level_backup and
042_low_level_backup_replica_success.log have all the information
needed, that can be summarized like that:
The system cannot find the file specified.
2024-03-14 06:02:37.670 GMT [560][startup] FATAL:  invalid data in file 
"backup_label"

The first message is something new to me, that seems to point to a
corruption failure of the file system.  Why don't we see something
similar in other tests, then?  Leaving that aside..

The second LOG is something that can be acted on.  I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n.  Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that.  The simplest method I can
think of is to use binmode, as of the attached.


Yeah, that makes sense.


It is the first time that we'd take the contents received from a
BackgroundPsql and write them to a file parsed by the backend, so
perhaps we should try to do that in a more general way, but I'm not
sure how, tbh, and the case of this test is special while adding
handling for \r when reading the backup_label got discussed in the
past but we were OK with what we are doing now on HEAD.


I think it makes sense to leave the parsing code as is and make the 
change in the test. If we add more tests to this module we'll probably 
need a function to avoid repeating code.



On top of all that, note that I have removed remove_tree as I am not
sure if this would be OK in all the buildfarm animals, added a quit()
for BackgroundPsql, moved queries to use less BackgroundPsql, as well
as a few other things like avoiding the hardcoded segment names.
meson.build is.. Cough.. Updated now.


OK.


I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run.  Do you have any comments about


These changes look good to me. Sure wish we had an easier to way to test 
commits in the build farm.


Regards,
-David

Re: Built-in CTYPE provider

2024-03-14 Thread Jeff Davis

On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote:
> On 14.03.24 09:08, Jeff Davis wrote:
> > 0001 (the C.UTF-8 locale) is also close...
> 
> If have tested this against the libc locale C.utf8 that was available
> on 
> the OS, and the behavior is consistent.

That was the goal, in spirit.

But to clarify: it's not guaranteed that the built-in C.UTF-8 is always
the same as the libc UTF-8, because different implementations do
different things. For instance, I saw significant differences on MacOS.

> I wonder if we should version the builtin locales too.  We might make
> a 
> mistake and want to change something sometime?

I'm fine with that, see v25-0004 in the reply to your other mail.

The version only tracks sort order, and all of the builtin locales sort
based on memcmp(). But it's possible there are bugs in the
optimizations around memcmp() (e.g. abbreviated keys, or some future
optimization).

> Tiny comments:
> 
> * src/bin/scripts/t/020_createdb.pl
> 
> The two added tests should have different names that tells them apart
> (like the new initdb tests).
> 
> * src/include/catalog/pg_collation.dat

Done in v25-0002 (in reply to your other mail).

Regards,
Jeff Davis

Re: Recent 027_streaming_regress.pl hangs

2024-03-14 Thread Tom Lane

Thomas Munro  writes:
> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin  wrote:
>> Could it be that the timeout (360 sec?) is just not enough for the test
>> under the current (changed due to switch to meson) conditions?

> But you're right that under meson the test takes a lot longer, I guess
> due to increased concurrency:

What it seems to be is highly variable.  Looking at calliphoridae's
last half dozen successful runs, I see reported times for
027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
else is happening on that machine.  Also, this is probably not
helping anything:

   'extra_config' => {
  ...
  'fsync = on'

I would suggest turning that off and raising wait_timeout a good
deal, and then we'll see if calliphoridae gets any more stable.

regards, tom lane

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Thomas Munro

On Fri, Mar 15, 2024 at 3:18 AM Tomas Vondra
 wrote:
> So, IIUC this means (1) the patched code is more aggressive wrt
> prefetching (because we prefetch more data overall, because master would
> prefetch N pages and patched prefetches N ranges, each of which may be
> multiple pages. And (2) it's not easy to quantify how much more
> aggressive it is, because it depends on how we happen to coalesce the
> pages into ranges.
>
> Do I understand this correctly?

Yes.

Parallelism must prevent coalescing here though.  Any parallel aware
executor node that allocates block numbers to workers without trying
to preserve ranges will.  That not only hides the opportunity to
coalesce reads, it also makes (globally) sequential scans look random
(ie locally they are more random), so that our logic to avoid issuing
advice for sequential scan won't work, and we'll inject extra useless
or harmful (?) fadvise calls.  I don't know what to do about that yet,
but it seems like a subject for future research.  Should we recognise
sequential scans with a window (like Linux does), instead of strictly
next-block detection (like some other OSes do)?  Maybe a shared
streaming read that all workers pull blocks from, so it can see what's
going on?  I think the latter would be strictly more like what the ad
hoc BHS prefetching code in master is doing, but I don't know if it'd
be over-engineering, or hard to do for some reason.

Another aspect of per-backend streaming reads in one parallel query
that don't know about each other is that they will all have their own
effective_io_concurrency limit.  That is a version of a problem that
comes up again and again in parallel query, to be solved by the grand
unified resource control system of the future.

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2024-03-14 Thread Heikki Linnakangas


On 14/03/2024 22:00, Melanie Plageman wrote:

On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote:

typedef struct SharedBitmapHeapInstrumentation
{
int num_workers;
BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
} SharedBitmapHeapInstrumentation;

typedef struct BitmapHeapScanState
{
ScanState   ss; /* its first field is 
NodeTag */
...
SharedBitmapHeapInstrumentation sinstrument;
} BitmapHeapScanState;

that compiles, at least with my compiler, but I find it weird to have a
variable-length inner struct embedded in an outer struct like that.


In the attached patch, BitmapHeapScanState->sinstrument is a pointer,
though. Or are you proposing the above as an alternative that you
decided not to go with?


Right, the above is what I contemplated at first but decided it was a 
bad idea.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: cleanup patches for incremental backup

2024-03-14 Thread Nathan Bossart

On Wed, Jan 24, 2024 at 12:05:15PM -0600, Nathan Bossart wrote:
> There might be an overflow risk in the cutoff time calculation, but I doubt
> that's the root cause of these failures:
> 
>   /*
>* Files should only be removed if the last modification time precedes 
> the
>* cutoff time we compute here.
>*/
>   cutoff_time = time(NULL) - 60 * wal_summary_keep_time;

I've attached a short patch for fixing this overflow risk.  Specifically,
it limits wal_summary_keep_time to INT_MAX / SECS_PER_MINUTE, just like
log_rotation_age.

I considering checking for overflow when we subtract the keep-time from the
result of time(2), but AFAICT there's only a problem if time_t is unsigned,
which Wikipedia leads me to believe is unusual [0], so I figured we might
be able to just wait this one out until 2038.

> Otherwise, I think we'll probably need to add some additional logging to
> figure out what is happening...

Separately, I suppose it's probably time to revert the temporary debugging
code adding by commit 5ddf997.  I can craft a patch for that, too.

[0] https://en.wikipedia.org/wiki/Unix_time#Representing_the_number

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ec2874c18c..c820d1f9ed 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
  * GUC parameters
  */
 bool		summarize_wal = false;
-int			wal_summary_keep_time = 10 * 24 * 60;
+int			wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR;
 
 static void WalSummarizerShutdown(int code, Datum arg);
 static XLogRecPtr GetLatestLSN(TimeLineID *tli);
@@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void)
 	 * Files should only be removed if the last modification time precedes the
 	 * cutoff time we compute here.
 	 */
-	cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+	cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE;
 
 	/* Get all the summaries that currently exist. */
 	wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 57d9de4dd9..1e71e7db4a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_MIN,
 		},
 		&wal_summary_keep_time,
-		10 * 24 * 60,			/* 10 days */
+		10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */
 		0,
-		INT_MAX,
+		INT_MAX / SECS_PER_MINUTE,
 		NULL, NULL, NULL
 	},

Re: JIT compilation per plan node

2024-03-14 Thread Tomas Vondra




On 3/14/24 20:14, Robert Haas wrote:
> On Tue, Feb 20, 2024 at 5:31 AM Tomas Vondra
>  wrote:
>> I certainly agree that the current JIT costing is quite crude, and we've
>> all seen cases where the decision turns out to not be great. And I think
>> the plan to make the decisions at the node level makes sense, so +1 to
>> that in general.
> 
> Seems reasonable to me also.
> 
>> And I think you're right that looking just at the node total cost may
>> not be sufficient - that we may need a better cost model, considering
>> how many times an expression is executed and so on. But I think we
>> should try to do this in smaller steps, meaningful on their own,
>> otherwise we won't move at all. The two threads linked by Melih are ~4y
>> old and *nothing* changed since then, AFAIK.
>>
>> I think it's reasonable to start by moving the decision to the node
>> level - it's where the JIT happens, anyway. It may not be perfect, but
>> it seems like a clear improvement. And if we then choose to improve the
>> "JIT cost model" to address some of the issues you pointed out, surely
>> that would need to happen at the node level too ...
> 
> I'm not sure I understand whether you (Tomas) think that this patch is
> a good idea or a bad idea as it stands. I read the first of these two
> paragraphs to suggest that the patch hasn't really evolved much in the
> last few years, perhaps suggesting that if it wasn't good enough to
> commit back then, it still isn't now. But the second of these two
> paragraphs seems more supportive.
> 

To clarify, I think the patch is a step in the right direction, and a
meaningful improvement. It may not be the perfect solution we imagine
(but who knows how far we are from that), but AFAIK moving these
decisions to the node level is something the ideal solution would need
to do too.

The reference to the 4y old patches was meant to support this patch as
an improvement - perhaps incomplete, but still an improvement. We keep
imagining "perfect solutions" and then end up doing nothing.

I recognize there's a risk we may never get to have the ideal solution
(e.g. because it requires information we don't possess). But I still
think moving the decision to the node level would allow us to do better
decisions compared to just doing it for the query as a whole.

> From my own point of view, I definitely agree with David's statement
> that what we really want to know is how many times each expression
> will be evaluated. If we had that information, or just an estimate, I
> think we could make much better decisions in this area. But we don't
> have that infrastructure now, and it doesn't seem easy to create, so
> it seems to me that what we have to decide now is whether applying a
> cost threshold on a per-plan-node basis will produce better or worse
> results than what making one decision for the whole plan. David's
> provided an example of where it does indeed work better back in
> https://www.postgresql.org/message-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com
> - but could there be enough cases where the opposite happens to make
> us think that the patch is overall a bad idea?
> 

Right, this risk or regression is always there, and I'm sure it'd be
possible to construct such cases. But considering how crude the current
costing is, I'd be surprised if this ends up being a net negative.

Also, is the number of executions really the thing we're missing? Surely
we know the number of rows the node is dealing with, so we could use
this (yes, I realize there are issues, but we deal with that when
costing quals too). Isn't it much bigger issue that we have pretty much
no cost model for the actual JIT (compilation/optimization) depending on
how many expressions it deals with?

> I personally find that a bit unlikely, although not impossible. I see
> a couple of ways that using the per-node cost can distort things -- it
> seems like it will tend to heavily feature JIT for "interior" plan
> nodes because the cost of a plan node includes it's children -- and as
> was mentioned previously, it doesn't really care whether the node cost
> is high because of expression evaluation or something else. But
> neither of those things seem like they'd be bad enough to make this a
> bad way forward over all. For the patch to lose, it seems like we'd
> need a case where the overall plan cost would have been high enough to
> trigger JIT pre-patch, but most of the benefit would have come from
> relatively low-cost nodes that don't get JITted post-patch. The
> easiest way for that to happen is if the planner's estimates are off,
> but that's not really an argument against this patch as much as it is
> an argument that query planning is hard in general.
> 
> A slightly subtler way the patch could lose is if the new threshold is
> harder to adjust than the old one. For example, imagine that you have
> a query that does a Cartesian join. That makes the cost of the input
> nodes rather small compared to

Re: Possibility to disable `ALTER SYSTEM`

2024-03-14 Thread Maciek Sakrejda

On Thu, Mar 14, 2024 at 1:38 PM Robert Haas  wrote:
> On Thu, Mar 14, 2024 at 4:08 PM Tom Lane  wrote:
> > The patch-of-record contains no such wording.
>
> I plan to fix that, if nobody else beats me to it.
>
> > And if this isn't a
> > security feature, then what is it?  If you have to say to your
> > (super) users "please don't mess with the system configuration",
> > you might as well just trust them not to do it the easy way as not
> > to do it the hard way.  If they're untrustworthy, why have they
> > got superuser?
>
> I mean, I feel like this question has been asked and answered before,
> multiple times, on this thread. If you sincerely don't understand the
> use case, I can try again to explain it. But somehow I feel like it's
> more that you just don't like the idea, which is fair, but it seems
> like a considerable number of people feel otherwise.

I know I'm jumping into a long thread here, but I've been following it
out of interest. I'm sympathetic to the use case, since I used to work
at a Postgres cloud provider, and while our system intentionally did
not give our end users superuser privileges, I can imagine other
managed environments where that's not an issue. I'd like to give
answering this question again a shot, because I think this has been a
persistent misunderstanding in this thread, and I don't think it's
been made all that clear.

It's not a security feature: it's a usability feature.

It's a usability feature because, when Postgres configuration is
managed by an outside mechanism (e.g., as in a Kubernetes
environment), ALTER SYSTEM currently allows a superuser to make
changes that appear to work, but may be discarded at some point in the
future when that outside mechanism updates the config. They may also
be represented incorrectly in a management dashboard if that dashboard
is based on the values in the outside configuration mechanism, rather
than values directly from Postgres.

In this case, the end user with access to Postgres superuser
privileges presumably also has access to the outside configuration
mechanism. The goal is not to prevent them from changing settings, but
to offer guard rails that prevent them from changing settings in a way
that will be unstable (revertible by a future update) or confusing
(not showing up in a management UI).

There are challenges here in making sure this is _not_ seen as a
security feature. But I do think the feature itself is sensible and
worthwhile.

Thanks,
Maciek

1 2 >

1 - 100 of 147 matches

Mail list logo