date:20240731

Re: remove volatile qualifiers from pg_stat_statements

2024-07-31 Thread Bertrand Drouvot

Hi,

On Tue, Jul 30, 2024 at 01:24:54PM -0500, Nathan Bossart wrote:
> While looking into converting pgssEntry->mutex to an LWLock (per a
> suggestion elsewhere [0]), I noticed that pg_stat_statements uses
> "volatile" quite liberally.  IIUC we can remove these as of commit 0709b7e
> (like commits 8f6bb85, df4077c, and 6ba4ecb did in other areas).  All of
> the uses in pg_stat_statements except those added by commit 9fbc3f3 predate
> that commit (0709b7e), and I assume commit 9fbc3f3 was just following the
> examples in surrounding code.
> 
> Am I missing something?  Or can we remove these qualifiers now?

I share the same understanding and I think those can be removed.

The patch LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

make pg_createsubscriber option names more consistent

2024-07-31 Thread Peter Eisentraut

I propose to rename the pg_createsubscriber option --socket-directory to 
--socketdir.  This would make it match the equivalent option in 
pg_upgrade.  (It even has the same short option '-s'.) 
pg_createsubscriber and pg_upgrade have a lot of common terminology and 
a similar operating mode, so it would make sense to keep this consistent.From 3dba607f14382998ebc6eef02e78d75f20836514 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 31 Jul 2024 08:53:42 +0200
Subject: [PATCH] pg_createsubscriber: Rename option --socket-directory to
 --socketdir

For consistency with the equivalent option in pg_upgrade.
---
 doc/src/sgml/ref/pg_createsubscriber.sgml  |  2 +-
 src/bin/pg_basebackup/pg_createsubscriber.c|  4 ++--
 .../pg_basebackup/t/040_pg_createsubscriber.pl | 18 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml 
b/doc/src/sgml/ref/pg_createsubscriber.sgml
index 87a9d3db28e..026290e0114 100644
--- a/doc/src/sgml/ref/pg_createsubscriber.sgml
+++ b/doc/src/sgml/ref/pg_createsubscriber.sgml
@@ -139,7 +139,7 @@ Options
 
 
  -s dir
- --socket-directory=dir
+ --socketdir=dir
  
   
The directory to use for postmaster sockets on target server.  The
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c 
b/src/bin/pg_basebackup/pg_createsubscriber.c
index f838a079b66..9e68e1d508a 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -226,7 +226,7 @@ usage(void)
printf(_("  -n, --dry-run   dry run, just show what 
would be done\n"));
printf(_("  -p, --subscriber-port=PORT  subscriber port number 
(default %s)\n"), DEFAULT_SUB_PORT);
printf(_("  -P, --publisher-server=CONNSTR  publisher connection 
string\n"));
-   printf(_("  -s, --socket-directory=DIR  socket directory to use 
(default current directory)\n"));
+   printf(_("  -s, --socketdir=DIR socket directory to use 
(default current directory)\n"));
printf(_("  -t, --recovery-timeout=SECS seconds to wait for 
recovery to end\n"));
printf(_("  -U, --subscriber-username=NAME  subscriber username\n"));
printf(_("  -v, --verbose   output verbose 
messages\n"));
@@ -1871,7 +1871,7 @@ main(int argc, char **argv)
{"dry-run", no_argument, NULL, 'n'},
{"subscriber-port", required_argument, NULL, 'p'},
{"publisher-server", required_argument, NULL, 'P'},
-   {"socket-directory", required_argument, NULL, 's'},
+   {"socketdir", required_argument, NULL, 's'},
{"recovery-timeout", required_argument, NULL, 't'},
{"subscriber-username", required_argument, NULL, 'U'},
{"verbose", no_argument, NULL, 'v'},
diff --git a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl 
b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
index 546f784a311..0a900edb656 100644
--- a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
+++ b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
@@ -171,7 +171,7 @@ sub generate_db
'pg_createsubscriber', '--verbose',
'--dry-run', '--pgdata',
$node_t->data_dir, '--publisher-server',
-   $node_p->connstr($db1), '--socket-directory',
+   $node_p->connstr($db1), '--socketdir',
$node_t->host, '--subscriber-port',
$node_t->port, '--database',
$db1, '--database',
@@ -185,7 +185,7 @@ sub generate_db
'pg_createsubscriber', '--verbose',
'--dry-run', '--pgdata',
$node_s->data_dir, '--publisher-server',
-   $node_p->connstr($db1), '--socket-directory',
+   $node_p->connstr($db1), '--socketdir',
$node_s->host, '--subscriber-port',
$node_s->port, '--database',
$db1, '--database',
@@ -199,7 +199,7 @@ sub generate_db
'pg_createsubscriber', '--verbose',
'--pgdata', $node_f->data_dir,
'--publisher-server', $node_p->connstr($db1),
-   '--socket-directory', $node_f->host,
+   '--socketdir', $node_f->host,
'--subscriber-port', $node_f->port,
'--database', $db1,
'--database', $db2
@@ -219,7 +219,7 @@ sub generate_db
'pg_createsubscriber', '--verbose',
'--dry-run', '--pgdata',
$node_c->data_dir, '--publisher-server',
-   $node_s->connstr($db1), '--socket-directory',
+   $node_s->connstr($db1), '--socketdir',
$node_c->host, '--subscriber-port',
$node_c->port, '--database',
$db1, '--database',
@@ -242,7 +242,7 @@ sub generate_db
'pg_createsubscriber', '--verbose',
'--

Lack of possibility to specify CTAS TAM

2024-07-31 Thread Kirill Reshke

I have noticed $subj while working with other unrelated patches.
The question is, why there is no CREATE TABLE AS  USING
(some_access_method)?
This feature looks straightforward, and lack of it is a bit of
inconsistency from my point of view.
Maybe there are some unobvious caveats with implementing it?
I have done a little research reading related threads [1][2], but
these do not address $subj, if i'm not missing anything.
Neither can I find an open CF entry/thread implementing this (Im
looking here http://cfbot.cputube.org/) .

The same storage specification feature can actually be supported for
CTAE (like CTAS but execute) and CREATE MATERIALIZED VIEW.

I can try to propose a POC patch implementing $subj if there are no objections
to having this functionality in the core.


[1] 
https://www.postgresql.org/message-id/flat/20180703070645.wchpu5muyto5n647%40alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/flat/20160812231527.GA690404%40alvherre.pgsql

Re: Use pgBufferUsage for block reporting in analyze

2024-07-31 Thread Anthonin Bonnefoy

On Tue, Jul 30, 2024 at 9:21 AM Anthonin Bonnefoy
 wrote:
> A possible change would be to pass an inh flag when an acquirefunc is
> called from acquire_inherited_sample_rows. The acquirefunc could then
> use an alternative log format to append to logbuf. This way, we could
> have a more compact format for partitioned tables.

I've just tested this, the result isn't great as it creates an
inconsistent output

INFO:  analyze of table "postgres.public.test_partition"
"test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
and 0 dead tuples; 999 rows in sample, 999 estimated total rows
"test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples
and 0 dead tuples; 1000 rows in sample, 1000 estimated total rows
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
...
INFO:  analyze of table "postgres.public.test_partition_1"
pages: 5 of 5 scanned
tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
estimated total tuples
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s

Maybe the best approach is to always use the compact form?

INFO:  analyze of table "postgres.public.test_partition"
"test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples
"test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples
and 0 dead tuples; 1000 tuples in sample, 1000 estimated total tuples
avg read rate: 1.953 MB/s, avg write rate: 0.000 MB/s
...
INFO:  analyze of table "postgres.public.test_partition_1"
"test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s

I've updated the patchset with those changes. 0004 introduces the
StringInfo logbuf so we can output logs as a single log and during
ANALYZE VERBOSE while using the compact form.

Regards,
Anthonin

v6-0004-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch
Description: Binary data

v6-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data

v6-0002-Output-analyze-details-on-analyze-verbose.patch
Description: Binary data

v6-0003-Output-wal-usage-of-analyze-in-verbose-output-and.patch
Description: Binary data

Re: proposal: schema variables

2024-07-31 Thread Pavel Stehule

st 31. 7. 2024 v 8:57 odesílatel Laurenz Albe 
napsal:

> On Wed, 2024-07-31 at 08:41 +0200, Pavel Stehule wrote:
> > Probably you didn't attach new files - the second patch is not complete.
> Or you didn't make changes there?
>
> Hm.  What is missing?
>

let.sgml,
session_variable.c
svariable_receiver.c
session_variable.h
...

Regards

Pavel




>
> Yours,
> Laurenz Albe
>

Re: Lack of possibility to specify CTAS TAM

2024-07-31 Thread Andrey M. Borodin




> On 31 Jul 2024, at 12:03, Kirill Reshke  wrote:
> 
> CREATE TABLE AS  USING
> (some_access_method)

This looks in a line with usual CREATE TABLE.
+1 for the feature.
Currently we do not have so many TAMs, but I hope eventually we will have some.


Best regards, Andrey Borodin.

Re: Lack of possibility to specify CTAS TAM

2024-07-31 Thread David G. Johnston

On Wednesday, July 31, 2024, Kirill Reshke  wrote:

> I have noticed $subj while working with other unrelated patches.
> The question is, why there is no CREATE TABLE AS  USING
> (some_access_method)?


The syntax is documented…

CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
IF NOT EXISTS ] *table_name*
[ (*column_name* [, ...] ) ]
[ USING *method* ]

… AS query

https://www.postgresql.org/docs/current/sql-createtableas.html

David J.

Re: Lack of possibility to specify CTAS TAM

2024-07-31 Thread Kirill Reshke

On Wed, 31 Jul 2024, 12:12 Andrey M. Borodin,  wrote:

>
> Currently we do not have so many TAMs
>
Currently we do not have so many TAM in core. Outside core there is
actually a quite a number of projects doing TAMs. Orioledb is one example.

>

Re: Lack of possibility to specify CTAS TAM

2024-07-31 Thread David G. Johnston

On Wednesday, July 31, 2024, Kirill Reshke  wrote:
>
>
> The same storage specification feature can actually be supported for
> CTAE (like CTAS but execute) and CREATE MATERIALIZED VIEW.
>
>
On a related note, the description here seems outdated.


https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-DEFAULT-TABLE-ACCESS-METHOD

CMV also has this syntax already; we don’t actually have CTAE presently,
correct?

David J.

Re: Logical Replication of sequences

2024-07-31 Thread Peter Smith

Hi Vignesh,

Here are my review comments for your latest 0730_2* patches.

Patch v20240730_2-0001 looks good to me.

Patch v20240730_2-0002 looks good to me.

My comments for the v20240730_2-0003 patch are below:

//

GENERAL

1. Inconsistent terms

I've noticed there are many variations of how the sequence sync worker is known:
- "sequencesync worker"
- "sequence sync worker"
- "sequence-sync worker"
- "sequence synchronization worker"
- more?

We must settle on some standardized name.

AFAICT we generally use "table synchronization worker" in the docs,
and "tablesync worker" in the code and comments.  IMO, we should do
same as that for sequences -- e.g. "sequence synchronization worker"
in the docs, and "sequencesync worker" in the code and comments.

==
doc/src/sgml/catalogs.sgml

nitpick - the links should jump directly to REFRESH PUBLICATION or
REFRESH PUBLICATION SEQUENCES. Currently they go to the top of the
ALTER SUBSCRIPTION page which is not as useful.

==
src/backend/commands/sequence.c

do_setval:
nitpick - minor wording in the function header
nitpick - change some param names to more closely resemble the fields
they get assigned to (/logcnt/log_cnt/, /iscalled/is_called/)

~

2.
  seq->is_called = iscalled;
- seq->log_cnt = 0;
+ seq->log_cnt = (logcnt == SEQ_LOG_CNT_INVALID) ? 0: logcnt;

The logic here for SEQ_LOG_CNT_INVALID seemed strange. Why not just
#define SEQ_LOG_CNT_INVALID as 0 in the first place if that is what
you will assign for invalid? Then you won't need to do anything here
except seq->log_cnt = log_cnt;

==
src/backend/catalog/pg_subscription.c

HasSubscriptionRelations:
nitpick - I think the comment "If even a single tuple exists..." is
not quite accurate. e.g. It also has to be the right kind of tuple.

~~

GetSubscriptionRelations:
nitpick - Give more description in the function header about the other
parameters.
nitpick - I felt that a better name for 'all_relations' is all_states.
Because in my mind *all relations* sounds more like when both
'all_tables' and 'all_sequences' are true.
nitpick - IMO add an Assert to be sure something is being fetched.
Assert(get_tables || get_sequences);
nitpick - Rephrase the "skip the tables" and "skip the sequences"
comments to be more aligned with the code condition.

~

3.
- if (not_ready)
+ /* Get the relations that are not in ready state */
+ if (get_tables && !all_relations)
  ScanKeyInit(&skey[nkeys++],
  Anum_pg_subscription_rel_srsubstate,
  BTEqualStrategyNumber, F_CHARNE,
  CharGetDatum(SUBREL_STATE_READY));
+ /* Get the sequences that are in init state */
+ else if (get_sequences && !all_relations)
+ ScanKeyInit(&skey[nkeys++],
+ Anum_pg_subscription_rel_srsubstate,
+ BTEqualStrategyNumber, F_CHAREQ,
+ CharGetDatum(SUBREL_STATE_INIT));

This is quite tricky, using multiple flags (get_tables and
get_sequences) in such a way. It might even be a bug -- e.g. Is the
'else' keyword correct? Otherwise, when both get_tables and
get_sequences are true, and all_relations is false, then the sequence
part wouldn't even get executed (???).

==
src/backend/commands/subscriptioncmds.c

CreateSubscription:
nitpick - let's move the 'tables' declaration to be beside the
'sequences' var for consistency. (in passing move other vars too)
nitpick - it's not strictly required for the patch, but let's change
the 'tables' loop to be consistent with the new sequences loop.

~~~

4. AlterSubscription_refresh

My first impression (from the function comment) is that these function
parameters are a bit awkward. For example,
- It says:  If 'copy_data' parameter is true, the function will set
the state to "init"; otherwise, it will set the state to "ready".
- It also says: "If 'all_relations' is true, mark all objects with
"init" state..."
Those statements seem to clash. e.g. if copy_data is false but
all_relations is true, then what (???)

~

nitpick - tweak function comment wording.
nitpick - introduce a 'relkind' variable to avoid multiple calls of
get_rel_relkind(relid)
nitpick - use an existing 'relkind' variable instead of calling
get_rel_relkind(relid);
nitpick - add another comment about skipping (for dropping tablesync slots)

~

5.
+ /*
+ * If all the relations should be re-synchronized, then set the
+ * state to init for re-synchronization. This is currently
+ * supported only for sequences.
+ */
+ else if (all_relations)
+ {
+ ereport(DEBUG1,
+ (errmsg_internal("sequence \"%s.%s\" of subscription \"%s\" set to
INIT state",
  get_namespace_name(get_rel_namespace(relid)),
  get_rel_name(relid),
  sub->name)));
+ UpdateSubscriptionRelState(sub->oid, relid, SUBREL_STATE_INIT,
+InvalidXLogRecPtr);

(This is a continuation of my doubts regarding 'all_relations' in the
previous review comment #4 above)

Here are some more questions about it:

~

5a. Why is this an 'else' of the !bsearch? It needs more explanation
what this case means.

~

5b. Along with more description, it might be better to reverse the
!bsearch condition, s

Re: Lack of possibility to specify CTAS TAM

2024-07-31 Thread David G. Johnston

On Wednesday, July 31, 2024, David G. Johnston 
wrote:

> On Wednesday, July 31, 2024, Kirill Reshke  wrote:
>>
>>
>> The same storage specification feature can actually be supported for
>> CTAE (like CTAS but execute) and CREATE MATERIALIZED VIEW.
>>
>>
> On a related note, the description here seems outdated.
>
>  https://www.postgresql.org/docs/current/runtime-config-
> client.html#GUC-DEFAULT-TABLE-ACCESS-METHOD
>

Nevermind, re-reading it I see it is correct.  The others are all covered
by “create” while “select into” is called out because of its reliance on
the default.

David J.

Re: Lack of possibility to specify CTAS TAM

2024-07-31 Thread Kirill Reshke

On Wed, 31 Jul 2024 at 12:15, David G. Johnston
 wrote:
>
> On Wednesday, July 31, 2024, Kirill Reshke  wrote:
>>
>> I have noticed $subj while working with other unrelated patches.
>> The question is, why there is no CREATE TABLE AS  USING
>> (some_access_method)?
>
>
> The syntax is documented…
My bad.
Everything is supported in core actually..

Re: Docs: Order of json aggregate functions

2024-07-31 Thread Laurenz Albe

On Tue, 2024-07-23 at 11:45 +0200, Marlene Reiterer wrote:
> Am Mo., 22. Juli 2024 um 15:19 Uhr schrieb Wolfgang Walther 
> :
> > 
> > The order of json related aggregate functions in the docs is currently
> > like this:
> > 
> > [...]
> > json_agg
> > json_objectagg
> > json_object_agg
> > json_object_agg_strict
> > json_object_agg_unique
> > json_arrayagg
> > json_object_agg_unique_strict
> > max
> > min
> > range_agg
> > range_intersect_agg
> > json_agg_strict
> > [...]
> > 
> > json_arrayagg and json_agg_strict are out of place.
> > 
> > Attached patch puts them in the right spot. This is the same down to v16.
> 
> I compiled and it worked and didn't throw an error.
> 
> The changes to the patch seem useful in my perspective, for making it
> easier to find the functions in the documentation, so people will find
> them easier.
> 
> There is another table which isn't sorted too, the "Hypothetical-Set
> Aggregate Functions". Which would be in need of an alphabetical
> sorting too, if all the tables on this side
> of the documentation should look alike.

There are only four hypothetical-set aggregate functions, so it is no problem
to find a function in that list.

I would say that it makes sense to apply the proposed patch, even if we
don't sort that short list.

Yours,
Laurenz Albe

psql client does not handle WSAEWOULDBLOCK on Windows

2024-07-31 Thread Ning

*Description:*The connection fails with a non-blocking socket error when
using psql on
Windows to connect to a PostgreSQL server with GSSAPI enabled. The error is
because the socket error code is obtained by WSAGetLastError() instead of
errno. This causes the value of errno to be incorrect when handling a
non-blocking socket error.


*Steps to Reproduce:*1. Compile PostgreSQL client (psql) on Windows.
a. Make sure MIT Kerberos is installed. I use the latest version MIT
Kerberos
Version 4.1.
b. Make sure GSSAPI is enabled
2. Attempt to connect to a PostgreSQL server using psql.
a. Set up the Kerberos server and configure the PostgreSQL server by
referring
to https://github.com/50wu/kerberos-docker/blob/main/POSTGRES.README.md
b. change the entry to hostgssenc on PostgreSQL server pg_hba.conf and
restart
hostgssencallall0.0.0.0/0gssinclude_realm=0
krb_realm=GPDB.KRB
c. Use the following command to connect to the database server
psql -h  -U "postgres/krb5-service-example-com.example.com" -d
postgres
3. The connection fails with a non-blocking socket error. The error is
something like:
psql: error: connection to server at "xxx", port 5432 failed:

*Environment*:
PostgreSQL version: 16.3
Operating System: Windows 11


*Fix Steps:*In the gss_read function of
src/interfaces/libpq/fe-secure-gssapi.c, change the
check of the error code to use the SOCK_ERRNO to make sure that EAGAIN,
EWOULDBLOCK and EINTR can be properly handled on Windows and other
platforms.

The patch file is attached to this email, please review and consider
merging it to
the main code library.

Thanks,
Ning Wu


fix-windows-non-blocking-socket.patch
Description: Binary data

PG17beta2: SMGR: inconsistent type for nblocks

2024-07-31 Thread Matthias van de Meent

Hi,

While working on rebasing the patches of Neon's fork onto the
REL_17_STABLE branch, I noticed that the nblocks arguments of various
smgr functions have inconsistent types: smgrzeroextend accepts
`nblocks` as signed integer, as does the new signature for
smgrprefetch, but the new vectorized operations of *readv and *writev,
and the older *writeback all use an unsigned BlockNumber as indicator
for number of blocks.

Can we update the definition to be consistent across this (new, or
also older) API? As far as I can see, in none of these cases are
negative numbers allowed or expected, so updating this all to be
consistently BlockNumber across the API seems like a straigthforward
patch.

cc-ed Thomas as committer of the PG17 smgr API changes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Remove last traces of HPPA support

2024-07-31 Thread Heikki Linnakangas


On 31/07/2024 08:52, Thomas Munro wrote:

On Tue, Jul 30, 2024 at 9:50 AM Thomas Munro  wrote:

I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...


Here is a first attempt at that.


Looks good, thanks!


I haven't compared the generated asm yet, but it seems to work OK.

The old __i386__ implementation of TAS() said:


 * When this was last tested, we didn't have separate TAS() and 
TAS_SPIN()
 * macros.  Nowadays it probably would be better to do a non-locking 
test
 * in TAS_SPIN() but not in TAS(), like on x86_64, but no-one's done the
 * testing to verify that.  Without some empirical evidence, better to
 * leave it alone.


It seems that you did what the comment suggested. That seems fine. For 
sake of completeness, if someone has an i386 machine lying around, it 
would be nice to verify that. Or an official CPU manufacturer's 
implementation guide, or references to other implementations or something.



2.  The pg_atomic_unlocked_test_flag() function was surprising to me:
it returns true if it's not currently set (according to a relaxed
load).  Most of this patch was easy, but figuring out that I had
reverse polarity here was a multi-coffee operation :-)  I can't call
it wrong though, as it's not based on , and it's clearly
documented, so *shrug*.


Huh, yeah that's unexpected.


3.  As for why we have a function that  doesn't, I
speculate that it might have been intended for implementing this exact
patch, ie wanting to perform that relaxed load while spinning as
recommended by Intel.  (If we strictly had to use 
functions, we couldn't use atomic_flag due to the lack of a relaxed
load operation on that type, so we'd probably have to use atomic_char
instead.  Perhaps one day we will cross that bridge.)


As a side note, I remember when I've tried to use pg_atomic_flag in the 
past, I wanted to do an atomic compare-and-exchange on it, to clear the 
value and return the old value. Surprisingly, there's no function to do 
that. There's pg_atomic_test_set_flag(), but no 
pg_atomic_test_clear_flag(). C11 has both "atomic_flag" and 
"atomic_bool", and I guess what I actually wanted was atomic_bool.



- * On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
- * S_UNLOCK() macros must further include hardware-level memory fence
- * instructions to prevent similar re-ordering at the hardware level.
- * TAS() and TAS_SPIN() must guarantee that loads and stores issued after
- * the macro are not executed until the lock has been obtained.  
Conversely,
- * S_UNLOCK() must guarantee that loads and stores issued before the macro
- * have been executed before the lock is released.


That old comment means that both SpinLockAcquire() and SpinLockRelease() 
acted as full memory barriers, and looking at the implementations, that 
was indeed so. With the new implementation, SpinLockAcquire() will have 
"acquire semantics" and SpinLockRelease will have "release semantics". 
That's very sensible, and I don't believe it will break anything, but 
it's a change in semantics nevertheless.


--
Heikki Linnakangas
Neon (https://neon.tech)

RE: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-31 Thread Hayato Kuroda (Fujitsu)

Dear Hou,

Thanks for creating a patch!

> When reviewing the code in logical/worker.c, I noticed that when applying a
> cross-partition update action, it scans the old partition twice.
> I am attaching the patch 0001 to remove this duplicate table scan.

Just to clarify, you meant that FindReplTupleInLocalRel() are called in
apply_handle_tuple_routing() and 
apply_handle_tuple_routing()->apply_handle_delete_internal(),
which requires the index or sequential scan, right? LGTM.

> Apart from above, I found there are quite a few duplicate codes related to 
> partition
> handling(e.g. apply_handle_tuple_routing), so I tried to extract some
> common logic to simplify the codes. Please see 0002 for this refactoring.

IIUC, you wanted to remove the application code from 
apply_handle_tuple_routing()
and put only a part partition detection. Is it right? Anyway, here are comments.

01. apply_handle_insert()

```
+targetRelInfo = edata->targetRelInfo;
+
 /* For a partitioned table, insert the tuple into a partition. */
 if (rel->localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-apply_handle_tuple_routing(edata,
-   remoteslot, NULL, CMD_INSERT);
-else
-apply_handle_insert_internal(edata, edata->targetRelInfo,
- remoteslot);
+remoteslot = apply_handle_tuple_routing(edata, CMD_INSERT, remoteslot,
+&targetRelInfo);
+
+/* For a partitioned table, insert the tuple into a partition. */
+apply_handle_insert_internal(edata, targetRelInfo, remoteslot);
```

This part contains same comments, and no need to subsctitute in case of normal 
tables.
How about:

```
-/* For a partitioned table, insert the tuple into a partition. */
+/*
+ * Find the actual target table if the table is partitioned. Otherwise, use
+ * the same table as the remote one.
+ */
 if (rel->localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-apply_handle_tuple_routing(edata,
-   remoteslot, NULL, CMD_INSERT);
+remoteslot = apply_handle_tuple_routing(edata, CMD_INSERT, remoteslot,
+&targetRelInfo);
 else
-apply_handle_insert_internal(edata, edata->targetRelInfo,
- remoteslot);
+targetRelInfo = edata->targetRelInfo;
+
+/* Insert a tuple to the target table */
+apply_handle_insert_internal(edata, targetRelInfo, remoteslot);
```

02. apply_handle_tuple_routing()

```
 /*
- * This handles insert, update, delete on a partitioned table.
+ * Determine the partition in which the tuple in slot is to be inserted, and
...
```

But this function is called from delete_internal(). How about "Determine the
partition to which the tuple in the slot belongs."?

03. apply_handle_tuple_routing()

Do you have a reason why this does not return `ResultRelInfo *` but 
`TupleTableSlot *`?
Not sure, but it is more proper for me to return the table info because this is 
a
routing function. 

04. apply_handle_update()

```
+targetRelInfo = edata->targetRelInfo;
+targetrel = rel;
+remoteslot_root = remoteslot;
```

Here I can say the same thing as 1.

05. apply_handle_update_internal()

It looks the ordering of function's implementations are changed. Is it 
intentaional?

before

apply_handle_update
apply_handle_update_internal
apply_handle_delete
apply_handle_delete_internal
FindReplTupleInLocalRel
apply_handle_tuple_routing

after

apply_handle_update
apply_handle_delete
apply_handle_delete_internal
FindReplTupleInLocalRel
apply_handle_tuple_routing
apply_handle_update_internal

06. apply_handle_delete_internal()

```
+targetRelInfo = edata->targetRelInfo;
+targetrel = rel;
+
```

Here I can say the same thing as 1.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Logical Replication of sequences

2024-07-31 Thread shveta malik

On Mon, Jun 10, 2024 at 5:00 PM vignesh C  wrote:
>
> On Mon, 10 Jun 2024 at 12:24, Amul Sul  wrote:
> >
> >
> >
> > On Sat, Jun 8, 2024 at 6:43 PM vignesh C  wrote:
> >>
> >> On Wed, 5 Jun 2024 at 14:11, Amit Kapila  wrote:
> >> [...]
> >> A new catalog table, pg_subscription_seq, has been introduced for
> >> mapping subscriptions to sequences. Additionally, the sequence LSN
> >> (Log Sequence Number) is stored, facilitating determination of
> >> sequence changes occurring before or after the returned sequence
> >> state.
> >
> >
> > Can't it be done using pg_depend? It seems a bit excessive unless I'm 
> > missing
> > something.
>
> We'll require the lsn because the sequence LSN informs the user that
> it has been synchronized up to the LSN in pg_subscription_seq. Since
> we are not supporting incremental sync, the user will be able to
> identify if he should run refresh sequences or not by checking the lsn
> of the pg_subscription_seq and the lsn of the sequence(using
> pg_sequence_state added) in the publisher.

How the user will know from seq's lsn that he needs to run refresh.
lsn indicates page_lsn and thus the sequence might advance on pub
without changing lsn and thus lsn may look the same on subscriber even
though a sequence-refresh is needed. Am I missing something here?

thanks
Shveta

RE: make pg_createsubscriber option names more consistent

2024-07-31 Thread Hayato Kuroda (Fujitsu)

Dear Peter,

> I propose to rename the pg_createsubscriber option --socket-directory to
> --socketdir.  This would make it match the equivalent option in
> pg_upgrade.  (It even has the same short option '-s'.)
> pg_createsubscriber and pg_upgrade have a lot of common terminology and
> a similar operating mode, so it would make sense to keep this consistent.

+1. If so, should we say "default current dir." instead of "default current 
directory" in usage()
because pg_upgrade says like that?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-31 Thread Yugo NAGATA

Hi,

On Fri, 26 Jul 2024 16:47:23 -0700
Jeff Davis  wrote:

> Hello,
> 
> Thank you for looking.
> 
> On Fri, 2024-07-26 at 12:26 +0900, Yugo Nagata wrote:
> > Since this commit, matviews are no longer handled in
> > ExecCreateTableAs, so the
> > following error message has not to consider materialized view cases,
> > and can be made simple.
> > 
> >     /* SELECT should never rewrite to more or less than one
> > SELECT query */
> >     if (list_length(rewritten) != 1)
> >     elog(ERROR, "unexpected rewrite result for %s",
> >  is_matview ? "CREATE MATERIALIZED VIEW" :
> >  "CREATE TABLE AS SELECT");
> 
> There's a similar error in refresh_matview_datafill(), and I suppose
> that should account for the CREATE MATERIALIZED VIEW case. We could
> pass an additional flag to RefreshMatViewByOid to indicate whether it's
> a CREATE or REFRESH, but it's an internal error, so perhaps it's not
> important.

Thank you for looking into the pach.

I agree that it might not be important, but I think adding the flag would be
also helpful for improving code-readability because it clarify the function
is used in the two cases. I attached patch for this fix (patch 0003).

> > Another my question is why RefreshMatViewByOid has a ParamListInfo
> > parameter.
> 
> I just passed the params through, but you're right, they aren't
> referenced at all.
> 
> I looked at the history, and it appears to go all the way back to the
> function's introduction in commit 3bf3ab8c56.
> 
> > I don't understand why ExecRefreshMatView has one, either, because
> > currently
> > materialized views may not be defined using bound parameters, which
> > is checked
> > in transformCreateTableAsStmt, and the param argument is not used at
> > all. It might
> > be unsafe to change the interface of ExecRefreshMatView since this is
> > public for a
> > long time, but I don't think the new interface RefreshMatViewByOid
> > has to have this
> > unused argument.
> 
> Extensions should be prepared for reasonable changes in these kinds of
> functions between releases. Even if the signatures remain the same, the
> parse structures may change, which creates similar incompatibilities.
> So let's just get rid of the 'params' argument from both functions.

Sure. I fixed the patch to remove 'param' from both functions. (patch 0002)

I also add the small refactoring around ExecCreateTableAs(). (patch 0001)

- Remove matview-related codes from intorel_startup.
  Materialized views are no longer handled in this function.

- RefreshMatViewByOid is moved to just after create_ctas_nodata
  call to improve code readability.


Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From e6027d1e417b446d09411a98fca11a831c6b7b03 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Wed, 31 Jul 2024 17:22:41 +0900
Subject: [PATCH v2 3/3] Add a flag to RefreshMatviewByOid to indicate whether
 it is CREATE or not

RefreshMatviewByOid is used for both REFRESH and CREATE MATERIALIZED VIEW.
This flag is currently just used for handling internal error messages,
but also aimed to improve code-readability.
---
 src/backend/commands/matview.c | 38 +-
 src/include/commands/matview.h |  3 ++-
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 8644ad695c..29a80fe565 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -60,7 +60,7 @@ static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static uint64 refresh_matview_datafill(DestReceiver *dest, Query *query,
-	   const char *queryString);
+	   const char *queryString, bool is_create);
 static char *make_temptable_name_n(char *tempname, int n);
 static void refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
    int save_sec_context);
@@ -136,7 +136,7 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
 		  NULL);
 
 	return RefreshMatViewByOid(matviewOid, stmt->skipData, stmt->concurrent,
-			   queryString, qc);
+			   queryString, qc, false);
 }
 
 /*
@@ -157,10 +157,14 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
  *
  * The matview's "populated" state is changed based on whether the contents
  * reflect the result set of the materialized view's query.
+ *
+ * This is also used to populate the materialized view created by CREATE
+ * MATERIALIZED VIEW command.
  */
 ObjectAddress
 RefreshMatViewByOid(Oid matviewOid, bool skipData, bool concurrent,
-	const char *queryString, QueryCompletion *qc)
+	const char *queryString, QueryCompletion *qc,
+	bool is_create)
 {
 	Relation	matviewRel;
 	RewriteRule *rule;
@@ -169,7 +173,6 @@ RefreshMatViewByOid(Oid matviewOid, bool skipData, bool concurrent,
 	Oid			tableSpace;
 	Oid			re

Re: Proposal: Document ABI Compatibility

2024-07-31 Thread Peter Eisentraut


On 19.07.24 16:10, David E. Wheeler wrote:

On Jun 27, 2024, at 18:07, David E. Wheeler  wrote:


Minor nit - misspelled “considerd"


Thank you, Jeremy. V3 attached.


Rebase on 5784a49 attached. I presume this topic needs quite a bit of review 
and consensus from the committers more generally.


Well, nobody has protested against what we wrote, so I have committed it.

Re: Logical Replication of sequences

2024-07-31 Thread vignesh C

On Sat, 20 Jul 2024 at 20:48, vignesh C  wrote:
>
> On Fri, 12 Jul 2024 at 08:22, Peter Smith  wrote:
> >
> > Hi Vignesh. Here are the rest of my comments for patch v20240705-0003.
> > ==
> >
> > 8. logicalrep_sequence_sync_worker_find
> >
> > +/*
> > + * Walks the workers array and searches for one that matches given
> > + * subscription id.
> > + *
> > + * We are only interested in the sequence sync worker.
> > + */
> > +LogicalRepWorker *
> > +logicalrep_sequence_sync_worker_find(Oid subid, bool only_running)
> >
> > There are other similar functions for walking the workers array to
> > search for a worker. Instead of having different functions for
> > different cases, wouldn't it be cleaner to combine these into a single
> > function, where you pass a parameter (e.g. a mask of worker types that
> > you are interested in finding)?

This is fixed in the v20240730_2 version attached at [1].

> > 17.
> > Also, where does the number 100 come from? Why not 1000? Why not 10?
> > Why have batching at all? Maybe there should be some comment to
> > describe the reason and the chosen value.

I had run some tests with 10/100 and 1000 sequences per batch for
1 sequences. The results for it:
10 per batch   - 4.94 seconds
100 per batch  - 4.87 seconds
1000 per batch - 4.53 seconds

There is not much time difference between each of them. Currently, it
is set to 100, which seems fine since it will not generate a lot of
transactions. Additionally, the locks on the sequences will be
periodically released during the commit transaction.

I had used the test from the attached patch by changing
max_sequences_sync_per_batch to 10/100/100 in 035_sequences.pl to
verify this.

[1] - 
https://www.postgresql.org/message-id/CALDaNm3%2BXzHAbgyn8gmbBLK5goyv_uyGgHEsTQxRZ8bVk6nAEg%40mail.gmail.com

Regards,
Vignesh
  | 1000 seq/batch | 100 seq/batch |10 seq/batch
--||---|-
Exec time | 4.604323149| 4.719184017   |4.945001841
Exec time | 4.518861055| 4.738708973   |4.958534002
Exec time | 4.522281885| 4.873774052   |4.957324982
Exec time | 4.524350882| 4.878502131   |4.943160057
Exec time | 4.531511068| 4.874341965   |4.850522995
Median| 4.524350882| 4.873774052   |4.945001841
From d060bd7903d812904816f0ada004420e2d9f0c21 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 30 Jul 2024 11:34:55 +0530
Subject: [PATCH] Performance testing changes.

Performance testing changes.
---
 src/backend/replication/logical/launcher.c |   1 +
 src/backend/replication/logical/sequencesync.c |   6 +-
 src/backend/utils/misc/guc_tables.c|  12 +++
 src/backend/utils/misc/postgresql.conf.sample  |   1 +
 src/include/replication/logicallauncher.h  |   2 +
 src/test/subscription/t/035_sequences.pl   | 142 +
 6 files changed, 162 insertions(+), 2 deletions(-)
 create mode 100644 src/test/subscription/t/035_sequences.pl

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 04d76e7..2ece21f 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -50,6 +50,7 @@
 int			max_logical_replication_workers = 4;
 int			max_sync_workers_per_subscription = 2;
 int			max_parallel_apply_workers_per_subscription = 2;
+int			max_sequences_sync_per_batch = 10;
 
 LogicalRepWorker *MyLogicalRepWorker = NULL;
 
diff --git a/src/backend/replication/logical/sequencesync.c b/src/backend/replication/logical/sequencesync.c
index fc36bf9..6bcfbdf 100644
--- a/src/backend/replication/logical/sequencesync.c
+++ b/src/backend/replication/logical/sequencesync.c
@@ -20,6 +20,7 @@
 #include "catalog/pg_subscription_rel.h"
 #include "commands/sequence.h"
 #include "pgstat.h"
+#include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
 #include "replication/worker_internal.h"
 #include "utils/acl.h"
@@ -29,6 +30,7 @@
 #include "utils/rls.h"
 #include "utils/usercontext.h"
 
+
 /*
  * fetch_remote_sequence_data
  *
@@ -287,11 +289,11 @@ LogicalRepSyncSequences(void)
 		 * Have we reached the end of the current batch of sequences,
 		 * or last remaining sequences to synchronize?
 		 */
-		if (((curr_seq % MAX_SEQUENCES_SYNC_PER_BATCH) == 0) ||
+		if (((curr_seq % max_sequences_sync_per_batch) == 0) ||
 			curr_seq == seq_count)
 		{
 			/* Obtain the starting index of the current batch. */
-			int			i = (curr_seq - 1) - ((curr_seq - 1) % MAX_SEQUENCES_SYNC_PER_BATCH);
+			int			i = (curr_seq - 1) - ((curr_seq - 1) % max_sequences_sync_per_batch);
 
 			/* LOG all the sequences synchronized during current batch. */
 			for (; i < curr_seq; i++)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6a623f5..017fb59 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tab

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Andrew Dunstan



On 2024-07-30 Tu 6:51 PM, David Rowley wrote:

On Wed, 31 Jul 2024 at 09:35, Andrew Dunstan  wrote:

Fast forward to now. The customer has found no observable ill effects of
marking these functions leakproof. The would like to know if there is
any reason why we can't mark them leakproof, so that they don't have to
do this in every database of every cluster they use.

Thoughts?

According to [1], it's just not been done yet due to concerns about
risk to reward ratios.  Nobody mentioned any reason why it couldn't
be, but there were some fears that future code changes could yield new
failure paths.

David

[1]https://postgr.es/m/02BDFCCF-BDBB-4658-9717-4D95F9A91561%40thebuild.com



Hmm, somehow I missed that thread in searching, and clearly I'd 
forgotten it.


Still, I'm not terribly convinced by arguments along the lines you're 
suggesting. "Sufficient unto the day is the evil thereof." Maybe we need 
a test to make sure we don't make changes along those lines, although I 
have no idea what such a test would look like.



cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Remove last traces of HPPA support

2024-07-31 Thread Thomas Munro

On Wed, Jul 31, 2024 at 8:47 PM Heikki Linnakangas  wrote:
> On 31/07/2024 08:52, Thomas Munro wrote:
> The old __i386__ implementation of TAS() said:
>
> >* When this was last tested, we didn't have separate TAS() and 
> > TAS_SPIN()
> >* macros.  Nowadays it probably would be better to do a non-locking 
> > test
> >* in TAS_SPIN() but not in TAS(), like on x86_64, but no-one's done 
> > the
> >* testing to verify that.  Without some empirical evidence, better to
> >* leave it alone.
>
> It seems that you did what the comment suggested. That seems fine. For
> sake of completeness, if someone has an i386 machine lying around, it
> would be nice to verify that. Or an official CPU manufacturer's
> implementation guide, or references to other implementations or something.

Hmm, the last "real" 32 bit CPU is from ~20 years ago.  Now the only
32 bit x86 systems we should nominally care about are modern CPUs that
can also run 32 bit instruction; is there a reason to think they'd
behave differently at this level?  Looking at the current Intel
optimisation guide's discussion of spinlock implementation at page
2-34 of [1], it doesn't distinguish between 32 and 64, and it has that
double-check thing.

> > - * On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
> > - * S_UNLOCK() macros must further include hardware-level memory fence
> > - * instructions to prevent similar re-ordering at the hardware level.
> > - * TAS() and TAS_SPIN() must guarantee that loads and stores issued 
> > after
> > - * the macro are not executed until the lock has been obtained.  
> > Conversely,
> > - * S_UNLOCK() must guarantee that loads and stores issued before the 
> > macro
> > - * have been executed before the lock is released.
>
> That old comment means that both SpinLockAcquire() and SpinLockRelease()
> acted as full memory barriers, and looking at the implementations, that
> was indeed so. With the new implementation, SpinLockAcquire() will have
> "acquire semantics" and SpinLockRelease will have "release semantics".
> That's very sensible, and I don't believe it will break anything, but
> it's a change in semantics nevertheless.

Yeah.  It's interesting that our pg_atomic_clear_flag(f) is like
standard atomic_flag_clear_explicit(f, memory_order_release), not like
atomic_flag_clear(f) which is short for atomic_flag_clear_explicit(f,
memory_order_seq_cst).  Example spinlock code I've seen written in
modern C or C++ therefore uses the _explicit variants, so it can get
acquire/release, which is what people usually want from a lock-like
thing.  What's a good way to test the performance in PostgreSQL?  In a
naive loop that just test-and-sets and clears a flag a billion times
in a loop and does nothing else, I see 20-40% performance increase
depending on architecture when comparing _seq_cst with
_acquire/_release.  You're right that this semantic change deserves
explicit highlighting, in comments somewhere...  I wonder if we have
anywhere that is counting on the stronger barrier...

[1] 
https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html

Re: Conflict detection and logging in logical replication

2024-07-31 Thread Amit Kapila

On Wed, Jul 31, 2024 at 7:40 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is the V8 patch set. It includes the following changes:
>

A few more comments:
1. I think in FindConflictTuple() the patch is locking the tuple so
that after finding a conflict if there is a concurrent delete, it can
retry to find the tuple. If there is no concurrent delete then we can
successfully report the conflict. Is that correct? If so, it is better
to explain this somewhere in the comments.

2.
* Note that this doesn't lock the values in any way, so it's
 * possible that a conflicting tuple is inserted immediately
 * after this returns.  But this can be used for a pre-check
 * before insertion.
..
ExecCheckIndexConstraints()

These comments indicate that this function can be used before
inserting the tuple, however, this patch uses it after inserting the
tuple as well. So, I think the comments should be updated accordingly.

3.
 * For unique indexes, we usually don't want to add info to the IndexInfo for
 * checking uniqueness, since the B-Tree AM handles that directly.  However,
 * in the case of speculative insertion, additional support is required.
...
BuildSpeculativeIndexInfo(){...}

This additional support is now required even for logical replication
to detect conflicts. So the comments atop this function should reflect
the same.

-- 
With Regards,
Amit Kapila.

The stats.sql test is failing sporadically in v14- on POWER7/AIX 7.1 buildfarm animals

2024-07-31 Thread Alexander Lakhin


Hello hackers,

Yesterday, the buildfarm animal sungazer was benevolent enough to
demonstrate a rare anomaly, related to old stats collector:
test stats    ... FAILED   469155 ms


 1 of 212 tests failed.


--- 
/home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/expected/stats.out
 2022-03-30 01:18:17.0 +
+++ 
/home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/results/stats.out
 2024-07-30 09:49:39.0 +
@@ -165,11 +165,11 @@
   WHERE relname like 'trunc_stats_test%' order by relname;
    relname  | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | 
n_dead_tup
---+---+---+---++
-  trunc_stats_test  | 3 | 0 | 0 | 0 |  0
-  trunc_stats_test1 | 4 | 2 | 1 | 1 |  0
-  trunc_stats_test2 | 1 | 0 | 0 | 1 |  0
-  trunc_stats_test3 | 4 | 0 | 0 | 2 |  2
-  trunc_stats_test4 | 2 | 0 | 0 | 0 |  2
+  trunc_stats_test  | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test1 | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test2 | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test3 | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test4 | 0 | 0 | 0 | 0 |  0
...

inst/logfile contains:
2024-07-30 09:25:11.225 UTC [63307946:1] LOG:  using stale statistics instead of current ones because stats collector is 
not responding
2024-07-30 09:25:11.345 UTC [11206724:559] pg_regress/create_index LOG:  using stale statistics instead of current ones 
because stats collector is not responding

...

That's not the only failure of that kind occurred on sungazer, there were
also [2] (REL_13_STABLE), [3] (REL_13_STABLE), [4] (REL_12_STABLE).
Moreover, such failures were produced by all the other POWER7/AIX 7.1
animals: hornet ([5], [6]), tern ([7], [8]), mandrill ([9], [10], ...).
But I could not find such failures coming from POWER8 animals: hoverfly
(running AIX 7200-04-03-2038), ayu, boa, chub, and I did not encounter such
anomalies on x86 nor ARM platforms.

Thus, it looks like this stats collector issue is only happening on this
concrete platform, and given [11], I think such failures perhaps should
be just ignored for the next two years (until v14 EOL) unless AIX 7.1
will be upgraded and we see them on a vendor-supported OS version.

So I'm parking this information here just for reference.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-07-30%2003%3A49%3A35
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-02-09%2009%3A29%3A10
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-06-16%2009%3A52%3A47
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-12-13%2003%3A40%3A42
[5] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-03-29%2005%3A27%3A09
[6] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-03-19%2002%3A09%3A07
[7] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-12-16%2009%3A17%3A38
[8] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-04-01%2003%3A09%3A38
[9] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-05%2004%3A22%3A17
[10] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-07-12%2004%3A31%3A37
[11] https://www.postgresql.org/message-id/3154146.1697661946%40sss.pgh.pa.us

Best regards,
Alexander

RE: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-31 Thread Zhijie Hou (Fujitsu)

On Wednesday, July 31, 2024 5:07 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Hou,
> 
> > When reviewing the code in logical/worker.c, I noticed that when
> > applying a cross-partition update action, it scans the old partition twice.
> > I am attaching the patch 0001 to remove this duplicate table scan.
> 
> Just to clarify, you meant that FindReplTupleInLocalRel() are called in
> apply_handle_tuple_routing() and
> apply_handle_tuple_routing()->apply_handle_delete_internal(),
> which requires the index or sequential scan, right? LGTM.

Thanks for reviewing the patch, and your understanding is correct.

Here is the updated patch 0001. I removed the comments as suggested by Amit.

Since 0002 patch is only refactoring the code and I need some time to review
the comments for it, I will hold it until the 0001 is committed.

Best Regards,
Hou zj


v2-0001-avoid-duplicate-table-scan-for-cross-partition-up.patch
Description:  v2-0001-avoid-duplicate-table-scan-for-cross-partition-up.patch

Re: refactor the CopyOneRowTo

2024-07-31 Thread Ilia Evdokimov


Hi, hackers

I'm sure this refactoring is useful because it's more readable when 
datum value is binary or not.


However, I can see a little improvement. We can declare variable 'bytea 
*outputbytes' in 'else' because variable is used nowhere except this place.



Regards,

Ilia Evdokimov,

Tantor Labs LLC.

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Joe Conway


On 7/31/24 05:47, Andrew Dunstan wrote:

On 2024-07-30 Tu 6:51 PM, David Rowley wrote:

On Wed, 31 Jul 2024 at 09:35, Andrew Dunstan wrote:

Fast forward to now. The customer has found no observable ill effects of
marking these functions leakproof. The would like to know if there is
any reason why we can't mark them leakproof, so that they don't have to
do this in every database of every cluster they use.

Thoughts?

According to [1], it's just not been done yet due to concerns about
risk to reward ratios.  Nobody mentioned any reason why it couldn't
be, but there were some fears that future code changes could yield new
failure paths.

David

[1]https://postgr.es/m/02BDFCCF-BDBB-4658-9717-4D95F9A91561%40thebuild.com


Hmm, somehow I missed that thread in searching, and clearly I'd 
forgotten it.


Still, I'm not terribly convinced by arguments along the lines you're 
suggesting. "Sufficient unto the day is the evil thereof." Maybe we need 
a test to make sure we don't make changes along those lines, although I 
have no idea what such a test would look like.



I think I have expressed this opinion before (which was shot down), and 
I will grant that it is hand-wavy, but I will give it another try.


In my opinion, for this use case and others, it should be possible to 
redact the values substituted into log messages based on some criteria. 
One of those criteria could be "I am in a leakproof call right now". In 
fact in a similar fashion, an extension ought to be able to mutate the 
log message based on the entire string, e.g. when "ALTER 
ROLE...PASSWORD..." is spotted I would like to be able to redact 
everything after "PASSWORD".


Yes it might render the error message unhelpful, but I know of users 
that would accept that tradeoff in order to get better performance and 
security on their production workloads. Or in some cases (e.g. PASSWORD) 
just better security.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: refactor the CopyOneRowTo

2024-07-31 Thread Junwang Zhao

On Fri, Jul 5, 2024 at 12:26 AM jian he  wrote:
>
> original CopyOneRowTo:
> https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/commands/copyto.c#n922
> I change it to:
> ---
> if (!cstate->opts.binary)
> {
> foreach_int(attnum, cstate->attnumlist)
> {
> Datum value = slot->tts_values[attnum - 1];
> bool isnull = slot->tts_isnull[attnum - 1];
>
> if (need_delim)
> CopySendChar(cstate, cstate->opts.delim[0]);
> need_delim = true;
>
> if (isnull)
> CopySendString(cstate, cstate->opts.null_print_client);
> else
> {
> string = OutputFunctionCall(&out_functions[attnum - 1],
> value);
> if (cstate->opts.csv_mode)
> CopyAttributeOutCSV(cstate, string,
> cstate->opts.force_quote_flags[attnum - 1]);
> else
> CopyAttributeOutText(cstate, string);
> }
> }
> }
> else
> {
> foreach_int(attnum, cstate->attnumlist)
> {
> Datum value = slot->tts_values[attnum - 1];
> bool isnull = slot->tts_isnull[attnum - 1];
> bytea*outputbytes;
>
> if (isnull)
> CopySendInt32(cstate, -1);
> else
> {
> outputbytes = SendFunctionCall(&out_functions[attnum - 1],
> value);
> CopySendInt32(cstate, VARSIZE(outputbytes) - VARHDRSZ);
> CopySendData(cstate, VARDATA(outputbytes),
> VARSIZE(outputbytes) - VARHDRSZ);
> }
> }
> }
>
>
> overall less "if else" logic,
> also copy format don't change during copy, we don't need to check
> binary or nor for each datum value.

I believe what you proposed is included in the patch 0002
attached in [1], but you use foreach_int, which I think is
an improvement.

[1] 
https://www.postgresql.org/message-id/20240724.173059.909782980111496972.kou%40clear-code.com


-- 
Regards
Junwang Zhao

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Andrew Dunstan




On 2024-07-31 We 9:14 AM, Joe Conway wrote:

On 7/31/24 05:47, Andrew Dunstan wrote:

On 2024-07-30 Tu 6:51 PM, David Rowley wrote:
On Wed, 31 Jul 2024 at 09:35, Andrew Dunstan 
wrote:
Fast forward to now. The customer has found no observable ill 
effects of

marking these functions leakproof. The would like to know if there is
any reason why we can't mark them leakproof, so that they don't 
have to

do this in every database of every cluster they use.

Thoughts?

According to [1], it's just not been done yet due to concerns about
risk to reward ratios.  Nobody mentioned any reason why it couldn't
be, but there were some fears that future code changes could yield new
failure paths.

David

[1]https://postgr.es/m/02BDFCCF-BDBB-4658-9717-4D95F9A91561%40thebuild.com 



Hmm, somehow I missed that thread in searching, and clearly I'd 
forgotten it.


Still, I'm not terribly convinced by arguments along the lines you're 
suggesting. "Sufficient unto the day is the evil thereof." Maybe we 
need a test to make sure we don't make changes along those lines, 
although I have no idea what such a test would look like.



I think I have expressed this opinion before (which was shot down), 
and I will grant that it is hand-wavy, but I will give it another try.


In my opinion, for this use case and others, it should be possible to 
redact the values substituted into log messages based on some 
criteria. One of those criteria could be "I am in a leakproof call 
right now". In fact in a similar fashion, an extension ought to be 
able to mutate the log message based on the entire string, e.g. when 
"ALTER ROLE...PASSWORD..." is spotted I would like to be able to 
redact everything after "PASSWORD".


Yes it might render the error message unhelpful, but I know of users 
that would accept that tradeoff in order to get better performance and 
security on their production workloads. Or in some cases (e.g. 
PASSWORD) just better security.



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-07-31 Thread Alexander Korotkov

On Mon, Jul 29, 2024 at 12:24 AM Tom Lane  wrote:
> So I'm forced to the conclusion that we'd better make the transaction
> size adaptive as per Alexander's suggestion.
>
> In addition to the patches attached, I experimented with making
> dumpTableSchema fold all the ALTER TABLE commands for a single table
> into one command.  That's do-able without too much effort, but I'm now
> convinced that we shouldn't.  It would break the semicolon-counting
> hack for detecting that tables like these involve extra work.
> I'm also not very confident that the backend won't have trouble with
> ALTER TABLE commands containing hundreds of subcommands.  That's
> something we ought to work on probably, but it's not a project that
> I want to condition v17 pg_upgrade's stability on.
>
> Anyway, proposed patches attached.  0001 is some trivial cleanup
> that I noticed while working on the failed single-ALTER-TABLE idea.
> 0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
> and 0003 is Alexander's suggestion.

Nice to see you picked up my idea.  I took a look over the patchset.
Looks good to me.

--
Regards,
Alexander Korotkov
Supabase

Re: Fixing backslash dot for COPY FROM...CSV

2024-07-31 Thread Daniel Verite

Sutou Kouhei wrote:

> BTW, here is a diff after pgindent:

PFA a v5 with the cosmetic changes applied.

> I also confirmed that the updated server and non-updated
> psql compatibility problem (the end-of-data "\." is
> inserted). It seems that it's difficult to solve without
> introducing incompatibility.

To clarify the compatibility issue, the attached bash script
compares pre-patch and post-patch client/server combinations with
different cases, submitted with different copy variants.

case A: quoted backslash-dot sequence in CSV
case B: unquoted backslash-dot sequence in CSV
case C: CSV without backslash-dot
case D: text with backslash-dot at the end
case E: text without backslash-dot at the end

The different ways to submit the data:
copy from file
\copy from file
\copy from pstdin
copy from stdin with embedded data after the command

Also attaching the tables of results with the patch as it stands.
"Failed" is when psql reports an error and "Data mismatch"
is when it succeeds but with copied data differing from what was
expected.

Does your test has an equivalent in these results or is it a different
case?



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
===
From b820bca30632c688dfc6a016e4550e05b1630c7d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 29 Jul 2024 15:19:52 +0200
Subject: [PATCH v5] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/copy.sgml   |  6 +--
 doc/src/sgml/ref/psql-ref.sgml   |  4 --
 src/backend/commands/copyfromparse.c | 57 +++-
 src/bin/psql/copy.c  | 31 ++-
 src/test/regress/expected/copy.out   | 18 +
 src/test/regress/sql/copy.sql| 12 ++
 6 files changed, 66 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..f7eb59dbac 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -814,11 +814,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.

 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 07419a3b92..7f56e24359 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1120,10 +1120,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index 7efcb89159..97e0b9777e 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -136,14 +136,6 @@ if (1) \
} \
 } else ((void) 0)
 
-/* Undo any read-ahead and jump out of the block. */
-#define NO_END_OF_COPY_GOTO \
-if (1) \
-{ \
-   input_buf_ptr = prev_raw_ptr + 1; \
-   goto not_end_of_copy; \
-} else ((void) 0)
-
 /* NOTE: there's a copy of this in copyto.c */
 static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
 
@@ -1182,7 +1174,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1377,10 +1368,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, backslash is a normal character.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1413,21 +1403,15 @@ CopyReadLineText(CopyFromState cstate)
 
if (c2 == '\n')
{
-   if (!cstate->opts.csv_mode)
-   ereport(ERROR,

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

2024-07-31 Thread Alexander Korotkov

On Mon, Jul 29, 2024 at 7:20 PM Robert Haas  wrote:
> On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov  
> wrote:
> > 0002 could also use pg_indent and pgperltidy.  And I have couple other
> > notes regarding 0002.
>
> As Tom said elsewhere, we don't have a practice of requiring perltidy
> for every commit, and I normally don't bother. The tree is
> pgindent-clean currently so I believe I got that part right.

Got it, thanks.

> > >In the process of fixing these bugs, I realized that the logic to wait
> > >for WAL summarization to catch up was spread out in a way that made
> > >it difficult to reuse properly, so this code refactors things to make
> > >it easier.
> >
> > It would be nice to split refactoring out of material logic changed.
> > This way it would be easier to review and would look cleaner in the
> > git history.
>
> I support that idea in general but felt it was overkill in this case:
> it's new code, and there was only one existing caller of the function
> that got refactored, and I'm not a huge fan of cluttering the git
> history with a bunch of tiny little refactoring commits to fix a
> single bug. I might have changed it if I'd seen this note before
> committing, though.

I understand your point.  I'm also not huge fan of a flood of small
commits.  Nevertheless, I find splitting refactoring from other
changes generally useful.  That could be a single commit of many small
refactorings, not many small commits.  The point for me is easier
review: you can expect refactoring commit to contain "isomorphic"
changes, while other commits implementing material logic changes.  But
that might be a committer preference though.

> > >To make this fix work, also teach the WAL summarizer that after a
> > >promotion has occurred, no more WAL can appear on the previous
> > >timeline: previously, the WAL summarizer wouldn't switch to the new
> > >timeline until we actually started writing WAL there, but that meant
> > >that when the startup process was waiting for the WAL summarizer, it
> > >was waiting for an action that the summarizer wasn't yet prepared to
> > >take.
> >
> > I think this is a pretty long sentence, and I'm not sure I can
> > understand it correctly.  Does startup process wait for the WAL
> > summarizer without this patch?  If not, it's not clear for me that
> > word "previously" doesn't distribute to this part of sentence.
> > Breaking this into multiple sentences could improve the clarity for
> > me.
>
> Yes, I think that phrasing is muddled. It's too late to amend the
> commit message, but for posterity: I initially thought that I could
> fix this bug by just teaching the startup process to wait for WAL
> summarization before performing the .partial renaming, but that was
> not enough by itself. The problem is that the WAL summarizer would
> read all the records that were present in the final WAL file on the
> old timeline, but it wouldn't actually write out a summary file,
> because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
> timeline switch point. Since no WAL had appeared on the new timeline
> yet, it didn't view the end of the old timeline as a switch point, so
> it just sat there waiting, without ever writing out a file; and the
> startup process sat there waiting for it. So the second part of the
> fix is to make the WAL summarizer realize that once the startup
> process has chosen a new timeline ID, no more WAL is going to appear
> on the old timeline, and thus it can (and should) write out the final
> summary file for the old timeline and prepare to read WAL from the new
> timeline.

Great, thank you for the explanation.  Now that's clear.

--
Regards,
Alexander Korotkov
Supabase

Re: Proposal: Document ABI Compatibility

2024-07-31 Thread David E. Wheeler

On Jul 31, 2024, at 05:27, Peter Eisentraut  wrote:

> Well, nobody has protested against what we wrote, so I have committed it.

Excellent, thank you!

D

Re: make pg_createsubscriber option names more consistent

2024-07-31 Thread Euler Taveira

On Wed, Jul 31, 2024, at 4:02 AM, Peter Eisentraut wrote:
> I propose to rename the pg_createsubscriber option --socket-directory to 
> --socketdir.  This would make it match the equivalent option in 
> pg_upgrade.  (It even has the same short option '-s'.) 
> pg_createsubscriber and pg_upgrade have a lot of common terminology and 
> a similar operating mode, so it would make sense to keep this consistent.

WFM. 


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: New compiler warnings in buildfarm

2024-07-31 Thread Tom Lane

Peter Eisentraut  writes:
> On 30.07.24 18:19, Tom Lane wrote:
>> Sometime in the last month or so, flaviventris's bleeding-edge
>> version of gcc has started whining[1] about truncation of a
>> string literal's implicit trailing '\0' in contexts like this:
>> ../pgsql/src/backend/commands/copyto.c:106:41: warning:
>> initializer-string for array of 'char' is too long
>> [-Wunterminated-string-initialization]
>> 106 | static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
>> | ^~~~

> According to the gcc documentation, this warning is part of -Wextra. 
> And indeed flaviventris runs with -Wextra:

> 'CFLAGS' => '-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra 
> -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers 
> -O0',

Ah --- and it was not doing that a month ago.  So maybe the compiler
has had this warning for longer.  I don't see it with gcc 13.3 though,
which is the newest I have handy.

> So I think the appropriate fix here for now is to add 
> -Wno-unterminated-string-initialization to this buildfarm configuration.

Agreed; our policy so far has been to treat -Wextra warnings with
suspicion, and there is not anything really wrong with these bits
of code.

It looks like serinus needs this fix too.

regards, tom lane

Suggestions to overcome 'multixact "members" limit exceeded' in temporary tables

2024-07-31 Thread Jim Vanns

I've reached the limit of my understanding and attempts at correcting my
code/use of temporary tables in the face of multixact members and have come
to ask for your help! Here's a brief description of my software;

Pool of N connection sessions, persistent for the duration of the program
lifetime.
Upon each session initialisation, a set of CREATE TEMPORARY TABLE ON COMMIT
DELETE ROWS statements are made for bulk ingest.
Each session is acquired by a thread for use when ingesting data and
therefore each temporary table remains until the session is terminated
The thread performs a COPY  FROM STDIN in binary format
Then an INSERT INTO  SELECT FROM  WHERE...

This has been working great for a while and with excellent throughput.
However, upon scaling up I eventually hit this error;

ERROR:  multixact "members" limit exceeded
DETAIL:  This command would create a multixact with 2 members, but the
remaining space is only enough for 0 members.
HINT:  Execute a database-wide VACUUM in database with OID 16467 with
reduced vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.

And it took me quite a while to identify that it appears to be coming from
the temporary table (the other 'main' tables were being autovacuumed OK) -
which makes sense because they have a long lifetime, aren't auto vacuumed
and shared by transactions (in turn).

I first attempted to overcome this by introducing an initial step of always
creating the temporary table before the copy (and using on commit drop) but
this lead to a terrible performance degradation.
Next, I reverted the above and instead I introduced a VACUUM step every
100 (configurable) ingest operations
Finally, I introduced a TRUNCATE step in addition to the occasional VACUUM
since the TRUNCATE allowed the COPY option of FREEZE.

The new overhead appears minimal until after several hours and again I've
hit a performance degradation seemingly dominated by the TRUNCATE.

My questions are;

1) Is the VACUUM necessary if I use TRUNCATE + COPY FREEZE (on the
temporary table)?
2) Is there really any benefit to using FREEZE here or is it best to just
VACUUM the temporary tables occasionally?
3) Is there a better way of managing all this!? Perhaps re-CREATING the TT
every day or something?

I understand that I can create a Linux tmpfs partition for a tablespace for
the temporary tables and that may speed up the TRUNCATE but that seems like
a hack and I'd rather not do it at all if it's avoidable.

Thanks for your help,

Jim

PS. PG version in use is 15.4 if that matters here

-- 
Jim Vanns
Principal Production Engineer
Industrial Light & Magic, London

Re: improve performance of pg_dump with many sequences

2024-07-31 Thread Nathan Bossart

Committed.

-- 
nathan

Re: why there is not VACUUM FULL CONCURRENTLY?

2024-07-31 Thread Antonin Houska

Kirill Reshke  wrote:

> Also, I was thinking about pg_repack vs pg_squeeze being used for the
> VACUUM FULL CONCURRENTLY feature, and I'm a bit suspicious about the
> latter.

> If I understand correctly, we essentially parse the whole WAL to
> obtain info about one particular relation changes. That may be a big
> overhead,

pg_squeeze is an extension but the logical decoding is performed by the core,
so there is no way to ensure that data changes of the "other tables" are not
decoded. However, it might be possible if we integrate the functionality into
the core. I'll consider doing so in the next version of [1].

> whereas the trigger approach does not suffer from this. So, there is the
> chance that VACUUM FULL CONCURRENTLY will never keep up with vacuumed
> relation changes. Am I right?

Perhaps it can happen, but note that trigger processing is also not free and
that in this case the cost is paid by the applications. So while VACUUM FULL
CONCURRENTLY (based on logical decoding) might fail to catch-up, the trigger
based solution may slow down the applications that execute DML commands while
the table is being rewritten.

[1] https://commitfest.postgresql.org/49/5117/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Suggestions to overcome 'multixact "members" limit exceeded' in temporary tables

2024-07-31 Thread Jim Vanns

I've been able to observe that the performance degradation with TRUNCATE
appears to happen when other ancillary processes are running that are also
heavy users of temporary tables. If I used an exclusive tablespace, would
that improve things?

Cheers

Jim


On Wed, 31 Jul 2024 at 15:16, Jim Vanns  wrote:

> I've reached the limit of my understanding and attempts at correcting my
> code/use of temporary tables in the face of multixact members and have come
> to ask for your help! Here's a brief description of my software;
>
> Pool of N connection sessions, persistent for the duration of the program
> lifetime.
> Upon each session initialisation, a set of CREATE TEMPORARY TABLE ON
> COMMIT DELETE ROWS statements are made for bulk ingest.
> Each session is acquired by a thread for use when ingesting data and
> therefore each temporary table remains until the session is terminated
> The thread performs a COPY  FROM STDIN in binary format
> Then an INSERT INTO  SELECT FROM  WHERE...
>
> This has been working great for a while and with excellent throughput.
> However, upon scaling up I eventually hit this error;
>
> ERROR:  multixact "members" limit exceeded
> DETAIL:  This command would create a multixact with 2 members, but the
> remaining space is only enough for 0 members.
> HINT:  Execute a database-wide VACUUM in database with OID 16467 with
> reduced vacuum_multixact_freeze_min_age and
> vacuum_multixact_freeze_table_age settings.
>
> And it took me quite a while to identify that it appears to be coming from
> the temporary table (the other 'main' tables were being autovacuumed OK) -
> which makes sense because they have a long lifetime, aren't auto vacuumed
> and shared by transactions (in turn).
>
> I first attempted to overcome this by introducing an initial step of
> always creating the temporary table before the copy (and using on commit
> drop) but this lead to a terrible performance degradation.
> Next, I reverted the above and instead I introduced a VACUUM step every
> 100 (configurable) ingest operations
> Finally, I introduced a TRUNCATE step in addition to the occasional VACUUM
> since the TRUNCATE allowed the COPY option of FREEZE.
>
> The new overhead appears minimal until after several hours and again I've
> hit a performance degradation seemingly dominated by the TRUNCATE.
>
> My questions are;
>
> 1) Is the VACUUM necessary if I use TRUNCATE + COPY FREEZE (on the
> temporary table)?
> 2) Is there really any benefit to using FREEZE here or is it best to just
> VACUUM the temporary tables occasionally?
> 3) Is there a better way of managing all this!? Perhaps re-CREATING the TT
> every day or something?
>
> I understand that I can create a Linux tmpfs partition for a tablespace
> for the temporary tables and that may speed up the TRUNCATE but that seems
> like a hack and I'd rather not do it at all if it's avoidable.
>
> Thanks for your help,
>
> Jim
>
> PS. PG version in use is 15.4 if that matters here
>
> --
> Jim Vanns
> Principal Production Engineer
> Industrial Light & Magic, London
>


-- 
Jim Vanns
Principal Production Engineer
Industrial Light & Magic, London

Re: Recent 027_streaming_regress.pl hangs

2024-07-31 Thread Andrew Dunstan



On 2024-07-25 Th 6:33 PM, Andrew Dunstan wrote:



On 2024-07-25 Th 5:14 PM, Tom Lane wrote:

I wrote:

I'm confused by crake's buildfarm logs.  AFAICS it is not running
recovery-check at all in most of the runs; at least there is no
mention of that step, for example here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02

Oh, I see it: the log file that is called recovery-check in a
failing run is called misc-check if successful.  That seems
mighty bizarre, and it's not how my own animals behave.
Something weird about the meson code path, perhaps?



Yes, it was discussed some time ago. As suggested by Andres, we run 
the meson test suite more or less all together (in effect like "make 
checkworld" but without the main regression suite, which is run on its 
own first), rather than in the separate (and serialized) way we do 
with the configure/make tests. That results in significant speedup. If 
the tests fail we report the failure as happening at the first failure 
we encounter, which is possibly less than ideal, but I haven't got a 
better idea.




Anyway, in this successful run:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check

here are some salient test timings:

   1/297 postgresql:pg_upgrade / pg_upgrade/001_basic   
 OK0.18s   9 subtests passed
   2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots   
 OK   15.95s   12 subtests passed
   3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription
 OK   16.29s   14 subtests passed
  17/297 postgresql:isolation / isolation/isolation 
 OK   71.60s   119 subtests passed
  41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade  
 OK  169.13s   18 subtests passed
140/297 postgresql:initdb / initdb/001_initdb   
OK   41.34s   52 subtests passed
170/297 postgresql:recovery / recovery/027_stream_regress   
OK  469.49s   9 subtests passed

while in the next, failing run

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check

the same tests took:

   1/297 postgresql:pg_upgrade / pg_upgrade/001_basic   
 OK0.22s   9 subtests passed
   2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots   
 OK   56.62s   12 subtests passed
   3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription
 OK   71.92s   14 subtests passed
  21/297 postgresql:isolation / isolation/isolation 
 OK  299.12s   119 subtests passed
  31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade  
 OK  344.42s   18 subtests passed
159/297 postgresql:initdb / initdb/001_initdb   
OK  344.46s   52 subtests passed
162/297 postgresql:recovery / recovery/027_stream_regress   
ERROR   840.84s   exit status 29

Based on this, it seems fairly likely that crake is simply timing out
as a consequence of intermittent heavy background activity.




The latest failure is this:


Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on 
primary
[16:40:29.481](208.545s) # poll_query_until timed out executing this query:
# SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at 
/home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 
103.


Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my 
house on it. Pretty much nothing else runs on this machine.

I have added a mild throttle to the buildfarm config so it doesn't try 
to run every branch at once. Maybe I also need to bring down the 
number or meson jobs too? But I suspect there's something deeper. 
Prior to the failure of this test 10 days ago it hadn't failed in a 
very long time. The OS was upgraded a month ago. Eight or so days ago 
I changed PG_TEST_EXTRA. I can't think of anything else.







There seem to be a bunch of recent failures, and not just on crake, and 
not just on HEAD: 




cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Recent 027_streaming_regress.pl hangs

2024-07-31 Thread Tom Lane

Andrew Dunstan  writes:
> There seem to be a bunch of recent failures, and not just on crake, and 
> not just on HEAD: 
> 

There were a batch of recovery-stage failures ending about 9 days ago
caused by instability of the new 043_vacuum_horizon_floor.pl test.
Once you take those out it doesn't look quite so bad.

regards, tom lane

Re: On disable_cost

2024-07-31 Thread Robert Haas

OK, here's a new patch version. I earlier committed the refactoring to
avoid using disable_cost to force WHERE CURRENT OF to be implemented
by a TID scan. In this version, I've dropped everything related to
reworking enable_indexscan or any other enable_* GUC. Hence, this
version of the patch set just focuses on adding the count of disabled
nodes and removing the use of disable_cost. In addition to dropping
the controversial patches, I've also found and squashed a few bugs in
this version.

Behavior: With the patch, whenever an enable_* GUC would cause
disable_cost to be added, disabled_nodes is incremented instead. There
is one remaining use of disable_cost which is not triggered by an
enable_* GUC but by the desire to avoid plans that we think will
overflow work_mem. I welcome thoughts on what to do about that case;
for now, I do nothing. As before, 0001 adds the disabled_nodes field
to paths and 0002 adds it to plans. I think we could plausibly commit
only 0001, both patches separately, or both patches squashed.

Notes:

- I favor committing both patches. Tom stated that he didn't think
that we needed to show anything related to disabled nodes, and that
could be true. However, today, you can tell which nodes are disabled
as long as you print out the costs; if we don't propagate disabled
nodes into the plan and print them out, that will no longer be
possible. I found working on the patches that it was really hard to
debug the patch set without this, so my guess is that we'll find not
having it pretty annoying, but we can also just commit 0001 for
starters and see how long it takes for the lack of 0002 to become
annoying. If the answer is "infinite time," that's cool; if it isn't,
we can reconsider committing 0002.

- If we do commit 0002, I think it's a good idea to have the number of
disabled nodes displayed even with COSTS OFF, because it's stable, and
it's pretty useful to be able to see this in the regression output. I
have found while working on this that I often need to adjust the .sql
files to say EXPLAIN (COSTS ON) instead of EXPLAIN (COSTS OFF) in
order to understand what's happening. Right now, there's no real
alternative because costs aren't stable, but disabled-node counts
should be stable, so I feel this would be a step forward. Apart from
that, I also think it's good for features to have regression test
coverage, and since we use COSTS OFF everywhere or at least nearly
everywhere in the regression test, if we don't print out the disabled
node counts when COSTS OFF is used, then we don't cover that case in
our tests. Bummer.

Regression test changes in 0001:

- btree_index.sql executes a query "select proname from pg_proc where
proname ilike 'ri%foo' order by 1" with everything but bitmap scans
disabled. Currently, that produces an index-only scan; with the patch,
it produces a sort over a sequential scan. That's a little odd,
because the test seems to be aimed at demonstrating that we can use a
bitmap scan, and it doesn't, because we apparently can't. But, why
does the patch change the plan?
At least on my machine, the index-only scan is significantly more
costly than the sequential scan. I think what's happening here is that
when you add disable_cost to the cost of both paths, they compare
fuzzily the same; without that, the cheaper one wins.

- select_parallel.out executes a query with sequential scans disabled
but tenk2 must nevertheless be sequential-scanned. With the patch,
that changes to a parallel sequential scan. I think the explanation
here is the same as in the preceding case.

- horizons.spec currently sets enable_seqscan=false,
enable_indexscan=false, and enable_bitmapscan=false. I suspect that
Andres thought that this would force the use of an index-only scan,
since nothing sets enable_indexonlyscan=false. But as discussed
upthread, that is not true. Instead everything is disabled. For the
same reasons as in the previous two examples, this caused an
assortment of plan changes which in turn caused the test to fail to
test what it was intended to test. So I removed enable_indexscan=false
from the spec file, and now it gets index-only scans everywhere again,
as desired.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v4-0002-Show-number-of-disabled-nodes-in-EXPLAIN-ANALYZE-.patch
Description: Binary data


v4-0001-Treat-number-of-disabled-nodes-in-a-path-as-a-sep.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-07-31 Thread Alexander Korotkov

On Mon, Jul 15, 2024 at 2:02 PM Alexander Korotkov  wrote:
> On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov  
> wrote:
> > Thanks to Kyotaro for the review.  And thanks to Ivan for the patch
> > revision.  I made another revision of the patch.
>
> I've noticed failures on cfbot.  The attached revision addressed docs
> build failure.  Also it adds some "quits" for background psql sessions
> for tests.  Probably this will address test hangs on windows.

I made the following changes to the patch.

1) I've changed the check for snapshot in pg_wal_replay_wait().  Now
it checks that GetOldestSnapshot() returns NULL.  It happens when both
ActiveSnapshot is NULL and RegisteredSnapshots pairing heap is empty.
This is the same condition when SnapshotResetXmin() sets out xmin to
invalid.  Thus, we are not preventing WAL from replay.  This should be
satisfied when pg_wal_replay_wait() isn't called within a transaction
with an isolation level higher than READ COMMITTED, another procedure,
or a function.  Documented it this way.

2) Explicitly document in the PortalRunUtility() comment that
pg_wal_replay_wait() is another case when active snapshot gets
released.

3) I've removed tests with cascading replication.  It's rather unclear
what problem these tests could potentially spot.

4) Did some improvements to docs, comments and commit message to make
them consistent with the patch contents.

The commit to pg17 was inconsiderate.  But I feel this patch got much
better shape.  Especially, now it's clear when the
pg_wal_replay_wait() procedure can be used.  So, I dare commit this to
pg18 if nobody objects.

--
Regards,
Alexander Korotkov
Supabase
From f6713100b6231f2d124a98021d9fbc353caad0db Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Wed, 31 Jul 2024 19:13:55 +0300
Subject: [PATCH v23] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records, implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/func.sgml| 117 ++
 src/backend/access/transam/xact.c |   6 +
 src/backend/access/transam/xlog.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/catalog/system_functions.sql  |   3 +
 src/backend/commands/Makefile |   3 +-
 src/backend/commands/meson.build  |   1 +
 src/backend/commands/waitlsn.c| 363 ++
 src/backend/lib/pairingheap.c |  18 +-
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/lmgr/proc.c   |   6 +
 src/backend/tcop/pquery.c |   9 +-
 .../utils/activity/wait_event_names.txt   |   2 +
 src/include/catalog/pg_proc.dat   |   6 +
 src/include/commands/waitlsn.h|  80 
 src/include/lib/pairingheap.h |   3 +
 src/include/storage/lwlocklist.h  |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/043_wal_replay_wait.pl| 150 
 src/tools/pgindent/typedefs.list  |   2 +
 20 files changed, 785 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b39f97dc8de..3cf896b22fa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28911,6 +28911,123 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
 the pause, the rate of WAL generation and available disk space.

+   
+The procedure shown in 
+can be executed only during recovery.
+   
+
+   
+Recovery Synchronization Procedure
+
+ 
+  
+   
+Procedure
+   
+   
+Description
+   
+  
+ 
+
+ 
+

Re: Recent 027_streaming_regress.pl hangs

2024-07-31 Thread Andrew Dunstan




On 2024-07-31 We 12:05 PM, Tom Lane wrote:

Andrew Dunstan  writes:

There seem to be a bunch of recent failures, and not just on crake, and
not just on HEAD:


There were a batch of recovery-stage failures ending about 9 days ago
caused by instability of the new 043_vacuum_horizon_floor.pl test.
Once you take those out it doesn't look quite so bad.




We'll see. I have switched crake from --run-parallel mode to --run-all 
mode i.e. the runs are serialized. Maybe that will be enough to stop the 
errors. I'm still annoyed that this test is susceptible to load, if that 
is indeed what is the issue.



cheers


andrew


--

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Robert Haas

On Wed, Jul 31, 2024 at 9:14 AM Joe Conway  wrote:
> In my opinion, for this use case and others, it should be possible to
> redact the values substituted into log messages based on some criteria.
> One of those criteria could be "I am in a leakproof call right now". In
> fact in a similar fashion, an extension ought to be able to mutate the
> log message based on the entire string, e.g. when "ALTER
> ROLE...PASSWORD..." is spotted I would like to be able to redact
> everything after "PASSWORD".

This might be helpful, and unfortunately I'm all too familiar with the
ALTER ROLE...PASSWORD example, but I think it's not really related to
the question of whether we can mark upper() and lower() leakproof.

If there are some inputs that cause upper() and lower() to fail and
others that do not, the functions aren't leakproof, because an
attacker can extract information about values that they can't see by
feeding those values into these functions and seeing whether they get
a failure or not. It doesn't matter what error message is produced;
the fact that the function throws an error of any kind for some input
values but enough for others is enough to make it unsafe, and it seems
to me that we've repeatedly found that there's often a way to turn
even what seems like a small leak into a very large one, so we need to
be quite careful here.

Annoyingly, we have a WHOLE BUNCH of different code paths that need to
be assessed individually here. I think we can ignore the fact that
upper() can throw errors when it's unclear which collation to use; I
think that's a property of the query string rather than the input
value. It's a bit less clear whether we can ignore out of memory
conditions, but my judgement is that a possible OOM from a small
allocation is not generally going to be useful as an attack vector. If
a long string produces an error and a short one doesn't, that might
qualify as a real leak. And other errors that depend on the input
value are also leaks. So let's go through the code paths:

- When lc_ctype_is_c(), we call asc_toupper(), which calls
pg_ascii_toupper(), which just replaces a-z with A-Z. No errors.

- When mylocale->provider == COLLPROVIDER_ICU, we call icu_to_uchar()
and icu_convert_case(). icu_to_uchar() calls uchar_length(), which has
this:

if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
ereport(ERROR,
(errmsg("%s failed: %s",
"ucnv_toUChars", u_errorName(status;

I don't know what errors can be reported here, or if this ever fails
in practice, so I can't prove it never fails consistently for some
particular input string. uchar_convert() and icu_convert_case() have
similar error reporting.

- When mylocale->provider == COLLPROVIDER_BUILTIN, we call
unicode_strupper() which calls unicode_to_utf8() which clearly throws
no errors. Likewise unicode_utf8len() does not error. I don't see how
we can get an error out of this path.

- In cases not covered by the above, we take different paths depending
on whether a multi-byte encoding is in use. In single-byte encodings,
we rely on either pg_toupper() or toupper_l(). The latter is an
OS-provided function so can't ereport(), and the former calls either
does the work itself or calls toupper() which again is an OS-provided
function and can't report().

- Finally, when mylocale == NULL or the provider is COLLPROVIDER_LIBC
and the encoding is multi-byte, we use char2wchar(), then towupper_l()
or towupper(), then wchar2char(). The whole thing can fall apart if
the string is too long, which might be enough to declare this
leakproof but it depends on whether the error guarded by /* Overflow
paranoia */ is really just paranoia or whether it's actually
reachable. Otherwise, towupper*() won't ereport because it's not part
of PG, so we need to assess char2wchar() and wchar2char(). Here I note
that char2wchar() claims that it does an ereport() if the input is
invalidly encoded. This is kind of surprising when you think about it,
because our usual policy is not to allow invalidly encoded data into
the database in the first place, but whoever wrote this thought it
would be possible to hit this case "if LC_CTYPE locale is different
from the database encoding." But it seems that the logic here dates to
commit 2ab0796d7a3a7116a79b65531fd33f1548514b52 back in 2011, so it
seems at least possible to me that we've tightened things up enough
since then that you can't actually hit this any more. But then again,
maybe not.

So in summary, I think upper() is ... pretty close to leakproof. But
if ICU sometimes fails on certain strings, then it isn't. And if the
multi-byte libc path can be made to fail reliably either with really
long strings or with certain choices of the LC_CTYPE locale, then it
isn't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Tom Lane

Robert Haas  writes:
> If there are some inputs that cause upper() and lower() to fail and
> others that do not, the functions aren't leakproof, because an
> attacker can extract information about values that they can't see by
> feeding those values into these functions and seeing whether they get
> a failure or not.

> [ rather exhaustive analysis redacted ]

> So in summary, I think upper() is ... pretty close to leakproof. But
> if ICU sometimes fails on certain strings, then it isn't. And if the
> multi-byte libc path can be made to fail reliably either with really
> long strings or with certain choices of the LC_CTYPE locale, then it
> isn't.

The problem here is that marking these functions leakproof is a
promise about a *whole bunch* of code, much of it not under our
control; worse, there's no reason to think all that code is stable.
A large fraction of it didn't even exist a few versions ago.

Even if we could convince ourselves that the possible issues Robert
mentions aren't real at the moment, I think marking these leakproof
is mighty risky.  It's unlikely we'd remember to revisit the marking
the next time someone drops a bunch of new code in here.

regards, tom lane

Re: New compiler warnings in buildfarm

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-31 10:11:07 -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On 30.07.24 18:19, Tom Lane wrote:
> >> Sometime in the last month or so, flaviventris's bleeding-edge
> >> version of gcc has started whining[1] about truncation of a
> >> string literal's implicit trailing '\0' in contexts like this:
> >> ../pgsql/src/backend/commands/copyto.c:106:41: warning:
> >> initializer-string for array of 'char' is too long
> >> [-Wunterminated-string-initialization]
> >> 106 | static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
> >> | ^~~~
>
> > According to the gcc documentation, this warning is part of -Wextra.
> > And indeed flaviventris runs with -Wextra:
>
> > 'CFLAGS' => '-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra
> > -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers
> > -O0',
>
> Ah --- and it was not doing that a month ago.

Hm? I've not touched flaviventris config since at least the 26th of March. And
a buildfarm run from before then also shows -Wextra:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-03-17%2011%3A17%3A01


> So maybe the compiler has had this warning for longer.

It's very new:

commit 44c9403ed1833ae71a59e84f9e37af3182be0df5
Author: Alejandro Colomar 
AuthorDate: 2024-06-29 15:10:43 +0200
Commit: Martin Uecker 
CommitDate: 2024-07-14 11:41:00 +0200

c, objc: Add -Wunterminated-string-initialization


It might be worth piping up in the gcc bugtracker and suggesting that the
warning isn't issued when there's an explicit \0?


> > So I think the appropriate fix here for now is to add
> > -Wno-unterminated-string-initialization to this buildfarm configuration.
>
> Agreed; our policy so far has been to treat -Wextra warnings with
> suspicion, and there is not anything really wrong with these bits
> of code.
>
> It looks like serinus needs this fix too.

Added to both.  I've forced runs for both animals, so the bf should show
results of that soon.

Greetings,

Andres Freund

Re: New compiler warnings in buildfarm

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-31 11:32:56 -0700, Andres Freund wrote:
> On 2024-07-31 10:11:07 -0400, Tom Lane wrote:
> > It looks like serinus needs this fix too.
>
> Added to both.  I've forced runs for both animals, so the bf should show
> results of that soon.

I Wonder if I should also should add -Wno-clobbered to serinus' config. Afaict
-Wclobbered is pretty useless once optimizations are used. I've long added
that to my local dev environment flags because it's so noisy (which is too
bad, in theory a good warning for this would be quite helpful).

Or whether we should just do that on a project level?

Greetings,

Andres Freund

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Joe Conway


On 7/31/24 14:03, Tom Lane wrote:

Robert Haas  writes:

If there are some inputs that cause upper() and lower() to fail and
others that do not, the functions aren't leakproof, because an
attacker can extract information about values that they can't see by
feeding those values into these functions and seeing whether they get
a failure or not.



[ rather exhaustive analysis redacted ]



So in summary, I think upper() is ... pretty close to leakproof. But
if ICU sometimes fails on certain strings, then it isn't. And if the
multi-byte libc path can be made to fail reliably either with really
long strings or with certain choices of the LC_CTYPE locale, then it
isn't.


The problem here is that marking these functions leakproof is a
promise about a *whole bunch* of code, much of it not under our
control; worse, there's no reason to think all that code is stable.
A large fraction of it didn't even exist a few versions ago.

Even if we could convince ourselves that the possible issues Robert
mentions aren't real at the moment, I think marking these leakproof
is mighty risky.  It's unlikely we'd remember to revisit the marking
the next time someone drops a bunch of new code in here.



I still maintain that there is a whole host of users that would accept 
the risk of side channel attacks via existence of an error or not, if 
they could only be sure nothing sensitive leaks directly into the logs 
or to the clients. We should give them that choice.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Remove last traces of HPPA support

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-31 17:52:34 +1200, Thomas Munro wrote:
> 2.  The pg_atomic_unlocked_test_flag() function was surprising to me:
> it returns true if it's not currently set (according to a relaxed
> load).  Most of this patch was easy, but figuring out that I had
> reverse polarity here was a multi-coffee operation :-)  I can't call
> it wrong though, as it's not based on , and it's clearly
> documented, so *shrug*.

I have no idea why I did it that way round. This was a long time ago...


> 4.  Another reason would be that you need it to implement
> SpinLockFree() and S_LOCK_FREE().  They don't seem to have had any
> real callers since the beginning of open source PostgreSQL!, except
> for a test of limited value in a new world without ports developing
> their own spinlock code.  Let's remove them!  I see this was already
> threatened by Andres in 3b37a6de.

Note that I would like to add a user for S_LOCK_FREE(), to detect repeated
SpinLockRelease():
https://postgr.es/m/20240729182952.hua325647e2ggbsy%40awork3.anarazel.de

Greetings,

Andres Freund

Re: Remove last traces of HPPA support

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-30 23:08:36 +1200, Thomas Munro wrote:
> On Tue, Jul 30, 2024 at 12:39 PM Thomas Munro  wrote:
> > On Tue, Jul 30, 2024 at 11:16 AM Heikki Linnakangas  wrote:
> > > Looks good to me.
> >
> > Thanks.  I'll wait just a bit longer to see if anyone else has comments.
> 
> And pushed.

Yay!


> I am aware of a couple of build farm animals that will now fail
> because they deliberately test --disable-spinlocks: francolin and
> rorqual, which will need adjustment or retirement on master.  I'll
> watch out for other surprises on the farm...

I've now adjusted rorqual, francolin, piculet to not run on master anymore -
they're just there to test combinations of --disable-atomics and
--disable-spinlocks, so there seems not much point in just disabling those
options for HEAD.

Greetings,

Andres Freund

Re: Use pgBufferUsage for block reporting in analyze

2024-07-31 Thread Masahiko Sawada

On Wed, Jul 31, 2024 at 12:03 AM Anthonin Bonnefoy
 wrote:
>
> On Tue, Jul 30, 2024 at 9:21 AM Anthonin Bonnefoy
>  wrote:
> > A possible change would be to pass an inh flag when an acquirefunc is
> > called from acquire_inherited_sample_rows. The acquirefunc could then
> > use an alternative log format to append to logbuf. This way, we could
> > have a more compact format for partitioned tables.
>
> I've just tested this, the result isn't great as it creates an
> inconsistent output
>
> INFO:  analyze of table "postgres.public.test_partition"
> "test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
> and 0 dead tuples; 999 rows in sample, 999 estimated total rows
> "test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples
> and 0 dead tuples; 1000 rows in sample, 1000 estimated total rows
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> ...
> INFO:  analyze of table "postgres.public.test_partition_1"
> pages: 5 of 5 scanned
> tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
> estimated total tuples
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
>
> Maybe the best approach is to always use the compact form?
>
> INFO:  analyze of table "postgres.public.test_partition"
> "test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
> and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples
> "test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples
> and 0 dead tuples; 1000 tuples in sample, 1000 estimated total tuples
> avg read rate: 1.953 MB/s, avg write rate: 0.000 MB/s
> ...
> INFO:  analyze of table "postgres.public.test_partition_1"
> "test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
> and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
>
> I've updated the patchset with those changes. 0004 introduces the
> StringInfo logbuf so we can output logs as a single log and during
> ANALYZE VERBOSE while using the compact form.
>

Fair point. I'll consider a better output format.

Meanwhile, I think we can push 0001 and 0002 patches since they are in
good shape. I've updated commit messages to them and slightly changed
0002 patch to write "finished analyzing of table \"%s.%s.%s\" instead
of  "analyze of table \"%s.%s.%s\".

Also, regarding 0003 patch, what is the main reason why we want to add
WAL usage to analyze reports? I think that analyze normally does not
write WAL records much so I'm not sure it's going to provide a good
insight for users.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v7-0001-Use-pgBufferUsage-for-buffer-usage-tracking-in-an.patch
Description: Binary data


v7-0002-Add-resource-statistics-reporting-to-ANALYZE-VERB.patch
Description: Binary data

Re: [17+] check after second call to pg_strnxfrm is too strict, relax it

2024-07-31 Thread Jeff Davis

On Tue, 2024-07-30 at 23:01 +0300, Heikki Linnakangas wrote:
> +1. A comment in the pg_strnxfrm() function itself would be good too.

Committed, thank you. Backported to 16 (original email said 17, but
that was a mistake).

Regards,
Jeff Davis

Re: Remove last traces of HPPA support

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-31 22:32:19 +1200, Thomas Munro wrote:
> > That old comment means that both SpinLockAcquire() and SpinLockRelease()
> > acted as full memory barriers, and looking at the implementations, that
> > was indeed so. With the new implementation, SpinLockAcquire() will have
> > "acquire semantics" and SpinLockRelease will have "release semantics".
> > That's very sensible, and I don't believe it will break anything, but
> > it's a change in semantics nevertheless.
> 
> Yeah.  It's interesting that our pg_atomic_clear_flag(f) is like
> standard atomic_flag_clear_explicit(f, memory_order_release), not like
> atomic_flag_clear(f) which is short for atomic_flag_clear_explicit(f,
> memory_order_seq_cst).  Example spinlock code I've seen written in
> modern C or C++ therefore uses the _explicit variants, so it can get
> acquire/release, which is what people usually want from a lock-like
> thing.  What's a good way to test the performance in PostgreSQL?

I've used
  c=8;pgbench -n -Mprepared -c$c -j$c -P1 -T10 -f <(echo "SELECT 
pg_logical_emit_message(false, \:client_id::text, '1'), generate_series(1, 
1000) OFFSET 1000;")
in the past. Because of NUM_XLOGINSERT_LOCKS = 8 this ends up with 8 backends
doing tiny xlog insertions and heavily contending on insertpos_lck.

The generate_series() is necessary as otherwise the context switch and
executor startup overhead dominates.


> In a naive loop that just test-and-sets and clears a flag a billion times in
> a loop and does nothing else, I see 20-40% performance increase depending on
> architecture when comparing _seq_cst with _acquire/_release.

I'd expect the difference to be even bigger on concurrent workloads on x86-64
- the added memory barrier during lock release really hurts. I have a test
program to play around with this and the difference in isolation is like 0.4x
the throughput with a full barrier release on my older 2 socket workstation
[1]. Of course it's not trivial to hit "pure enough" cases in the real world.


On said workstation [1], with the above pgbench, I get ~1.95M inserts/sec
(1959 TPS * 1000) on HEAD and 1.80M insert/sec after adding
#define S_UNLOCK(lock) __atomic_store_n(lock, 0, __ATOMIC_SEQ_CST)


If I change NUM_XLOGINSERT_LOCKS = 40 and use 40 clients, I get
1.03M inserts/sec with the current code and 0.86M inserts/sec with
__ATOMIC_SEQ_CST.

Greetings,

Andres Freund

[1] 2x Xeon Gold 5215

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Andrew Dunstan




On 2024-07-31 We 2:43 PM, Joe Conway wrote:

On 7/31/24 14:03, Tom Lane wrote:

Robert Haas  writes:

If there are some inputs that cause upper() and lower() to fail and
others that do not, the functions aren't leakproof, because an
attacker can extract information about values that they can't see by
feeding those values into these functions and seeing whether they get
a failure or not.



[ rather exhaustive analysis redacted ]



First, thanks you very much, Robert for the analysis.





So in summary, I think upper() is ... pretty close to leakproof. But
if ICU sometimes fails on certain strings, then it isn't. And if the
multi-byte libc path can be made to fail reliably either with really
long strings or with certain choices of the LC_CTYPE locale, then it
isn't.


The problem here is that marking these functions leakproof is a
promise about a *whole bunch* of code, much of it not under our
control; worse, there's no reason to think all that code is stable.
A large fraction of it didn't even exist a few versions ago.

Even if we could convince ourselves that the possible issues Robert
mentions aren't real at the moment, I think marking these leakproof
is mighty risky.  It's unlikely we'd remember to revisit the marking
the next time someone drops a bunch of new code in here.



I still maintain that there is a whole host of users that would accept 
the risk of side channel attacks via existence of an error or not, if 
they could only be sure nothing sensitive leaks directly into the logs 
or to the clients. We should give them that choice.




As I meant to say in my previous empty reply, I think your suggestions 
make lots of sense.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_verifybackup: TAR format backup verification

2024-07-31 Thread Robert Haas

On Wed, Jul 31, 2024 at 9:28 AM Amul Sul  wrote:
> Fixed -- I did that because it was part of a separate group in pg_basebackup.

Well, that's because pg_basebackup builds multiple executables, and
these files needed to be linked with some but not others. It looks
like when Andres added meson support, instead of linking each object
file into the binaries that need it, he had it just build a static
library and link every executable to that. That puts the linker in
charge of sorting out which binaries need which files, instead of
having the makefile do it. In any case, this consideration doesn't
apply when we're putting the object files into a library, so there was
no need to preserve the separate makefile variable. I think this looks
good now.

> Fixed -- frontend_common_code now includes lz4 as well.

Cool. 0003 overall looks good to me now, unless Andres wants to object.

> Noted. I might give it a try another day, unless someone else beats
> me, perhaps in a separate thread.

Probably not too important, since nobody has complained.

> Done -- added a new patch as 0004, and the subsequent patch numbers
> have been incremented accordingly.

I think I would have made this pass context->show_progress to
progress_report() instead of the whole verifier_context, but that's an
arguable stylistic choice, so I'll defer to you if you prefer it the
way you have it. Other than that, this LGTM.

However, what is now 0005 does something rather evil. The commit
message claims that it's just rearranging code, and that's almost
entirely true, except that it also changes manifest_file's pathname
member to be char * instead of const char *. I do not think that is a
good idea, and I definitely do not think you should do it in a patch
that purports to just be doing code movement, and I even more
definitely think that you should not do it without even mentioning
that you did it, and why you did it.

> Fixed -- I did the NULL check in the earlier 0007 patch, but it should
> have been done in this patch.

This is now 0006. struct stat's st_size is of type off_t -- or maybe
ssize_t on some platforms? - not type size_t. I suggest making the
filesize argument use int64 as we do in some other places. size_t is,
I believe, defined to be the right width to hold the size of an object
in memory, not the size of a file on disk, so it isn't really relevant
here.

Other than that, my only comment on this patch is that I think I would
find it more natural to write the check in verify_backup_file() in a
different order: I'd put context->manifest->version != 1 && m != NULL
&& m->matched && !m->bad && strcmp() because (a) that way the most
expensive test is last and (b) it feels weird to think about whether
we have the right pathname if we don't even have a valid manifest
entry. But this is minor and just a stylistic preference, so it's also
OK as you have it if you prefer.

> I agree, changing the order of errors could create confusion.
> Previously, a file size mismatch was a clear and appropriate error
> that was reported before the checksum failure error.

In my opinion, this patch (currently 0007) creates a rather confusing
situation that I can't easily reason about. Post-patch,
verify_content_checksum() is a mix of two different things: it ends up
containing all of the logic that needs to be performed on every chunk
of bytes read from the file plus some but not all of the end-of-file
error-checks from verify_file_checksum(). That's really weird. I'm not
very convinced that the test for whether we've reached the end of the
file is 100% correct, but even if it is, the stuff before that point
is stuff that is supposed to happen many times and the stuff after
that is only supposed to happen once, and I don't see any good reason
to smush those two categories of things into a single function. Plus,
changing the order in which those end-of-file checks happen doesn't
seem like the right idea either: the current ordering is good the way
it is. Maybe you want to think of refactoring to create TWO new
functions, one to do the per-hunk work and a second to do the
end-of-file "is the checksum OK?" stuff, or maybe you can just open
code it, but I'm not willing to commit this the way it is.

Regarding 0008, I don't really see a reason why the m != NULL
shouldn't also move inside should_verify_control_data(). Yeah, the
caller added in 0010 might not need the check, but it won't really
cost anything. Also, it seems to me that the logic in 0010 is actually
wrong. If m == NULL, we'll keep the values of verifyChecksums and
verifyControlData from the previous iteration, whereas what we should
do is make them both false. How about removing the if m == NULL guard
here and making both should_verify_checksum() and
should_verify_control_data() test m != NULL internally? Then it all
works out nicely, I think. Or alternatively you need an else clause
that resets both values to false when m == NULL.

> Okay, I added the verify_checksums() and verify_controldata()
>

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Robert Haas

On Wed, Jul 31, 2024 at 2:43 PM Joe Conway  wrote:
> I still maintain that there is a whole host of users that would accept
> the risk of side channel attacks via existence of an error or not, if
> they could only be sure nothing sensitive leaks directly into the logs
> or to the clients. We should give them that choice.

I'm not sure what design you have in mind. A lot of possible designs
seem to end up like this:

1. You can't directly select the invisible value.

2. But you can write a plpgsql procedure that tries a bunch of things
in a loop and catches errors and uses which things error and which
things don't to figure out and return the invisible value.

And I would argue that's not really that useful. Especially if that
plpgsql procedure can extract the hidden values in like 1ms/row.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Robert Haas

On Wed, Jul 31, 2024 at 2:03 PM Tom Lane  wrote:
> The problem here is that marking these functions leakproof is a
> promise about a *whole bunch* of code, much of it not under our
> control; worse, there's no reason to think all that code is stable.
> A large fraction of it didn't even exist a few versions ago.

I think it's a fair critique. It could be reasonable to say, well, a
particular user could reasonably judge that the risk of marking
upper() as leak-proof is acceptable, but it's a little too risky for
us to want to do it as a project. After all, they can know things like
"we don't even use ICU," which we as a project cannot know.

However, the risk is that an end-user is going to be much less able to
evaluate what is and isn't safe than we are. I think some people are
going to be like -- well the core project doesn't mark enough stuff
leakproof, so I'll just go add markings to a bunch of stuff myself.
And they probably won't stop at stuff like UPPER which is almost
leakproof. They might add it to stuff such as LIKE which results in
immediately giving away the farm. By not giving people any guidance,
we invite them to make up their own rules.

Perhaps it's still right to be conservative; after all, no matter what
an end-user does, it's not a CVE for us, and CVEs suck. On the other
hand, shipping a system that is not useful in practice until you
modify a bunch of stuff also sucks, especially when it's not at all
clear which things are safe to modify.

I'm not sure what the right thing to do here is, but I think that it's
wrong to imagine that being unwilling to endorse probably-leakproof
things as leakproof -- or unwilling to put in the work to MAKE them
leakproof if they currently aren't -- has no security costs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-31 Thread Peter Geoghegan

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
> In back branches starting with 14, failing to remove tuples older than
> OldestXmin during pruning caused vacuum to infinitely loop in
> lazy_scan_prune(), as investigated on this [1] thread.

Shouldn't somebody remove the entry that we have for this issue under
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Older_bugs_affecting_stable_branches?

-- 
Peter Geoghegan

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Joe Conway


On 7/31/24 16:10, Robert Haas wrote:

On Wed, Jul 31, 2024 at 2:43 PM Joe Conway  wrote:

I still maintain that there is a whole host of users that would accept
the risk of side channel attacks via existence of an error or not, if
they could only be sure nothing sensitive leaks directly into the logs
or to the clients. We should give them that choice.


I'm not sure what design you have in mind. A lot of possible designs
seem to end up like this:

1. You can't directly select the invisible value.

2. But you can write a plpgsql procedure that tries a bunch of things
in a loop and catches errors and uses which things error and which
things don't to figure out and return the invisible value.

And I would argue that's not really that useful. Especially if that
plpgsql procedure can extract the hidden values in like 1ms/row.



You are assuming that everyone allows direct logins with the ability to 
create procedures. Plenty don't.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote:
> > On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
> >> My point is that _xgetbv() is made available by -mavx512vpopcntdq 
> >> -mavx512bw
> >> alone, without needing -mxsave:
> >
> > Oh, I see.  I'll work on a patch to remove that compiler check, then...
>
> As I started on this, I remembered why I needed it.  The file
> pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
> to avoid inadvertently issuing any AVX-512 instructions before determining
> we have support.  If that's not a concern, we could still probably remove
> the XSAVE check.

I think it's a valid concern - but isn't that theoretically also an issue with
xsave itself? I guess practically the compiler won't do that, because there's
no practical reason to emit any instructions enabled by -mxsave (in contrast
to e.g. -mavx, which does trigger gcc to emit different instructions even for
basic math).

I think this is one of the few instances where msvc has the right approach -
if I use intrinsics to emit a specific instruction, the intrinsic should do
so, regardless of whether the compiler is allowed to do so on its own.

I think enabling options like these on a per-translation-unit basis isn't
really a scalable approach. To actually be safe there could only be a single
function in each TU and that function could only be called after a cpuid check
performed in a separate TU. That a) ends up pretty unreadable b) requires
functions to be implemented in .c files, which we really don't want for some
of this.

I think we'd be better off enabling architectural features on a per-function
basis, roughly like this:
https://godbolt.org/z/a4q9Gc6Ez

For posterity, in the unlikely case anybody reads this after godbolt shuts
down:

I'm thinking we'd have an attribute like this:

/*
 * GCC like compilers don't support intrinsics without those intrinsics 
explicitly
 * having been enabled. We can't just add these options more widely, as that 
allows the
 * compiler to emit such instructions more widely, even if we gate reaching the 
code using
 * intrinsics. So we just enable the relevant support for individual functions.
 *
 * In contrast to this, msvc allows use of intrinsics independent of what the 
compiler
 * otherwise is allowed to emit.
 */
#ifdef __GNUC__
#define pg_enable_target(foo) __attribute__ ((__target__ (foo)))
#else
#define pg_enable_target(foo)
#endif

and then use that selectively for some functions:

/* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw 
support */
pg_enable_target("avx512vpopcntdq,avx512bw")
uint64_t
pg_popcount_avx512(const char *buf, int bytes)
...

Greetings,

Andres Freund

Re: pg_verifybackup: TAR format backup verification

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-31 16:07:03 -0400, Robert Haas wrote:
> On Wed, Jul 31, 2024 at 9:28 AM Amul Sul  wrote:
> > Fixed -- I did that because it was part of a separate group in 
> > pg_basebackup.
> 
> Well, that's because pg_basebackup builds multiple executables, and
> these files needed to be linked with some but not others. It looks
> like when Andres added meson support, instead of linking each object
> file into the binaries that need it, he had it just build a static
> library and link every executable to that. That puts the linker in
> charge of sorting out which binaries need which files, instead of
> having the makefile do it.

Right. Meson supports using the same file with different compilation flags,
depending on the context its used (i.e. as part of an executable or a shared
library). But that also ends up compiling files multiple times when using the
same file in multiple binaries. Which wasn't desirable here -> hence moving it
to a static lib.


> > Fixed -- frontend_common_code now includes lz4 as well.
> 
> Cool. 0003 overall looks good to me now, unless Andres wants to object.

Nope.

Greetings,

Andres Freund

Re: [BUG?] check_exclusion_or_unique_constraint false negative

2024-07-31 Thread Michail Nikolaev

It seems like I've identified the cause of the issue.

Currently, any DirtySnapshot (or SnapshotSelf) scan over a B-tree index may
skip (not find the TID for) some records in the case of parallel updates.

The following scenario is possible:

* Session 1 reads a B-tree page using SnapshotDirty and copies item X to
the buffer.
* Session 2 updates item X, inserting a new TID Y into the same page.
* Session 2 commits its transaction.
* Session 1 starts to fetch from the heap and tries to fetch X, but it was
already deleted by session 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to the next page, skipping Y.
* Therefore, the search finds nothing, but tuple Y is still alive.

This situation is somewhat controversial. DirtySnapshot might seem to show
more (or more recent, even uncommitted) data than MVCC, but not less. So,
DirtySnapshot scan over a B-tree does not provide any guarantees, as far as
I understand.
Why does it work for MVCC? Because tuple X will be visible due to the
snapshot, making Y unnecessary.
This might be "as designed," but I think it needs to be clearly documented
(I couldn't find any documentation on this particular case, only
_bt_drop_lock_and_maybe_pin - related).

Here are the potential consequences of the issue:

* check_exclusion_or_unique_constraint

It may not find a record in a UNIQUE index during INSERT ON CONFLICT
UPDATE. However, this is just a minor performance issue.

* Exclusion constraints with B-tree, like ADD CONSTRAINT exclusion_data
EXCLUDE USING btree (data WITH =)

It should work correctly because the first inserter may "skip" the TID from
a concurrent inserter, but the second one should still find the TID from
the first.

* RelationFindReplTupleByIndex

Amit, this is why I've included you in this previously solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may not find some
records if they are updated by a parallel transaction. This could lead to
lost deletes/updates, especially in the case of streaming=parallel mode.
I'm not familiar with how parallel workers apply transactions, so maybe
this isn't possible.

Best regards,
Mikhail

>

Re: Changing default -march landscape

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-30 21:59:44 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 07:39:18PM -0700, Andres Freund wrote:
> > We can hide most of the dispatch cost in a static inline function that only
> > does the runtime test if size is large enough - the size is a compile time
> > constant most of the time, which optimizes away the dispatch cost most of 
> > the
> > time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
> > indirect function call.
> 
> I ended up doing precisely this for pg_popcount()/pg_popcount_masked(),
> although not quite as sophisticated as what you propose below.  I'll look
> into expanding on this strategy in v18.

I think you subsume that under "not quite as sophisticated", but just to make
clear:  The most important bits are to

a) do the dispatch in a header, without an indirect function call

b) implement intrinsic using stuff in a header if it's using a size argument
   or such, because that allows to compiler to optimize away size checks in
   the very common case of such an argument being a compile time constant.

Greetings,

Andres Freund

Re: can we mark upper/lower/textlike functions leakproof?

2024-07-31 Thread Tom Lane

Robert Haas  writes:
> I'm not sure what the right thing to do here is, but I think that it's
> wrong to imagine that being unwilling to endorse probably-leakproof
> things as leakproof -- or unwilling to put in the work to MAKE them
> leakproof if they currently aren't -- has no security costs.

Well, we *have* been a little bit spongy about that --- notably,
that texteq and friends are marked leakproof.  But IMV, marking
upper/lower as leakproof is substantially riskier and offers
substantially less benefit than those did.

In general, I'm worried about a slippery slope here.  If we
start marking things as leakproof because we cannot prove
they leak, rather than because we can prove they don't,
we are eventually going to find ourselves in a very bad place.

regards, tom lane

Re: Popcount optimization using AVX512

2024-07-31 Thread Nathan Bossart

On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
> On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote:
>> As I started on this, I remembered why I needed it.  The file
>> pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
>> to avoid inadvertently issuing any AVX-512 instructions before determining
>> we have support.  If that's not a concern, we could still probably remove
>> the XSAVE check.
> 
> I think it's a valid concern - but isn't that theoretically also an issue with
> xsave itself? I guess practically the compiler won't do that, because there's
> no practical reason to emit any instructions enabled by -mxsave (in contrast
> to e.g. -mavx, which does trigger gcc to emit different instructions even for
> basic math).

Yeah, this crossed my mind.  It's certainly not the sturdiest of
assumptions...

> I think enabling options like these on a per-translation-unit basis isn't
> really a scalable approach. To actually be safe there could only be a single
> function in each TU and that function could only be called after a cpuid check
> performed in a separate TU. That a) ends up pretty unreadable b) requires
> functions to be implemented in .c files, which we really don't want for some
> of this.

Agreed.

> I think we'd be better off enabling architectural features on a per-function
> basis, roughly like this:
>
> [...]
> 
> /* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw 
> support */
> pg_enable_target("avx512vpopcntdq,avx512bw")
> uint64_t
> pg_popcount_avx512(const char *buf, int bytes)
> ...

I remember wondering why the CRC-32C code wasn't already doing something
like this (old compiler versions? non-gcc-like compilers?), and I'm not
sure I ever discovered the reason, so out of an abundance of caution I used
the same approach for AVX-512.  If we can convince ourselves that
__attribute__((target("..."))) is standard enough at this point, +1 for
moving to that.

-- 
nathan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-31 Thread Melanie Plageman

On Wed, Jul 31, 2024 at 4:38 PM Peter Geoghegan  wrote:
>
> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
>  wrote:
> > In back branches starting with 14, failing to remove tuples older than
> > OldestXmin during pruning caused vacuum to infinitely loop in
> > lazy_scan_prune(), as investigated on this [1] thread.
>
> Shouldn't somebody remove the entry that we have for this issue under
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Older_bugs_affecting_stable_branches?

Thanks for the reminder. Done!

- Melanie

Re: Fixing backslash dot for COPY FROM...CSV

2024-07-31 Thread Sutou Kouhei

Hi,

In <4ffb7317-7e92-4cbd-a542-1e478af6a...@manitou-mail.org>
  "Re: Fixing backslash dot for COPY FROM...CSV" on Wed, 31 Jul 2024 15:42:41 
+0200,
  "Daniel Verite"  wrote:

>> BTW, here is a diff after pgindent:
> 
> PFA a v5 with the cosmetic changes applied.

Thanks for updating the patch.

>> I also confirmed that the updated server and non-updated
>> psql compatibility problem (the end-of-data "\." is
>> inserted). It seems that it's difficult to solve without
>> introducing incompatibility.
> 
> To clarify the compatibility issue, the attached bash script
> compares pre-patch and post-patch client/server combinations with
> different cases, submitted with different copy variants.
> 
> case A: quoted backslash-dot sequence in CSV
> case B: unquoted backslash-dot sequence in CSV
> case C: CSV without backslash-dot
> case D: text with backslash-dot at the end
> case E: text without backslash-dot at the end
> 
> The different ways to submit the data:
> copy from file
> \copy from file
> \copy from pstdin
> copy from stdin with embedded data after the command
> 
> Also attaching the tables of results with the patch as it stands.
> "Failed" is when psql reports an error and "Data mismatch"
> is when it succeeds but with copied data differing from what was
> expected.
> 
> Does your test has an equivalent in these results or is it a different
> case?

Sorry for not clarify my try. My try was:

Case C
+--++
|  method  | patched-server+|
|  | unpatched-psql |
+--++
| embedded | Data mismatch  |
+--++

I confirmed that this case is "Data mismatch".



Thanks,
-- 
kou

Re: optimizing pg_upgrade's once-in-each-database steps

2024-07-31 Thread Ilya Gladyshev




On 22.07.2024 21:07, Nathan Bossart wrote:

On Fri, Jul 19, 2024 at 04:21:37PM -0500, Nathan Bossart wrote:

However, while looking into this, I noticed that only one get_query
callback (get_db_subscription_count()) actually customizes the generated
query using information in the provided DbInfo.  AFAICT we can do this
particular step without running a query in each database, as I mentioned
elsewhere [0].  That should speed things up a bit and allow us to simplify
the AsyncTask code.

With that, if we are willing to assume that a given get_query callback will
generate the same string for all databases (and I think we should), we can
run the callback once and save the string in the step for dispatch_query()
to use.  This would look more like what you suggested in the quoted text.

Here is a new patch set.  I've included the latest revision of the patch to
fix get_db_subscription_count() from the other thread [0] as 0001 since I
expect that to be committed soon.  I've also moved the patch that moves the
"live_check" variable to "user_opts" to 0002 since I plan on committing
that sooner than later, too.  Otherwise, I've tried to address all feedback
provided thus far.

[0] https://commitfest.postgresql.org/49/5135/


Hi,

I like your idea of parallelizing these checks with async libpq API, 
thanks for working on it. The patch doesn't apply cleanly on master 
anymore, but I've rebased locally and taken it for a quick spin with a 
pg16 instance of 1000 empty databases. Didn't see any regressions with 
-j 1, there's some speedup with -j 8 (33 sec vs 8 sec for these checks).


One thing that I noticed that could be improved is we could start a new 
connection right away after having run all query callbacks for the 
current connection in process_slot, instead of just returning and 
establishing the new connection only on the next iteration of the loop 
in async_task_run after potentially sleeping on select.


+1 to Jeff's suggestion that perhaps we could reuse connections, but 
perhaps that's a separate story.


Regards,

Ilya

Re: Add mention of execution time memory for enable_partitionwise_* GUCs

2024-07-31 Thread David Rowley

On Wed, 31 Jul 2024 at 16:15, Ashutosh Bapat
 wrote:
> We can commit your
> version and see if users find it confusing.

Ok. I've now pushed the patch. Thanks for reviewing it.

David

Re: Remove last traces of HPPA support

2024-07-31 Thread Thomas Munro

On Thu, Aug 1, 2024 at 7:07 AM Andres Freund  wrote:
> Note that I would like to add a user for S_LOCK_FREE(), to detect repeated
> SpinLockRelease():
> https://postgr.es/m/20240729182952.hua325647e2ggbsy%40awork3.anarazel.de

What about adding a "magic" member in assertion builds?  Here is my
attempt at that, in 0002.

I also realised that we might as well skip the trivial S_XXX macros
and delete s_lock.h.  In this version of 0001 we retain just spin.h,
but s_lock.c still exists to hold the slow path.
From 47d5c4537dd741efc0fd6ac54393d2e7aca7ec8b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 31 Jul 2024 13:11:35 +1200
Subject: [PATCH v2 1/2] Use atomics API to implement spinlocks.

Since our spinlock API pre-dates our C11-style atomics API by decades,
it had its own hand-crafted operations written in assembler.  Use the
atomics API instead, to simplify and de-duplicate.  We couldn't have
done this earlier, because that'd be circular: atomics were simulated
with spinlocks in --disable-atomics builds.  Commit 81385261 removed
that option, so now we can delete most the system-specific spinlock code
and just redirect everything to pg_atomic_flag.

The main special knowledge embodied in the hand-crafted code was the
relaxed load of the lock value before attempting to test-and-set, while
spinning.  That is retained in simplified form in the new coding.
---
 configure  |  22 -
 configure.ac   |  19 -
 src/Makefile.global.in |   3 -
 src/backend/port/Makefile  |  12 -
 src/backend/port/meson.build   |   2 +-
 src/backend/port/tas/dummy.s   |   0
 src/backend/port/tas/sunstudio_sparc.s |  53 --
 src/backend/port/tas/sunstudio_x86.s   |  43 --
 src/backend/storage/lmgr/s_lock.c  | 128 +
 src/include/storage/s_lock.h   | 749 -
 src/include/storage/spin.h |  60 +-
 src/template/linux |  15 -
 src/template/solaris   |  15 -
 src/test/regress/regress.c |  25 +-
 14 files changed, 75 insertions(+), 1071 deletions(-)
 delete mode 100644 src/backend/port/tas/dummy.s
 delete mode 100644 src/backend/port/tas/sunstudio_sparc.s
 delete mode 100644 src/backend/port/tas/sunstudio_x86.s
 delete mode 100644 src/include/storage/s_lock.h

diff --git a/configure b/configure
index 8f684f7945e..e2267837b7d 100755
--- a/configure
+++ b/configure
@@ -731,7 +731,6 @@ PKG_CONFIG_LIBDIR
 PKG_CONFIG_PATH
 PKG_CONFIG
 DLSUFFIX
-TAS
 GCC
 CPP
 CFLAGS_SL
@@ -3021,12 +3020,6 @@ $as_echo "$template" >&6; }
 PORTNAME=$template
 
 
-# Initialize default assumption that we do not need separate assembly code
-# for TAS (test-and-set).  This can be overridden by the template file
-# when it's executed.
-need_tas=no
-tas_file=dummy.s
-
 # Default, works for most platforms, override in template file if needed
 DLSUFFIX=".so"
 
@@ -7770,20 +7763,6 @@ ac_compiler_gnu=$ac_cv_c_compiler_gnu
 
 
 
-#
-# Set up TAS assembly code if needed; the template file has now had its
-# chance to request this.
-#
-ac_config_links="$ac_config_links src/backend/port/tas.s:src/backend/port/tas/${tas_file}"
-
-
-if test "$need_tas" = yes ; then
-  TAS=tas.o
-else
-  TAS=""
-fi
-
-
 
 cat >>confdefs.h <<_ACEOF
 #define DLSUFFIX "$DLSUFFIX"
@@ -19924,7 +19903,6 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 for ac_config_target in $ac_config_targets
 do
   case $ac_config_target in
-"src/backend/port/tas.s") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/tas.s:src/backend/port/tas/${tas_file}" ;;
 "GNUmakefile") CONFIG_FILES="$CONFIG_FILES GNUmakefile" ;;
 "src/Makefile.global") CONFIG_FILES="$CONFIG_FILES src/Makefile.global" ;;
 "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
diff --git a/configure.ac b/configure.ac
index 75b73532fe0..59c3b7e3d35 100644
--- a/configure.ac
+++ b/configure.ac
@@ -95,12 +95,6 @@ AC_MSG_RESULT([$template])
 PORTNAME=$template
 AC_SUBST(PORTNAME)
 
-# Initialize default assumption that we do not need separate assembly code
-# for TAS (test-and-set).  This can be overridden by the template file
-# when it's executed.
-need_tas=no
-tas_file=dummy.s
-
 # Default, works for most platforms, override in template file if needed
 DLSUFFIX=".so"
 
@@ -740,19 +734,6 @@ AC_PROG_CPP
 AC_SUBST(GCC)
 
 
-#
-# Set up TAS assembly code if needed; the template file has now had its
-# chance to request this.
-#
-AC_CONFIG_LINKS([src/backend/port/tas.s:src/backend/port/tas/${tas_file}])
-
-if test "$need_tas" = yes ; then
-  TAS=tas.o
-else
-  TAS=""
-fi
-AC_SUBST(TAS)
-
 AC_SUBST(DLSUFFIX)dnl
 AC_DEFINE_UNQUOTED([DLSUFFIX], ["$DLSUFFIX"],
[Define to the file name extension of dynamically-loadable modules.])
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 83b91fe9167..0301f463027 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.globa

Re: Support LIKE with nondeterministic collations

2024-07-31 Thread Jeff Davis

On Fri, 2024-05-03 at 16:58 +0200, Daniel Verite wrote:
>    * Generating bounds for a sort key (prefix matching)
> 
>    Having sort keys for strings allows for easy creation of bounds -
>    sort keys that are guaranteed to be smaller or larger than any
> sort
>    key from a give range. For example, if bounds are produced for a
>    sortkey of string “smith”, strings between upper and lower bounds
>    with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
>    of upper bounds can be generated - the first one will match only
>    strings of equal length, while the second one will match all the
>    strings with the same initial prefix.
> 
>    CLDR 1.9/ICU 4.6 and later map U+ to a collation element with
>    the maximum primary weight, so that for example the string
>    “smith\u” can be used as the upper bound rather than modifying
>    the sort key for “smith”.
> 
> In other words it says that
> 
>   col LIKE 'smith%' collate "nd"
> 
> is equivalent to:
> 
>   col >= 'smith' collate "nd" AND col < U&'smith\' collate "nd"

That logic seems to assume something about the collation. If you have a
collation that orders strings by their sha256 hash, that would entirely
break the connection between prefixes and ranges, and it wouldn't work.

Is there something about the way collations are defined that inherently
maintains a connection between a prefix and a range? Does it remain
true even when strange rules are added to a collation?

Regards,
Jeff Davis

Re: Logical Replication of sequences

2024-07-31 Thread Peter Smith

Hi Vignesh,

I have a question about the subscriber-side behaviour of currval().

==

AFAIK it is normal for currval() to give error is nextval() has not
yet been called [1]

For example.
test_pub=# create sequence s1;
CREATE SEQUENCE
test_pub=# select * from currval('s1');
2024-08-01 07:42:48.619 AEST [24131] ERROR:  currval of sequence "s1"
is not yet defined in this session
2024-08-01 07:42:48.619 AEST [24131] STATEMENT:  select * from currval('s1');
ERROR:  currval of sequence "s1" is not yet defined in this session
test_pub=# select * from nextval('s1');
 nextval
-
   1
(1 row)

test_pub=# select * from currval('s1');
 currval
-
   1
(1 row)

test_pub=#

~~~

OTOH, I was hoping to be able to use currval() at the subscriber=side
to see the current sequence value after issuing ALTER .. REFRESH
PUBLICATION SEQUENCES.

Unfortunately, it has the same behaviour where currval() cannot be
used without nextval(). But, on the subscriber, you probably never
want to do an explicit nextval() independently of the publisher.

Is this currently a bug, or maybe a quirk that should be documented?

For example:

Publisher
==

test_pub=# create sequence s1;
CREATE SEQUENCE
test_pub=# CREATE PUBLICATION pub1 FOR ALL SEQUENCES;
CREATE PUBLICATION
test_pub=# select * from nextval('s1');
 nextval
-
   1
(1 row)

test_pub=# select * from nextval('s1');
 nextval
-
   2
(1 row)

test_pub=# select * from nextval('s1');
 nextval
-
   3
(1 row)

test_pub=#

Subscriber
==

(Notice currval() always gives an error unless nextval() is used prior).

test_sub=# create sequence s1;
CREATE SEQUENCE
test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=test_pub'
PUBLICATION pub1;
2024-08-01 07:51:06.955 AEST [24325] WARNING:  subscriptions created
by regression test cases should have names starting with "regress_"
WARNING:  subscriptions created by regression test cases should have
names starting with "regress_"
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
test_sub=# 2024-08-01 07:51:07.023 AEST [4211] LOG:  logical
replication apply worker for subscription "sub1" has started
2024-08-01 07:51:07.037 AEST [4213] LOG:  logical replication sequence
synchronization worker for subscription "sub1" has started
2024-08-01 07:51:07.063 AEST [4213] LOG:  logical replication
synchronization for subscription "sub1", sequence "s1" has finished
2024-08-01 07:51:07.063 AEST [4213] LOG:  logical replication sequence
synchronization worker for subscription "sub1" has finished

test_sub=# SELECT * FROM currval('s1');
2024-08-01 07:51:19.688 AEST [24325] ERROR:  currval of sequence "s1"
is not yet defined in this session
2024-08-01 07:51:19.688 AEST [24325] STATEMENT:  SELECT * FROM currval('s1');
ERROR:  currval of sequence "s1" is not yet defined in this session
test_sub=# ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION SEQUENCES;
ALTER SUBSCRIPTION
test_sub=# 2024-08-01 07:51:35.298 AEST [4993] LOG:  logical
replication sequence synchronization worker for subscription "sub1"
has started

test_sub=# 2024-08-01 07:51:35.321 AEST [4993] LOG:  logical
replication synchronization for subscription "sub1", sequence "s1" has
finished
2024-08-01 07:51:35.321 AEST [4993] LOG:  logical replication sequence
synchronization worker for subscription "sub1" has finished

test_sub=#
test_sub=# SELECT * FROM currval('s1');
2024-08-01 07:51:41.438 AEST [24325] ERROR:  currval of sequence "s1"
is not yet defined in this session
2024-08-01 07:51:41.438 AEST [24325] STATEMENT:  SELECT * FROM currval('s1');
ERROR:  currval of sequence "s1" is not yet defined in this session
test_sub=#
test_sub=# SELECT * FROM nextval('s1');
 nextval
-
   4
(1 row)

test_sub=# SELECT * FROM currval('s1');
 currval
-
   4
(1 row)

test_sub=#

==
[1] https://www.postgresql.org/docs/current/functions-sequence.html

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Remove last traces of HPPA support

2024-07-31 Thread Andres Freund

Hi,

On 2024-08-01 10:09:07 +1200, Thomas Munro wrote:
> On Thu, Aug 1, 2024 at 7:07 AM Andres Freund  wrote:
> > Note that I would like to add a user for S_LOCK_FREE(), to detect repeated
> > SpinLockRelease():
> > https://postgr.es/m/20240729182952.hua325647e2ggbsy%40awork3.anarazel.de
> 
> What about adding a "magic" member in assertion builds?  Here is my
> attempt at that, in 0002.

That changes the ABI, which we don't want, because it breaks using
extensions against a differently built postgres.

I don't really see a reason to avoid having S_LOCK_FREE(), am I missing
something? Previously the semaphore fallback was a reason, but that's gone
now...

Greetings,

Andres Freund

Re: Logical Replication of sequences

2024-07-31 Thread Peter Smith

Hi Vignesh,

I noticed that when replicating sequences (using the latest patches
0730_2*)  the subscriber-side checks the *existence* of the sequence,
but apparently it is not checking other sequence attributes.

For example, consider:

Publisher: "CREATE SEQUENCE s1 START 1 INCREMENT 2;" should be a
sequence of only odd numbers.
Subscriber: "CREATE SEQUENCE s1 START 2 INCREMENT 2;" should be a
sequence of only even numbers.

Because the names match, currently the patch allows replication of the
s1 sequence. I think that might lead to unexpected results on the
subscriber. IMO it might be safer to report ERROR unless the sequences
match properly (i.e. not just a name check).

Below is a demonstration the problem:

==
Publisher:
==

(publisher sequence is odd numbers)

test_pub=# create sequence s1 start 1 increment 2;
CREATE SEQUENCE
test_pub=# select * from nextval('s1');
 nextval
-
   1
(1 row)

test_pub=# select * from nextval('s1');
 nextval
-
   3
(1 row)

test_pub=# select * from nextval('s1');
 nextval
-
   5
(1 row)

test_pub=# CREATE PUBLICATION pub1 FOR ALL SEQUENCES;
CREATE PUBLICATION
test_pub=#

==
Subscriber:
==

(subscriber sequence is even numbers)

test_sub=# create sequence s1 start 2 increment 2;
CREATE SEQUENCE
test_sub=# SELECT * FROM nextval('s1');
 nextval
-
   2
(1 row)

test_sub=# SELECT * FROM nextval('s1');
 nextval
-
   4
(1 row)

test_sub=# SELECT * FROM nextval('s1');
 nextval
-
   6
(1 row)

test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=test_pub'
PUBLICATION pub1;
2024-08-01 08:43:04.198 AEST [24325] WARNING:  subscriptions created
by regression test cases should have names starting with "regress_"
WARNING:  subscriptions created by regression test cases should have
names starting with "regress_"
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
test_sub=# 2024-08-01 08:43:04.294 AEST [26240] LOG:  logical
replication apply worker for subscription "sub1" has started
2024-08-01 08:43:04.309 AEST [26244] LOG:  logical replication
sequence synchronization worker for subscription "sub1" has started
2024-08-01 08:43:04.323 AEST [26244] LOG:  logical replication
synchronization for subscription "sub1", sequence "s1" has finished
2024-08-01 08:43:04.323 AEST [26244] LOG:  logical replication
sequence synchronization worker for subscription "sub1" has finished

(after the CREATE SUBSCRIPTION we are getting replicated odd values
from the publisher, even though the subscriber side sequence was
supposed to be even numbers)

test_sub=# SELECT * FROM nextval('s1');
 nextval
-
   7
(1 row)

test_sub=# SELECT * FROM nextval('s1');
 nextval
-
   9
(1 row)

test_sub=# SELECT * FROM nextval('s1');
 nextval
-
  11
(1 row)

(Looking at the description you would expect odd values for this
sequence to be impossible)

test_sub=# \dS+ s1
 Sequence "public.s1"
  Type  | Start | Minimum |   Maximum   | Increment | Cycles? | Cache
+---+-+-+---+-+---
 bigint | 2 |   1 | 9223372036854775807 | 2 | no  | 1

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: The stats.sql test is failing sporadically in v14- on POWER7/AIX 7.1 buildfarm animals

2024-07-31 Thread Noah Misch

On Wed, Jul 31, 2024 at 02:00:00PM +0300, Alexander Lakhin wrote:
> --- 
> /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/expected/stats.out
>  2022-03-30 01:18:17.0 +
> +++ 
> /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/results/stats.out
>  2024-07-30 09:49:39.0 +
> @@ -165,11 +165,11 @@
>    WHERE relname like 'trunc_stats_test%' order by relname;
>     relname  | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | 
> n_dead_tup
> ---+---+---+---++
> -  trunc_stats_test  | 3 | 0 | 0 | 0 |  0
> -  trunc_stats_test1 | 4 | 2 | 1 | 1 |  0
> -  trunc_stats_test2 | 1 | 0 | 0 | 1 |  0
> -  trunc_stats_test3 | 4 | 0 | 0 | 2 |  2
> -  trunc_stats_test4 | 2 | 0 | 0 | 0 |  2
> +  trunc_stats_test  | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test1 | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test2 | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test3 | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test4 | 0 | 0 | 0 | 0 |  0
> ...
> 
> inst/logfile contains:
> 2024-07-30 09:25:11.225 UTC [63307946:1] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2024-07-30 09:25:11.345 UTC [11206724:559] pg_regress/create_index LOG: 
> using stale statistics instead of current ones because stats collector is
> not responding
> ...

> I could not find such failures coming from POWER8 animals: hoverfly
> (running AIX 7200-04-03-2038), ayu, boa, chub, and I did not encounter such
> anomalies on x86 nor ARM platforms.

The animals you list as affected share a filesystem.  The failure arises from
the slow filesystem metadata operations of that filesystem.

> Thus, it looks like this stats collector issue is only happening on this
> concrete platform, and given [11], I think such failures perhaps should
> be just ignored for the next two years (until v14 EOL) unless AIX 7.1
> will be upgraded and we see them on a vendor-supported OS version.

This has happened on non-POWER, I/O-constrained machines.  Still, I have been
ignoring these failures.  The stats subsystem was designed to drop stats
updates at times, which was always at odds with the need for stable tests.  So
the failures witness a defect of the test, not a defect of the backend.
Stabilizing this test was a known benefit of the new stats implementation.

Re: Remove last traces of HPPA support

2024-07-31 Thread Thomas Munro

On Thu, Aug 1, 2024 at 10:38 AM Andres Freund  wrote:
> On 2024-08-01 10:09:07 +1200, Thomas Munro wrote:
> > On Thu, Aug 1, 2024 at 7:07 AM Andres Freund  wrote:
> > > Note that I would like to add a user for S_LOCK_FREE(), to detect repeated
> > > SpinLockRelease():
> > > https://postgr.es/m/20240729182952.hua325647e2ggbsy%40awork3.anarazel.de
> >
> > What about adding a "magic" member in assertion builds?  Here is my
> > attempt at that, in 0002.
>
> That changes the ABI, which we don't want, because it breaks using
> extensions against a differently built postgres.

Yeah, right, bad idea.  Let me think about how to do something like
what you showed, but with the atomics patch...

Hmm.  One of the interesting things about the atomic_flag interface is
that it completely hides the contents of memory.  (Guess: its weird
minimal interface was designed to help weird architectures like
PA-RISC, staying on topic for $SUBJECT; I doubt we'll see such a
system again but it's useful for this trick).  So I guess we could
push the check down to that layer, and choose arbitrary non-zero
values for the arch-x86.h implementation of pg_atomic_flag .  See
attached.  Is this on the right track?

(Looking ahead, if we eventually move to using , we won't
be able to use atomic_flag due to lack of relaxed load anyway, so we
could generalise this to atomic_char (rather than atomic_bool), and
keep using non-zero values.  Presumably at that point we could also
decree that zero-initialised memory is valid for initialising our
spinlocks, but it seems useful as a defence against uninitialised
objects anyway.)

> I don't really see a reason to avoid having S_LOCK_FREE(), am I missing
> something? Previously the semaphore fallback was a reason, but that's gone
> now...

Sure, but if it's just for assertions, we don't need it.  Or any of
the S_XXX stuff.
From 8981ac9dcf0a76d4f75a1d2c71822579e1b18a93 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 31 Jul 2024 13:11:35 +1200
Subject: [PATCH v3 1/3] Use atomics API to implement spinlocks.

Since our spinlock API pre-dates our C11-style atomics API by decades,
it had its own hand-crafted operations written in assembler.  Use the
atomics API instead, to simplify and de-duplicate.  We couldn't have
done this earlier, because that'd be circular: atomics were simulated
with spinlocks in --disable-atomics builds.  Commit 81385261 removed
that option, so now we can delete most the system-specific spinlock code
and just redirect everything to pg_atomic_flag.

The main special knowledge embodied in the hand-crafted code was the
relaxed load of the lock value before attempting to test-and-set, while
spinning.  That is retained in simplified form in the new coding.
---
 configure  |  22 -
 configure.ac   |  19 -
 src/Makefile.global.in |   3 -
 src/backend/port/Makefile  |  12 -
 src/backend/port/meson.build   |   2 +-
 src/backend/port/tas/dummy.s   |   0
 src/backend/port/tas/sunstudio_sparc.s |  53 --
 src/backend/port/tas/sunstudio_x86.s   |  43 --
 src/backend/storage/lmgr/s_lock.c  | 128 +
 src/include/storage/s_lock.h   | 749 -
 src/include/storage/spin.h |  70 ++-
 src/template/linux |  15 -
 src/template/solaris   |  15 -
 src/test/regress/regress.c |  25 +-
 14 files changed, 85 insertions(+), 1071 deletions(-)
 delete mode 100644 src/backend/port/tas/dummy.s
 delete mode 100644 src/backend/port/tas/sunstudio_sparc.s
 delete mode 100644 src/backend/port/tas/sunstudio_x86.s
 delete mode 100644 src/include/storage/s_lock.h

diff --git a/configure b/configure
index 8f684f7945e..e2267837b7d 100755
--- a/configure
+++ b/configure
@@ -731,7 +731,6 @@ PKG_CONFIG_LIBDIR
 PKG_CONFIG_PATH
 PKG_CONFIG
 DLSUFFIX
-TAS
 GCC
 CPP
 CFLAGS_SL
@@ -3021,12 +3020,6 @@ $as_echo "$template" >&6; }
 PORTNAME=$template

-# Initialize default assumption that we do not need separate assembly code
-# for TAS (test-and-set).  This can be overridden by the template file
-# when it's executed.
-need_tas=no
-tas_file=dummy.s
-
 # Default, works for most platforms, override in template file if needed
 DLSUFFIX=".so"

@@ -7770,20 +7763,6 @@ ac_compiler_gnu=$ac_cv_c_compiler_gnu

-#
-# Set up TAS assembly code if needed; the template file has now had its
-# chance to request this.
-#
-ac_config_links="$ac_config_links src/backend/port/tas.s:src/backend/port/tas/${tas_file}"
-
-
-if test "$need_tas" = yes ; then
-  TAS=tas.o
-else
-  TAS=""
-fi
-
-

 cat >>confdefs.h <<_ACEOF
 #define DLSUFFIX "$DLSUFFIX"
@@ -19924,7 +19903,6 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 for ac_config_target in $ac_config_targets
 do
   case $ac_config_target in
-"src/backend/port/tas.s") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/tas.s:src/backend/port/tas/${tas_file}" ;;
 "GNUmakefile") CONFIG_FILES="$CONFIG_F

Re: Add LSN <-> time conversion functionality

2024-07-31 Thread Melanie Plageman

Thanks for the review! v6 attached.

On Sat, Jul 6, 2024 at 1:36 PM Andrey M. Borodin  wrote:
>
> PgStartLSN = GetXLogInsertRecPtr();
> Should this be kind of RecoveryEndPtr? How is it expected to behave on 
> Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, 
> then is in startup for a year as a Hot Standby, and then is promoted?

So, I don't think we will allow use of the LSNTimeStream on a standby,
since it is unclear what it would mean on a standby. For example, do
you want to know the time the LSN was generated or the time it was
replayed? Note that bgwriter won't insert values to the time stream on
a standby (it explicitly checks).

But, you bring up an issue that I don't quite know what to do about.
If the standby doesn't have an LSNTimeStream, then when it is
promoted, LSN <-> time conversions of LSNs and times before the
promotion seem impossible. Maybe if the stats file is getting written
out at checkpoints, we could restore from that previous primary's file
after promotion?

This brings up a second issue, which is that, currently, bgwriter
won't insert into the time stream when wal_level is minimal. I'm not
sure exactly how to resolve it because I am relying on the "last
important rec pointer" and the LOG_SNAPSHOT_INTERVAL_MS to throttle
when the bgwriter actually inserts new records into the LSNTimeStream.
I could come up with a different timeout interval for updating the
time stream. Perhaps I should do that?

> lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem 
> capable to accommodate LSN*time. And the function may return negative result, 
> despite claiming area as a result. It’s intended, but a little misleading.

Ah, great point. I've fixed this.

> i-- > 0
> Is there a point to do a backward count in the loop?
> Consider dropping not one by one, but half of a stream, LSNTimeStream is ~2Kb 
> of cache and it’s loaded as a whole to the cache..

Yes, the backwards looping was super confusing. It was a relic of my
old design. Even without your point about cache locality, the code is
much harder to understand with the backwards looping. I've changed the
array to fill forwards and be accessed with forward loops.

> > On 27 Jun 2024, at 07:18, Melanie Plageman  
> > wrote:
> >
> >> 2. Some benchmarks to proof the patch does not have CPU footprint.
> >
> > This is still a todo. Typically when designing a benchmark like this,
> > I would want to pick a worst-case workload to see how bad it could be.
> > I wonder if just a write heavy workload like pgbench builtin tpcb-like
> > would be sufficient?
>
> Increasing background writer activity to maximum and not seeing LSNTimeStream 
> function in `perf top` seems enough to me.

I've got this on my TODO.

> >> Tests fail on Windows.
> >
> > I think this was because of the compiler warnings, but I need to
> > double-check now.
> Nope, it really looks more serious.
> [12:31:25.701] ../src/backend/utils/activity/pgstat_wal.c(433): error C2375: 
> 'pg_estimate_lsn_at_time': redefinition; different linkage

Ah, yes. I mistakenly added the functions to pg_proc.dat and also
called PG_FUNCTION_INFO_V1 for the functions. I've fixed it.

> >> The patch lacks tests.
> >
> > I thought about this a bit. I wonder what kind of tests make sense.
> >
> > I could
> > 1) Add tests with the existing stats tests
> > (src/test/regress/sql/stats.sql) and just test that bgwriter is in
> > fact adding to the time stream.
> >
> > 2) Or should I add some infrastructure to be able to create an
> > LSNTimeStream and then insert values to it and do some validations of
> > what is added? I did a version of this but it is just much more
> > annoying with C & SQL than with python (where I tried out my
> > algorithms) [2].
>
> I think just a test which calls functions and discards the result would 
> greatly increase coverage.

I've added tests of the two main conversion functions. I didn't add a
test of the function which gets the whole stream (pg_lsntime_stream())
because I don't think I can guarantee it won't be empty -- so I'm not
sure what I could do with it in a test.

> > On 29 Jun 2024, at 03:09, Melanie Plageman  
> > wrote:
> > change the user-facing functions for estimating an
> > LSN/time conversion to instead return a floor and a ceiling -- instead
> > of linearly interpolating a guess. This would be a way to keep users
> > from misunderstanding the accuracy of the functions to translate LSN
> > <-> time.
>
> I think this is a good idea. And it covers well “server restart problem”. If 
> API just returns -inf as a boundary, caller can correctly interpret this 
> situation.

Providing "ceiling" and "floor" user functions is still a TODO for me,
however, I think that the patch mostly does handle server restarts.

In the event of a restart, the cumulative stats system will have
persisted our time stream, so the LSNTimeStream will just be read back
in with the rest of the stats. I've added logic to ensure that if the
PgSta

RE: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-31 Thread Hayato Kuroda (Fujitsu)

Dear Hou,

> Thanks for reviewing the patch, and your understanding is correct.
> 
> Here is the updated patch 0001. I removed the comments as suggested by Amit.
> 
> Since 0002 patch is only refactoring the code and I need some time to review
> the comments for it, I will hold it until the 0001 is committed.

Thanks for updating the patch. I did a performance testing with v2-0001.

Before: 15.553 [s]
After:  7.593 [s]

I used the attached script for setting up. I used almost the same setting and 
synchronous
replication is used.

[machine]
CPU(s):120
Model name:Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
Core(s) per socket:15
Socket(s): 4

Best regards,
Hayato Kuroda
FUJITSU LIMITED



test.sh
Description: test.sh

Re: On disable_cost

2024-07-31 Thread David Rowley

On Thu, 1 Aug 2024 at 04:23, Robert Haas  wrote:
> OK, here's a new patch version.

I think we're going down the right path here.

I've reviewed both patches, here's what I noted down during my review:

0. I've not seen any mention so far about postgres_fdw's
use_remote_estimate.  Maybe changing the costs is fixing an issue that
existed before. I'm just not 100% sure on that.

Consider:

CREATE EXTENSION postgres_fdw;

DO $d$
BEGIN
EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (use_remote_estimate 'true',
 dbname '$$||current_database()||$$',
 port '$$||current_setting('port')||$$'
)$$;
END;
$d$;
CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;

create table t (a int);
create foreign table ft (a int) server loopback OPTIONS (table_name 't');

alter system set enable_seqscan=0;
select pg_Reload_conf();
set enable_seqscan=1;
explain select * from ft;

patched:
 Foreign Scan on ft  (cost=100.00..671.00 rows=2550 width=4)

master:
 Foreign Scan on ft  (cost=1000100.00..1000671.00 rows=2550 width=4)

I kinda think that might be fixing an issue that I don't recall being
reported before. I think we shouldn't really care that much about what
nodes are disabled on the remote server and not having disabled_cost
applied to that gives us that.

1. The final sentence of the function header comment needs to be
updated in estimate_path_cost_size().

2. Does cost_tidscan() need to update the header comment to say
tidquals must not be empty?

3. final_cost_nestloop() seems to initially use the disabled_nodes
from initial_cost_nestloop() but then it goes off and calculates it
again itself. One of these seems redundant. The "We could include
disable_cost in the preliminary estimate" comment explains why it was
originally left to final_cost_nestloop(), so maybe worth sticking to
that? I don't quite know the full implications, but it does not seem
worth risking a behaviour change here.

4. I wonder if it's worth doing a quick refactor of the code in
initial_cost_mergejoin() to get rid of the duplicate code in the "if
(outersortkeys)" and "if (innersortkeys)" branches. It seems ok to do
outer_path = &sort_path. Likewise for inner_path.

5. final_cost_hashjoin() does the same thing as #3

6. createplan.c adds #include "nodes/print.h" but doesn't seem to add
any code that might use anything in there.

7. create_lockrows_path() needs to propagate disabled_nodes.

create table a (a int);
set enable_seqscan=0;

explain select * from a for update limit 1;

 Limit  (cost=0.00..0.02 rows=1 width=10)
   ->  LockRows  (cost=0.00..61.00 rows=2550 width=10)
 ->  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=10)
   Disabled Nodes: 1
(4 rows)


explain select * from a limit 1;

 Limit  (cost=0.00..0.01 rows=1 width=4)
   Disabled Nodes: 1
   ->  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4)
 Disabled Nodes: 1
(4 rows)

8. There's something weird with CTEs too.

create table b(a int);
set enable_sort=0;

Patched:

explain with cte as materialized (select * from b order by a) select *
from cte order by a desc;

 Sort  (cost=381.44..387.82 rows=2550 width=4)
   Disabled Nodes: 1
   Sort Key: cte.a DESC
   CTE cte
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Disabled Nodes: 1
   Sort Key: b.a
   ->  Seq Scan on b  (cost=0.00..35.50 rows=2550 width=4)
   ->  CTE Scan on cte  (cost=0.00..51.00 rows=2550 width=4)
(9 rows)

master:

explain with cte as materialized (select * from a order by a) select *
from cte order by a desc;

 Sort  (cost=2000381.44..2000387.82 rows=2550 width=4)
   Sort Key: cte.a DESC
   CTE cte
 ->  Sort  (cost=1000179.78..1000186.16 rows=2550 width=4)
   Sort Key: a.a
   ->  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4)
   ->  CTE Scan on cte  (cost=0.00..51.00 rows=2550 width=4)
(7 rows)

I'd expect the final sort to have disabled_nodes == 2 since
disabled_cost has been added twice in master.

9. create_set_projection_path() needs to propagate disabled_nodes too:

explain select b from (select a,generate_series(1,2) as b from b) a  limit 1;

 Limit  (cost=0.00..0.03 rows=1 width=4)
   ->  Subquery Scan on a  (cost=0.00..131.12 rows=5100 width=4)
 ->  ProjectSet  (cost=0.00..80.12 rows=5100 width=8)
   ->  Seq Scan on b  (cost=0.00..35.50 rows=2550 width=0)
 Disabled Nodes: 1

10. create_setop_path() needs to propagate disabled_nodes.

explain select * from b except select * from b limit 1;

 Limit  (cost=0.00..0.80 rows=1 width=8)
   ->  HashSetOp Except  (cost=0.00..160.25 rows=200 width=8)
 ->  Append  (cost=0.00..147.50 rows=5100 width=8)
   Disabled Nodes: 2
   ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..61.00
rows=2550 width=8)
 Disabled Nodes: 1
 ->  Seq Scan on b  (cost=0

Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2024-07-31 Thread Andres Freund

Hi,

On 2023-02-25 13:44:01 -0800, Andres Freund wrote:
> Ended up simpler than I'd thought. I see small, consistent, speedups and
> reductions in memory usage.

For the sake of person following the link from the commit message to this
thread in a few years, I thought it'd be useful to have an example for the
differences due to the patch.

Consider e.g. the query used for psql's \d pg_class, just because that's the
first thing using a subplan that I got my hand on:

Memory usage in ExecutorState changes from
  Grand total: 131072 bytes in 12 blocks; 88696 free (2 chunks); 42376 used
to
  Grand total: 131072 bytes in 12 blocks; 93656 free (4 chunks); 37416 used

What's more interesting is that if I - just to show the effect - force JITing,
EXPLAIN ANALYZE's jit section changes from:

JIT:
  Functions: 31
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 2.656 ms (Deform 1.496 ms), Inlining 25.147 ms, 
Optimization 112.853 ms, Emission 81.585 ms, Total 222.241 ms

to

 JIT:
   Functions: 21
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.883 ms (Deform 0.990 ms), Inlining 23.821 ms, 
Optimization 85.150 ms, Emission 64.303 ms, Total 175.157 ms

I.e. noticeably reduced overhead, mostly due to the reduction in emitted
functions.

The difference obviously gets bigger the more parameters the subplan has, in
artificial cases it can be very large.

I also see some small performance gains during execution, but for realistic
queries that's in the ~1-3% range.

Greetings,

Andres Freund

v17 vs v16 performance comparison

2024-07-31 Thread Alexander Lakhin


Hello hackers,

I've repeated the performance measurement for REL_17_STABLE (1e020258e)
and REL_16_STABLE (6f6b0f193) and found several benchmarks where v16 is
significantly better than v17. Please find attached an html table with
all the benchmarking results.

I had payed attention to:
Best pg-src-17--.* worse than pg-src-16--.* by 57.9 percents (225.11 > 142.52): 
pg_tpcds.query15
Average pg-src-17--.* worse than pg-src-16--.* by 55.5 percents (230.57 > 
148.29): pg_tpcds.query15
in May, performed `git bisect` for this degradation, that led me to commit
b7b0f3f27 [1].

This time I bisected the following anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 23.6 percents (192.25 > 155.58): 
pg_tpcds.query21
Average pg-src-17--.* worse than pg-src-16--.* by 25.1 percents (196.19 > 
156.85): pg_tpcds.query21
and to my surprise I got "b7b0f3f27 is the first bad commit".

Moreover, bisecting of another anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24269.21 > 
19539.89): pg_tpcds.query72
Average pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24517.66 > 
19740.12): pg_tpcds.query72
pointed at the same commit again.

So it looks like q15 from TPC-DS is not the only query suffering from that
change.

But beside that, I've found a separate regression. Bisecting for this 
degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 105.0 percents (356.63 > 
173.96): s64da_tpcds.query95
Average pg-src-17--.* worse than pg-src-16--.* by 105.2 percents (357.79 > 
174.38): s64da_tpcds.query95
pointed at f7816aec2.

Does this deserve more analysis and maybe fixing?

[1] 
https://www.postgresql.org/message-id/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com

Best regards,
Alexander./analyze-benchmarks.py -i 'pg-src-17--.*' 'pg-src-16--.*'
Best pg-src-17--.* better than pg-src-16--.* by 5.3 percents (9.04 < 9.55): 
pg_tpch.query1
Average pg-src-17--.* better than pg-src-16--.* by 5.8 percents (9.05 < 9.61): 
pg_tpch.query1
Best pg-src-17--.* better than pg-src-16--.* by 5.3 percents (2.84 < 3.00): 
pg_tpch.query3
Average pg-src-17--.* better than pg-src-16--.* by 5.7 percents (2.85 < 3.02): 
pg_tpch.query3
Best pg-src-17--.* better than pg-src-16--.* by 8.2 percents (1.35 < 1.47): 
pg_tpch.query6
Average pg-src-17--.* better than pg-src-16--.* by 7.5 percents (1.36 < 1.47): 
pg_tpch.query6
Best pg-src-17--.* worse than pg-src-16--.* by 5.9 percents (0.90 > 0.85): 
pg_tpch.query9
Best pg-src-17--.* better than pg-src-16--.* by 8.2 percents (2.58 < 2.81): 
pg_tpch.query13
Average pg-src-17--.* better than pg-src-16--.* by 5.9 percents (2.66 < 2.83): 
pg_tpch.query13
Best pg-src-17--.* better than pg-src-16--.* by 6.2 percents (3.02 < 3.22): 
pg_tpch.query15
Average pg-src-17--.* better than pg-src-16--.* by 6.1 percents (3.03 < 3.23): 
pg_tpch.query15
Best pg-src-17--.* better than pg-src-16--.* by 5.6 percents (8.14 < 8.62): 
pg_tpch.query17
Average pg-src-17--.* better than pg-src-16--.* by 5.2 percents (8.18 < 8.63): 
pg_tpch.query17
Best pg-src-17--.* better than pg-src-16--.* by 5.3 percents (0.18 < 0.19): 
pg_tpch.query19
Best pg-src-17--.* better than pg-src-16--.* by 18.6 percents (4.81 < 5.91): 
pg_tpch.query20
Average pg-src-17--.* better than pg-src-16--.* by 10.6 percents (5.30 < 5.93): 
pg_tpch.query20
Best pg-src-17--.* better than pg-src-16--.* by 5.9 percents (2.07 < 2.20): 
pg_tpcds.vacuum_freeze_time
Average pg-src-17--.* better than pg-src-16--.* by 5.9 percents (2.08 < 2.21): 
pg_tpcds.vacuum_freeze_time
Best pg-src-17--.* better than pg-src-16--.* by 6.9 percents (73.31 < 78.77): 
pg_tpcds.query92
Average pg-src-17--.* better than pg-src-16--.* by 7.2 percents (74.12 < 
79.90): pg_tpcds.query92
Best pg-src-17--.* better than pg-src-16--.* by 6.3 percents (600.38 < 640.59): 
pg_tpcds.query13
Average pg-src-17--.* better than pg-src-16--.* by 6.0 percents (604.42 < 
643.14): pg_tpcds.query13
Best pg-src-17--.* better than pg-src-16--.* by 7.3 percents (1324.73 < 
1429.52): pg_tpcds.query9
Average pg-src-17--.* better than pg-src-16--.* by 7.3 percents (1327.16 < 
1431.45): pg_tpcds.query9
Best pg-src-17--.* worse than pg-src-16--.* by 23.6 percents (192.25 > 155.58): 
pg_tpcds.query21
Average pg-src-17--.* worse than pg-src-16--.* by 25.1 percents (196.19 > 
156.85): pg_tpcds.query21
Best pg-src-17--.* better than pg-src-16--.* by 7.1 percents (4.35 < 4.68): 
pg_tpcds.query93
Best pg-src-17--.* worse than pg-src-16--.* by 54.3 percents (145.53 > 94.31): 
pg_tpcds.query16
Average pg-src-17--.* worse than pg-src-16--.* by 50.0 percents (146.84 > 
97.87): pg_tpcds.query16
Best pg-src-17--.* better than pg-src-16--.* by 5.8 percents (250.23 < 265.52): 
pg_tpcds.query68
Best pg-src-17--.* better than pg-src-16--.* by 5.8 percents (7.38 < 7.83): 
pg_tpcds.query37
Average pg-src-17--.* worse than pg-src-16--.* by 19.4 percents (9.50 > 7.95): 
pg_tpcds.query37
Best pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24269.21 > 
19539.89): pg_tpcds.query72
Average pg-src-1

Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-19 21:17:12 -0700, Andres Freund wrote:
> On 2024-07-18 16:01:19 -0400, Tom Lane wrote:
> > Alena Rybakina  writes:
> > > I fixed it. The code remains the same.
> > 
> > I see the cfbot is again complaining that this patch doesn't apply.
> > 
> > In hopes of pushing this over the finish line, I fixed up the (minor)
> > patch conflict and also addressed the cosmetic complaints I had
> > upthread [1].  I think the attached v4 is committable.  If Andres is
> > too busy, I can push it, but really it's his patch ...
> 
> Thanks for the rebase - I'll try to get it pushed in the next few days!

And finally done. No code changes. I did spend some more time evaluating the
resource usage benefits actually do exist (see mail upthread).

Thanks for the reviews, rebasing and the reminders!

Greetings,

Andres

RE: Conflict detection and logging in logical replication

2024-07-31 Thread Zhijie Hou (Fujitsu)

On Wednesday, July 31, 2024 1:36 PM shveta malik  wrote:
> 
> On Wed, Jul 31, 2024 at 7:40 AM Zhijie Hou (Fujitsu) 
> 
> wrote:
> >
> > >
> > > 2)
> > > apply_handle_delete_internal()
> > >
> > > --Do we need to check "(!edata->mtstate || 
> > > edata->mtstate->operation != CMD_UPDATE)" in the else part as 
> > > well? Can there be a scenario where during update flow, it is 
> > > trying to delete from a partition and comes here, but till then 
> > > that row is deleted already and we end up raising 'delete_missing' 
> > > additionally instead of 'update_missing'
> > > alone?
> >
> > I think this shouldn't happen because the row to be deleted should 
> > have been locked before entering the apply_handle_delete_internal().
> > Actually, calling
> > apply_handle_delete_internal() for cross-partition update is a big 
> > buggy because the row to be deleted has already been found in 
> > apply_handle_tuple_routing(), so we could have avoid scanning the 
> > tuple again. I have posted another patch to fix this issue in thread[1].
> 
> Thanks for the details.
> 
> >
> > Here is the V8 patch set. It includes the following changes:
> >
> 
> Thanks for the patch. I verified that all the bugs reported so far are 
> addressed.
> Few trivial comments:

Thanks for the comments!

> 
> 1)
> 029_on_error.pl:
> --I did not understand the intent of this change. The existing insert 
> would also have resulted in conflict (insert_exists) and we would have 
> identified and skipped that. Why change to UPDATE?
> 
>  $node_publisher->safe_psql(
>   'postgres',
>   qq[
>  BEGIN;
> -INSERT INTO tbl VALUES (1, NULL);
> +UPDATE tbl SET i = 2;
>  PREPARE TRANSACTION 'gtx';
>  COMMIT PREPARED 'gtx';
>  ]);
> 

The intention of this change is to cover the code path of update_exists.
The original test only tested the code of insert_exists.

> 
> 2)
> logical-replication.sgml
> --In doc, shall we have 'delete_differ' first and then 
> 'delete_missing', similar to what we have for update (first 
> 'update_differ' and then 'update_missing')
> 
> 3)
> logical-replication.sgml: "For instance, the origin in the above log 
> indicates that the existing row was modified by a local change."
> 
> --This clarification about origin was required when we had 'origin 0'
> in 'DETAILS'. Now we have "locally":
> "Key (c)=(1) already exists in unique index "t_pkey", which was 
> modified locally in transaction 740".
> 
> And thus shall we rephrase the concerned line ?

Fixed in the V9 patch set.

Best Regards,
Hou zj

Re: v17 vs v16 performance comparison

2024-07-31 Thread Tom Lane

Alexander Lakhin  writes:
> I've repeated the performance measurement for REL_17_STABLE (1e020258e)
> and REL_16_STABLE (6f6b0f193) and found several benchmarks where v16 is
> significantly better than v17. Please find attached an html table with
> all the benchmarking results.

Thanks for doing that!

I have no opinion about b7b0f3f27, but as far as this goes:

> But beside that, I've found a separate regression. Bisecting for this 
> degradation:
> Best pg-src-17--.* worse than pg-src-16--.* by 105.0 percents (356.63 > 
> 173.96): s64da_tpcds.query95
> Average pg-src-17--.* worse than pg-src-16--.* by 105.2 percents (357.79 > 
> 174.38): s64da_tpcds.query95
> pointed at f7816aec2.

I'm not terribly concerned about that.  The nature of planner changes
like that is that some queries will get worse and some better, because
the statistics and cost estimates we're dealing with are not perfect.
It is probably worth drilling down into that test case to understand
where the planner is going wrong, with an eye to future improvements;
but I doubt it's something we need to address for v17.

regards, tom lane

Re: Speed up JSON escape processing with SIMD plus other optimisations

2024-07-31 Thread David Rowley

On Sun, 28 Jul 2024 at 00:51, David Rowley  wrote:
> I did another round of testing on the SIMD patch (attached as v5-0001)
> as I wondered if the SIMD loop maybe shouldn't wait too long before
> copying the bytes to the destination string.  I had wondered if the
> JSON string was very large that if we looked ahead too far that by the
> time we flush those bytes out to the destination buffer, we'd have
> started eviction of L1 cachelines for parts of the buffer that are
> still to be flushed.  I put this to the test (test 3) and found that
> with a 1MB JSON string it is faster to flush every 512 bytes than it
> is to only flush after checking the entire 1MB.  With a 10kB JSON
> string (test 2), the extra code to flush every 512 bytes seems to slow
> things down.

I'd been wondering why test 2 (10KB) with v5-0001
ESCAPE_JSON_MAX_LOOKHEAD 512 was not better than v5-0001.  It occurred
to me that when using 10KB vs 1MB and flushing the buffer every 512
bytes that enlargeStringInfo() is called more often proportionally to
the length of the string. Doing that causes more repalloc/memcpy work
in stringinfo.c.

We can reduce the repalloc/memcpy work by calling enlargeStringInfo()
once at the beginning of escape_json_with_len().  We already know the
minimum length we're going to append so we might as well do that.

After making that change, doing the 512-byte flushing no longer slows
down test 2.

Here are the results of testing v6-0001. I've added test 4, which
tests a very short string to ensure there are no performance
regressions when we can't do SIMD. Test 2 patched came out 3.74x
faster than master.

## Test 1:
echo "select row_to_json(j1)::jsonb from j1;" > test1.sql
for i in {1..3}; do pgbench -n -f test1.sql -T 10 -M prepared postgres
| grep tps; done

master @ e6a963748:
tps = 339.560611
tps = 344.649009
tps = 343.246659

v6-0001:
tps = 610.734018
tps = 628.297298
tps = 630.028225

v6-0001 ESCAPE_JSON_MAX_LOOKHEAD 512:
tps = 557.562866
tps = 626.476618
tps = 618.665045

## Test 2:
echo "select row_to_json(j2)::jsonb from j2;" > test2.sql
for i in {1..3}; do pgbench -n -f test2.sql -T 10 -M prepared postgres
| grep tps; done

master @ e6a963748:
tps = 25.633934
tps = 18.580632
tps = 25.395866

v6-0001:
tps = 89.325752
tps = 91.277016
tps = 86.289533

v6-0001 ESCAPE_JSON_MAX_LOOKHEAD 512:
tps = 85.194479
tps = 90.054279
tps = 85.483279

## Test 3:
echo "select row_to_json(j3)::jsonb from j3;" > test3.sql
for i in {1..3}; do pgbench -n -f test3.sql -T 10 -M prepared postgres
| grep tps; done

master @ e6a963748:
tps = 18.863420
tps = 18.866374
tps = 18.791395

v6-0001:
tps = 38.990681
tps = 37.893820
tps = 38.057235

v6-0001 ESCAPE_JSON_MAX_LOOKHEAD 512:
tps = 46.076842
tps = 46.400413
tps = 46.165491

## Test 4:
echo "select row_to_json(j4)::jsonb from j4;" > test4.sql
for i in {1..3}; do pgbench -n -f test4.sql -T 10 -M prepared postgres
| grep tps; done

master @ e6a963748:
tps = 1700.888458
tps = 1684.753818
tps = 1690.262772

v6-0001:
tps = 1721.821561
tps = 1699.189207
tps = 1663.618117

v6-0001 ESCAPE_JSON_MAX_LOOKHEAD 512:
tps = 1701.565562
tps = 1706.310398
tps = 1687.585128

I'm pretty happy with this now so I'd like to commit this and move on
to other work.  Doing "#define ESCAPE_JSON_MAX_LOOKHEAD 512", seems
like the right thing. If anyone else wants to verify my results or
take a look at the patch, please do so.

David

setup.sql
Description: Binary data

v6-0001-Optimize-escaping-of-JSON-strings-using-SIMD.patch
Description: Binary data

Re: Remove last traces of HPPA support

2024-07-31 Thread Thomas Munro

On Tue, Jul 30, 2024 at 12:39 PM Thomas Munro  wrote:
> On Tue, Jul 30, 2024 at 11:16 AM Heikki Linnakangas  wrote:
> > I think we should do:
> >
> > #ifdef _M_AMD64
> > #define __x86_64__
> > #endif
> >
> > somewhere, perhaps in src/include/port/win32.h.

I suppose we could define our own
PG_ARCH_{ARM,MIPS,POWER,RISCV,S390,SPARC,X86}_{32,64} in one central
place, instead.  Draft patch for illustration.
From 4df24c6fe7370cdeacd4e794f4ccc6202a909e62 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 1 Aug 2024 16:38:05 +1200
Subject: [PATCH] Standardize macros for detecting architectures.

Instead of repeating multiple compilers' architecture macros throughout
the tree, detect them in one central place, and define our own macros of
the form:

  PG_ARCH_{ARM,MIPS,POWER,RISCV,S390,SPARC,X86}_{32,64}

This fixes the problem that MSVC builds were unintentionally using
suboptimal fallback code defined by "port/atomics.h", due to
inconsistent testing for macros.  A couple of other places were also
affected.

XXX This patch doesn't adjust s_lock.h, because it's complicated, full
of old dead sub-architectures, and a nearby patch proposes to delete
it...

Discussion: https://postgr.es/m/CA%2BhUKGKAf_i6w7hB_3pqZXQeqn%2BixvY%2BCMps_n%3DmJ5HAatMjMw%40mail.gmail.com
---
 contrib/pgcrypto/crypt-blowfish.c   |  4 ++--
 src/include/c.h | 33 +
 src/include/port/atomics.h  |  6 +++---
 src/include/port/atomics/arch-x86.h | 16 +++---
 src/include/port/pg_bitutils.h  |  6 +++---
 src/include/port/simd.h |  2 +-
 src/include/storage/s_lock.h|  2 +-
 src/port/pg_crc32c_sse42.c  |  4 ++--
 8 files changed, 53 insertions(+), 20 deletions(-)

diff --git a/contrib/pgcrypto/crypt-blowfish.c b/contrib/pgcrypto/crypt-blowfish.c
index 5a1b1e10091..9c4e02e428b 100644
--- a/contrib/pgcrypto/crypt-blowfish.c
+++ b/contrib/pgcrypto/crypt-blowfish.c
@@ -38,10 +38,10 @@
 #include "px-crypt.h"
 #include "px.h"
 
-#ifdef __i386__
+#if defined(PG_ARCH_X86_32)
 #define BF_ASM0	/* 1 */
 #define BF_SCALE			1
-#elif defined(__x86_64__)
+#elif defined(PG_ARCH_X86_64)
 #define BF_ASM0
 #define BF_SCALE			1
 #else
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346cd..542cbd33fad 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -425,6 +425,39 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/*
+ * Project-standardized name for CPU architectures, to avoid having to repeat
+ * the names that different compilers use.
+ */
+#if defined(__arm__) || defined(__arm)
+#define PG_ARCH_ARM_32
+#elif defined(__aarch64__) || defined(_M_ARM64)
+#define PG_ARCH_ARM_64
+#elif defined(__mips__)
+#define PG_ARCH_MIPS_32
+#elif defined(__mips64__)
+#define PG_ARCH_MIPS_64
+#elif defined(__ppc__) || defined(__powerpc__)
+#define PG_ARCH_POWER_32
+#elif defined(__ppc64__) || defined(__powerpc64__)
+#define PG_ARCH_POWER_64
+#elif defined(__riscv__)
+#define PG_ARCH_RISCV_32
+#elif defined(__riscv64__)
+#define PG_ARCH_RISCV_64
+#elif defined(__s390__)
+#define PG_ARCH_S390_32
+#elif defined(__s390x__)
+#define PG_ARCH_S390_64
+#elif defined(__sparc)
+#define PG_ARCH_SPARC_32
+#elif defined(__sparcv9)
+#define PG_ARCH_SPARC_64
+#elif defined(__i386__) || defined (__i386) || defined(_M_IX86)
+#define PG_ARCH_X86_32
+#elif defined(__x86_64__) || defined(__x86_64) || defined (__amd64)
+#define PG_ARCH_X86_64
+#endif
 
 /* 
  *Section 2:	bool, true, false
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index c0c8688f736..3300ea54c17 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -63,11 +63,11 @@
  * compiler barrier.
  *
  */
-#if defined(__arm__) || defined(__arm) || defined(__aarch64__)
+#if defined(PG_ARCH_ARM_32) || defined(PG_ARCH_ARM_64)
 #include "port/atomics/arch-arm.h"
-#elif defined(__i386__) || defined(__i386) || defined(__x86_64__)
+#elif defined(PG_ARCH_X86_32) || defined(PG_ARCH_X86_64)
 #include "port/atomics/arch-x86.h"
-#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+#elif defined(PG_ARCH_POWER_32) || defined (PG_ARCH_POWER_64)
 #include "port/atomics/arch-ppc.h"
 #endif
 
diff --git a/src/include/port/atomics/arch-x86.h b/src/include/port/atomics/arch-x86.h
index c12f8a60697..9f20edf221f 100644
--- a/src/include/port/atomics/arch-x86.h
+++ b/src/include/port/atomics/arch-x86.h
@@ -32,10 +32,10 @@
  */
 
 #if defined(__GNUC__) || defined(__INTEL_COMPILER)
-#if defined(__i386__) || defined(__i386)
+#if defined(PG_ARCH_X86_32)
 #define pg_memory_barrier_impl()		\
 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory", "cc")
-#elif defined(__x86_64__)
+#elif defined(PG_ARCH_X86_64)
 #define pg_memory_barrier_impl()		\
 	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory", "cc")
 #endif
@@ -67,14 +67,14 @@ typedef str

RE: [BUG?] check_exclusion_or_unique_constraint false negative

2024-07-31 Thread Hayato Kuroda (Fujitsu)

Dear Michail,

Thanks for pointing out the issue!

>* RelationFindReplTupleByIndex
>
>Amit, this is why I've included you in this previously solo thread :)
>RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records
>if they are updated by a parallel transaction. This could lead to lost
>deletes/updates, especially in the case of streaming=parallel mode. 
>I'm not familiar with how parallel workers apply transactions, so maybe this
>isn't possible.

IIUC, the issue can happen when two concurrent transactions using DirtySnapshot 
access
the same tuples, which is not specific to the parallel apply. Consider that two
subscriptions exist and publishers modify the same tuple of the same table.
In this case, two workers access the tuple, so one of the changes may be missed
by the scenario you said. I feel we do not need special treatments for parallel
apply.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: v17 vs v16 performance comparison

2024-07-31 Thread Thomas Munro

On Thu, Aug 1, 2024 at 3:00 PM Alexander Lakhin  wrote:
> So it looks like q15 from TPC-DS is not the only query suffering from that
> change.

I'm going to try to set up a local repro to study these new cases.  If
you have a write-up somewhere of how exactly you run that, that'd be
useful.

Re: Use pgBufferUsage for block reporting in analyze

2024-07-31 Thread Anthonin Bonnefoy

On Wed, Jul 31, 2024 at 9:36 PM Masahiko Sawada  wrote:
> Meanwhile, I think we can push 0001 and 0002 patches since they are in
> good shape. I've updated commit messages to them and slightly changed
> 0002 patch to write "finished analyzing of table \"%s.%s.%s\" instead
> of  "analyze of table \"%s.%s.%s\".

Wouldn't it make sense to do the same for autoanalyze and write
"finished automatic analyze of table \"%s.%s.%s\"\n" instead of
"automatic analyze of table \"%s.%s.%s\"\n"?

> Also, regarding 0003 patch, what is the main reason why we want to add
> WAL usage to analyze reports? I think that analyze normally does not
> write WAL records much so I'm not sure it's going to provide a good
> insight for users.

There was no strong reason except for consistency with VACUUM VERBOSE
output. But as you said, it's not really providing valuable
information so it's probably better to keep the noise down and drop
it.

Regards,
Anthonin Bonnefoy

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

2024-07-31 Thread Michael Paquier

On Wed, Jul 31, 2024 at 04:49:54PM +0300, Alexander Korotkov wrote:
> On Mon, Jul 29, 2024 at 7:20 PM Robert Haas  wrote:
>> I support that idea in general but felt it was overkill in this case:
>> it's new code, and there was only one existing caller of the function
>> that got refactored, and I'm not a huge fan of cluttering the git
>> history with a bunch of tiny little refactoring commits to fix a
>> single bug. I might have changed it if I'd seen this note before
>> committing, though.
> 
> I understand your point.  I'm also not huge fan of a flood of small
> commits.  Nevertheless, I find splitting refactoring from other
> changes generally useful.  That could be a single commit of many small
> refactorings, not many small commits.  The point for me is easier
> review: you can expect refactoring commit to contain "isomorphic"
> changes, while other commits implementing material logic changes.

For review, it also tends to matter a lot to me, especially if the
same areas of code are changed across multiple commits.  That's more
annoying for authors as the splits are annoying to maintain.  For a
single caller introduced, what Robert has done is fine IMO.

> But that might be a committer preference though.

I tend to prefer refactorings if it comes to a cleaner git history,
still that's always case-by-case, and all of us have our own habits.
--
Michael


signature.asc
Description: PGP signature

93 matches

Mail list logo