date:20240729

On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, July 26, 2024 7:34 PM Amit Kapila  wrote:
> >
> > On Thu, Jul 25, 2024 at 4:12 PM Amit Kapila 
> > wrote:
> > > >
> > A few more comments:
>
> Thanks for the comments.
>
> > 1.
> > For duplicate key, the patch reports conflict as following:
> > ERROR:  conflict insert_exists detected on relation "public.t1"
> > 2024-07-26 11:06:34.570 IST [27800] DETAIL:  Key (c1)=(1) already exists in
> > unique index "t1_pkey", which was modified by origin 1 in transaction 770 at
> > 2024-07-26 09:16:47.79805+05:30.
> > 2024-07-26 11:06:34.570 IST [27800] CONTEXT:  processing remote data for
> > replication origin "pg_16387" during message type "INSERT" for replication
> > target relation "public.t1" in transaction 742, finished at 0/151A108
> >
> > In detail, it is better to display the origin name instead of the origin 
> > id. This will
> > be similar to what we do in CONTEXT information.
>
>
> Agreed. Before modifying this, I'd like to confirm the message style in the
> cases where origin id may not have a corresponding origin name (e.g., if the
> data was modified locally (id = 0), or if the origin that modified the data 
> has
> been dropped). I thought of two styles:
>
> 1)
> - for local change: "xxx was modified by a different origin \"(local)\" in 
> transaction 123 at 2024.."
> - for dropped origin: "xxx was modified by a different origin \"(unknown)\" 
> in transaction 123 at 2024.."
>
> One issue for this style is that user may create an origin with the same name
> here (e.g. "(local)" and "(unknown)").
>
> 2)
> - for local change: "xxx was modified locally in transaction 123 at 2024.."
>

This sounds good.

> - for dropped origin: "xxx was modified by an unknown different origin 1234 
> in transaction 123 at 2024.."
>

For this one, how about: "xxx was modified by a non-existent origin in
transaction 123 at 2024.."?

Also, in code please do write comments when each of these two can occur.

-- 
With Regards,
Amit Kapila.

Re: tls 1.3: sending multiple tickets

2024-07-29 Thread Daniel Gustafsson

> On 26 Jul 2024, at 20:29, Robert Haas  wrote:

> One of my chronic complaints about comments is
> that they should say why we're doing things, not what we're doing.

Agreed.

> I feel like any
> place where we are doing X because of some property of a non-PG code
> base with which a particular reader might not be familiar, we should
> have a comment explaining why we're doing it. And especially if it's
> security-relevant.

I'm sure there are more interactions with OpenSSL, and TLS in general, which
warrants better comments but the attached takes a stab at the two examples in
question here to get started (to avoid perfect get in the way of progress). 

--
Daniel Gustafsson

openssl_comments.diff
Description: Binary data

Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Rui Zhao

Prior to PG16, postmaster children would manually detach from shared memory
if it was not needed. However, this behavior was removed in fork mode in
commit aafc05d.
Detaching shared memory when it is no longer needed is beneficial, as 
postmaster children (like syslogger) don't wish to take any risk of
accidentally corrupting shared memory. Additionally, any panic in these
processes will not reset shared memory.
The attached patch addresses this issue by detaching shared memory after
fork_process().
Best regard,
Rui Zhao


0001-Detach-shared-memory-in-Postmaster-child-if-not-needed.patch
Description: Binary data

Re: Add ALL_CANDIDATES option to EXPLAIN

2024-07-29 Thread Ashutosh Bapat

On Fri, Jul 26, 2024 at 10:47 PM Robert Haas  wrote:
>
> On Fri, Jul 26, 2024 at 12:59 PM Anthonin Bonnefoy
>  wrote:
> > I have a prototype for an ALL_CANDIDATES option for EXPLAIN. The goal
> > of this option is to print all plan candidates instead of only the
> > cheapest plan. It will output the plans from the most expensive at the
> > top to the cheapest. Here's an example:
>
> It's difficult for me to understand how this can work. Either it's not
> really going to print out all candidates, or it's going to print out
> gigabytes of output for moderately complex queries.
>
> I've thought about trying to figure out some way of identifying and
> printing out plans that are "interestingly different" from the chosen
> plan, with the costs they would have had, but I haven't been able to
> come up with a good algorithm. Printing out absolutely everything
> doesn't seem viable, because planning would be slow and use amazing
> amounts of memory and the output would be so large as to be useless.

If we print the path forest as a forest as against individual path
trees, we will be able to cut down on the size but it will still be
huge. Irrespective of that even with slightly non-trivial queries it's
going to be difficult to analyze these paths. The way I think of it is
dumping this information in the form of tables. Roughly something like
a table containing RelOptInfo id and RelOptInfo itself and another
containing all the paths identified by id and RelOptInfo id. The path
linkages are stored as path ids. That's a minimum. We will need more
tables to store query, and other metadata. If we do so we can use SQL
to carry out investigations.

-- 
Best Wishes,
Ashutosh Bapat

Re: Remove duplicate table scan in logical apply worker and code refactoring

On Thu, Jul 25, 2024 at 4:00 PM Zhijie Hou (Fujitsu)
 wrote:
>
> When reviewing the code in logical/worker.c, I noticed that when applying a
> cross-partition update action, it scans the old partition twice.
> I am attaching the patch 0001 to remove this duplicate table scan.
>
> The test shows that it brings noticeable improvement:
>
> Steps
> -
> Pub:
> create table tab (a int not null, b int);
> alter table tab replica identity full;
> insert into tab select 1,generate_series(1, 100, 1);
>
> Sub:
> create table tab (a int not null, b int) partition by range (b);
> create table tab_1 partition of tab for values from (minvalue) to (500);
> create table tab_2 partition of tab for values from (500) to (maxvalue);
> alter table tab replica identity full;
>
>
> Test query:
> update tab set b = 600 where b > 00; -- UPDATE 100
>
> Results (The time spent by apply worker to apply the all the UPDATEs):
> Before  14s
> After   7s
> -
>

The idea sounds good to me. BTW, we don't need the following comment
in the 0001 patch:
We
+ * don't call apply_handle_delete_internal() here to avoid
+ * repeating some work already done above to find the
+ * local tuple in the partition.

It is implied by the change and we already follow the same for the update.

-- 
With Regards,
Amit Kapila.

Re: Logical Replication of sequences

2024-07-29 Thread vignesh C

On Thu, 25 Jul 2024 at 15:41, shveta malik  wrote:
>
> On Thu, Jul 25, 2024 at 12:08 PM shveta malik  wrote:
> >
> > On Thu, Jul 25, 2024 at 9:06 AM vignesh C  wrote:
> > >
> > > The attached v20240725 version patch has the changes for the same.
> >
> > Thank You for addressing the comments. Please review below issues:
> >
> > 1) Sub ahead of pub due to wrong initial sync of last_value for
> > non-incremented sequences. Steps at [1]
> > 2) Sequence's min value is not honored on sub during replication. Steps at 
> > [2]
>
> One more issue:
> 3)  Sequence datatype's range is not honored on sub during
> replication, while it is honored for tables.
>
>
> Behaviour for tables:
> -
> Pub: create table tab1( i integer);
> Sub: create table tab1( i smallint);
>
> Pub: insert into tab1 values(generate_series(1, 32768));
>
> Error on sub:
> 2024-07-25 10:38:06.446 IST [178680] ERROR:  value "32768" is out of
> range for type smallint
>
> -
> Behaviour for sequences:
> -
>
> Pub:
> CREATE SEQUENCE myseq_i as integer INCREMENT 1 START 1;
>
> Sub:
> CREATE SEQUENCE myseq_i as smallint INCREMENT 1 START 1;
>
> Pub:
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i'); -->brings value to 40001
>
> Sub:
> ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION SEQUENCES;
> SELECT * from pg_sequences;  -->last_val reached till 40001, while the
> range is till  32767.

This issue is addressed in the v20240730_2 version patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3%2BXzHAbgyn8gmbBLK5goyv_uyGgHEsTQxRZ8bVk6nAEg%40mail.gmail.com

Regards,
Vignesh

Fix smgrtruncate code comment.

2024-07-29 Thread Kirill Reshke

Today I was busy rebasing my Greenplum-related Extendable SMGR
patches on Cloudberry, and I faced some conflicts.
While resolving them I noticed code & comments inconsistency in smgr.c
in smgrtruncate function, which tracks down to
c5315f4f44843c20ada876fdb0d0828795dfbdf5. In this commit,
smgr_fsm_nblocks & smgr_vm_nblocks fields were removed, however
comments were not fixed accordingly.

So i suggest to fix this, PFA


v1-0001-Fix-smgrtruncate-code-comment.patch
Description: Binary data

Re: why is pg_upgrade's regression run so slow?

2024-07-29 Thread Andrew Dunstan




On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:


Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.



I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 72.34s

test: postgresql:regress / regress/regress
duration: 41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test: postgresql:regress / regress/regress
duration: 50.08s

test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r 
testrun/pg_upgrade | wc -l

17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.




Well, that last fact explains the discrepancy I originally complained 
about. How the heck did that happen? It looks like we just ignored its 
use in Makefile.global.in :-(



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Add crc32(text) & crc32(bytea)

2024-07-29 Thread Aleksander Alekseev

Hi Nathan,

> I don't think adding crc32c() would sufficiently increase the scope.  We'd
> use the existing implementations for both crc32() and crc32c().  And
> besides, this could be useful for adding tests for that code.
>
> +crc32 ( text )
>
> Do we need a version of the function that takes a text input?  It's easy
> enough to cast to a bytea.
>
> +text
>
> My first reaction is that we should just have this return bytea like the
> SHA ones do, if for no other reason than commit 10cfce3 seems intended to
> move us away from returning text for these kinds of functions.  Upthread,
> you mentioned the possibility of returning a bigint, too.  I think I'd
> still prefer bytea in case we want to add, say, crc64() or crc16() in the
> future.  That would allow us to keep all of these functions consistent
> instead of returning different types for each.  However, I understand that
> returning the numeric types might be more convenient.  I'm curious what
> others think about this.
>
> +Computes the CRC32 hash of
> +the binary string, with the result written in hexadecimal.
>
> I'm not sure we should call the check values "hashes."  Wikipedia does
> include them in the "List of hash functions" page [0], but it seems to
> deliberately avoid calling them hashes in the CRC page [1].  I'd suggest
> calling them "CRC32 values" instead.

Thanks for the code review. Here is the updated patch.

-- 
Best regards,
Aleksander Alekseev


v2-0001-Add-crc32-bytea-crc32c-bytea.patch
Description: Binary data

Re: Conflict detection and logging in logical replication

2024-07-29 Thread Dilip Kumar

On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>

I was going through v7-0001, and I have some initial comments.

@@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,
  ExprContext *econtext;
  Datum values[INDEX_MAX_KEYS];
  bool isnull[INDEX_MAX_KEYS];
- ItemPointerData invalidItemPtr;
  bool checkedIndex = false;

  ItemPointerSetInvalid(conflictTid);
- ItemPointerSetInvalid(&invalidItemPtr);

  /*
  * Get information from the result relation info structure.
@@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,

  satisfiesConstraint =
  check_exclusion_or_unique_constraint(heapRelation, indexRelation,
- indexInfo, &invalidItemPtr,
+ indexInfo, &slot->tts_tid,
  values, isnull, estate, false,
  CEOUC_WAIT, true,
  conflictTid);

What is the purpose of this change?  I mean
'check_exclusion_or_unique_constraint' says that 'tupleid'
should be invalidItemPtr if the tuple is not yet inserted and
ExecCheckIndexConstraints is called by ExecInsert
before inserting the tuple.  So what is this change? Would this change
ExecInsert's behavior as well?




+ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
+ConflictType type, List *recheckIndexes,
+TupleTableSlot *slot)
+{
+ /* Re-check all the unique indexes for potential conflicts */
+ foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ TupleTableSlot *conflictslot;
+
+ if (list_member_oid(recheckIndexes, uniqueidx) &&
+ FindConflictTuple(resultRelInfo, estate, uniqueidx, slot, &conflictslot))
+ {
+ RepOriginId origin;
+ TimestampTz committs;
+ TransactionId xmin;
+
+ GetTupleCommitTs(conflictslot, &xmin, &origin, &committs);
+ ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc, uniqueidx,
+ xmin, origin, committs, conflictslot);
+ }
+ }
+}
 and

+ conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
+
  if (resultRelInfo->ri_NumIndices > 0)
  recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-slot, estate, false, false,
-NULL, NIL, false);
+slot, estate, false,
+conflictindexes ? true : false,
+&conflict,
+conflictindexes, false);
+
+ /*
+ * Rechecks the conflict indexes to fetch the conflicting local tuple
+ * and reports the conflict. We perform this check here, instead of
+ * perform an additional index scan before the actual insertion and
+ * reporting the conflict if any conflicting tuples are found. This is
+ * to avoid the overhead of executing the extra scan for each INSERT
+ * operation, even when no conflict arises, which could introduce
+ * significant overhead to replication, particularly in cases where
+ * conflicts are rare.
+ */
+ if (conflict)
+ ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
+recheckIndexes, slot);


 This logic is confusing, first, you are calling
ExecInsertIndexTuples() with no duplicate error for the indexes
present in 'ri_onConflictArbiterIndexes' which means
 the indexes returned by the function must be a subset of
'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
you are again processing all the
 indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
is a subset of the indexes that is returned by
ExecInsertIndexTuples().

 Why are we doing that, I think we can directly use the
'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
those indexes are guaranteed to be a subset of
 ri_onConflictArbiterIndexes. No?

 ---
 ---


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Remove dead code generation tools in src/backend/utils/mb/

These programs in src/backend/utils/mb/ are unused, and have been unused 
and unusable since 2003:


iso.c
win1251.c
win866.c

Attached patch removes them. See commit message for a little more 
detailed analysis.


--
Heikki Linnakangas
Neon (https://neon.tech)From 2c2c82757749534bd47348e5385cbb2b8363bf5b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 14:12:28 +0300
Subject: [PATCH 1/1] Remove dead generators for cyrillic encoding converson
 tables

These tools were used to read the koi-iso.tab, koi-win.tab, and
koi-alt.tab files, which contained the mappings between the
single-byte cyrillic encodings. However, those data files were removed
in commit 4c3c8c048d, back in 2003. These code generators have been
unused and unusable ever since.

The generated tables live in cyrillic_and_mic.c. There has been one
change to the tables since they were generated in 1999, in commit
f4b7624eb07a. So if we resurrected the original data tables, that
change would need to be taken into account.

So this code is very dead. The tables in cyrillic_and_mic.c, which
were originally generated by these tools, are now the authoritative
source for these mappings.

Discussion: xxx
---
 src/backend/utils/mb/README|  3 --
 src/backend/utils/mb/iso.c | 74 --
 src/backend/utils/mb/win1251.c | 74 --
 src/backend/utils/mb/win866.c  | 74 --
 4 files changed, 225 deletions(-)
 delete mode 100644 src/backend/utils/mb/iso.c
 delete mode 100644 src/backend/utils/mb/win1251.c
 delete mode 100644 src/backend/utils/mb/win866.c

diff --git a/src/backend/utils/mb/README b/src/backend/utils/mb/README
index ef366268913..4447881ead0 100644
--- a/src/backend/utils/mb/README
+++ b/src/backend/utils/mb/README
@@ -8,9 +8,6 @@ mbutils.c:	public functions for the backend only.
 stringinfo_mb.c: public backend-only multibyte-aware stringinfo functions
 wstrcmp.c:	strcmp for mb
 wstrncmp.c:	strncmp for mb
-win866.c:	a tool to generate KOI8 <--> CP866 conversion table
-iso.c:		a tool to generate KOI8 <--> ISO8859-5 conversion table
-win1251.c:	a tool to generate KOI8 <--> CP1251 conversion table
 
 See also in src/common/:
 
diff --git a/src/backend/utils/mb/iso.c b/src/backend/utils/mb/iso.c
deleted file mode 100644
index d5dae56339f..000
--- a/src/backend/utils/mb/iso.c
+++ /dev/null
@@ -1,74 +0,0 @@
-/*
- * make KOI8->ISO8859-5 and ISO8859-5->KOI8 translation table
- * from koi-iso.tab.
- *
- * Tatsuo Ishii
- *
- * src/backend/utils/mb/iso.c
- */
-
-#include 
-
-
-main()
-{
-	int			i;
-	char		koitab[128],
-isotab[128];
-	char		buf[4096];
-	int			koi,
-iso;
-
-	for (i = 0; i < 128; i++)
-		koitab[i] = isotab[i] = 0;
-
-	while (fgets(buf, sizeof(buf), stdin) != NULL)
-	{
-		if (*buf == '#')
-			continue;
-		sscanf(buf, "%d %x", &koi, &iso);
-		if (koi < 128 || koi > 255 || iso < 128 || iso > 255)
-		{
-			fprintf(stderr, "invalid value %d\n", koi);
-			exit(1);
-		}
-		koitab[koi - 128] = iso;
-		isotab[iso - 128] = koi;
-	}
-
-	i = 0;
-	printf("static char koi2iso[] = {\n");
-	while (i < 128)
-	{
-		int			j = 0;
-
-		while (j < 8)
-		{
-			printf("0x%02x", koitab[i++]);
-			j++;
-			if (i >= 128)
-break;
-			printf(", ");
-		}
-		printf("\n");
-	}
-	printf("};\n");
-
-	i = 0;
-	printf("static char iso2koi[] = {\n");
-	while (i < 128)
-	{
-		int			j = 0;
-
-		while (j < 8)
-		{
-			printf("0x%02x", isotab[i++]);
-			j++;
-			if (i >= 128)
-break;
-			printf(", ");
-		}
-		printf("\n");
-	}
-	printf("};\n");
-}
diff --git a/src/backend/utils/mb/win1251.c b/src/backend/utils/mb/win1251.c
deleted file mode 100644
index 75129e6eff8..000
--- a/src/backend/utils/mb/win1251.c
+++ /dev/null
@@ -1,74 +0,0 @@
-/*
- * make KOI8->CP1251(win-1251) and CP1251(win-1251)->KOI8 translation table
- * from koi-win.tab.
- *
- * Tatsuo Ishii
- *
- * src/backend/utils/mb/win1251.c
- */
-
-#include 
-
-
-main()
-{
-	int			i;
-	char		koitab[128],
-wintab[128];
-	char		buf[4096];
-	int			koi,
-win;
-
-	for (i = 0; i < 128; i++)
-		koitab[i] = wintab[i] = 0;
-
-	while (fgets(buf, sizeof(buf), stdin) != NULL)
-	{
-		if (*buf == '#')
-			continue;
-		sscanf(buf, "%d %d", &koi, &win);
-		if (koi < 128 || koi > 255 || win < 128 || win > 255)
-		{
-			fprintf(stderr, "invalid value %d\n", koi);
-			exit(1);
-		}
-		koitab[koi - 128] = win;
-		wintab[win - 128] = koi;
-	}
-
-	i = 0;
-	printf("static char koi2win[] = {\n");
-	while (i < 128)
-	{
-		int			j = 0;
-
-		while (j < 8)
-		{
-			printf("0x%02x", koitab[i++]);
-			j++;
-			if (i >= 128)
-break;
-			printf(", ");
-		}
-		printf("\n");
-	}
-	printf("};\n");
-
-	i = 0;
-	printf("static char win2koi[] = {\n");
-	while (i < 128)
-	{
-		int			j = 0;
-
-		while (j < 8)
-		{
-			printf("0x%02x", wintab[i++]);
-			j++;
-			if (i >= 128)
-break;
-			printf(", ");
-		}
-		printf("\n");
-	}
-	printf("};\n");
-}
diff --git a/src/backend/utils/mb/win866.c b/src/backend/

Re: Fix smgrtruncate code comment.


On 29/07/2024 13:50, Kirill Reshke wrote:

Today I was busy rebasing my Greenplum-related Extendable SMGR
patches on Cloudberry, and I faced some conflicts.
While resolving them I noticed code & comments inconsistency in smgr.c
in smgrtruncate function, which tracks down to
c5315f4f44843c20ada876fdb0d0828795dfbdf5. In this commit,
smgr_fsm_nblocks & smgr_vm_nblocks fields were removed, however
comments were not fixed accordingly.

So i suggest to fix this, PFA


Applied, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Remove duplicate table scan in logical apply worker and code refactoring

On Thu, Jul 25, 2024 at 5:56 PM Kirill Reshke  wrote:
>
> > +/*
> > + * If the tuple to be modified could not be found, a log message is 
> > emitted.
> > + */
> > +static void
> > +report_tuple_not_found(CmdType cmd, Relation targetrel, bool is_partition)
> > +{
> > + Assert(cmd == CMD_UPDATE || cmd == CMD_DELETE);
> > +
> > + /* XXX should this be promoted to ereport(LOG) perhaps? */
> > + elog(DEBUG1,
> > + "logical replication did not find row to be %s in replication target 
> > relation%s \"%s\"",
> > + cmd == CMD_UPDATE ? "updated" : "deleted",
> > + is_partition ? "'s partition" : "",
> > + RelationGetRelationName(targetrel));
> > +}
>
> Encapsulating report logic inside function is ok,
>

This could even be a separate patch as it is not directly to other
parts of the 0002 patch. BTW, the problem statement for 0002 is not
explicitly stated like which part of the code we want to optimize by
removing duplication. Also, as proposed the name
apply_handle_tuple_routing() for the function doesn't seem suitable as
it no longer remains similar to other apply_handle_* functions where
we perform the required operation like insert or update the tuple. How
about naming it as apply_tuple_routing()?

-- 
With Regards,
Amit Kapila.

Re: Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Aleksander Alekseev

Hi Rui,

> Prior to PG16, postmaster children would manually detach from shared memory
> if it was not needed. However, this behavior was removed in fork mode in
> commit aafc05d.
>
> Detaching shared memory when it is no longer needed is beneficial, as
> postmaster children (like syslogger) don't wish to take any risk of
> accidentally corrupting shared memory. Additionally, any panic in these
> processes will not reset shared memory.
>
> The attached patch addresses this issue by detaching shared memory after
> fork_process().

Thanks for the patch. How do you estimate its performance impact?

Note the comments for postmaster_child_launch(). This function is
exposed to the third-party code and guarantees to attach shared
memory. I doubt that there is much third-party code in existence to
break but you should change to comment.

-- 
Best regards,
Aleksander Alekseev

Re: Conflict detection and logging in logical replication

2024-07-29 Thread shveta malik

On Mon, Jul 29, 2024 at 9:31 AM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2024 at 4:28 PM shveta malik  wrote:
> >
> > On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila  wrote:
> > >
> >
> > > One more thing we need to consider is whether we should LOG or ERROR
> > > for update/delete_differ conflicts. If we LOG as the patch is doing
> > > then we are intentionally overwriting the row when the user may not
> > > expect it. OTOH, without a patch anyway we are overwriting, so there
> > > is an argument that logging by default is what the user will expect.
> > > What do you think?
> >
> > I was under the impression that in this patch we do not intend to
> > change behaviour of HEAD and thus only LOG the conflict wherever
> > possible.
> >
>
> Earlier, I thought it was good to keep LOGGING the conflict where
> there is no chance of wrong data update but for cases where we can do
> something wrong, it is better to ERROR out. For example, for an
> update_differ case where the apply worker can overwrite the data from
> a different origin, it is better to ERROR out. I thought this case was
> comparable to an existing ERROR case like a unique constraint
> violation. But I see your point as well that one might expect the
> existing behavior where we are silently overwriting the different
> origin data. The one idea to address this concern is to suggest users
> set the detect_conflict subscription option as off but I guess that
> would make this feature unusable for users who don't want to ERROR out
> for different origin update cases.
>
> > And in the next patch of resolver, based on the user's input
> > of error/skip/or resolve, we take the action. I still think it is
> > better to stick to the said behaviour. Only if we commit the resolver
> > patch in the same version where we commit the detection patch, then we
> > can take the risk of changing this default behaviour to 'always
> > error'. Otherwise users will be left with conflicts arising but no
> > automatic way to resolve those. But for users who really want their
> > application to error out, we can provide an additional GUC in this
> > patch itself which changes the behaviour to 'always ERROR on
> > conflict'.
> >
>
> I don't see a need of GUC here, even if we want we can have a
> subscription option such conflict_log_level. But users may want to
> either LOG or ERROR based on conflict type. For example, there won't
> be any data inconsistency in two node replication for delete_missing
> case as one is trying to delete already deleted data, so LOGGING such
> a case should be sufficient whereas update_differ could lead to
> different data on two nodes, so the user may want to ERROR out in such
> a case.
>
> We can keep the current behavior as default for the purpose of
> conflict detection but can have a separate patch to decide whether to
> LOG/ERROR based on conflict_type.

+1 on the idea of giving an option to the user to choose either ERROR
or LOG for each conflict type separately.

thanks
Shveta

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Peter Eisentraut

I have some comments about the first three patches, that deal with
memory management.

v24-0001-Revert-ECPG-s-use-of-pnstrdup.patch

This looks right.

I suppose another approach would be to put a full replacement for
strndup() into src/port/. But since there is currently only one user,
and most other users should be using pnstrdup(), the presented approach
seems ok.

We should take the check for exit() calls from libpq and expand it to
cover the other libraries as well. Maybe there are other problems like
this?

v24-0002-Remove-fe_memutils-from-libpgcommon_shlib.patch

I don't quite understand how this problem can arise. The description says

"""
libpq appears to have no need for this, and the exit() references cause
our libpq-refs-stamp test to fail if the linker doesn't strip out the
unused code.
"""

But under what circumstances does "the linker doesn't strip out" happen?
If this happens accidentally, then we should have seen some buildfarm
failures or something?

Also, one could look further and notice that restricted_token.c and
sprompt.c both a) are not needed by libpq and b) can trigger exit()
calls. Then it's not clear why those are not affected.

v24-0003-common-jsonapi-support-libpq-as-a-client.patch

I'm reminded of thread [0]. I think there is quite a bit of confusion
about the pqexpbuffer vs. stringinfo APIs, and they are probably used
incorrectly quite a bit. There are now also programs that use both of
them! This patch now introduces another layer on top of them. I fear,
at the end, nobody is going to understand any of this anymore. Also,
changing all the programs to link in libpq for pqexpbuffer seems like
the opposite direction from what was suggested in [0].

I think we need to do some deeper thinking here about how we want the
memory management on the client side to work. Maybe we could just use
one API but have some flags or callbacks to control the out-of-memory
behavior.

[0]:
https://www.postgresql.org/message-id/flat/16d0beac-a141-e5d3-60e9-323da75f49bf%40eisentraut.org

RE: Conflict detection and logging in logical replication

2024-07-29 Thread Zhijie Hou (Fujitsu)

On Monday, July 29, 2024 6:59 PM Dilip Kumar  wrote:
> 
> On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> 
> I was going through v7-0001, and I have some initial comments.

Thanks for the comments !

> 
> @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
>   ExprContext *econtext;
>   Datum values[INDEX_MAX_KEYS];
>   bool isnull[INDEX_MAX_KEYS];
> - ItemPointerData invalidItemPtr;
>   bool checkedIndex = false;
> 
>   ItemPointerSetInvalid(conflictTid);
> - ItemPointerSetInvalid(&invalidItemPtr);
> 
>   /*
>   * Get information from the result relation info structure.
> @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
> 
>   satisfiesConstraint =
>   check_exclusion_or_unique_constraint(heapRelation, indexRelation,
> - indexInfo, &invalidItemPtr,
> + indexInfo, &slot->tts_tid,
>   values, isnull, estate, false,
>   CEOUC_WAIT, true,
>   conflictTid);
> 
> What is the purpose of this change?  I mean
> 'check_exclusion_or_unique_constraint' says that 'tupleid'
> should be invalidItemPtr if the tuple is not yet inserted and
> ExecCheckIndexConstraints is called by ExecInsert before inserting the tuple.
> So what is this change?

Because the function ExecCheckIndexConstraints() is now invoked after inserting
a tuple (in the patch). So, we need to ignore the newly inserted tuple when
checking conflict in check_exclusion_or_unique_constraint().

> Would this change ExecInsert's behavior as well?

Thanks for pointing it out, I will check and reply.

> 
> 
> 
> 
> +ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
> +ConflictType type, List *recheckIndexes,
> +TupleTableSlot *slot)
> +{
> + /* Re-check all the unique indexes for potential conflicts */
> +foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
> + {
> + TupleTableSlot *conflictslot;
> +
> + if (list_member_oid(recheckIndexes, uniqueidx) &&
> + FindConflictTuple(resultRelInfo, estate, uniqueidx, slot,
> + &conflictslot)) { RepOriginId origin; TimestampTz committs;
> + TransactionId xmin;
> +
> + GetTupleCommitTs(conflictslot, &xmin, &origin, &committs);
> +ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc,
> +uniqueidx,  xmin, origin, committs, conflictslot);  }  } }
>  and
> 
> + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
> +
>   if (resultRelInfo->ri_NumIndices > 0)
>   recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
> -slot, estate, false, false,
> -NULL, NIL, false);
> +slot, estate, false,
> +conflictindexes ? true : false,
> +&conflict,
> +conflictindexes, false);
> +
> + /*
> + * Rechecks the conflict indexes to fetch the conflicting local tuple
> + * and reports the conflict. We perform this check here, instead of
> + * perform an additional index scan before the actual insertion and
> + * reporting the conflict if any conflicting tuples are found. This is
> + * to avoid the overhead of executing the extra scan for each INSERT
> + * operation, even when no conflict arises, which could introduce
> + * significant overhead to replication, particularly in cases where
> + * conflicts are rare.
> + */
> + if (conflict)
> + ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
> +recheckIndexes, slot);
> 
> 
>  This logic is confusing, first, you are calling
> ExecInsertIndexTuples() with no duplicate error for the indexes
> present in 'ri_onConflictArbiterIndexes' which means
>  the indexes returned by the function must be a subset of
> 'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
> you are again processing all the
>  indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
> is a subset of the indexes that is returned by
> ExecInsertIndexTuples().

I think that's not always true. The indexes returned by the function *may not*
be a subset of 'ri_onConflictArbiterIndexes'. Based on the comments atop of the
ExecInsertIndexTuples, it returns a list of index OIDs for any unique or
exclusion constraints that are deferred, and in addition to that, it will
include the indexes in 'arbiterIndexes' if noDupErr == true.

> 
>  Why are we doing that, I think we can directly use the
> 'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
> those indexes are guaranteed to be a subset of
>  ri_onConflictArbiterIndexes. No?

Based on above, we need to filter the deferred indexes or exclusion constraints
in the 'ri_onConflictArbiterIndexes'.

Best Regards,
Hou zj

Re: Make query cancellation keys longer


On 24/07/2024 19:12, Heikki Linnakangas wrote:

On 04/07/2024 15:20, Jelte Fennema-Nio wrote:

On Thu, 4 Jul 2024 at 12:32, Heikki Linnakangas  wrote:

We currently don't do any locking on the ProcSignal array. For query
cancellations, that's good because a query cancel packet is processed
without having a PGPROC entry, so we cannot take LWLocks. We could use
spinlocks though. In this patch, I used memory barriers to ensure that
we load/store the pid and the cancellation key in a sensible order, so
that you cannot e.g. send a cancellation signal to a backend that's just
starting up and hasn't advertised its cancellation key in ProcSignal
yet. But I think this might be simpler and less error-prone by just
adding a spinlock to each ProcSignal slot. That would also fix the
existing race condition where we might set the pss_signalFlags flag for
a slot, when the process concurrently terminates and the slot is reused
for a different process. Because of that, we currently have this caveat:
"... all the signals are such that no harm is done if they're mistakenly
fired". With a spinlock, we could eliminate that race.


I think a spinlock would make this thing a whole concurrency stuff a
lot easier to reason about.

+   slot->pss_cancel_key_valid = false;
+   slot->pss_cancel_key = 0;

If no spinlock is added, I think these accesses should still be made
atomic writes. Otherwise data-races on those fields are still
possible, resulting in undefined behaviour. The memory barriers you
added don't prevent that afaict. With atomic operations there are
still race conditions, but no data-races.

Actually it seems like that same argument applies to the already
existing reading/writing of pss_pid: it's written/read using
non-atomic operations so data-races can occur and thus undefined
behaviour too.


Ok, here's a version with spinlocks.

I went back and forth on what exactly is protected by the spinlock. I
kept the "volatile sig_atomic_t" type for pss_signalFlags, so that it
can still be safely read without holding the spinlock in
CheckProcSignal, but all the functions that set the flags now hold the
spinlock. That removes the race condition that you might set the flag
for wrong slot, if the target backend exits and the slot is recycled.
The race condition was harmless and there were comments to note it, but
it doesn't occur anymore with the spinlock.

(Note: Thomas's "Interrupts vs signals" patch will remove ps_signalFlags
altogether. I'm looking forward to that.)


-volatile pid_t pss_pid;
+pid_tpss_pid;

Why remove the volatile modifier?


Because I introduced a memory barrier to ensure the reads/writes of
pss_pid become visible to other processes in right order. That makes the
'volatile' unnecessary IIUC. With the spinlock, the point is moot however.


I:
- fixed a few comments,
- fixed a straightforward failure with EXEC_BACKEND (ProcSignal needs to 
be passed down in BackendParameters now),
- put back the snippet to signal the whole process group if supported, 
which you pointed out earlier


and committed this refactoring patch.

Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Remove dead code generation tools in src/backend/utils/mb/

Heikki Linnakangas  writes:
> These programs in src/backend/utils/mb/ are unused, and have been unused 
> and unusable since 2003:
> iso.c
> win1251.c
> win866.c
> Attached patch removes them. See commit message for a little more 
> detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.

regards, tom lane

Re: Optimize mul_var() for var1ndigits >= 8

On Mon, Jul 29, 2024, at 02:23, Joel Jacobson wrote:
> Then, I've used sched_setaffinity() from  to ensure the 
> benchmark run on CPU core id 31.

I fixed a bug in my measure function, I had forgot to reset affinity after each
benchmark, so the PostgreSQL backend continued to use the core even after
numeric_mul had finished.

New results with less noise below.

Pardon the exceeding of 80 chars line width,
but felt important to include commit hash and relative delta.


ndigits|rate|  change   |   accum   | commit  | 
 summary
---++---+---+-+
 (1,1) |  1.639e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,1) |  2.248e+07 | +37.16 %  | +37.16 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,1) |  2.333e+07 | +3.77 %   | +42.32 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,1) |  2.291e+07 | -1.81 %   | +39.75 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,1) |  2.276e+07 | -0.64 %   | +38.86 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,1) |  2.256e+07 | -0.86 %   | +37.66 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,1) |  2.182e+07 | -3.32 %   | +33.09 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,2) |  1.640e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,2) |  2.202e+07 | +34.28 %  | +34.28 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,2) |  2.214e+07 | +0.58 %   | +35.06 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,2) |  2.173e+07 | -1.85 %   | +32.55 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,2) |  2.260e+07 | +3.98 %   | +37.83 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,2) |  2.233e+07 | -1.19 %   | +36.19 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,2) |  2.144e+07 | -3.97 %   | +30.79 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,3) |  1.511e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,3) |  2.179e+07 | +44.20 %  | +44.20 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,3) |  2.134e+07 | -2.05 %   | +41.24 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,3) |  2.198e+07 | +2.99 %   | +45.47 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,3) |  2.190e+07 | -0.39 %   | +44.91 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,3) |  2.164e+07 | -1.16 %   | +43.23 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,3) |  2.104e+07 | -2.79 %   | +39.24 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,4) |  1.494e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,4) |  2.132e+07 | +42.71 %  | +42.71 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,4) |  2.151e+07 | +0.91 %   | +44.00 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,4) |  2.190e+07 | +1.82 %   | +46.62 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,4) |  2.172e+07 | -0.82 %   | +45.41 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,4) |  2.112e+07 | -2.75 %   | +41.41 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,4) |  2.077e+07 | -1.67 %   | +39.05 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,8) |  1.444e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,8) |  2.063e+07 | +42.85 %  | +42.85 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,8) |  1.996e+07 | -3.25 %   | +38.21 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,8) |  2.039e+07 | +2.12 %   | +41.14 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,8) |  2.020e+07 | -0.89 %   | +39.87 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,8) |  1.934e+07 | -4.28 %   | +33.89 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,8) |  1.948e+07 | +0.73 %   | +34.87 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,16)|  9.614e+06 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,16)|  1.215e+07 | +26.37 %  | +26.37 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,16)|  1.223e+07 |

Re: Virtual generated columns

2024-07-29 Thread Peter Eisentraut


On 22.07.24 12:53, jian he wrote:

another bug?
drop table gtest12v;
CREATE TABLE gtest12v (a int PRIMARY KEY, b bigint, c int GENERATED
ALWAYS AS (b * 2) VIRTUAL);
insert into gtest12v (a,b) values (11,  22147483647);
table gtest12v;

insert ok, but select error:
ERROR:  integer out of range

should insert fail?


I think this is the correct behavior.

There has been a previous discussion: 
https://www.postgresql.org/message-id/2e3d5147-16f8-af0f-00ab-4c72cafc896f%402ndquadrant.com



CREATE TABLE gtest12v (a int PRIMARY KEY, b bigint, c int GENERATED
ALWAYS AS (b * 2) VIRTUAL);
CREATE SEQUENCE sequence_testx OWNED BY gtest12v.c;

seems to work. But I am not sure if there are any corner cases that
make it not work.
just want to raise this issue.


I don't think this matters.  You can make a sequence owned by any 
column, even if that column doesn't have a default that invokes the 
sequence.  So nonsensical setups are possible, but they are harmless.

Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Alexander Lakhin


Hello Tom and Heikki,

29.07.2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.


I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?

(As far as I can tell, KOI8-R fallen out of mainstream usage in Russia
twenty years ago...)

Best regards,
Alexander

Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Andreas Karlsson


On 7/29/24 5:00 PM, Alexander Lakhin wrote:

I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?


I was actually wondering about those yesterday and they should probably 
be removed (or fixed if anyone can see a use for them). As they are 
right now they do not seem very useful, especially with the current 
selection of locales: de_DE.ISO8859-1, gr_GR.ISO8859-7 and koi8-r.


Andreas

Re: tls 1.3: sending multiple tickets

On Mon, Jul 29, 2024 at 5:57 AM Daniel Gustafsson  wrote:
> I'm sure there are more interactions with OpenSSL, and TLS in general, which
> warrants better comments but the attached takes a stab at the two examples in
> question here to get started (to avoid perfect get in the way of progress).

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov  wrote:
> 0002 could also use pg_indent and pgperltidy.  And I have couple other
> notes regarding 0002.

As Tom said elsewhere, we don't have a practice of requiring perltidy
for every commit, and I normally don't bother. The tree is
pgindent-clean currently so I believe I got that part right.

> >In the process of fixing these bugs, I realized that the logic to wait
> >for WAL summarization to catch up was spread out in a way that made
> >it difficult to reuse properly, so this code refactors things to make
> >it easier.
>
> It would be nice to split refactoring out of material logic changed.
> This way it would be easier to review and would look cleaner in the
> git history.

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

> >To make this fix work, also teach the WAL summarizer that after a
> >promotion has occurred, no more WAL can appear on the previous
> >timeline: previously, the WAL summarizer wouldn't switch to the new
> >timeline until we actually started writing WAL there, but that meant
> >that when the startup process was waiting for the WAL summarizer, it
> >was waiting for an action that the summarizer wasn't yet prepared to
> >take.
>
> I think this is a pretty long sentence, and I'm not sure I can
> understand it correctly.  Does startup process wait for the WAL
> summarizer without this patch?  If not, it's not clear for me that
> word "previously" doesn't distribute to this part of sentence.
> Breaking this into multiple sentences could improve the clarity for
> me.

Yes, I think that phrasing is muddled. It's too late to amend the
commit message, but for posterity: I initially thought that I could
fix this bug by just teaching the startup process to wait for WAL
summarization before performing the .partial renaming, but that was
not enough by itself. The problem is that the WAL summarizer would
read all the records that were present in the final WAL file on the
old timeline, but it wouldn't actually write out a summary file,
because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
timeline switch point. Since no WAL had appeared on the new timeline
yet, it didn't view the end of the old timeline as a switch point, so
it just sat there waiting, without ever writing out a file; and the
startup process sat there waiting for it. So the second part of the
fix is to make the WAL summarizer realize that once the startup
process has chosen a new timeline ID, no more WAL is going to appear
on the old timeline, and thus it can (and should) write out the final
summary file for the old timeline and prepare to read WAL from the new
timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 26, 2024 at 4:13 PM Alexander Korotkov  wrote:
> Great!  Should we mark the corresponding v17 open item as closed?

Done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE MATERIALIZED VIEW

2024-07-29 Thread Nathan Bossart

Committed.

-- 
nathan

Detect double-release of spinlock

Hi,

On 2024-07-29 09:40:26 -0700, Andres Freund wrote:
> On 2024-07-29 12:33:13 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2024-07-29 11:31:56 -0400, Tom Lane wrote:
> > >> There was some recent discussion about getting rid of
> > >> --disable-spinlocks on the grounds that nobody would use
> > >> hardware that lacked native spinlocks.  But now I wonder
> > >> if there is a testing/debugging reason to keep it.
> > 
> > > Seems it'd be a lot more straightforward to just add an assertion to the
> > > x86-64 spinlock implementation verifying that the spinlock isn't already 
> > > free?
> 
> FWIW, I quickly hacked that up, and it indeed quickly fails with 0393f542d72^
> and passes with 0393f542d72.

Thought it'd be valuable to post a patch to go along with this, to
-hackers. The thread started at [1]

Other context from this discussion:
> > I dunno, is that the only extra check that the --disable-spinlocks
> > implementation is providing?
> 
> I think it also provides the (valuable!) check that spinlocks were actually
> initialized. But that again seems like something we'd be better off adding
> more general infrastructure for - nobody runs --disable-spinlocks locally, we
> shouldn't need to run this on the buildfarm to find problems like this.
> 
> 
> > I'm kind of allergic to putting Asserts into spinlocked code segments,
> > mostly on the grounds that it violates the straight-line-code precept.
> > I suppose it's not really that bad for tests that you don't expect
> > to fail, but still ...
> 
> I don't think the spinlock implementation itself is really affected by that
> rule - after all, the --disable-spinlocks implementation actually consists out
> of several layers of external function calls (including syscalls in some
> cases!).

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/E1sYSF2-001lEB-D1%40gemulon.postgresql.org
>From 06bd992f4be2f46586bfc7bab3135b3a2ca84e72 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 09:48:26 -0700
Subject: [PATCH v1] Detect unlocking an unlocked spinlock

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/spin.h | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index c0679c59992..7ec32cd816a 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -61,7 +61,22 @@
 
 #define SpinLockAcquire(lock) S_LOCK(lock)
 
-#define SpinLockRelease(lock) S_UNLOCK(lock)
+static inline void
+SpinLockRelease(volatile slock_t *lock)
+{
+	/*
+	 * Check that this backend actually held the spinlock. Implemented as a
+	 * static inline to avoid multiple-evaluation hazards.
+	 *
+	 * Can't assert when using the semaphore fallback implementation - it
+	 * doesn't implement S_LOCK_FREE() - but the fallback triggers failures in
+	 * the case of double-release on its own.
+	 */
+#ifdef HAVE_SPINLOCKS
+	Assert(!S_LOCK_FREE(lock));
+#endif
+	S_UNLOCK(lock);
+}
 
 #define SpinLockFree(lock)	S_LOCK_FREE(lock)
 
-- 
2.45.2.746.g06e570c0df.dirty

Re: CREATE MATERIALIZED VIEW

2024-07-29 Thread Dagfinn Ilmari Mannsåker

Nathan Bossart  writes:

> Committed.

Thanks!

- ilmari

Re: Interrupts vs signals

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas  wrote:
> a) assign every child process a PGPROC entry, and make postmaster
> responsible for assigning them like you suggest. We'll need more PGPROC
> entries, because currently a process doesn't reserve one until
> authentication has happened. Or we change that behavior.

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Detect double-release of spinlock


On 29/07/2024 19:51, Andres Freund wrote:

On 2024-07-29 09:40:26 -0700, Andres Freund wrote:

On 2024-07-29 12:33:13 -0400, Tom Lane wrote:

Andres Freund  writes:

On 2024-07-29 11:31:56 -0400, Tom Lane wrote:

There was some recent discussion about getting rid of
--disable-spinlocks on the grounds that nobody would use
hardware that lacked native spinlocks.  But now I wonder
if there is a testing/debugging reason to keep it.



Seems it'd be a lot more straightforward to just add an assertion to the
x86-64 spinlock implementation verifying that the spinlock isn't already free?


FWIW, I quickly hacked that up, and it indeed quickly fails with 0393f542d72^
and passes with 0393f542d72.


+1. Thanks!


Other context from this discussion:
I dunno, is that the only extra check that the --disable-spinlocks
implementation is providing?


I think it also provides the (valuable!) check that spinlocks were actually
initialized. But that again seems like something we'd be better off adding
more general infrastructure for - nobody runs --disable-spinlocks locally, we
shouldn't need to run this on the buildfarm to find problems like this.


Note that the "check" for double-release with the fallback 
implementation wasn't an explicit check either. It just incremented the 
underlying semaphore, which caused very weird failures later in 
completely unrelated code. An explicit assert would be much nicer.


+1 for removing --disable-spinlocks, but let's add this assertion first.


I'm kind of allergic to putting Asserts into spinlocked code segments,
mostly on the grounds that it violates the straight-line-code precept.
I suppose it's not really that bad for tests that you don't expect
to fail, but still ...


I don't think the spinlock implementation itself is really affected by that
rule - after all, the --disable-spinlocks implementation actually consists out
of several layers of external function calls (including syscalls in some
cases!).


Yeah I'm not worried about that at all. Also, the assert is made when 
you have already released the spinlock; you are already out of the 
critical section.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Interrupts vs signals


On 29/07/2024 19:56, Robert Haas wrote:

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas  wrote:

a) assign every child process a PGPROC entry, and make postmaster
responsible for assigning them like you suggest. We'll need more PGPROC
entries, because currently a process doesn't reserve one until
authentication has happened. Or we change that behavior.


I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?



My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.


Yes, there's a limit, roughly 2x max_connections. see 
MaxLivePostmasterChildren().


There's another issue with that that I was about to post in another 
thread, but here goes: the MaxLivePostmasterChildren() limit is shared 
by all regular backends, bgworkers and autovacuum workers. If you open a 
lot of TCP connections to postmaster and don't send anything to the 
server, you exhaust those slots, and the server won't be able to start 
any autovacuum workers or background workers either. That's not great. I 
started to work on approach b), with separate pools of slots for 
different kinds of child processes, which fixes that. Stay tuned for a 
patch.


In addition to that, you can have an unlimited number of "dead-end" 
backends, which are doomed to just respond with "sorry, too many 
clients" error. The only limit on those is the amount of resources 
needed for all the processes and a little memory to track them.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Interrupts vs signals

Robert Haas  writes:
> I wonder how this works right now. Is there something that limits the
> number of authentication requests that can be in flight concurrently,
> or is it completely uncapped (except by machine resources)?

The former.  IIRC, the postmaster won't spawn more than 2X max_connections
subprocesses (don't recall the exact limit, but it's around there).

regards, tom lane

Re：Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Rui Zhao

Thanks for your reply.
> Thanks for the patch. How do you estimate its performance impact?
In my patch, ony child processes that set
(child_process_kinds[child_type].shmem_attach == false)
will detach from shared memory.
Child processes with B_STANDALONE_BACKEND and B_INVALID don't call
postmaster_child_launch().
Therefore, currently, only the syslogger will be affected,
which should be harmless.
> Note the comments for postmaster_child_launch(). This function is
> exposed to the third-party code and guarantees to attach shared
> memory. I doubt that there is much third-party code in existence to
> break but you should change to comment.
Thank you for your reminder. My v2 patch will include the comments for
postmaster_child_launch().
--
Best regards,
Rui Zhao


0001-Detach-shared-memory-in-Postmaster-child-if-not-needed-v2.patch
Description: Binary data

Re: Detect double-release of spinlock

Heikki Linnakangas  writes:
> Yeah I'm not worried about that at all. Also, the assert is made when 
> you have already released the spinlock; you are already out of the 
> critical section.

Not in the patch Andres posted.

regards, tom lane

Re: Remove dead code generation tools in src/backend/utils/mb/


On 29/07/2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.


+1.  Seems to have been my oversight in 4c3c8c048d.


Removed.

(Aleksander, you forgot to CC the mailing list, but thanks for your 
review too.)


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Detect double-release of spinlock

Hi,

On 2024-07-29 13:25:22 -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > Yeah I'm not worried about that at all. Also, the assert is made when 
> > you have already released the spinlock; you are already out of the 
> > critical section.
> 
> Not in the patch Andres posted.

Which seems fairly fundamental - once outside of the critical section, we
can't actually assert that the lock isn't acquired, somebody else *validly*
might have acquired it by then.

However, I still don't think it's a problem to assert that the lock is held in
in the unlock "routine". As mentioned before, the spinlock implementation
itself has never followed the "just straight line code" rule that users of
spinlocks are supposed to follow.

Greetings,

Andres Freund

Re: Detect double-release of spinlock

Andres Freund  writes:
> However, I still don't think it's a problem to assert that the lock is held in
> in the unlock "routine". As mentioned before, the spinlock implementation
> itself has never followed the "just straight line code" rule that users of
> spinlocks are supposed to follow.

Yeah, that's fair.

regards, tom lane

Re: Optimize mul_var() for var1ndigits >= 8

On Mon, Jul 29, 2024, at 16:42, Joel Jacobson wrote:
> New results with less noise below.
>
> Pardon the exceeding of 80 chars line width,
> but felt important to include commit hash and relative delta.
>
>
> ndigits|rate|  change   |   accum   | commit  | 
>  summary
> ---++---+---+-+

I've reviewed the benchmark results, and it looks like v3-0001 made some cases 
a bit slower:

 (32,32)   |  1.786e+06 | -13.27 %  | -11.26 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (32,64)   |  1.119e+06 | -16.72 %  | -20.45 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (32,128)  |  7.242e+05 | -13.55 %  | -9.24 %   | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (64,64)   |  5.515e+05 | -22.34 %  | -24.47 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (64,128)  |  3.204e+05 | -14.83 %  | -12.44 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (128,128) |  1.750e+05 | -16.01 %  | -15.24 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co

Thanks to v3-0002, they are all still significantly faster when both patches 
have been applied,
but I wonder if it is expected or not, that v3-0001 temporarily made them a bit 
slower?

Same cases with v3-0002 applied:

 (32,32)   |  3.408e+06 | +90.80 %  | +69.32 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (32,64)   |  2.356e+06 | +110.63 % | +67.56 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (32,128)  |  1.393e+06 | +92.39 %  | +74.61 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (64,64)   |  1.432e+06 | +159.69 % | +96.14 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (128,128) |  5.567e+05 | +218.07 % | +169.60 % | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2

/Joel

Re: Detect double-release of spinlock


On 29/07/2024 20:48, Andres Freund wrote:

On 2024-07-29 13:25:22 -0400, Tom Lane wrote:

Heikki Linnakangas  writes:

Yeah I'm not worried about that at all. Also, the assert is made when
you have already released the spinlock; you are already out of the
critical section.


Not in the patch Andres posted.


Which seems fairly fundamental - once outside of the critical section, we
can't actually assert that the lock isn't acquired, somebody else *validly*
might have acquired it by then.


You could do:

bool was_free = S_LOCK_FREE(lock);

S_UNLOCK(lock);
Assert(!was_free);

Depending on the underlying implementation, you could also use 
compare-and-exchange. That makes the assertion-enabled instructions a 
little different than without assertions though.



However, I still don't think it's a problem to assert that the lock is held in
in the unlock "routine". As mentioned before, the spinlock implementation
itself has never followed the "just straight line code" rule that users of
spinlocks are supposed to follow.


Agreed.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Interrupts vs signals

On Mon, Jul 29, 2024 at 1:24 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I wonder how this works right now. Is there something that limits the
> > number of authentication requests that can be in flight concurrently,
> > or is it completely uncapped (except by machine resources)?
>
> The former.  IIRC, the postmaster won't spawn more than 2X max_connections
> subprocesses (don't recall the exact limit, but it's around there).

Hmm. Not to sidetrack this thread too much, but multiplying by two
doesn't really sound like the right idea to me. The basic idea
articulated in the comment for canAcceptConnections() makes sense:
some backends might fail authentication, or might be about to exit, so
it makes sense to allow for some slop. But 2X is a lot of slop even on
a machine with the default max_connections=100, and people with
connection management problems are likely to be running with
max_connections=500 or max_connections=900 or even (insanely)
max_connections=2000. Continuing with a connection attempt because we
think that hundreds or thousands of connections that are ahead of us
in the queue might clear out of the way before we need a PGPROC is not
a good bet.

I wonder if we ought to restrict this to a small, flat value, like say
50, or by a new GUC that defaults to such a value if a constant seems
problematic. Maybe it doesn't really matter. I'm not sure how much
work we'd save by booting out the doomed connection attempt earlier.

The unlimited number of dead-end backends doesn't sound too great
either. I don't have another idea, but presumably resisting DDOS
attacks and/or preserving resources for things that still have a
chance of working ought to take priority over printing a nicer error
message from a connection that's doomed to fail anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Detach shared memory in Postmaster child if not needed

On Mon, Jul 29, 2024 at 5:57 AM Rui Zhao  wrote:
> Prior to PG16, postmaster children would manually detach from shared memory
> if it was not needed. However, this behavior was removed in fork mode in
> commit aafc05d.

Oh. The commit message makes no mention of that. I wonder whether it
was inadvertent.

> Detaching shared memory when it is no longer needed is beneficial, as
> postmaster children (like syslogger) don't wish to take any risk of
> accidentally corrupting shared memory. Additionally, any panic in these
> processes will not reset shared memory.

+1.

> The attached patch addresses this issue by detaching shared memory after
> fork_process().

I don't know whether this is the correct approach or not, but
hopefully Heikki can comment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Detect double-release of spinlock

Hi,

On 2024-07-29 21:00:35 +0300, Heikki Linnakangas wrote:
> On 29/07/2024 20:48, Andres Freund wrote:
> > On 2024-07-29 13:25:22 -0400, Tom Lane wrote:
> > > Heikki Linnakangas  writes:
> > > > Yeah I'm not worried about that at all. Also, the assert is made when
> > > > you have already released the spinlock; you are already out of the
> > > > critical section.
> > > 
> > > Not in the patch Andres posted.
> > 
> > Which seems fairly fundamental - once outside of the critical section, we
> > can't actually assert that the lock isn't acquired, somebody else *validly*
> > might have acquired it by then.
> 
> You could do:
> 
> bool was_free = S_LOCK_FREE(lock);
> 
> S_UNLOCK(lock);
> Assert(!was_free);

I don't really see the point - we're about to crash with an assertion failure,
why would we want to do that outside of the critical section? If anything that
will make it harder to debug the issue in a core dump, because other backends
might "destroy evidence" due to being able to acquire the spinlock.

> Depending on the underlying implementation, you could also use
> compare-and-exchange.

That'd scale a lot worse, at least on x86-64, as it requires the unlock to be
an atomic op, whereas today it's a simple store (+ compiler barrier).

I've experimented with replacing all spinlocks with lwlocks, and the fact that
you need an atomic op for an rwlock release is one of the two major reasons
they have a higher overhead (the remainder is boring stuff like the overhead
of external function calls and ownership management).

Greetings,

Andres Freund

Re: Detect double-release of spinlock

Hi,

Partially replying here to an email on -committers [1].

On 2024-07-29 13:57:02 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > However, I still don't think it's a problem to assert that the lock is held 
> > in
> > in the unlock "routine". As mentioned before, the spinlock implementation
> > itself has never followed the "just straight line code" rule that users of
> > spinlocks are supposed to follow.
>
> Yeah, that's fair.

Cool.


On 2024-07-29 13:56:05 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > b) Instead define the spinlock to have 1 as the unlocked state and 0 as the
> >locked state. That makes it a bit harder to understand that 
> > initialization
> >is missing, compared to a dedicated state, as the first use of the 
> > spinlock
> >just blocks.
>
> This option works for me.

> > To make 1) b) easier to understand it might be worth changing the slock_t
> > typedef to be something like
>
> > typedef struct slock_t
> > {
> > char is_free;
> > } slock_t;
>
> +1

Cool. I've attached a prototype.


I just realized there's a nice little advantage to the "inverted"
representation - it detects missing initialization even in optimized builds.


> How much of this would we change across platforms, and how much
> would be x86-only?  I think there are enough people developing on
> ARM (e.g. Mac) now to make it worth covering that, but maybe we
> don't care so much about anything else.

Not sure. Right now I've only hacked up x86-64 (not even touching i386), but
it shouldn't be hard to change at least some additional platforms.

My current prototype requires S_UNLOCK, S_LOCK_FREE, S_INIT_LOCK to be
implemented for x86-64 instead of using the "generic" implementation. That'd
be mildly annoying duplication if we did so for a few more platforms.


It'd be more palatable to just change all platforms if we made more of them
use __sync_lock_test_and_set (or some other intrinsic(s))...

Greetings,

Andres Freund

[1] https://postgr.es/m/2812376.1722275765%40sss.pgh.pa.us
>From 06bd992f4be2f46586bfc7bab3135b3a2ca84e72 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 09:48:26 -0700
Subject: [PATCH v2 1/2] Detect unlocking an unlocked spinlock

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/spin.h | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index c0679c59992..7ec32cd816a 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -61,7 +61,22 @@
 
 #define SpinLockAcquire(lock) S_LOCK(lock)
 
-#define SpinLockRelease(lock) S_UNLOCK(lock)
+static inline void
+SpinLockRelease(volatile slock_t *lock)
+{
+	/*
+	 * Check that this backend actually held the spinlock. Implemented as a
+	 * static inline to avoid multiple-evaluation hazards.
+	 *
+	 * Can't assert when using the semaphore fallback implementation - it
+	 * doesn't implement S_LOCK_FREE() - but the fallback triggers failures in
+	 * the case of double-release on its own.
+	 */
+#ifdef HAVE_SPINLOCKS
+	Assert(!S_LOCK_FREE(lock));
+#endif
+	S_UNLOCK(lock);
+}
 
 #define SpinLockFree(lock)	S_LOCK_FREE(lock)
 
-- 
2.45.2.746.g06e570c0df.dirty

>From 574d02d3d4d7f38b764ca6a2e4d2617a417ab8a8 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 10:55:11 -0700
Subject: [PATCH v2 2/2] Detect many uses of uninitialized spinlocks on x86-64

On most platforms we use 0 for a unlocked spinlock and 1 for a locked one. As
we often place spinlocks in zero-initialized memory, missing calls to
SpinLockInit() aren't noticed.

To address that, swap the meaning of the spinlock state and use 1 for unlocked
and 0 for locked. That way using an uninitialized spinlock blocks the first
time a lock is acquired. A dedicated state for initialized locks would allow
for a nicer assertion, but ends up with slightly less dense code (a three byte
cmp with an immediate, instead of a 2 byte test).

To make the swapped meaning of the slock_t state easier to understand when
debugging, change the slock_t to be a struct on x86-64, with an "is_free"
member.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/s_lock.h | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 02c68513a53..18291b284b0 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -205,7 +205,10 @@ spin_delay(void)
 #ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
 #define HAS_TEST_AND_SET
 
-typedef unsigned char slock_t;
+typedef struct slock_t
+{
+	char	is_free;
+} slock_t;
 
 #define TAS(lock) tas(lock)
 
@@ -217,21 +220,27 @@ typedef unsigned char slock_t;
  * and IA32, by Michael Chynoweth and Mary R. Lee. As of this writing, it is
  * av

040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> In hopes of moving things along as we approach the v18 branch,
> I went ahead and pushed Kuroda-san's patches (with a bit of
> further editorialization).  AFAICS that allows closing out
> the concerns raised by Noah, so I've marked that open item
> done.  However, I added a new open item about how the
> 040_pg_createsubscriber.pl test is slow and still unstable.

This open item is now the only open item. It seems to have been open
for a month with no response from Peter which is, from my point of
view, far from ideal. However, another thing that is not ideal is that
we've been using the same thread to discuss every issue related to
this patch for 2.5 years. The thread spans hundreds of messages and it
is by no means obvious to what extent the messages posted after this
one addressed the underlying concern. Perhaps it would have been an
idea to start new threads when we started discussing post-commit
issues, instead of just tagging onto the same one.

But that said, I see no commits in the commit history which purport to
improve performance, so I guess the performance is probably still not
what you want, though I am not clear on the details. And as far as
stability is concerned, I peered through the dense haze of
027_stream_regress-related buildfarm failures for long enough to
discover that the stability issues with 040_pg_createsubscriber aren't
fixed either, because we have these recent buildfarm reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-07-26%2016%3A02%3A40
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-07-26%2009%3A20%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-07-25%2002%3A39%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-07-22%2002%3A31%3A32

So, Peter, as the committer responsible for pg_createsubscriber, what
are we going to do about this?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Detach shared memory in Postmaster child if not needed


On 29/07/2024 21:10, Robert Haas wrote:

On Mon, Jul 29, 2024 at 5:57 AM Rui Zhao  wrote:

Prior to PG16, postmaster children would manually detach from shared memory
if it was not needed. However, this behavior was removed in fork mode in
commit aafc05d.


Oh. The commit message makes no mention of that. I wonder whether it
was inadvertent.


Detaching shared memory when it is no longer needed is beneficial, as
postmaster children (like syslogger) don't wish to take any risk of
accidentally corrupting shared memory. Additionally, any panic in these
processes will not reset shared memory.


+1.


The attached patch addresses this issue by detaching shared memory after
fork_process().


I don't know whether this is the correct approach or not, but
hopefully Heikki can comment.


Good catch, it was not intentional. The patch looks good to me, so 
committed. Thanks Rui!


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: tiny step toward threading: reduce dependence on setlocale()

2024-07-29 Thread Peter Eisentraut


On 27.07.24 21:03, Andreas Karlsson wrote:

On 7/26/24 10:35 PM, Jeff Davis wrote:

database_ctype_is_c refers to the LC_CTYPE environment of the database
-- pg_database.datctype. default_locale.ctype_is_c is the ctype of the
database's default collation.

Confusing, I know, but it matters for a few things that still depend on
the LC_CTYPE, such as tsearch and maybe a few extensions. See
f413941f41.


Ah, right! That was thinko on my behalf.

The set of patches looks good to me now. There is further refactoring 
that can be done in this area (and should be done given all calls e.g to 
isalpha()) but I think this set of patches improves code readability 
while moving us away from setlocale().


And even if we take a tiny performance hit here, which your tests did 
not measure, I would say it is worth it both due to code clarity and due 
to not relying on thread unsafe state.


I do not see these patches in the commitfest app but if they were I 
would have marked them as ready for committer.


Here: https://commitfest.postgresql.org/48/5023/

I have also re-reviewed the patches and I agree they are good to go.

Suboptimal spinlock code due to volatile

Hi,

As part of [1] I was staring at the assembly code generated for
SpinLockAcquire(), fairly randomly using GetRecoveryState() as the example.

On master, in an optimized build this generates the following code (gcc 12 in
this case, but it doesn't really matter):

4220 :
4220:   55  push   %rbp
4221:   48 8b 05 00 00 00 00mov0x0(%rip),%rax# 4228 

4228:   ba 01 00 00 00  mov$0x1,%edx
422d:   48 89 e5mov%rsp,%rbp
4230:   48 05 c0 01 00 00   add$0x1c0,%rax
4236:   f0 86 10lock xchg %dl,(%rax)
4239:   84 d2   test   %dl,%dl
423b:   75 23   jne4260 
423d:   48 8b 05 00 00 00 00mov0x0(%rip),%rax# 4244 

4244:   8b 80 44 01 00 00   mov0x144(%rax),%eax
424a:   48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 4251 

4251:   c6 82 c0 01 00 00 00movb   $0x0,0x1c0(%rdx)
4258:   5d  pop%rbp
4259:   c3  ret
425a:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)
4260:   48 8b 05 00 00 00 00mov0x0(%rip),%rax# 4267 

4267:   48 8d 0d 00 00 00 00lea0x0(%rip),%rcx# 426e 

426e:   48 8d b8 c0 01 00 00lea0x1c0(%rax),%rdi
4275:   ba c8 18 00 00  mov$0x18c8,%edx
427a:   48 8d 35 00 00 00 00lea0x0(%rip),%rsi# 4281 

4281:   ff 15 00 00 00 00   call   *0x0(%rip)# 4287 

4287:   eb b4   jmp423d 

The main thing I want to raise attention about is the following bit:
add$0x1c0,%rax
lock xchg %dl,(%rax)

0x1c0 is the offset of info_lck in XLogCtlData. So the code first computes the
address of the lock in %rax and then does the xchg on that.

That's pretty odd, because on x86 this could just be encoded as an offset to
the address - as shown in the code for the unlock a bit later:
4251:   c6 82 c0 01 00 00 00movb   $0x0,0x1c0(%rdx)


After being confused for a while, the explanation is fairly simple: We use
volatile and dereference the address:

static __inline__ int
tas(volatile slock_t *lock)
{
slock_t _res = 1;

__asm__ __volatile__(
"   lock\n"
"   xchgb   %0,%1   \n"
:   "+q"(_res), "+m"(*lock)
:   /* no inputs */
:   "memory", "cc");
return (int) _res;
}

(note the (*lock) and the volatile in the signature).

I think it'd be just as defensible to not emit a separate load here, despite
the volatile, and indeed clang doesn't emit a separate load. But it also does
seem defensible to take translate the code very literally, as gcc does.


If I remove the volatile from the signature or cast it away, gcc indeed
generates the offset version:
4230:   f0 86 82 c0 01 00 00lock xchg %al,0x1c0(%rdx)


A second, even smaller, issue with the code is that we use "lock xchgb"
despite xchg having implied lock approximately forever ([2]). That makes the 
code
slightly wider than necessary (the lock prefix is one byte).


I doubt there's a lot of situations where these end up having a meaningful
performance impact, but it still seems suboptimal.   I may be seeing a *small*
gain in a workload inserting lots of tiny records, but it's hard to be sure if
it's above the noise floor.


I'm wondering in how many places our fairly broad use of volatiles causes
more substantially worse code being generated.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20240729165154.56zqyg34x2ywkpsh%40awork3.anarazel.de
[2] https://www.felixcloutier.com/x86/xchg#description

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Dean Rasheed

On Mon, 29 Jul 2024 at 18:57, Joel Jacobson  wrote:
>
> Thanks to v3-0002, they are all still significantly faster when both patches 
> have been applied,
> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
> bit slower?
>

There's no obvious reason why 0001 would make those cases slower, but
the fact that, together with 0002, it's a significant net win, and the
gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Something I did notice in my tests was that if ndigits was a small
multiple of 8, the old code was disproportionately faster, which can
be explained by the fact that the computation fits exactly into a
whole number of XMM register operations, with no remaining digits to
process. For example, these results from above:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
--+--+---+---+--
   15 |   15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
   16 |   16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
   17 |   17 | 3.9258755e+06 |  4.935394e+06 | +25.71%

   23 |   23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
   24 |   24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
   25 |   25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%

   31 |   31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
   32 |   32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
   33 |   33 |   2.27133e+06 |  3.435459e+06 | +51.25%

(Note how 16x16 was much faster than 15x15, for example.)

The patched code seems to do a better job at levelling out and coping
with arbitrary-sized inputs, not just those that fit exactly into a
whole number of loops using SSE2 operations.

Something else I noticed was that the relative gains for large numbers
of digits were much higher with clang than with gcc:

gcc 13.3.0:

16383 |16383 | 21.629467 |  73.58552 | +240.21%

clang 15.0.7:

16383 |16383 | 11.562384 |  73.00517 | +531.40%

That seems to be because clang doesn't do a good job of generating
efficient SSE2 code in the old case of 16-bit x 16-bit
multiplications. Looking on godbolt.org, it generates
overly-complicated code using PMULUDQ, which actually does 32-bit x
32-bit multiplications. Gcc, on the other hand, generates a much more
compact loop, using PMULHW and PMULLW, which is much faster. With the
patch, they both generate the same SSE2 code, so the results are
pretty consistent.

Regards,
Dean

Re: tls 1.3: sending multiple tickets

On 2024-07-26 13:55:29 +0200, Daniel Gustafsson wrote:
> Thanks for review, I've applied this backpatched all the way.

Thanks for working on this!

Re: Refactoring postmaster's code to cleanup after child exit


On 06/07/2024 22:01, Heikki Linnakangas wrote:
Reading through postmaster code, I spotted some refactoring 
opportunities to make it slightly more readable.


Currently, when a child process exits, the postmaster first scans 
through BackgroundWorkerList to see if it was a bgworker process. If not 
found, it scans through the BackendList to see if it was a backend 
process (which it really should be then). That feels a bit silly, 
because every running background worker process also has an entry in 
BackendList. There's a lot of duplication between 
CleanupBackgroundWorker and CleanupBackend.


Before commit 8a02b3d732, we used to created Backend entries only for 
background worker processes that connected to a database, not for other 
background worker processes. I think that's why we have the code 
structure we have. But now that we have a Backend entry for all bgworker 
processes, it's more natural to have single function to deal with both 
regular backends and bgworkers.


So I came up with the attached patches. This doesn't make any meaningful 
user-visible changes, except for some incidental changes in log messages 
(see commit message for details).


New patch version attached. Fixed conflicts with recent commits, no 
other changes.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 84ca49efab16cc2699f8446684a5ebe63dad1c38 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 23:13:16 +0300
Subject: [PATCH v2 1/4] Fix outdated comment; all running bgworkers are in
 BackendList

Before commit 8a02b3d732, only bgworkers that connected to a database
had an entry in the Backendlist. Commit 8a02b3d732 changed that, but
forgot to update this comment.

Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d...@iki.fi
---
 src/include/postmaster/bgworker_internals.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/include/postmaster/bgworker_internals.h b/src/include/postmaster/bgworker_internals.h
index 9106a0ef3f..61ba54117a 100644
--- a/src/include/postmaster/bgworker_internals.h
+++ b/src/include/postmaster/bgworker_internals.h
@@ -26,14 +26,14 @@
 /*
  * List of background workers, private to postmaster.
  *
- * A worker that requests a database connection during registration will have
- * rw_backend set, and will be present in BackendList.  Note: do not rely on
- * rw_backend being non-NULL for shmem-connected workers!
+ * All workers that are currently running will have rw_backend set, and will
+ * be present in BackendList.
  */
 typedef struct RegisteredBgWorker
 {
 	BackgroundWorker rw_worker; /* its registry entry */
-	struct bkend *rw_backend;	/* its BackendList entry, or NULL */
+	struct bkend *rw_backend;	/* its BackendList entry, or NULL if not
+ * running */
 	pid_t		rw_pid;			/* 0 if not running */
 	int			rw_child_slot;
 	TimestampTz rw_crashed_at;	/* if not 0, time it last crashed */
-- 
2.39.2

From e541a6d2c8481cd9f9c75fb2328e8e6031cddac6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 23:13:56 +0300
Subject: [PATCH v2 2/4] Minor refactoring of assign_backendlist_entry()

Make assign_backendlist_entry() responsible just for allocating the
Backend struct. Linking it to the RegisteredBgWorker is the caller's
responsibility now. Seems more clear that way.

Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d...@iki.fi
---
 src/backend/postmaster/postmaster.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 02442a4b85..a3e9e8fdc0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -416,7 +416,7 @@ static void TerminateChildren(int signal);
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
-static bool assign_backendlist_entry(RegisteredBgWorker *rw);
+static Backend *assign_backendlist_entry(void);
 static void maybe_start_bgworkers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(BackendType type);
@@ -4028,6 +4028,7 @@ MaxLivePostmasterChildren(void)
 static bool
 do_start_bgworker(RegisteredBgWorker *rw)
 {
+	Backend*bn;
 	pid_t		worker_pid;
 
 	Assert(rw->rw_pid == 0);
@@ -4042,11 +4043,14 @@ do_start_bgworker(RegisteredBgWorker *rw)
 	 * tried again right away, most likely we'd find ourselves hitting the
 	 * same resource-exhaustion condition.
 	 */
-	if (!assign_backendlist_entry(rw))
+	bn = assign_backendlist_entry();
+	if (bn == NULL)
 	{
 		rw->rw_crashed_at = GetCurrentTimestamp();
 		return false;
 	}
+	rw->rw_backend = bn;
+	rw->rw_child_slot = bn->child_slot;
 
 	ereport(DEBUG1,
 			(errmsg_internal("starting background worker process \"%s\"",
@@ -4119,12 +4123,10 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
  * Allocate

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

Robert Haas  writes:
> On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
>> ... However, I added a new open item about how the
>> 040_pg_createsubscriber.pl test is slow and still unstable.

> But that said, I see no commits in the commit history which purport to
> improve performance, so I guess the performance is probably still not
> what you want, though I am not clear on the details.

My concern is described at [1]:

>> I have a different but possibly-related complaint: why is
>> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
>> runs for a bit over 19 seconds, which seems completely out of line
>> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
>> other test scripts in this directory take much less).  It looks
>> like most of the blame falls on this step:
>> 
>> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
>> 
>> AFAICS the amount of data being replicated is completely trivial,
>> so that it doesn't make any sense for this to take so long --- and
>> if it does, that suggests that this tool will be impossibly slow
>> for production use.  But I suspect there is a logic flaw causing
>> this.  Speculating wildly, perhaps that is related to the failure
>> Alexander spotted?

The followup discussion in that thread made it sound like there's
some fairly fundamental deficiency in how wait_for_end_recovery()
detects end-of-recovery.  I'm not too conversant with the details
though, and it's possible that pg_createsubscriber is just falling
foul of a pre-existing infelicity.

If the problem can be correctly described as "pg_createsubscriber
takes 10 seconds or so to detect end-of-stream", then it's probably
only an annoyance for testing and not something that would be fatal
in the real world.  I'm not quite sure if that's accurate, though.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed

Re: Optimize mul_var() for var1ndigits >= 8

On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote:
> On Mon, 29 Jul 2024 at 18:57, Joel Jacobson  wrote:
>>
>> Thanks to v3-0002, they are all still significantly faster when both patches 
>> have been applied,
>> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
>> bit slower?
>>
>
> There's no obvious reason why 0001 would make those cases slower, but
> the fact that, together with 0002, it's a significant net win, and the
> gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Yes, I agree, I just thought it was noteworthy, but not a problem per se.

> Something I did notice in my tests was that if ndigits was a small
> multiple of 8, the old code was disproportionately faster, which can
> be explained by the fact that the computation fits exactly into a
> whole number of XMM register operations, with no remaining digits to
> process. For example, these results from above:
>
>  ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
> --+--+---+---+--
>15 |   15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
>16 |   16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
>17 |   17 | 3.9258755e+06 |  4.935394e+06 | +25.71%
>
>23 |   23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
>24 |   24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
>25 |   25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
>
>31 |   31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
>32 |   32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
>33 |   33 |   2.27133e+06 |  3.435459e+06 | +51.25%
>
> (Note how 16x16 was much faster than 15x15, for example.)
>
> The patched code seems to do a better job at levelling out and coping
> with arbitrary-sized inputs, not just those that fit exactly into a
> whole number of loops using SSE2 operations.

That's nice.

> Something else I noticed was that the relative gains for large numbers
> of digits were much higher with clang than with gcc:
>
> gcc 13.3.0:
>
> 16383 |16383 | 21.629467 |  73.58552 | +240.21%
>
> clang 15.0.7:
>
> 16383 |16383 | 11.562384 |  73.00517 | +531.40%
>
> That seems to be because clang doesn't do a good job of generating
> efficient SSE2 code in the old case of 16-bit x 16-bit
> multiplications. Looking on godbolt.org, it generates
> overly-complicated code using PMULUDQ, which actually does 32-bit x
> 32-bit multiplications. Gcc, on the other hand, generates a much more
> compact loop, using PMULHW and PMULLW, which is much faster. With the
> patch, they both generate the same SSE2 code, so the results are
> pretty consistent.

Very nice.

I've now also had an initial look at the actual code of the patches:

* v3-0001

Looks pretty straight forward, nice with the PRODSUM macros,
that really improved readability a lot.

I like these simplifications, how `var2ndigits` is used instead of 
`res_ndigits`:
-   for (int i = res_ndigits - 3; i >= 1; i--)
+   for (int i = var2ndigits - 1; i >= 1; i--)

But I wonder why does `case 1:`  not follow the same pattern?
for (int i = res_ndigits - 2; i >= 0; i--)

* v3-0002

I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
the benefits of simpler code.

You brought up the question if 32-bit systems should be regarded
as legacy previously in this thread.

Unfortunately, we didn't get any feedback, so I'm starting a separate
thread, with subject "Is fast 32-bit code still important?", hoping to get
more input to help us make judgement calls.

/Joel

Is fast 32-bit support still important?

Hello hackers,

I would like your help to collectively try to answer these three questions:

1. Who are the current users of 32-bit PostgreSQL?

2. Among these users, how many are upgrading to new major versions?

3. For how many of these users is performance critical?

This came up during ongoing work on optimizing numeric_mul [1].

To me, it's non-obvious whether introducing `#if SIZEOF_DATUM < 8` with
separate 32-bit and 64-bit code paths is worthwhile to maintain performance
for both.

Knowing more about $subject can hopefully help us reason about how much
additional code complication is justifiable for *fast* 32-bit support.

I checked the archives but only found a discussion on *dropping* 32-bit support
[2], which is a different topic.

Thanks for input!

/Joel

[1] https://postgr.es/m/9d8a4a42-c354-41f3-bbf3-199e1957d...@app.fastmail.com
[2] https://postgr.es/m/0a71b43129fb447988f152941e1db...@nidsa.net

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Daniel Gustafsson

Thanks for working on this patchset, I'm looking over 0004 and 0005 but came
across a thing I wanted to bring up one thing sooner than waiting for the
review. In parse_device_authz we have this:

  {"user_code", JSON_TOKEN_STRING, {&authz->user_code}, REQUIRED},
  {"verification_uri", JSON_TOKEN_STRING, {&authz->verification_uri}, REQUIRED},

  /*
   * The following fields are technically REQUIRED, but we don't use
   * them anywhere yet:
   *
   * - expires_in
   */

  {"interval", JSON_TOKEN_NUMBER, {&authz->interval_str}, OPTIONAL},

Together with a colleage we found the Azure provider use "verification_url"
rather than xxx_uri.  Another discrepancy is that it uses a string for the
interval (ie: "interval":"5").  One can of course argue that Azure is wrong and
should feel bad, but I fear that virtually all (major) providers will have
differences like this, so we will have to deal with it in an extensible fashion
(compile time, not runtime configurable).

I was toying with making the name json_field name member an array, to allow
variations.  That won't help with the fieldtype differences though, so another
train of thought was to have some form of REQUIRED_XOR where fields can tied
together.  What do you think about something along these lines?

Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
even if we don't use them to ensure that the JSON is wellformed?  If the JSON
we get is malformed in any way it seems like the safe/conservative option to
error out.

--
Daniel Gustafsson

Re: speed up a logical replica setup

2024-07-29 Thread Euler Taveira

On Wed, Jul 17, 2024, at 11:37 PM, Amit Kapila wrote:
> I am thinking of transactions between restart_lsn and "consistent
> point lsn" (aka the point after which all commits will be replicated).
> You conclusion seems correct to me that such transactions won't be
> replicated by streaming replication and would be skipped by logical
> replication. Now, if we can avoid that anyway, we can consider that.

Under reflection what I proposed [1] seems more complex and possibly
error prone than other available solutions. The recovery step was slow
if the server is idle (that's the case for the test). An idle server
usually doesn't have another WAL record after creating the replication
slots. Since pg_createsubscriber is using the LSN returned by the last
replication slot as recovery_target_lsn, this LSN is ahead of the
current WAL position and the recovery waits until something writes a
WAL record to reach the target and ends the recovery.

Hayato already mentioned one of the solution in a previous email [2].
AFAICS we can use any solution that creates a WAL record. I tested the
following options: 

\timing
select * from pg_create_logical_replication_slot('pg_createsubscriber', 
'pgoutput', true);
select pg_logical_emit_message(false, 'pg_createsubscriber', 'dummy');
select pg_log_standby_snapshot();
select pg_create_restore_point('pg_createsubscriber');

that results in the following output:

  slot_name  |lsn
-+---
pg_createsubscriber | 0/942DD28
(1 row)

Time: 200.571 ms
pg_logical_emit_message 
-
0/942DD78
(1 row)

Time: 0.938 ms
pg_log_standby_snapshot 
-
0/942DDB0
(1 row)

Time: 0.741 ms
pg_create_restore_point 
-
0/942DE18
(1 row)

Time: 0.870 ms

and generates the following WAL records:

rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DCF0, prev 0/0942DCB8, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: LogicalMessage len (rec/tot): 75/75, tx:  0, lsn: 
0/0942DD28, prev 0/0942DCF0, desc: MESSAGE non-transactional, prefix 
"pg_createsubscriber"; payload (5 bytes): 64 75 6D 6D 79
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DD78, prev 0/0942DD28, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: XLOGlen (rec/tot): 98/98, tx:  0, lsn: 
0/0942DDB0, prev 0/0942DD78, desc: RESTORE_POINT pg_createsubscriber

The options are:

(a) temporary replication slot: requires an additional replication slot.
small payload. it is extremely slow in comparison with the other
options.
(b) logical message: can be consumed by logical replication when/if it
is supported some day. big payload. fast.
(c) snapshot of running txn:  small payload. fast.
(d) named restore point: biggest payload. fast.

I don't have a strong preference but if I need to pick one I would
choose option (c) or option (d). The option (a) is out of the question.

Opinions?

[1] 
https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com
[2] 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: xid_wraparound tests intermittent failure.

2024-07-29 Thread Masahiko Sawada

On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan  wrote:
>
>
> On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
> > On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan  wrote:
> >>
> >> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
> >>
> >> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada  
> >> wrote:
> >>
> >> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan  
> >> wrote:
> >>
> >> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
> >>
> >> See 
> >> 
> >>
> >>
> >> The failure logs are from a run where both tests 1 and 2 failed.
> >>
> >> Thank you for sharing the logs.
> >>
> >> I think that the problem seems to match what Alexander Lakhin
> >> mentioned[1]. Probably we can fix such a race condition somehow but
> >> I'm not sure it's worth it as setting autovacuum = off and
> >> autovacuum_max_workers = 1 (or a low number) is an extremely rare
> >> case. I think it would be better to stabilize these tests. One idea is
> >> to turn the autovacuum GUC parameter on while setting
> >> autovacuum_enabled = off for each table. That way, we can ensure that
> >> autovacuum workers are launched. And I think it seems to align real
> >> use cases.
> >>
> >> Regards,
> >>
> >> [1] 
> >> https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
> >>
> >>
> >> OK, do you want to propose a patch?
> >>
> >> Yes, I'll prepare and share it soon.
> >>
> >> I've attached the patch. Could you please test if the patch fixes the
> >> instability you observed?
> >>
> >> Since we turn off autovacuum on all three tests and we wait for
> >> autovacuum to complete processing databases, these tests potentially
> >> have a similar (but lower) risk. So I modified these tests to turn it
> >> on so we can ensure the autovacuum runs periodically.
> >>
> >>
> >> I assume you actually meant to remove the "autovacuum = off" in 
> >> 003_wraparound.pl. With that change in your patch I retried my test, but 
> >> on iteration 100 out of 100 it failed on test 002_limits.pl.
> >>
> > I think we need to remove the "autovacuum = off' also in 002_limits.pl
> > as it waits for autovacuum to process both template0 and template1
> > databases. Just to be clear, the failure happened even without
> > "autovacuum = off"?
> >
>
> The attached patch, a slight modification of yours, removes "autovacuum
> = off" for all three tests, and given that a set of 200 runs was clean
> for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Remove last traces of HPPA support

2024-07-29 Thread Thomas Munro

On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Here are some experimental patches to try out some ideas mentioned
> > upthread, that are approximately unlocked by that cleanup.
>
> FWIW, I'm good with getting rid of --disable-spinlocks and
> --disable-atomics.  That's a fair amount of code and needing to
> support it causes problems, as you say.  I am very much less
> excited about ripping out our spinlock and/or atomics code in favor
> of ; I just don't see the gain there, and I do see risk
> in ceding control of the semantics and performance of those
> primitives.

OK,  part on ice for now.  Here's an update of the rest,
this time also removing the barrier fallbacks as discussed in the LTO
thread[1].

I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...

Side issue: I noticed via CI failure when I tried to require
read/write barriers to be provided (a choice I backed out of), that on
MSVC we seem to be using the full memory barrier fallback for those.
Huh?  For x86, I think they should be using pg_compiler_barrier() (no
code gen, just prevent reordering), not pg_pg_memory_barrier(), no?
Perhaps I'm missing something but I suspect we might be failing to
include arch-x86.h on that compiler when we should... maybe it needs
to detect _M_AMD64 too?  For ARM, from a quick look, the only way to
reach real acquire/release barriers seems to be to use the C11
interface (which would also be fine on x86 where it should degrade to
a no-op compiler barrier or signal fence as the standard calls it),
but IIRC the Windows/ARM basics haven't gone in yet anyway.

[1] 
https://www.postgresql.org/message-id/flat/721bf39a-ed8a-44b0-8b8e-be3bd81db748%40technowledgy.de#66ba381b05e8ee08b11503b846acc4a1
From 33533817949052f7af423aaee0ef6e737031effb Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 2 Jul 2024 14:47:59 +1200
Subject: [PATCH v2 1/4] Remove --disable-spinlocks.

A later change will require atomic support, so it wouldn't make sense
for a new system not to be able to implement true spinlocks.

Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
---
 configure   |  40 --
 configure.ac|  13 --
 doc/src/sgml/installation.sgml  |  37 +
 meson.build |   6 -
 src/backend/port/atomics.c  |  26 
 src/backend/port/posix_sema.c   |   3 +-
 src/backend/port/sysv_sema.c|   3 +-
 src/backend/postmaster/launch_backend.c |   8 --
 src/backend/storage/ipc/ipci.c  |  10 --
 src/backend/storage/lmgr/Makefile   |   1 -
 src/backend/storage/lmgr/meson.build|   1 -
 src/backend/storage/lmgr/s_lock.c   |   2 +-
 src/backend/storage/lmgr/spin.c | 180 
 src/include/pg_config.h.in  |   3 -
 src/include/pg_config_manual.h  |  15 --
 src/include/port/atomics.h  |   4 +-
 src/include/port/atomics/fallback.h |   4 +-
 src/include/storage/s_lock.h|  39 +
 src/include/storage/spin.h  |  18 +--
 src/test/regress/regress.c  |  86 ---
 20 files changed, 13 insertions(+), 486 deletions(-)
 delete mode 100644 src/backend/storage/lmgr/spin.c

diff --git a/configure b/configure
index ea5514fab1..f8deaa8d78 100755
--- a/configure
+++ b/configure
@@ -836,7 +836,6 @@ enable_integer_datetimes
 enable_nls
 with_pgport
 enable_rpath
-enable_spinlocks
 enable_atomics
 enable_debug
 enable_profiling
@@ -1529,7 +1528,6 @@ Optional Features:
   enable Native Language Support
   --disable-rpath do not embed shared library search path in
   executables
-  --disable-spinlocks do not use spinlocks
   --disable-atomics   do not use atomic operations
   --enable-debug  build with debugging symbols (-g)
   --enable-profiling  build with profiling enabled
@@ -3266,33 +3264,6 @@ fi
 
 
 
-#
-# Spinlocks
-#
-
-
-# Check whether --enable-spinlocks was given.
-if test "${enable_spinlocks+set}" = set; then :
-  enableval=$enable_spinlocks;
-  case $enableval in
-yes)
-  :
-  ;;
-no)
-  :
-  ;;
-*)
-  as_fn_error $? "no argument expected for --enable-spinlocks option" "$LINENO" 5
-  ;;
-  esac
-
-else
-  enable_spinlocks=yes
-
-fi
-
-
-
 #
 # Atomic operations
 #
@@ -12185,17 +12156,6 @@ fi
 
 fi
 
-if test "$enable_spinlocks" = yes; then
-
-$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
-
-else
-  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
-*** Not using spinlocks will cause poor performance." >&5
-$as_echo "$as_me: WARNING:
-*** Not using spinlocks will cause poor performance." >&2;}
-fi
-
 if test "$enable_atomics" = yes; then
 
 $as_echo "#define HAVE_ATOMICS 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index 00

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Dean Rasheed

On Mon, 29 Jul 2024 at 21:39, Joel Jacobson  wrote:
>
> I like these simplifications, how `var2ndigits` is used instead of 
> `res_ndigits`:
> -   for (int i = res_ndigits - 3; i >= 1; i--)
> +   for (int i = var2ndigits - 1; i >= 1; i--)
>
> But I wonder why does `case 1:`  not follow the same pattern?
> for (int i = res_ndigits - 2; i >= 0; i--)
>

Ah yes, that should be made the same. (I think I did do that at one
point, but then accidentally reverted it during a code refactoring.)

> * v3-0002
>
> I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
> using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
> the benefits of simpler code.
>
> You brought up the question if 32-bit systems should be regarded
> as legacy previously in this thread.
>
> Unfortunately, we didn't get any feedback, so I'm starting a separate
> thread, with subject "Is fast 32-bit code still important?", hoping to get
> more input to help us make judgement calls.
>

Looking at that other thread that you found [1], I think it's entirely
possible that there are people who care about 32-bit systems, which
means that we might well get complaints, if we make it slower for
them. Unfortunately, I don't have any way to test that (I doubt that
running a 32-bit executable on my x86-64 system is a realistic test).

Regards,
Dean

[1] https://postgr.es/m/0a71b43129fb447988f152941e1db...@nidsa.net

Speeding up ruleutils' name de-duplication code, redux

When deparsing queries or expressions, ruleutils.c has to generate
unique names for RTEs and columns of RTEs.  (Often, they're unique
already, but this isn't reliably true.)  The original logic for that
involved just strcmp'ing a proposed name against all the ones already
assigned, which obviously is O(N^2) in the number of names being
considered.  Back in commit 8004953b5, we fixed that problem for
generation of unique RTE names, by using a hash table to remember the
already-assigned names.  However, that commit did not touch the logic
for de-duplicating the column names within each RTE, explaining

In principle the same problem applies to the column-name-de-duplication
code; but in practice that seems to be less of a problem, first because
N is limited since we don't support extremely wide tables, and second
because duplicate column names within an RTE are fairly rare, so that in
practice the cost is more like O(N^2) not O(N^3).  It would be very much
messier to fix the column-name code, so for now I've left that alone.

But I think the time has come to do something about it.  In [1]
I presented this Perl script to generate a database that gives
pg_upgrade indigestion:

-
for (my $i = 0; $i < 100; $i++)
{
print "CREATE TABLE test_inh_check$i (\n";
for (my $j = 0; $j < 1000; $j++)
{
print "a$j float check (a$j > 10.2),\n";
}
print "b float);\n";
print "CREATE TABLE test_inh_check_child$i() 
INHERITS(test_inh_check$i);\n";
}
-

On my development machine, it takes over 14 minutes to pg_upgrade
this, and it turns out that that time is largely spent in column
name de-duplication while deparsing the CHECK constraints.  The
attached patch reduces that to about 3m45s.

(I think that we ought to reconsider MergeConstraintsIntoExisting's
use of deparsing to compare check constraints: it'd be faster and
probably more reliable to apply attnum translation to one parsetree
and then use equal().  But that's a matter for a different patch, and
this patch would still be useful for the pg_dump side of the problem.)

I was able to avoid a lot of the complexity I'd feared before by not
attempting to use hashing during set_using_names(), which only has to
consider columns merged by USING clauses, so it shouldn't have enough
of a performance problem to be worth touching.  The hashing code needs
to be optional anyway because it's unlikely to be a win for narrow
tables, so we can simply ignore it until we reach the potentially
expensive steps.  Also, things are already factored in such a way that
we only need to have one hashtable at a time, so this shouldn't cause
any large memory bloat.

I'll park this in the next CF.

regards, tom lane

[1] https://www.postgresql.org/message-id/2422717.1722201869%40sss.pgh.pa.us

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 653685bffc..9eb4b858ff 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -223,6 +223,10 @@ typedef struct
  * of aliases to columns of the right input.  Thus, positions in the printable
  * column alias list are not necessarily one-for-one with varattnos of the
  * JOIN, so we need a separate new_colnames[] array for printing purposes.
+ *
+ * Finally, when dealing with wide tables we risk O(N^2) costs in assigning
+ * non-duplicate column names.  We ameliorate that by using a hash table that
+ * holds all the strings appearing in colnames, new_colnames, and parentUsing.
  */
 typedef struct
 {
@@ -290,6 +294,15 @@ typedef struct
 	int		   *leftattnos;		/* left-child varattnos of join cols, or 0 */
 	int		   *rightattnos;	/* right-child varattnos of join cols, or 0 */
 	List	   *usingNames;		/* names assigned to merged columns */
+
+	/*
+	 * Hash table holding copies of all the strings appearing in this struct's
+	 * colnames, new_colnames, and parentUsing.  We use a hash table only for
+	 * sufficiently wide relations, and only during the colname-assignment
+	 * functions set_relation_column_names and set_join_column_names;
+	 * otherwise, names_hash is NULL.
+	 */
+	HTAB	   *names_hash;		/* entries are just strings */
 } deparse_columns;
 
 /* This macro is analogous to rt_fetch(), but for deparse_columns structs */
@@ -375,6 +388,9 @@ static bool colname_is_unique(const char *colname, deparse_namespace *dpns,
 static char *make_colname_unique(char *colname, deparse_namespace *dpns,
  deparse_columns *colinfo);
 static void expand_colnames_array_to(deparse_columns *colinfo, int n);
+static void build_colinfo_names_hash(deparse_columns *colinfo);
+static void add_to_names_hash(deparse_columns *colinfo, const char *name);
+static void destroy_colinfo_names_hash(deparse_columns *colinfo);
 static void identify_join_columns(JoinExpr *j, RangeTblEntry *jrte,
   deparse_columns *colinfo);
 static char *get_rtable_name(int rtindex, deparse_context *context)

Re: ecdh support causes unnecessary roundtrips

2024-07-29 Thread Daniel Gustafsson

> On 17 Jun 2024, at 19:56, Andres Freund  wrote:
> On 2024-06-17 19:51:45 +0200, Daniel Gustafsson wrote:

>> Changing the default of the ecdh GUC would perhaps be doable?
> 
> I was wondering whether we could change the default so that it accepts both
> x25519 and secp256r1. Unfortunately that seems to requires changing what we
> use to set the parameter...

Right.  The patch in https://commitfest.postgresql.org/48/5025/ does allow for
accepting both but that's a different discussion.

Changing, and backpatching, the default to at least keep new installations from
extra roundtrips doesn't seem that far off in terms of scope from what
860fe27ee1e2 backpatched.  Maybe it can be an option.

>> Amending the documentation is the one thing we certainly can do but 99.9% of
>> affected users won't know they are affected so won't look for that section.
> 
> Yea. It's also possible that some other bindings changed their default to
> match ours...

There is that possibility, though I think we would've heard something about
that by now if that had happened.

--
Daniel Gustafsson

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion

On Mon, Jul 29, 2024 at 5:02 AM Peter Eisentraut  wrote:
> We should take the check for exit() calls from libpq and expand it to
> cover the other libraries as well.  Maybe there are other problems like
> this?

Seems reasonable, yeah.

> But under what circumstances does "the linker doesn't strip out" happen?
>   If this happens accidentally, then we should have seen some buildfarm
> failures or something?

On my machine, for example, I see differences with optimization
levels. Say you inadvertently call pfree() in a _shlib build, as I did
multiple times upthread. By itself, that shouldn't actually be a
problem (it eventually redirects to free()), so it should be legal to
call pfree(), and with -O2 the build succeeds. But with -Og, the
exit() check trips, and when I disassemble I see that pg_malloc() et
all have infected the shared object. After all, we did tell the linker
to put that object file in, and we don't ask it to garbage-collect
sections.

> Also, one could look further and notice that restricted_token.c and
> sprompt.c both a) are not needed by libpq and b) can trigger exit()
> calls.  Then it's not clear why those are not affected.

I think it's easier for the linker to omit whole object files rather
than partial ones. If libpq doesn't use any of those APIs there's not
really a reason to trip over it.

(Maybe the _shlib variants should just contain the minimum objects
required to compile.)

> I'm reminded of thread [0].  I think there is quite a bit of confusion
> about the pqexpbuffer vs. stringinfo APIs, and they are probably used
> incorrectly quite a bit.  There are now also programs that use both of
> them!  This patch now introduces another layer on top of them.  I fear,
> at the end, nobody is going to understand any of this anymore.

"anymore"? :)

In all seriousness -- I agree that this isn't sustainable. At the
moment the worst pain (the new layer) is isolated to jsonapi.c, which
seems like an okay place to try something new, since there aren't that
many clients. But to be honest I'm not excited about deciding the Best
Way Forward based on a sample size of JSON.

> Also,
> changing all the programs to link in libpq for pqexpbuffer seems like
> the opposite direction from what was suggested in [0].

(I don't really want to keep that new libpq dependency. We'd just have
to decide where PQExpBuffer is going to go if we're not okay with it.)

> I think we need to do some deeper thinking here about how we want the
> memory management on the client side to work.  Maybe we could just use
> one API but have some flags or callbacks to control the out-of-memory
> behavior.

Any src/common code that needs to handle both in-band and out-of-band
failure modes will still have to decide whether it's going to 1)
duplicate code paths or 2) just act as if in-band failures can always
happen. I think that's probably essential complexity; an ideal API
might make it nicer to deal with but it can't abstract it away.

Thanks!
--Jacob

Re: Optimize mul_var() for var1ndigits >= 8

Dean Rasheed  writes:
> On Mon, 29 Jul 2024 at 21:39, Joel Jacobson  wrote:
>> I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
>> using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
>> the benefits of simpler code.

> Looking at that other thread that you found [1], I think it's entirely
> possible that there are people who care about 32-bit systems, which
> means that we might well get complaints, if we make it slower for
> them. Unfortunately, I don't have any way to test that (I doubt that
> running a 32-bit executable on my x86-64 system is a realistic test).

I think we've already done things that might impact 32-bit systems
negatively (5e1f3b9eb for instance), and not heard a lot of pushback.
I would argue that anyone still running PG on 32-bit must have pretty
minimal performance requirements, so that they're unlikely to care if
numeric_mul gets slightly faster or slower.  Obviously a *big*
performance drop might get pushback.

regards, tom lane

Re: Incremental View Maintenance, take 2

2024-07-29 Thread Kirill Reshke

On Sat, 27 Jul 2024 at 13:26, Kirill Reshke  wrote:
>
> Hi!
> Cloudberry DB (Greenplum fork) uses IMMV feature for AQUMV (auto query
> use matview) feature, so i got interested in how it is implemented.
>
> On Thu, 11 Jul 2024 at 09:24, Yugo NAGATA  wrote:
> >
> > I updated the patch to bump up the version numbers in psql and pg_dump codes
> > from 17 to 18.
>
> Few suggestions:
>
> 1) `Add-relisivm-column-to-pg_class-system-catalog` commit message
> should be fixed, there is "isimmv" in the last line.
> 2) I dont get why `Add-Incremental-View-Maintenance-support.patch`
> goes after 0005 & 0004. Shoulndt we first implement feature server
> side, only when client (psql & pg_dump) side?
> 3) Can we provide regression tests for each function separately? Test
> for main feature in main patch, test for DISTINCT support in
> v34-0007-Add-DISTINCT-support-for-IVM.patch etc? This way the patchset
> will be easier to review, and can be committed separelety.
> 4) v34-0006-Add-Incremental-View-Maintenance-support.patch no longer
> applies due to 4b74ebf726d444ba820830cad986a1f92f724649. After
> resolving issues manually, it does not compile, because
> 4b74ebf726d444ba820830cad986a1f92f724649 also removes
> save_userid/save_sec_context fields from ExecCreateTableAs.
>
> >   if (RelationIsIVM(matviewRel) && stmt->skipData)
> Now this function accepts skipData param.
>
> 5) For DISTINCT support patch uses hidden __ivm* columns. Is this
> design discussed anywhere? I wonder if this is a necessity (only
> solution) or if there are alternatives.
> 6)
> What are the caveats of supporting some simple cases for aggregation
> funcs like in example?
> ```
> regress=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_2 AS SELECT
> sum(j) + sum(i) from mv_base_a;
> ERROR:  expression containing an aggregate in it is not supported on
> incrementally maintainable materialized view
> ```
> I can see some difficulties with division CREATE IMMV  AS SELECT
> 1/sum(i) from mv_base_a;  (sum(i) == 0 case), but adding &
> multiplication should be ok, aren't they?
>
>
> Overall, patchset looks mature, however it is far from being
> committable due to lack of testing/feedback/discussion. There is only
> one way to fix this... Test and discuss it!
>
>
> [1] https://github.com/cloudberrydb/cloudberrydb

Hi! Small update: I tried to run a regression test and all
IMMV-related tests failed on my vm. Maybe I'm doing something wrong, I
will try to investigate.

Another suggestion: support for \d and \d+ commands in psql. With v34
patchset applied, psql does not show anything IMMV-related in \d mode.

```
reshke=# \d m1
   Materialized view "public.m1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 i  | integer |   |  |
Distributed by: (i)


reshke=# \d+ m1
 Materialized view "public.m1"
 Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+-+---+--+-+-+-+--+-
 i  | integer |   |  | | plain   |
|  |
View definition:
 SELECT t1.i
   FROM t1;
Distributed by: (i)
Access method: heap

```

Output should be 'Incrementally materialized view "public.m1"' IMO.

Re: Use pgBufferUsage for block reporting in analyze

2024-07-29 Thread Masahiko Sawada

Hi,

On Mon, Jul 29, 2024 at 12:12 AM Anthonin Bonnefoy
 wrote:
>
> On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada  
> wrote:
> > An alternative idea would
> > be to pass StringInfo to AcquireSampleRowsFunc() so that callback can
> > write its messages there. This is somewhat similar to what we do in
> > the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too
> > much but I think it could be better than writing logs in the single
> > format.
>

I have one comment on 0001 patch:

 /*
  * Calculate the difference in the Page
Hit/Miss/Dirty that
  * happened as part of the analyze by
subtracting out the
  * pre-analyze values which we saved above.
  */
-AnalyzePageHit = VacuumPageHit - AnalyzePageHit;
-AnalyzePageMiss = VacuumPageMiss - AnalyzePageMiss;
-AnalyzePageDirty = VacuumPageDirty - AnalyzePageDirty;
+memset(&bufferusage, 0, sizeof(BufferUsage));
+BufferUsageAccumDiff(&bufferusage,
&pgBufferUsage, &startbufferusage);
+
+total_blks_hit = bufferusage.shared_blks_hit +
+bufferusage.local_blks_hit;
+total_blks_read = bufferusage.shared_blks_read +
+bufferusage.local_blks_read;
+total_blks_dirtied = bufferusage.shared_blks_dirtied +
+bufferusage.local_blks_dirtied;

The comment should also be updated or removed.

And here are some comments on 0002 patch:

-   TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
+   delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);

I think that this change is to make vacuum code consistent with
analyze code, particularly the following part:

/*
 * We do not expect an analyze to take > 25 days and it simplifies
 * things a bit to use TimestampDifferenceMilliseconds.
 */
delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);

However, as the above comment says, delay_in_ms can have a duration up
to 25 days. I guess it would not be a problem for analyze cases but
could be in vacuum cases as vacuum could sometimes be running for a
very long time. I've seen vacuums running even for almost 1 month. So
I think we should keep this part.

---
 /* measure elapsed time iff autovacuum logging requires it */
-if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+if (instrument)

The comment should also be updated.

---
Could you split the 0002 patch into two patches? One is to have
ANALYZE command (with VERBOSE option) write the buffer usage, and
second one is to add WAL usage to both ANALYZE command and
autoanalyze. I think adding WAL usage to ANALYZE could be
controversial as it should not be WAL-intensive operation, so I'd like
to keep them separate.


> I've tested this approach, it definitely looks better. I've added a
> logbuf StringInfo to AcquireSampleRowsFunc which will receive the
> logs. elevel was removed as it is not used anymore. Since everything
> is in the same log line, I've removed the relation name in the acquire
> sample functions.
>
> For partitioned tables, I've also added the processed partition table
> being sampled. The output will look like:
>
> INFO:  analyze of table "postgres.public.test_partition"
> Sampling rows from child "public.test_partition_1"
> pages: 5 of 5 scanned
> tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
> estimated total tuples
> Sampling rows from child "public.test_partition_2"
> pages: 5 of 5 scanned
> tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000
> estimated total tuples
> avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s
> ...
>
> For a file_fdw, the output will be:
>
> INFO:  analyze of table "postgres.public.pglog"
> tuples: 60043 tuples; 3 tuples in sample
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> ...

Thank you for updating the patch. With your patch, I got the following
logs for when executing ANALYZE VERBOSE on a partitioned table:

postgres(1:3971560)=# analyze (verbose) p;
INFO:  analyzing "public.p" inheritance tree
INFO:  analyze of table "postgres.public.p"
Sampling rows from child "public.c1"
pages: 1 of 14750 scanned
tuples: 2259833 live tuples, 0 are dead; 1 tuples in sample,
254 estimated total tuples
Sampling rows from child "public.c2"
pages: 1 of 14750 scanned
tuples: 226 live tuples, 0 are dead; 1 tuples in sample,
500 estimated total tuples
Sampling rows from child "public.c3"
pages: 1 of 14750 scanned
tuples: 2259833 live tuples, 0 are dead; 1 tuples in sample,
254 estimated total tuples
avg read rate: 335.184 MB/s, avg write rate: 0.031 MB/s
buffer

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion

On Mon, Jul 29, 2024 at 1:51 PM Daniel Gustafsson  wrote:
> Together with a colleage we found the Azure provider use "verification_url"
> rather than xxx_uri.

Yeah, I think that's originally a Google-ism. (As far as I can tell
they helped author the spec for this and then didn't follow it. :/ ) I
didn't recall Azure having used it back when I was testing against it,
though, so that's good to know.

> Another discrepancy is that it uses a string for the
> interval (ie: "interval":"5").

Oh, that's a new one. I don't remember needing to hack around that
either; maybe iddawc handled it silently?

> One can of course argue that Azure is wrong and
> should feel bad, but I fear that virtually all (major) providers will have
> differences like this, so we will have to deal with it in an extensible 
> fashion
> (compile time, not runtime configurable).

Such is life... verification_url we will just have to deal with by
default, I think, since Google does/did it too. Not sure about
interval -- but do we want to make our distribution maintainers deal
with a compile-time setting for libpq, just to support various OAuth
flavors? To me it seems like we should just hold our noses and support
known (large) departures in the core.

> I was toying with making the name json_field name member an array, to allow
> variations.  That won't help with the fieldtype differences though, so another
> train of thought was to have some form of REQUIRED_XOR where fields can tied
> together.  What do you think about something along these lines?

If I designed it right, just adding alternative spellings directly to
the fields list should work. (The "required" check is by struct
member, not name, so both spellings can point to the same
destination.) The alternative typing on the other hand might require
something like a new sentinel "type" that will accept both... I hadn't
expected that.

> Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
> even if we don't use them to ensure that the JSON is wellformed?  If the JSON
> we get is malformed in any way it seems like the safe/conservative option to
> error out.

Good, I was hoping to have a conversation about that. I am fine with
either option in principle. In practice I expect to add code to use
`expires_in` (so that we can pass it to custom OAuth hook
implementations) and `scope` (to check if the server has changed it on
us).

That leaves the provider... Forcing the provider itself to implement
unused stuff in order to interoperate seems like it could backfire on
us, especially since IETF standardized an alternate .well-known URI
[1] that changes some of these REQUIRED things into OPTIONAL. (One way
for us to interpret this: those fields may be required for OpenID, but
your OAuth provider might not be an OpenID provider, and our code
doesn't require OpenID.) I think we should probably tread lightly in
that particular case. Thoughts on that?

Thanks!
--Jacob

[1] https://www.rfc-editor.org/rfc/rfc8414.html

Re: Restart pg_usleep when interrupted

2024-07-29 Thread Sami Imseih




> On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot  
> wrote:
> 
> Hi,
> 
> On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote:
>> I am attaching v3 of the patch which addresses the comments made
>> earlier by Bertrand about the comment in the patch [1].
> 
> Thanks!
> 
> Looking at it:
> 
> 1 ===
> 
> +   struct instr_time start_time;
> 
> I think we can get rid of the "struct" keyword here.
> 
> 2 ===
> 
> +   struct instr_time current_time;
> +   struct instr_time elapsed_time;
> 
> Same as above.

Will fix those 2.

> 
> 3 ===
> 
> I gave more thoughts and I think it can be simplified a bit to reduce the
> number of operations in the while loop.
> 
> What about relying on a "absolute" time that way:
> 
>   instr_time absolute;
>absolute.ticks = start_time.ticks + msec * 100;
> 
> and then in the while loop:
> 
>while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
>{
>instr_time current_time;
>INSTR_TIME_SET_CURRENT(current_time);
> 
>if (current_time.ticks > absolute.ticks)
>{
>break;

While I agree this code is cleaner, myy hesitation there is we don’t 
have any other place in which we access .ticks directly and the 
common practice is to use the intsr_time.h APIs.


What do you think?


Regards,

Sami

Re: Remove last traces of HPPA support


On 30/07/2024 00:50, Thomas Munro wrote:

On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:

Thomas Munro  writes:

Here are some experimental patches to try out some ideas mentioned
upthread, that are approximately unlocked by that cleanup.


FWIW, I'm good with getting rid of --disable-spinlocks and
--disable-atomics.  That's a fair amount of code and needing to
support it causes problems, as you say.  I am very much less
excited about ripping out our spinlock and/or atomics code in favor
of ; I just don't see the gain there, and I do see risk
in ceding control of the semantics and performance of those
primitives.


OK,  part on ice for now.  Here's an update of the rest,
this time also removing the barrier fallbacks as discussed in the LTO
thread[1].


Looks good to me.


I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...

Side issue: I noticed via CI failure when I tried to require
read/write barriers to be provided (a choice I backed out of), that on
MSVC we seem to be using the full memory barrier fallback for those.
Huh?  For x86, I think they should be using pg_compiler_barrier() (no
code gen, just prevent reordering), not pg_pg_memory_barrier(), no?


Agreed, arch-x86.h is quite clear on that.


Perhaps I'm missing something but I suspect we might be failing to
include arch-x86.h on that compiler when we should... maybe it needs
to detect _M_AMD64 too? 


Aha, yes I think that's it. Apparently, __x86_64__ is not defined on 
MSVC. To prove that, I added garbage to the "#ifdef __x86_64__" guarded 
block in atomics.h. The compilation passes on MSVC, but not on other 
platforms: https://cirrus-ci.com/build/6310061188841472.


That means that we're not getting the x86-64 instructions in 
src/port/pg_crc32c_sse42.c on MSVC either.


I think we should do:

#ifdef _M_AMD64
#define __x86_64__
#endif

somewhere, perhaps in src/include/port/win32.h.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: psql: Add leakproof field to \dAo+ meta-command results

2024-07-29 Thread Erik Wienhold

On 2024-07-01 15:08 +0200, Yugo NAGATA wrote:
> I would like to propose to add a new field to psql's \dAo+ meta-command
> to show whether the underlying function of an operator is leak-proof.

+1 for making that info easily accessible.

> This idea is inspired from [1] that claims some indexes uses non-LEAKPROOF
> functions under the associated operators, as a result, it can not be selected
> for queries with security_barrier views or row-level security policies.
> The original proposal was to add a query over system catalogs for looking up
> non-leakproof operators to the documentation, but I thought it is useful
> to improve \dAo results rather than putting such query to the doc.
> 
> The attached patch adds the field to \dAo+ and also a description that
> explains the relation between indexes and security quals with referencing
> \dAo+ meta-command.
> 
> [1] 
> https://www.postgresql.org/message-id/raw/5af3bf0c-5e0c-4128-81dc-084c5258b1af%40code406.com

\dAo+ output looks good.

But this patch fails regression tests in src/test/regress/sql/psql.sql
(\dAo+ btree float_ops) because of the new leak-proof column.  I think
this could even be changed to "\dAo+ btree array_ops|float_ops" to also
cover operators that are not leak-proof.

+
+For example, an index scan can not be selected for queries with

I check the docs and "cannot" is more commonly used than "can not".

+security_barrier views or row-level security policies 
if an
+operator used in the WHERE clause is associated with the
+operator family of the index, but its underlying function is not marked
+LEAKPROOF. The  program's
+\dAo+ meta-command is useful for listing the operators
+with associated operator families and whether it is leak-proof.
+

I think the last sentence can be improved.  How about: "Use psql's \dAo+
command to list operator families and tell which of their operators are
marked as leak-proof."?  Should something similar be added to [1] which
also talks about leak-proof operators?

The rest is just formatting nitpicks:

+ ", ofs.opfname AS \"%s\"\n,"

The trailing comma should come before the newline.

+ "  CASE\n"
+ "   WHEN p.proleakproof THEN '%s'\n"
+ "   ELSE '%s'\n"
+ " END AS \"%s\"\n",

WHEN/ELSE/END should be intended with one additional space to be
consistent with the other CASE expressions in this query.

[1] https://www.postgresql.org/docs/devel/planner-stats-security.html

-- 
Erik

Re: speed up a logical replica setup

2024-07-29 Thread Euler Taveira

On Mon, Jul 29, 2024, at 6:11 PM, Euler Taveira wrote:
> The options are:
> 
> (a) temporary replication slot: requires an additional replication slot.
> small payload. it is extremely slow in comparison with the other
> options.
> (b) logical message: can be consumed by logical replication when/if it
> is supported some day. big payload. fast.
> (c) snapshot of running txn:  small payload. fast.
> (d) named restore point: biggest payload. fast.
> 
> I don't have a strong preference but if I need to pick one I would
> choose option (c) or option (d). The option (a) is out of the question.

I'm attaching a patch that implements option (c). While reading the code
I noticed that I left a comment that should be removed by commit
b9639138262. 0002 removes it.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From f4afe05fc7e73c5c23bcdeba4fc65a538c83b8ba Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 29 Jul 2024 19:44:16 -0300
Subject: [PATCH 1/2] pg_createsubscriber: fix slow recovery

If the primary server is idle when you are running pg_createsubscriber,
it used to take some time during recovery. The reason is that it was
using the LSN returned by pg_create_logical_replication_slot as
recovery_target_lsn. This LSN points to the next WAL record that might
not be available at WAL, hence, the recovery routine waits until some
activity writes a WAL record to end the recovery. Inject a new WAL
record after the last replication slot to avoid slowness.

Discussion: https://www.postgresql.org/message-id/2377319.1719766794%40sss.pgh.pa.us
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 23 +
 1 file changed, 23 insertions(+)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index b02318782a6..00976c643a1 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -778,6 +778,29 @@ setup_publisher(struct LogicalRepInfo *dbinfo)
 		else
 			exit(1);
 
+		/*
+		 * An idle server might not write a new WAL record until the recovery
+		 * is about to end. Since pg_createsubscriber is using the LSN
+		 * returned by the last replication slot as recovery_target_lsn, this
+		 * LSN is ahead of the current WAL position and the recovery waits
+		 * until something writes a WAL record to reach the target and ends
+		 * the recovery. To avoid the recovery slowness in this case, injects
+		 * a new WAL record here.
+		 */
+		if (i == num_dbs - 1 && !dry_run)
+		{
+			PGresult   *res;
+
+			res = PQexec(conn, "SELECT pg_log_standby_snapshot()");
+			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+			{
+pg_log_error("could not write an additional WAL record: %s",
+			 PQresultErrorMessage(res));
+disconnect_database(conn, true);
+			}
+			PQclear(res);
+		}
+
 		disconnect_database(conn, false);
 	}
 
-- 
2.30.2

From 75558e8379abae3a642583f31b21e0ca5db80d2b Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 29 Jul 2024 20:59:32 -0300
Subject: [PATCH 2/2] pg_createsubscriber: remove obsolete comment

This comment should have been removed by commit b9639138262. There is no
replication slot check on primary anymore.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 00976c643a1..87668640f78 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -2209,10 +2209,7 @@ main(int argc, char **argv)
 	stop_standby_server(subscriber_dir);
 
 	/*
-	 * Create the required objects for each database on publisher. This step
-	 * is here mainly because if we stop the standby we cannot verify if the
-	 * primary slot is in use. We could use an extra connection for it but it
-	 * doesn't seem worth.
+	 * Create the required objects for each database on publisher.
 	 */
 	consistent_lsn = setup_publisher(dbinfo);
 
-- 
2.30.2

Re: Remove last traces of HPPA support

2024-07-29 Thread Thomas Munro

On Tue, Jul 30, 2024 at 11:16 AM Heikki Linnakangas  wrote:
> On 30/07/2024 00:50, Thomas Munro wrote:
> > On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:
> >> Thomas Munro  writes:
> > OK,  part on ice for now.  Here's an update of the rest,
> > this time also removing the barrier fallbacks as discussed in the LTO
> > thread[1].
>
> Looks good to me.

Thanks.  I'll wait just a bit longer to see if anyone else has comments.

> > Perhaps I'm missing something but I suspect we might be failing to
> > include arch-x86.h on that compiler when we should... maybe it needs
> > to detect _M_AMD64 too?
>
> Aha, yes I think that's it. Apparently, __x86_64__ is not defined on
> MSVC. To prove that, I added garbage to the "#ifdef __x86_64__" guarded
> block in atomics.h. The compilation passes on MSVC, but not on other
> platforms: https://cirrus-ci.com/build/6310061188841472.
>
> That means that we're not getting the x86-64 instructions in
> src/port/pg_crc32c_sse42.c on MSVC either.
>
> I think we should do:
>
> #ifdef _M_AMD64
> #define __x86_64__
> #endif
>
> somewhere, perhaps in src/include/port/win32.h.

Hmm.  I had come up with the opposite solution, because we already
tested for _M_AMD64 explicitly elsewhere, and also I was thinking we
would back-patch, and I don't want to cause problems for external code
that thinks that __x86_64__ implies it can bust out some GCC inline
assembler or something.  But I don't have a strong opinion, your idea
is certainly simpler to implement and I also wouldn't mind much if we
just fixed it in master only, for fear of subtle breakage...

Same problem probably exists for i386.  I don't think CI, build farm
or the EDB packaging team do 32 bit Windows, so that makes it a little
hard to know if your blind code changes have broken or fixed
anything... on the other hand it's pretty simple...

I wondered if the pre-Meson system might have somehow defined
__x86_64__, but I'm not seeing it.  Commit b64d92f1a56 explicitly
mentions that it was tested on MSVC, so I guess maybe it was just
always "working" but not quite taking the intended code paths?  Funny
though, that code that calls _mm_pause() on AMD64 or the __asm thing
that only works on i386 doesn't look like blind code to me.  Curious.
From 47a8445c946e3792247fcf818c6e60ae72693f5c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 30 Jul 2024 11:01:26 +1200
Subject: [PATCH] Fix x86 architecture detection on MSVC.

We were looking for __x86_64__, but MSVC calls it _M_AMD64.  Therefore
we were mapping pg_{read,write}_barrier() to expensive
pg_memory_barrier() instead of pg_compiler_barrier(), and not using the
intended spinlock delay primitive.  A couple of other places missed it
as well.

The problem probably exists for _M_IX86 (32 bit) too; this is untested
due to lack of 32 bit Windows CI, but that macro was already used in our
tree so it seems safe to use it in new places.

Back-patch to all supported releases.

Reviewed-by: Heikki Linnakangas 
Discussion: https://postgr.es/m/CA%2BhUKGKAf_i6w7hB_3pqZXQeqn%2BixvY%2BCMps_n%3DmJ5HAatMjMw%40mail.gmail.com
---
 contrib/pgcrypto/crypt-blowfish.c   | 4 ++--
 src/include/port/atomics.h  | 3 ++-
 src/include/port/atomics/arch-x86.h | 2 +-
 src/port/pg_crc32c_sse42.c  | 2 +-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/contrib/pgcrypto/crypt-blowfish.c b/contrib/pgcrypto/crypt-blowfish.c
index 5a1b1e1009..c34e66b2f7 100644
--- a/contrib/pgcrypto/crypt-blowfish.c
+++ b/contrib/pgcrypto/crypt-blowfish.c
@@ -38,10 +38,10 @@
 #include "px-crypt.h"
 #include "px.h"

-#ifdef __i386__
+#if defined(__i386__) || defined(_M_IX86)
 #define BF_ASM0	/* 1 */
 #define BF_SCALE			1
-#elif defined(__x86_64__)
+#elif defined(__x86_64__) || defined(_M_AMD64)
 #define BF_ASM0
 #define BF_SCALE			1
 #else
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index f6fa432d2d..ec59745168 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -65,7 +65,8 @@
  */
 #if defined(__arm__) || defined(__arm) || defined(__aarch64__)
 #include "port/atomics/arch-arm.h"
-#elif defined(__i386__) || defined(__i386) || defined(__x86_64__)
+#elif defined(__i386__) || defined(__i386) || defined(_M_IX86) || \
+	  defined(__x86_64__) || defined(_M_AMD64)
 #include "port/atomics/arch-x86.h"
 #elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
 #include "port/atomics/arch-ppc.h"
diff --git a/src/include/port/atomics/arch-x86.h b/src/include/port/atomics/arch-x86.h
index 2a8eca30fc..4ecf540d12 100644
--- a/src/include/port/atomics/arch-x86.h
+++ b/src/include/port/atomics/arch-x86.h
@@ -113,7 +113,7 @@ pg_spin_delay_impl(void)
 {
 	__asm__ __volatile__(" rep; nop			\n");
 }
-#elif defined(_MSC_VER) && defined(__x86_64__)
+#elif defined(_MSC_VER) && defined(_M_AMD64)
 #define PG_HAVE_SPIN_DELAY
 static __forceinline void
 pg_spin_delay_impl(void)
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-07-29 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Sun, 28 Jul 2024 22:49:47 +0800,
  Junwang Zhao  wrote:

> Thanks for updating the patches, I applied them and test
> in my local machine, I did not use tmpfs in my test, I guess
> if I run the tests enough rounds, the OS will cache the
> pages, below is my numbers(I run each test 30 times, I
> count for the last 10 ones):
> 
> HEADPATCHED
> 
> COPY to_tab_30 TO '/dev/null' WITH (FORMAT text);
> 
> 5628.280 ms   5679.860 ms
> 5583.144 ms   5588.078 ms
> 5604.444 ms   5628.029 ms
> 5617.133 ms   5613.926 ms
> 5575.570 ms   5601.045 ms
> 5634.828 ms   5616.409 ms
> 5693.489 ms   5637.434 ms
> 5585.857 ms   5609.531 ms
> 5613.948 ms   5643.629 ms
> 5610.394 ms   5580.206 ms
> 
> COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text);
> 
> 3929.955 ms   4050.895 ms
> 3909.061 ms   3890.156 ms
> 3940.272 ms   3927.614 ms
> 3907.535 ms   3925.560 ms
> 3952.719 ms   3942.141 ms
> 3933.751 ms   3904.250 ms
> 3958.274 ms   4025.581 ms
> 3937.065 ms   3894.149 ms
> 3949.896 ms   3933.878 ms
> 3925.399 ms   3936.170 ms
> 
> I did not see obvious performance degradation, maybe it's
> because I did not use tmpfs, but I think this OTH means
> that the *function call* and *if branch* added for each row
> is not the bottleneck of the whole execution path.

Thanks for sharing your numbers. I agree with there are no
obvious performance degradation.


> In 0001,
> 
> +typedef struct CopyFromRoutine
> +{
> + /*
> + * Called when COPY FROM is started to set up the input functions
> + * associated to the relation's attributes writing to.  `finfo` can be
> + * optionally filled to provide the catalog information of the input
> + * function.  `typioparam` can be optionally filled to define the OID of
> + * the type to pass to the input function.  `atttypid` is the OID of data
> + * type used by the relation's attribute.
> 
> +typedef struct CopyToRoutine
> +{
> + /*
> + * Called when COPY TO is started to set up the output functions
> + * associated to the relation's attributes reading from.  `finfo` can be
> + * optionally filled.  `atttypid` is the OID of data type used by the
> + * relation's attribute.
> 
> The second comment has a simplified description for `finfo`, I think it
> should match the first by:
> 
> `finfo` can be optionally filled to provide the catalog information of the
> output function.

Good catch. I'll update it as suggested in the next patch set.

> After I post the patch diffs, the gmail grammer shows some hints that
> it should be *associated with* rather than *associated to*, but I'm
> not sure about this one.

Thanks. I'll use "associated with".

> I think the patches are in good shape, I can help to do some
> further tests if needed, thanks for working on this.

Thanks!

-- 
kou

COPY FROM crash

2024-07-29 Thread Zhang Mingli

Hi, all

I got a crash when copy partition tables with mass data in Cloudberry 
DB[0](based on Postgres14.4, Greenplum 7).

I have a test on Postgres and it has the similar issue(different places but 
same function).

However it’s a little hard to reproduce because it happened when inserting next 
tuple after a previous copy multi insert buffer is flushed.

To reproduce easily, change the Macros to:

#define MAX_BUFFERED_TUPLES 1
#define MAX_PARTITION_BUFFERS   0

Config and make install, when initdb, a core dump will be as:

#0 0x55de617211b9 in CopyMultiInsertInfoNextFreeSlot 
(miinfo=0x7ffce496d360, rri=0x55de6368ba88)
 at copyfrom.c:592
#1 0x55de61721ff1 in CopyFrom (cstate=0x55de63592ce8) at copyfrom.c:985
#2 0x55de6171dd86 in DoCopy (pstate=0x55de63589e00, stmt=0x55de635347d8, 
stmt_location=0, stmt_len=195,
 processed=0x7ffce496d590) at copy.c:306
#3 0x55de61ad7ce8 in standard_ProcessUtility (pstmt=0x55de635348a8,
 queryString=0x55de63533960 "COPY information_schema.sql_features (feature_id, 
feature_name, sub_feature_id, sub
_feature_name, is_supported, comments) FROM 
E'/home/gpadmin/install/pg17/share/postgresql/sql_features.txt';\n",
 readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
queryEnv=0x0, dest=0x55de620b0ce0 ,
 qc=0x7ffce496d910) at utility.c:735
#4 0x55de61ad7614 in ProcessUtility (pstmt=0x55de635348a8,
 queryString=0x55de63533960 "COPY information_schema.sql_features (feature_id, 
feature_name, sub_feature_id, sub
_feature_name, is_supported, comments) FROM 
E'/home/gpadmin/install/pg17/share/postgresql/sql_features.txt';\n",
 readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
queryEnv=0x0, dest=0x55de620b0ce0 ,
 qc=0x7ffce496d910) at utility.c:523
#5 0x55de61ad5e8f in PortalRunUtility (portal=0x55de633dd7a0, 
pstmt=0x55de635348a8, isTopLevel=true,
 setHoldSnapshot=false, dest=0x55de620b0ce0 , qc=0x7ffce496d910) at 
pquery.c:1158
#6 0x55de61ad6106 in PortalRunMulti (portal=0x55de633dd7a0, 
isTopLevel=true, setHoldSnapshot=false,
 dest=0x55de620b0ce0 , altdest=0x55de620b0ce0 , 
qc=0x7ffce496d910) at pquery.c:1315
#7 0x55de61ad5550 in PortalRun (portal=0x55de633dd7a0, 
count=9223372036854775807, isTopLevel=true,
 run_once=true, dest=0x55de620b0ce0 , altdest=0x55de620b0ce0 
, qc=0x7ffce496d910)
 at pquery.c:791```


The root cause is:  we may call CopyMultiInsertInfoFlush() to flush buffer 
during COPY tuples, ex: insert from next tuple,
CopyMultiInsertInfoNextFreeSlot() will get a crash due to null pointer of 
buffer.

To fix it: instead of call CopyMultiInsertInfoSetupBuffer() outside, I put it 
into CopyMultiInsertInfoNextFreeSlot() to avoid such issues.

[0] https://github.com/cloudberrydb/cloudberrydb


Zhang Mingli
www.hashdata.xyz


v0-0001-Fix-COPY-FROM-crash-due-to-buffer-flush.patch
Description: Binary data

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> >> ... However, I added a new open item about how the
> >> 040_pg_createsubscriber.pl test is slow and still unstable.
>
> > But that said, I see no commits in the commit history which purport to
> > improve performance, so I guess the performance is probably still not
> > what you want, though I am not clear on the details.
>
> My concern is described at [1]:
>
> >> I have a different but possibly-related complaint: why is
> >> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> >> runs for a bit over 19 seconds, which seems completely out of line
> >> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> >> other test scripts in this directory take much less).  It looks
> >> like most of the blame falls on this step:
> >>
> >> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> >>
> >> AFAICS the amount of data being replicated is completely trivial,
> >> so that it doesn't make any sense for this to take so long --- and
> >> if it does, that suggests that this tool will be impossibly slow
> >> for production use.  But I suspect there is a logic flaw causing
> >> this.  Speculating wildly, perhaps that is related to the failure
> >> Alexander spotted?
>
> The followup discussion in that thread made it sound like there's
> some fairly fundamental deficiency in how wait_for_end_recovery()
> detects end-of-recovery.  I'm not too conversant with the details
> though, and it's possible that pg_createsubscriber is just falling
> foul of a pre-existing infelicity.
>
> If the problem can be correctly described as "pg_createsubscriber
> takes 10 seconds or so to detect end-of-stream",
>

The problem can be defined as: "pg_createsubscriber waits for an
additional (new) WAL record to be generated on primary before it
considers the standby is ready for becoming a subscriber". Now, on
busy systems, this shouldn't be a problem but for idle systems, the
time to detect end-of-stream can't be easily defined.

One of the proposed solutions is that pg_createsubscriber generate a
dummy WAL record on the publisher/primary by using something like
pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will
fix the problem (BF failures and slow detection for end-of-stream) but
sounds more like a hack. The other ideas that we can consider as
mentioned in [1] require API/design change which is not preferable at
this point. So, the only way seems to be to accept the generation of
dummy WAL records to bring predictability in the tests or otherwise in
the usage of the tool.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bp%2B7Ag6nqdFRdqowK1EmJ6bG-MtZQ_54dnFBi%3D_OO5RQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-07-29 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

> Hayato already mentioned one of the solution in a previous email [2].
> AFAICS we can use any solution that creates a WAL record. I tested the
> following options:

Yes, my point was that we should add an arbitrary record after the 
recovery_target_lsn.

> (a) temporary replication slot: requires an additional replication slot.
> small payload. it is extremely slow in comparison with the other
> options.
> (b) logical message: can be consumed by logical replication when/if it
> is supported some day. big payload. fast.
> (c) snapshot of running txn:  small payload. fast.
> (d) named restore point: biggest payload. fast.

Another PoV is whether they trigger the flush of the generated WAL record. 
You've
tested pg_logical_emit_message() with flush=false, but this meant that the WAL 
does
not flush so that the wait_for_end_recovery() waits a time. IIUC, it may wait 15
seconds, which is the time duration bgwriter works. If flush is set to true, the
execution time will be longer.
pg_create_restore_point() also does not flush the generated record so that it 
may
be problematic. Moreover, the name of the restore point might be a conflict that
users will use.

Overall, I could agree to use pg_log_standby_snapshot(). This flushes the WAL
asynchronously but ensures the timing is not so delayed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict Detection and Resolution

2024-07-29 Thread shveta malik

On Fri, Jul 26, 2024 at 9:50 AM Ajin Cherian  wrote:
>>
> Please find v7 patch-set, the changes are:
>

Thanks Ajin for working on this. Please find few comments:

1)
parse_subscription_conflict_resolvers():
Here we loop in this function to find the given conflict type in the
supported list and error out if conflict-type is not valid. Also we
call validate_conflict_type_and_resolver() which again validates
conflict-type. I would recommend to loop 'stmtresolvers' in parse
function and then read each type and resolver and pass that to
validate_conflict_type_and_resolver(). Avoid double validation.

2)
SetSubConflictResolver():
It works well, but it does not look apt that the 'resolvers' passed to
this function by the caller is an array and this function knows the
array range and traverse from CT_MIN to CT_MAX assuming this array
maps directly to ConflictType. I think it would be better to have it
passed as a list and then SetSubConflictResolver() traverse the list
without knowing the range of it. Similar to what we do in
alter-sub-flow in and around UpdateSubConflictResolvers().

3)
When I execute 'alter subscription ..(detect_conflict=on)' for a
subscription which *already* has detect_conflict as ON, it tries to
reset resolvers to default and ends up in error. It should actually be
no-op in this particular situation and should not reset resolvers to
default.

postgres=# alter subscription sub1 set (detect_conflict=on);
WARNING:  Using default conflict resolvers
ERROR:  duplicate key value violates unique constraint
"pg_subscription_conflict_sub_index"

4)
Do we need SUBSCRIPTIONCONFLICTOID cache? We are not using it
anywhere. Shall we remove this and the corresponding index?

5)
RemoveSubscriptionConflictBySubid().
--We can remove extra blank line before table_open.
--We can get rid of curly braces around CatalogTupleDelete() as it is
a single line in loop.

thanks
Shveta

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

Amit Kapila  writes:
> On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
>> If the problem can be correctly described as "pg_createsubscriber
>> takes 10 seconds or so to detect end-of-stream",

> The problem can be defined as: "pg_createsubscriber waits for an
> additional (new) WAL record to be generated on primary before it
> considers the standby is ready for becoming a subscriber". Now, on
> busy systems, this shouldn't be a problem but for idle systems, the
> time to detect end-of-stream can't be easily defined.

Got it.  IMO, that absolutely will be a problem for real users,
not only test cases.

> One of the proposed solutions is that pg_createsubscriber generate a
> dummy WAL record on the publisher/primary by using something like
> pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will
> fix the problem (BF failures and slow detection for end-of-stream) but
> sounds more like a hack.

It's undoubtedly a hack, but I like it anyway because it's small,
self-contained, and easily removable once we have a better solution.
As you say, it seems a bit late in the v17 cycle to be designing
anything more invasive.

regards, tom lane

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

On Tue, Jul 30, 2024 at 9:56 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
> >> If the problem can be correctly described as "pg_createsubscriber
> >> takes 10 seconds or so to detect end-of-stream",
>
> > The problem can be defined as: "pg_createsubscriber waits for an
> > additional (new) WAL record to be generated on primary before it
> > considers the standby is ready for becoming a subscriber". Now, on
> > busy systems, this shouldn't be a problem but for idle systems, the
> > time to detect end-of-stream can't be easily defined.
>
> Got it.  IMO, that absolutely will be a problem for real users,
> not only test cases.
>
> > One of the proposed solutions is that pg_createsubscriber generate a
> > dummy WAL record on the publisher/primary by using something like
> > pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will
> > fix the problem (BF failures and slow detection for end-of-stream) but
> > sounds more like a hack.
>
> It's undoubtedly a hack, but I like it anyway because it's small,
> self-contained, and easily removable once we have a better solution.
> As you say, it seems a bit late in the v17 cycle to be designing
> anything more invasive.
>

Thanks for your feedback. We will proceed in that direction and try to
close this open item.

-- 
With Regards,
Amit Kapila.

Re: pgsql: Add more SQL/JSON constructor functions

2024-07-29 Thread Amit Langote

Hi,

On Fri, Jul 26, 2024 at 11:19 PM jian he  wrote:
> On Fri, Jul 26, 2024 at 4:53 PM Amit Langote  wrote:
> >
> >
> > Pushed 0003-0005 ahead of 0001-0002.  Will try to push them over the
> > weekend.  Rebased for now.

Pushed them now.

> {
> ...
> /*
>  * For expression nodes that support soft errors.  Should be set to NULL
>  * before calling ExecInitExprRec() if the caller wants errors thrown.
>  */
> ErrorSaveContext *escontext;
> } ExprState;
>
> i believe by default makeNode will set escontext to NULL.
> So the comment should be, if you want to catch the soft errors, make
> sure the escontext pointing to an allocated ErrorSaveContext.
> or maybe just say, default is NULL.
>
> Otherwise, the original comment's meaning feels like: we need to
> explicitly set it to NULL
> for certain operations, which I believe is false?

OK, I'll look into updating this.

> struct
> {
> Oidtargettype;
> int32targettypmod;
> boolomit_quotes;
> boolexists_coerce;
> boolexists_cast_to_int;
> boolcheck_domain;
> void   *json_coercion_cache;
> ErrorSaveContext *escontext;
> }jsonexpr_coercion;
> do we need to comment that "check_domain" is only used for JSON_EXISTS_OP?

I've renamed it to exists_check_domain and added a comment that
exists_* fields are relevant only for JSON_EXISTS_OP.

> json_behavior_type:
> ERROR_P{ $$ = JSON_BEHAVIOR_ERROR; }
> | NULL_P{ $$ = JSON_BEHAVIOR_NULL; }
> | TRUE_P{ $$ = JSON_BEHAVIOR_TRUE; }
> | FALSE_P{ $$ = JSON_BEHAVIOR_FALSE; }
> | UNKNOWN{ $$ = JSON_BEHAVIOR_UNKNOWN; }
> | EMPTY_P ARRAY{ $$ = JSON_BEHAVIOR_EMPTY_ARRAY; }
> | EMPTY_P OBJECT_P{ $$ = JSON_BEHAVIOR_EMPTY_OBJECT; }
> /* non-standard, for Oracle compatibility only */
> | EMPTY_P{ $$ = JSON_BEHAVIOR_EMPTY_ARRAY; }
> ;
>
> EMPTY_P behaves the same as EMPTY_P ARRAY
> so for function GetJsonBehaviorConst, the following "case
> JSON_BEHAVIOR_EMPTY:" is wrong?
>
> case JSON_BEHAVIOR_NULL:
> case JSON_BEHAVIOR_UNKNOWN:
> case JSON_BEHAVIOR_EMPTY:
> val = (Datum) 0;
> isnull = true;
> typid = INT4OID;
> len = sizeof(int32);
> isbyval = true;
> break;
>
> also src/backend/utils/adt/ruleutils.c
> if (jexpr->on_error->btype != JSON_BEHAVIOR_EMPTY)
> get_json_behavior(jexpr->on_error, context, "ERROR");

Something like the attached makes sense?  While this meaningfully
changes the deparsing output, there is no behavior change for
JsonTable top-level path execution.  That's because the behavior when
there's an error in the execution of the top-level path is to throw it
or return an empty set, which is handled in jsonpath_exec.c, not
execExprInterp.c.

> for json_value, json_query, i believe we can save some circles in
> ExecInitJsonExpr
> if you don't specify on error, on empty
>
> can you please check the attached, based on your latest attachment.

Perhaps makes sense, though I haven't checked closely.  I'll take a
look next week.

--
Thanks, Amit Langote

Re: Incremental View Maintenance, take 2

2024-07-29 Thread Yugo NAGATA

Hi,

On Tue, 30 Jul 2024 03:32:19 +0500
Kirill Reshke  wrote:

> On Sat, 27 Jul 2024 at 13:26, Kirill Reshke  wrote:
> >
> > Hi!
> > Cloudberry DB (Greenplum fork) uses IMMV feature for AQUMV (auto query
> > use matview) feature, so i got interested in how it is implemented.

Thank you so much for a lot of comments!
I will respond to the comments soon.

> >
> > On Thu, 11 Jul 2024 at 09:24, Yugo NAGATA  wrote:
> > >
> > > I updated the patch to bump up the version numbers in psql and pg_dump 
> > > codes
> > > from 17 to 18.
> >
> > Few suggestions:
> >
> > 1) `Add-relisivm-column-to-pg_class-system-catalog` commit message
> > should be fixed, there is "isimmv" in the last line.
> > 2) I dont get why `Add-Incremental-View-Maintenance-support.patch`
> > goes after 0005 & 0004. Shoulndt we first implement feature server
> > side, only when client (psql & pg_dump) side?
> > 3) Can we provide regression tests for each function separately? Test
> > for main feature in main patch, test for DISTINCT support in
> > v34-0007-Add-DISTINCT-support-for-IVM.patch etc? This way the patchset
> > will be easier to review, and can be committed separelety.
> > 4) v34-0006-Add-Incremental-View-Maintenance-support.patch no longer
> > applies due to 4b74ebf726d444ba820830cad986a1f92f724649. After
> > resolving issues manually, it does not compile, because
> > 4b74ebf726d444ba820830cad986a1f92f724649 also removes
> > save_userid/save_sec_context fields from ExecCreateTableAs.
> >
> > >   if (RelationIsIVM(matviewRel) && stmt->skipData)
> > Now this function accepts skipData param.
> >
> > 5) For DISTINCT support patch uses hidden __ivm* columns. Is this
> > design discussed anywhere? I wonder if this is a necessity (only
> > solution) or if there are alternatives.
> > 6)
> > What are the caveats of supporting some simple cases for aggregation
> > funcs like in example?
> > ```
> > regress=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_2 AS SELECT
> > sum(j) + sum(i) from mv_base_a;
> > ERROR:  expression containing an aggregate in it is not supported on
> > incrementally maintainable materialized view
> > ```
> > I can see some difficulties with division CREATE IMMV  AS SELECT
> > 1/sum(i) from mv_base_a;  (sum(i) == 0 case), but adding &
> > multiplication should be ok, aren't they?
> >
> >
> > Overall, patchset looks mature, however it is far from being
> > committable due to lack of testing/feedback/discussion. There is only
> > one way to fix this... Test and discuss it!
> >
> >
> > [1] https://github.com/cloudberrydb/cloudberrydb
> 
> Hi! Small update: I tried to run a regression test and all
> IMMV-related tests failed on my vm. Maybe I'm doing something wrong, I
> will try to investigate.
> 
> Another suggestion: support for \d and \d+ commands in psql. With v34
> patchset applied, psql does not show anything IMMV-related in \d mode.
> 
> ```
> reshke=# \d m1
>Materialized view "public.m1"
>  Column |  Type   | Collation | Nullable | Default
> +-+---+--+-
>  i  | integer |   |  |
> Distributed by: (i)
> 
> 
> reshke=# \d+ m1
>  Materialized view "public.m1"
>  Column |  Type   | Collation | Nullable | Default | Storage |
> Compression | Stats target | Description
> +-+---+--+-+-+-+--+-
>  i  | integer |   |  | | plain   |
> |  |
> View definition:
>  SELECT t1.i
>FROM t1;
> Distributed by: (i)
> Access method: heap
> 
> ```
> 
> Output should be 'Incrementally materialized view "public.m1"' IMO.


-- 
Yugo NAGATA

Re: speed up a logical replica setup

On Tue, Jul 30, 2024 at 9:26 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Hayato already mentioned one of the solution in a previous email [2].
> > AFAICS we can use any solution that creates a WAL record. I tested the
> > following options:
>
> Yes, my point was that we should add an arbitrary record after the 
> recovery_target_lsn.
>
> > (a) temporary replication slot: requires an additional replication slot.
> > small payload. it is extremely slow in comparison with the other
> > options.
> > (b) logical message: can be consumed by logical replication when/if it
> > is supported some day. big payload. fast.
> > (c) snapshot of running txn:  small payload. fast.
> > (d) named restore point: biggest payload. fast.
>
> Another PoV is whether they trigger the flush of the generated WAL record. 
> You've
> tested pg_logical_emit_message() with flush=false, but this meant that the 
> WAL does
> not flush so that the wait_for_end_recovery() waits a time. IIUC, it may wait 
> 15
> seconds, which is the time duration bgwriter works. If flush is set to true, 
> the
> execution time will be longer.
> pg_create_restore_point() also does not flush the generated record so that it 
> may
> be problematic. Moreover, the name of the restore point might be a conflict 
> that
> users will use.
>
> Overall, I could agree to use pg_log_standby_snapshot(). This flushes the WAL
> asynchronously but ensures the timing is not so delayed.
>

The other minor benefit of using pg_log_standby_snapshot() is that
after the standby is converted to subscriber, the publisher will
process this record to see if the slot machinery can be advanced. So,
overall there won't be any harm in using it. I'll check the latest
patch Euler shared.

-- 
With Regards,
Amit Kapila.

Re: COPY FROM crash

2024-07-29 Thread David Rowley

On Tue, 30 Jul 2024 at 15:52, Zhang Mingli  wrote:
> I have a test on Postgres and it has the similar issue(different places but 
> same function).
>
> However it’s a little hard to reproduce because it happened when inserting 
> next tuple after a previous copy multi insert buffer is flushed.
>
> To reproduce easily, change the Macros to:
>
> #define MAX_BUFFERED_TUPLES 1
> #define MAX_PARTITION_BUFFERS 0

I think you're going to need to demonstrate to us there's an actual
PostgreSQL bug here with a test case that causes a crash without
changing the above definitions.

It seems to me that it's not valid to set MAX_PARTITION_BUFFERS to
anything less than 2 due to the code inside
CopyMultiInsertInfoFlush(). If we find the CopyMultiInsertBuffer for
'curr_rri' then that code would misbehave if the list only contained a
single CopyMultiInsertBuffer due to the expectation there's another
item in the list after the list_delete_first().  If you're only able
to get it to misbehave by setting MAX_PARTITION_BUFFERS to less than
2, then my suggested fix would be to add a comment to say that values
less than to are not supported.

David

Re: COPY FROM crash

2024-07-29 Thread Kirill Reshke

Hi!

On Tue, 30 Jul 2024 at 08:52, Zhang Mingli  wrote:
>
> Hi, all
>
> I got a crash when copy partition tables with mass data in Cloudberry 
> DB[0](based on Postgres14.4, Greenplum 7).
>
> I have a test on Postgres and it has the similar issue(different places but 
> same function).

Just to be clear, you are facing this on HEAD, on on REL_14_STABLE?


> However it’s a little hard to reproduce because it happened when inserting 
> next tuple after a previous copy multi insert buffer is flushed.
>
> To reproduce easily, change the Macros to:
>
> #define MAX_BUFFERED_TUPLES 1
> #define MAX_PARTITION_BUFFERS 0

This way it's harder to believe that the problem persists with the
original settings. Are these values valid?

Re: Separate HEAP WAL replay logic into its own file

2024-07-29 Thread Sutou Kouhei

Hi,

In <599e67d2-2929-4858-b8bc-f9c4ae889...@ebay.com>
  "Re: Separate HEAP WAL replay logic into its own file" on Fri, 26 Jul 2024 
07:56:12 +,
  "Li, Yong"  wrote:

>> 1. Could you create your patch by "git format-patch -vN master"
>>   or something? If you create your patch by "git format-patch",
>>   we can apply your patch by "git am XXX.patch".
>>
> 
> Thanks for your review. I’ve updated the patch to follow your
> suggested format.

Thanks. I could apply your patch by "git am
v2-0001-heapam_refactor.patch".

Could you use the following format for the commit message
next time?


${TITLE}

${DESCRIPTION}


For example:


Separate HEAP WAL replay logic into its own file

Most access methods (i.e. nbtree and hash) use a separate
file with "xlog" in its name for its WAL replay logic. Heap
is one exception of this convention. To make it easier for
newcomers to find the WAL replay logic for the heap access
method, this patch isolates heap's replay logic in a new
heapam_xlog.c file. This patch is a pure refactoring with no
change to the logic.


This is a commonly used Git's commit message format. See
also other commit messages by "git log".

>> 5. There are still WAL related codes in heapam.c:
>>
>>   4.1. log_heap_update()
>>   4.2. log_heap_new_cid()
>>   4.3. if (RelationNeedsWAL()) {...} in heap_insert()
>>   4.4. if (needwal) {...} in heap_multi_insert()
>>   4.5. if (RelationNeedsWAL()) {...} in heap_delete()
>>   4.6. if (RelationNeedsWAL()) {...}s in heap_update()
>>   4.7. if (RelationNeedsWAL()) {...} in heap_lock_tuple()
>>   4.8. if (RelationNeedsWAL()) {...} in heap_lock_updated_tuple_rec()
>>   4.9. if (RelationNeedsWAL()) {...} in heap_finish_speculative()
>>   4.10. if (RelationNeedsWAL()) {...} in heap_abort_speculative()
>>   4.11. if (RelationNeedsWAL()) {...} in heap_inplace_update()
>>   4.12. log_heap_visible()
>>
>>   Should we move them to head_xlog.c too?
>>
>>   If we should do it, separated commits will be easy to
>>   review. For example, the 0001 patch moves existing codes
>>   to head_xlog.c as-is. The 0002 patch extracts WAL related
>>   codes in heap_insert() to heap_xlog.c and so on.
> 
> I followed the convention of most access methods. The “xlog”
> file includes the WAL replay logic only. The logic that generates
> the log records themselves stays with the code that performs
> the changes. Take nbtree as an example, you can also find
> WAL generating code in several _bt_insertxxx() functions inside
> the nbtinsert.c file.

You're right. Sorry.


I think that this proposal is reasonable but we need to get
attention from a committer to move forward this proposal.


Thanks,
-- 
kou

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Ashutosh Bapat

On Tue, Jul 30, 2024 at 9:25 AM Amit Kapila  wrote:
>
> On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
> >
> > Robert Haas  writes:
> > > On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> > >> ... However, I added a new open item about how the
> > >> 040_pg_createsubscriber.pl test is slow and still unstable.
> >
> > > But that said, I see no commits in the commit history which purport to
> > > improve performance, so I guess the performance is probably still not
> > > what you want, though I am not clear on the details.
> >
> > My concern is described at [1]:
> >
> > >> I have a different but possibly-related complaint: why is
> > >> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> > >> runs for a bit over 19 seconds, which seems completely out of line
> > >> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> > >> other test scripts in this directory take much less).  It looks
> > >> like most of the blame falls on this step:
> > >>
> > >> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> > >>
> > >> AFAICS the amount of data being replicated is completely trivial,
> > >> so that it doesn't make any sense for this to take so long --- and
> > >> if it does, that suggests that this tool will be impossibly slow
> > >> for production use.  But I suspect there is a logic flaw causing
> > >> this.  Speculating wildly, perhaps that is related to the failure
> > >> Alexander spotted?
> >
> > The followup discussion in that thread made it sound like there's
> > some fairly fundamental deficiency in how wait_for_end_recovery()
> > detects end-of-recovery.  I'm not too conversant with the details
> > though, and it's possible that pg_createsubscriber is just falling
> > foul of a pre-existing infelicity.
> >
> > If the problem can be correctly described as "pg_createsubscriber
> > takes 10 seconds or so to detect end-of-stream",
> >
>
> The problem can be defined as: "pg_createsubscriber waits for an
> additional (new) WAL record to be generated on primary before it
> considers the standby is ready for becoming a subscriber". Now, on
> busy systems, this shouldn't be a problem but for idle systems, the
> time to detect end-of-stream can't be easily defined.

AFAIU, the server will emit running transactions WAL record at least
15 seconds. So the subscriber should not have to wait longer than 15
seconds. I understand that it would be a problem for tests, but will
it be a problem for end users? Sorry for repetition, if this has been
discussed.

-- 
Best Wishes,
Ashutosh Bapat

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)