Re: Eliminating SPI from RI triggers - take 2

2022-09-29 Thread Amit Langote
On Thu, Sep 29, 2022 at 1:46 PM Amit Langote  wrote:
> Sorry about the delay.
>
> So I came up with such a patch that is attached as 0003.
>
> The main problem I want to fix with it is the need for RI_FKey_check()
> to "force"-push the latest snapshot that the PartitionDesc code wants
> to use to correctly include or omit a detach-pending partition from
> the view of that function's RI query.  Scribbling on ActiveSnapshot
> that way means that *all* scans involved in the execution of that
> query now see a snapshot that they shouldn't likely be seeing; a bug
> resulting from this has been demonstrated in a test case added by the
> commit 00cb86e75d.
>
> The fix is to make RI_FKey_check(), or really its RI_Plan's execution
> function ri_LookupKeyInPkRel() added by patch 0002, pass the latest
> snapshot explicitly as a parameter of PartitionDirectoryLookup(),
> which passes it down to the PartitionDesc code.  No need to manipulate
> ActiveSnapshot.  The actual fix is in patch 0004, which I extracted
> out of 0002 to keep the latter a mere refactoring patch without any
> semantic changes (though a bit more on that below).  BTW, I don't know
> of a way to back-patch a fix like this for the bug, because there is
> no way other than ActiveSnapshot to pass the desired snapshot to the
> PartitionDesc code if the only way we get to that code is by executing
> an SQL query plan.
>
> 0003 moves the relevant logic out of
> find_inheritance_children_extended() into its callers.  The logic of
> deciding which snapshot to use to determine if a detach-pending
> partition should indeed be omitted from the consideration of a caller
> based on the result of checking the visibility of the corresponding
> pg_inherits row with the snapshot; it just uses ActiveSnapshot now.
> Given the problems with using ActiveSnapshot mentioned above, I think
> it is better to make the callers decide the snapshot and pass it using
> a parameter named omit_detached_snapshot.  Only PartitionDesc code
> actually cares about sending anything but the parent query's
> ActiveSnapshot, so the PartitionDesc and PartitionDirectory interface
> has been changed to add the same omit_detached_snapshot parameter.
> find_inheritance_children(), the other caller used in many sites that
> look at a table's partitions, defaults to using ActiveSnapshot, which
> does not seem problematic.  Furthermore, only RI_FKey_check() needs to
> pass anything other than ActiveSnapshot, so other users of
> PartitionDesc, like user queries, still default to using the
> ActiveSnapshot, which doesn't have any known problems either.
>
> 0001 and 0002 are mostly unchanged in this version, except I took out
> the visibility bug-fix from 0002 into 0004 described above, which
> looks better using the interface added by 0003 anyway.  I need to
> address the main concern that it's still hard to be sure that the
> patch in its current form doesn't break any user-level semantics of
> these RI check triggers and other concerns about the implementation
> that Robert expressed in [1].

Oops, I apparently posted the wrong 0004, containing a bug that
crashes `make check`.

Fixed version attached.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v5-0002-Avoid-using-an-SQL-query-for-some-RI-checks.patch
Description: Binary data


v5-0003-Make-omit_detached-logic-independent-of-ActiveSna.patch
Description: Binary data


v5-0004-Teach-ri_LookupKeyInPkRel-to-pass-omit_detached_s.patch
Description: Binary data


v5-0001-Avoid-using-SPI-in-RI-trigger-functions.patch
Description: Binary data


Re: [patch] Adding an assertion to report too long hash table name

2022-09-29 Thread Junwang Zhao
LGTM
+1

On Thu, Sep 29, 2022 at 9:38 AM Xiaoran Wang  wrote:
>
> Hi,
>
> The max size for the shmem hash table name is SHMEM_INDEX_KEYSIZE - 1.
> but when the caller uses a longer hash table name, it doesn't report any 
> error, instead
> it just uses the first SHMEM_INDEX_KEYSIZE -1 chars as the hash table name.
>
> I created some shmem hash tables with the same prefix which was longer than
> SHMEM_INDEX_KEYSIZE - 1, and the size of those hash tables were the same,
> then only one hash table was created. But I thought those hash tables were 
> created
> successfully.
>
> I know this is a corner case, but it's difficult to figure it out when run 
> into it. So I add
> an assertion to prevent it.
>
>
> Thanks,
> Xiaoran



-- 
Regards
Junwang Zhao




Re: Obsolete comment in ExecInsert()

2022-09-29 Thread Etsuro Fujita
Hi,

On Wed, Sep 28, 2022 at 11:42 PM Tom Lane  wrote:
> Etsuro Fujita  writes:
> > I think the “or a tuple has come for a different relation than that
> > for the accumulated tuples" part in the comment is a leftover from an
> > earlier version of the patch [1].  As the code shows, we do not handle
> > that case anymore, so I think we should remove that part from the
> > comment.  Attached is a patch for that.
>
> +1, but what remains still seems awkwardly worded.  How about something
> like "When we've reached the desired batch size, perform the insertion"?

+1 for that change.  Pushed that way.

Thanks for reviewing!

Best regards,
Etsuro Fujita




Re: pg_rewind WAL segments deletion pitfall

2022-09-29 Thread Polina Bungina
I agree with your suggestions, so here is the updated version of patch.
Hope I haven't missed anything.

Regards,
Polina Bungina


v3-0001-pg_rewind-wal-deletion.patch
Description: Binary data


Re: Refactor UnpinBuffer()

2022-09-29 Thread Aleksander Alekseev
Nathan, Zhang,

Thanks for the review!

> Is it safe to move the call to ResourceOwnerForgetBuffer() to before the
> call to GetPrivateRefCountEntry()? From my quick skim of the code, it
> seems like it should be safe, but I thought I'd ask the question.
>
> Same question, have a look, it doesn’t seem to matter.

Yep, I had some doubts here as well but it seems to be safe.

-- 
Best regards,
Aleksander Alekseev




Re: Avoid memory leaks during base backups

2022-09-29 Thread Bharath Rupireddy
On Wed, Sep 28, 2022 at 8:46 PM Robert Haas  wrote:
>
> I feel like we ought to be trying to tie
> the cleanup into WalSndErrorCleanup() or WalSndResourceCleanup() based
> on having the memory context that we ought to be blowing away stored
> in a global variable, rather than using a try/catch block.

Okay, I got rid of the try-catch block. I added two clean up callbacks
(one for SQL backup functions or on-line backup, another for base
backup) that basically delete the respective memory contexts and reset
the file-level variables, they get called from PostgresMain()'s error
handling code.

> Like, maybe there's a function EstablishWalSenderMemoryContext() that
> commands can call before allocating memory that shouldn't survive an
> error. And it's deleted after each command if it exists, or if an
> error occurs then WalSndErrorCleanup() deletes it.

I don't think we need any of the above. I've used file-level variables
to hold memory contexts, allocating them whenever needed and cleaning
them up either at the end of backup operation or upon error.

Please review the attached v3 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v3-0001-Avoid-memory-leaks-during-backups.patch
Description: Binary data


Re: Refactor UnpinBuffer()

2022-09-29 Thread Bharath Rupireddy
On Thu, Sep 29, 2022 at 1:52 PM Aleksander Alekseev
 wrote:
>
> > Is it safe to move the call to ResourceOwnerForgetBuffer() to before the
> > call to GetPrivateRefCountEntry()? From my quick skim of the code, it
> > seems like it should be safe, but I thought I'd ask the question.
> >
> > Same question, have a look, it doesn’t seem to matter.
>
> Yep, I had some doubts here as well but it seems to be safe.

The commit 2d115e47c861878669ba0814b3d97a4e4c347e8b that removed the
last UnpinBuffer() call with fixOwner as false in ReleaseBuffer().
This commit is pretty old and +1 for removing the unused function
parameter.

Also, it looks like changing the order of GetPrivateRefCountEntry()
and ResourceOwnerForgetBuffer() doesn't have any effect as they are
independent, but do we want to actually do that if there's no specific
reason?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




do away with ALTER TABLE "Recurse" subcmd types

2022-09-29 Thread Alvaro Herrera
I already mentioned this in [1]: we can remove a few subcmd types that
were added to support exec-time recursion, by keeping a separate flag
for it.  We're already doing that for alter trigger operations, so this
patch just extends that to the other subcommand types that need it.

There's no visible change, just some code simplification.

[1] https://postgr.es/m/20220729184452.2i4xcru3lzey76m6@alvherre.pgsql

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)
>From 8c36176feeb47f4a4889b0af0800fcf316a58408 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Aug 2022 19:53:12 +0200
Subject: [PATCH] do away with AT_AddColumnRecurse

---
 src/backend/commands/tablecmds.c  | 67 ---
 src/backend/parser/parse_utilcmd.c|  2 -
 src/include/nodes/parsenodes.h|  5 --
 .../test_ddl_deparse/test_ddl_deparse.c   | 20 ++
 4 files changed, 17 insertions(+), 77 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7d8a75d23c..60a4663fd1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4069,8 +4069,7 @@ AlterTableLookupRelation(AlterTableStmt *stmt, LOCKMODE lockmode)
  * For most subcommand types, phases 2 and 3 do no explicit recursion,
  * since phase 1 already does it.  However, for certain subcommand types
  * it is only possible to determine how to recurse at phase 2 time; for
- * those cases, phase 1 sets the cmd->recurse flag (or, in some older coding,
- * changes the command subtype of a "Recurse" variant XXX to be cleaned up.)
+ * those cases, phase 1 sets the cmd->recurse flag.
  *
  * Thanks to the magic of MVCC, an error anywhere along the way rolls back
  * the whole operation; we don't have to do anything special to clean up.
@@ -4275,7 +4274,6 @@ AlterTableGetLockLevel(List *cmds)
 break;
 
 			case AT_AddConstraint:
-			case AT_AddConstraintRecurse:	/* becomes AT_AddConstraint */
 			case AT_ReAddConstraint:	/* becomes AT_AddConstraint */
 			case AT_ReAddDomainConstraint:	/* becomes AT_AddConstraint */
 if (IsA(cmd->def, Constraint))
@@ -4627,7 +4625,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			/* Recursion occurs during execution phase */
 			/* No command-specific prep needed except saving recurse flag */
 			if (recurse)
-cmd->subtype = AT_AddConstraintRecurse;
+cmd->recurse = true;
 			pass = AT_PASS_ADD_CONSTR;
 			break;
 		case AT_AddIndexConstraint: /* ADD CONSTRAINT USING INDEX */
@@ -4642,7 +4640,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			/* Other recursion occurs during execution phase */
 			/* No command-specific prep needed except saving recurse flag */
 			if (recurse)
-cmd->subtype = AT_DropConstraintRecurse;
+cmd->recurse = true;
 			pass = AT_PASS_DROP;
 			break;
 		case AT_AlterColumnType:	/* ALTER COLUMN TYPE */
@@ -4764,7 +4762,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			/* Recursion occurs during execution phase */
 			/* No command-specific prep needed except saving recurse flag */
 			if (recurse)
-cmd->subtype = AT_ValidateConstraintRecurse;
+cmd->recurse = true;
 			pass = AT_PASS_MISC;
 			break;
 		case AT_ReplicaIdentity:	/* REPLICA IDENTITY ... */
@@ -4929,12 +4927,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 		case AT_AddColumn:		/* ADD COLUMN */
 		case AT_AddColumnToView:	/* add column via CREATE OR REPLACE VIEW */
 			address = ATExecAddColumn(wqueue, tab, rel, &cmd,
-	  false, false,
-	  lockmode, cur_pass, context);
-			break;
-		case AT_AddColumnRecurse:
-			address = ATExecAddColumn(wqueue, tab, rel, &cmd,
-	  true, false,
+	  cmd->recurse, false,
 	  lockmode, cur_pass, context);
 			break;
 		case AT_ColumnDefault:	/* ALTER COLUMN DEFAULT */
@@ -4988,13 +4981,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			break;
 		case AT_DropColumn:		/* DROP COLUMN */
 			address = ATExecDropColumn(wqueue, rel, cmd->name,
-	   cmd->behavior, false, false,
-	   cmd->missing_ok, lockmode,
-	   NULL);
-			break;
-		case AT_DropColumnRecurse:	/* DROP COLUMN with recursion */
-			address = ATExecDropColumn(wqueue, rel, cmd->name,
-	   cmd->behavior, true, false,
+	   cmd->behavior, cmd->recurse, false,
 	   cmd->missing_ok, lockmode,
 	   NULL);
 			break;
@@ -5014,27 +5001,14 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			/* Transform the command only during initial examination */
 			if (cur_pass == AT_PASS_ADD_CONSTR)
 cmd = ATParseTransformCmd(wqueue, tab, rel, cmd,
-		  false, lockmode,
+		  cmd->recurse, lockmode,
 		  cur_pass, context);
 			/* Depending on constraint type, might be no mo

Re: Eliminating SPI from RI triggers - take 2

2022-09-29 Thread Amit Langote
On Thu, Sep 29, 2022 at 4:43 PM Amit Langote  wrote:
> On Thu, Sep 29, 2022 at 1:46 PM Amit Langote  wrote:
> > Sorry about the delay.
> >
> > So I came up with such a patch that is attached as 0003.
> >
> > The main problem I want to fix with it is the need for RI_FKey_check()
> > to "force"-push the latest snapshot that the PartitionDesc code wants
> > to use to correctly include or omit a detach-pending partition from
> > the view of that function's RI query.  Scribbling on ActiveSnapshot
> > that way means that *all* scans involved in the execution of that
> > query now see a snapshot that they shouldn't likely be seeing; a bug
> > resulting from this has been demonstrated in a test case added by the
> > commit 00cb86e75d.
> >
> > The fix is to make RI_FKey_check(), or really its RI_Plan's execution
> > function ri_LookupKeyInPkRel() added by patch 0002, pass the latest
> > snapshot explicitly as a parameter of PartitionDirectoryLookup(),
> > which passes it down to the PartitionDesc code.  No need to manipulate
> > ActiveSnapshot.  The actual fix is in patch 0004, which I extracted
> > out of 0002 to keep the latter a mere refactoring patch without any
> > semantic changes (though a bit more on that below).  BTW, I don't know
> > of a way to back-patch a fix like this for the bug, because there is
> > no way other than ActiveSnapshot to pass the desired snapshot to the
> > PartitionDesc code if the only way we get to that code is by executing
> > an SQL query plan.
> >
> > 0003 moves the relevant logic out of
> > find_inheritance_children_extended() into its callers.  The logic of
> > deciding which snapshot to use to determine if a detach-pending
> > partition should indeed be omitted from the consideration of a caller
> > based on the result of checking the visibility of the corresponding
> > pg_inherits row with the snapshot; it just uses ActiveSnapshot now.
> > Given the problems with using ActiveSnapshot mentioned above, I think
> > it is better to make the callers decide the snapshot and pass it using
> > a parameter named omit_detached_snapshot.  Only PartitionDesc code
> > actually cares about sending anything but the parent query's
> > ActiveSnapshot, so the PartitionDesc and PartitionDirectory interface
> > has been changed to add the same omit_detached_snapshot parameter.
> > find_inheritance_children(), the other caller used in many sites that
> > look at a table's partitions, defaults to using ActiveSnapshot, which
> > does not seem problematic.  Furthermore, only RI_FKey_check() needs to
> > pass anything other than ActiveSnapshot, so other users of
> > PartitionDesc, like user queries, still default to using the
> > ActiveSnapshot, which doesn't have any known problems either.
> >
> > 0001 and 0002 are mostly unchanged in this version, except I took out
> > the visibility bug-fix from 0002 into 0004 described above, which
> > looks better using the interface added by 0003 anyway.  I need to
> > address the main concern that it's still hard to be sure that the
> > patch in its current form doesn't break any user-level semantics of
> > these RI check triggers and other concerns about the implementation
> > that Robert expressed in [1].
>
> Oops, I apparently posted the wrong 0004, containing a bug that
> crashes `make check`.
>
> Fixed version attached.

Here's another version that hopefully fixes the crash reported by
Cirrus CI [1] that is not reliably reproducible.

I suspect it may have to do with error_context_stack not being reset
when ri_LookupKeyInPkRel() does an early return; the `return false` in
that case was wrong too:

@@ -2693,7 +2693,7 @@ ri_LookupKeyInPkRel(struct RI_Plan *plan,
 * looking for.
 */
if (leaf_pk_rel == NULL)
-   return false;
+   goto done;

...
+done:
/*
 * Pop the error context stack
 */

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://cirrus-ci.com/task/4901906421121024 (permalink?)


v6-0001-Avoid-using-SPI-in-RI-trigger-functions.patch
Description: Binary data


v6-0002-Avoid-using-an-SQL-query-for-some-RI-checks.patch
Description: Binary data


v6-0004-Teach-ri_LookupKeyInPkRel-to-pass-omit_detached_s.patch
Description: Binary data


v6-0003-Make-omit_detached-logic-independent-of-ActiveSna.patch
Description: Binary data


Re: Eliminating SPI from RI triggers - take 2

2022-09-29 Thread Amit Langote
On Thu, Sep 29, 2022 at 6:09 PM Amit Langote  wrote:
> On Thu, Sep 29, 2022 at 4:43 PM Amit Langote  wrote:
> > On Thu, Sep 29, 2022 at 1:46 PM Amit Langote  
> > wrote:
> > > Sorry about the delay.
> > >
> > > So I came up with such a patch that is attached as 0003.
> > >
> > > The main problem I want to fix with it is the need for RI_FKey_check()
> > > to "force"-push the latest snapshot that the PartitionDesc code wants
> > > to use to correctly include or omit a detach-pending partition from
> > > the view of that function's RI query.  Scribbling on ActiveSnapshot
> > > that way means that *all* scans involved in the execution of that
> > > query now see a snapshot that they shouldn't likely be seeing; a bug
> > > resulting from this has been demonstrated in a test case added by the
> > > commit 00cb86e75d.
> > >
> > > The fix is to make RI_FKey_check(), or really its RI_Plan's execution
> > > function ri_LookupKeyInPkRel() added by patch 0002, pass the latest
> > > snapshot explicitly as a parameter of PartitionDirectoryLookup(),
> > > which passes it down to the PartitionDesc code.  No need to manipulate
> > > ActiveSnapshot.  The actual fix is in patch 0004, which I extracted
> > > out of 0002 to keep the latter a mere refactoring patch without any
> > > semantic changes (though a bit more on that below).  BTW, I don't know
> > > of a way to back-patch a fix like this for the bug, because there is
> > > no way other than ActiveSnapshot to pass the desired snapshot to the
> > > PartitionDesc code if the only way we get to that code is by executing
> > > an SQL query plan.
> > >
> > > 0003 moves the relevant logic out of
> > > find_inheritance_children_extended() into its callers.  The logic of
> > > deciding which snapshot to use to determine if a detach-pending
> > > partition should indeed be omitted from the consideration of a caller
> > > based on the result of checking the visibility of the corresponding
> > > pg_inherits row with the snapshot; it just uses ActiveSnapshot now.
> > > Given the problems with using ActiveSnapshot mentioned above, I think
> > > it is better to make the callers decide the snapshot and pass it using
> > > a parameter named omit_detached_snapshot.  Only PartitionDesc code
> > > actually cares about sending anything but the parent query's
> > > ActiveSnapshot, so the PartitionDesc and PartitionDirectory interface
> > > has been changed to add the same omit_detached_snapshot parameter.
> > > find_inheritance_children(), the other caller used in many sites that
> > > look at a table's partitions, defaults to using ActiveSnapshot, which
> > > does not seem problematic.  Furthermore, only RI_FKey_check() needs to
> > > pass anything other than ActiveSnapshot, so other users of
> > > PartitionDesc, like user queries, still default to using the
> > > ActiveSnapshot, which doesn't have any known problems either.
> > >
> > > 0001 and 0002 are mostly unchanged in this version, except I took out
> > > the visibility bug-fix from 0002 into 0004 described above, which
> > > looks better using the interface added by 0003 anyway.  I need to
> > > address the main concern that it's still hard to be sure that the
> > > patch in its current form doesn't break any user-level semantics of
> > > these RI check triggers and other concerns about the implementation
> > > that Robert expressed in [1].
> >
> > Oops, I apparently posted the wrong 0004, containing a bug that
> > crashes `make check`.
> >
> > Fixed version attached.
>
> Here's another version that hopefully fixes the crash reported by
> Cirrus CI [1] that is not reliably reproducible.

And cfbot #1, which failed a bit after the above one, is not happy
with my failing to include utils/snapshot.h in a partdesc.h to which I
added:

@@ -65,9 +66,11 @@ typedef struct PartitionDescData


 extern PartitionDesc RelationGetPartitionDesc(Relation rel, bool
omit_detached);
+extern PartitionDesc RelationGetPartitionDescExt(Relation rel, bool
omit_detached,
+Snapshot
omit_detached_snapshot);

 extern PartitionDirectory CreatePartitionDirectory(MemoryContext
mcxt, bool omit_detached);
-extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory, Relation);
+extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory,
Relation, Snapshot);

So, here's a final revision for today.  Sorry for the noise.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v7-0004-Teach-ri_LookupKeyInPkRel-to-pass-omit_detached_s.patch
Description: Binary data


v7-0002-Avoid-using-an-SQL-query-for-some-RI-checks.patch
Description: Binary data


v7-0001-Avoid-using-SPI-in-RI-trigger-functions.patch
Description: Binary data


v7-0003-Make-omit_detached-logic-independent-of-ActiveSna.patch
Description: Binary data


Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-29 Thread Martin Kalcher

Am 28.09.22 um 16:18 schrieb Tom Lane:

It is seeded at process start, yes.  If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine.  (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

regards, tom lane


New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

MartinFrom b9433564f925521f5f6bcebd7cd74a3e12f4f354 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v5] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.
---
 doc/src/sgml/func.sgml  |  34 +
 src/backend/utils/adt/array_userfuncs.c | 176 
 src/include/catalog/pg_proc.dat |   6 +
 src/test/regress/expected/arrays.out|  54 
 src/test/regress/sql/arrays.sql |  14 ++
 5 files changed, 284 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e1fe4fec57..6b2a3d16f6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18468,6 +18468,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array in selection order.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..4bf28da5e4 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -14,6 +14,7 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		j = (int) pg_prng_uint64_range(&pg_global_prng_state, 0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms++ = *jelms;
+			*inuls++ = *jnuls;
+			*jelms++ = elm;
+			*jnuls++ = nul;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_R

RE: Perform streaming logical transactions by background workers and parallel apply

2022-09-29 Thread kuroda.hay...@fujitsu.com
Dear Hou,

Thanks for updating patch. I will review yours soon, but I reply to your 
comment.

> > 04. applyparallelworker.c - LogicalParallelApplyLoop()
> >
> > ```
> > +   shmq_res = shm_mq_receive(mqh, &len, &data, false);
> > ...
> > +   if (ConfigReloadPending)
> > +   {
> > +   ConfigReloadPending = false;
> > +   ProcessConfigFile(PGC_SIGHUP);
> > +   }
> > ```
> >
> >
> > Here the parallel apply worker waits to receive messages and after 
> > dispatching
> > it ProcessConfigFile() is called.
> > It means that .conf will be not read until the parallel apply worker 
> > receives new
> > messages and apply them.
> >
> > It may be problematic when users change log_min_message to debugXXX for
> > debugging but the streamed transaction rarely come.
> > They expected that detailed description appears on the log from next
> > streaming chunk, but it does not.
> >
> > This does not occur in leader worker when it waits messages from publisher,
> > because it uses libpqrcv_receive(), which works asynchronously.
> >
> > I 'm not sure whether it should be documented that the evaluation of GUCs 
> > may
> > be delayed, how do you think?
> 
> I changed the shm_mq_receive to asynchronous mode which is also consistent
> with
> what we did for Gather node when reading data from parallel query workers.

I checked your implementation, but it seemed that the parallel apply worker 
will not sleep
even if there are no messages or signals. It might be very inefficient.

In gather node - gather_readnext(), the same way is used, but I think there is 
a premise
that the wait-time is short because it is related with only one gather node.
In terms of parallel apply worker, however, we cannot predict the wait-time 
because
it is related with the streamed transactions. If such transactions rarely come, 
parallel apply workers may spend many CPU time.

I think we should wait during short time or until leader notifies, if shmq_res 
== SHM_MQ_WOULD_BLOCK.
How do you think?


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



[patch] \g with multiple result sets and \watch with copy queries

2022-09-29 Thread Daniel Verite
Hi,

The psql improvement in v15 to output multiple result sets does not
behave as one might expect with \g: the output file or program
to pipe into is opened/closed on each result set, overwriting the
previous ones in the case of \g file.

Example:

psql -At 

Re: Refactor UnpinBuffer()

2022-09-29 Thread Aleksander Alekseev
Hi Bharath,

> Also, it looks like changing the order of GetPrivateRefCountEntry()
> and ResourceOwnerForgetBuffer() doesn't have any effect as they are
> independent, but do we want to actually do that if there's no specific
> reason?

If we keep the order as it is now the code will become:

```
ref = GetPrivateRefCountEntry(b, false);
Assert(ref != NULL);

ResourceOwnerForgetBuffer(CurrentResourceOwner, b);

Assert(ref->refcount > 0);
ref->refcount--;
if (ref->refcount == 0)
```

I figured it would not hurt to gather all the calls and Asserts
related to `ref` together. This is the only reason why I choose to
rearrange the order of the calls in the patch.

So, no strong opinion in this respect from my side. I'm fine with
keeping the existing order.

-- 
Best regards,
Aleksander Alekseev




Re: Generalize ereport_startup_progress infrastructure

2022-09-29 Thread Bharath Rupireddy
On Wed, Aug 17, 2022 at 8:44 PM Robert Haas  wrote:
>
> Well, I don't agree that either of the proposed new uses of this
> infrastructure are the right way to solve the problems in question, so
> worrying about how to name the GUCs when we have a bunch of uses of
> this infrastructure seems to me to be premature.

Agreed.

> The proposed use in
> the postmaster doesn't look very safe, so you either need to give up
> on that or figure out a way to make it safe.

Is registering a SIGALRM handler in postmaster not a good idea? Is
setting the MyLatch conditionally [1] a concern?

I agree that the handle_sig_alarm() code for postmaster may not look
good as it holds interrupts and does a bunch of other things. But is
it a bigger issue?

> The proposed use in the
> checkpointer looks like it needs more design work, because it's not
> clear whether or how it should interact with log_checkpoints. While I
> agree that changing log_checkpoints into an integer value doesn't
> necessarily make sense, having some kind of new checkpoint logging
> that is completely unrelated to existing checkpoint logging doesn't
> necessarily make sense to me either.

Hm. Yes, we cannot forget about log_checkpoints while considering
adding more logs and controls with other GUCs. We could say that one
needs to enable both log_checkpoints and the progress report GUC, but
that's not great from usability perspective.

> I do have some sympathy with the idea that if people care about
> operations that unexpectedly run for a long time, they probably care
> about all of them, and probably don't care about changing the timeout
> or even the enable switch for each one individually.

I've seen the cases myself and asked by many about the server being
unresponsive in the cases where it processes files, for instance, temp
files in postmaster after a restart or snapshot or mapping or
BufferSync() during checkpoint where this sort of progress reporting
would've helped.

Thinking of another approach for reporting file processing alone - a
GUC log_file_processing_traffic = {none, medium, high} or {0, 1, 2,
. limit} that users can set to emit a file processing log after a
certain number of files. It doesn't require a timeout mechanism, so it
can be used by any process. But, it is specific to just files.

Similar to above but a bit generic, not specific to just file
processing, a GUC log_processing_traffic = {none, medium, high} or {0,
1, 2, . limit}.

Thoughts?

[1]
 /*
  * SIGALRM is always cause for waking anything waiting on the process
  * latch.
+ *
+ * Postmaster has no latch associated with it.
  */
-SetLatch(MyLatch);
+if (MyLatch)
+SetLatch(MyLatch);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2022-09-29 Thread Damir Belyalov
>
> Do you mean you stop dealing with errors concerned with constraints and
> triggers and we should review 0006-COPY_IGNORE_ERRORS?
>
Yes, this patch is simpler and I think it's worth adding it first.


> Hmm, I applied v6 patch and when canceled COPY by sending SIGINT(ctrl +
> C), I faced the same situation as below.
> I tested it on CentOS 8.4.
>
Thank you for pointing out this error. it really needs to be taken into
account. In the previous  0006 patch, there were 2 stages in one
subtransaction - filling the buffer and 'replay mode' (reading from the
buffer). Since only NextCopyFrom() is included in PG_TRY(), and the
ERRCODE_QUERY_CANCELED error can occur anywhere, it is impossible to catch
it and rollback the subtransaction.

I changed the 0006 patch and fixed this error and now only the 'replay
buffer filling' is made in the subtransaction.

Patch 0005 (that processed constraints) needs to be finalized, because it
requires subtransactions to rollback constraints, triggers. Therefore, it
is not possible to fix it yet. There is a decision to put for(;;) loop in
PG_TRY. It will solve the problem, but the code will be too complicated.
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index c25b52d0cb..22c992e6f6 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -34,6 +34,7 @@ COPY { table_name [ ( format_name
 FREEZE [ boolean ]
+IGNORE_ERRORS [ boolean ]
 DELIMITER 'delimiter_character'
 NULL 'null_string'
 HEADER [ boolean | MATCH ]
@@ -233,6 +234,19 @@ COPY { table_name [ ( 

 
+   
+IGNORE_ERRORS
+
+ 
+  Drops rows that contain malformed data while copying. These are rows
+  containing syntax errors in data, rows with too many or too few columns,
+  rows containing columns where the data type's input function raises an error.
+  Outputs warnings about rows with incorrect data (the number of warnings
+  is not more than 100) and the total number of errors.
+ 
+
+   
+

 DELIMITER
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 49924e476a..f41b25f26a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -406,6 +406,7 @@ ProcessCopyOptions(ParseState *pstate,
    bool is_from,
    List *options)
 {
+	bool		ignore_errors_specified = false;
 	bool		format_specified = false;
 	bool		freeze_specified = false;
 	bool		header_specified = false;
@@ -448,6 +449,13 @@ ProcessCopyOptions(ParseState *pstate,
 			freeze_specified = true;
 			opts_out->freeze = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "ignore_errors") == 0)
+		{
+			if (ignore_errors_specified)
+errorConflictingDefElem(defel, pstate);
+			ignore_errors_specified = true;
+			opts_out->ignore_errors = defGetBoolean(defel);
+		}
 		else if (strcmp(defel->defname, "delimiter") == 0)
 		{
 			if (opts_out->delim)
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index e8bb168aea..caa3375758 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -106,6 +106,9 @@ static char *limit_printout_length(const char *str);
 
 static void ClosePipeFromProgram(CopyFromState cstate);
 
+static bool FillReplayBuffer(CopyFromState cstate, ExprContext *econtext,
+			 TupleTableSlot *myslot);
+
 /*
  * error context callback for COPY FROM
  *
@@ -521,6 +524,177 @@ CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
 	miinfo->bufferedBytes += tuplen;
 }
 
+/*
+ * Fills replay_buffer for safe copying, enables by IGNORE_ERRORS option.
+ */
+bool
+FillReplayBuffer(CopyFromState cstate, ExprContext *econtext, TupleTableSlot *myslot)
+{
+	SafeCopyFromState  *sfcstate = cstate->sfcstate;
+	bool valid_row = true;
+
+	if (!sfcstate->replay_is_active)
+	{
+		BeginInternalSubTransaction(NULL);
+		CurrentResourceOwner = sfcstate->oldowner;
+
+		while (sfcstate->saved_tuples < REPLAY_BUFFER_SIZE)
+		{
+			bool tuple_is_valid = false;
+
+			PG_TRY();
+			{
+MemoryContext cxt = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
+
+valid_row = NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull);
+
+if (valid_row)
+	tuple_is_valid = true;
+
+CurrentMemoryContext = cxt;
+			}
+			PG_CATCH();
+			{
+MemoryContext ecxt = MemoryContextSwitchTo(sfcstate->oldcontext);
+ErrorData *errdata = CopyErrorData();
+
+Assert(IsSubTransaction());
+RollbackAndReleaseCurrentSubTransaction();
+CurrentResourceOwner = sfcstate->oldowner;
+
+switch (errdata->sqlerrcode)
+{
+	/* Ignore data exceptions */
+	case ERRCODE_CHARACTER_NOT_IN_REPERTOIRE:
+	case ERRCODE_DATA_EXCEPTION:
+	case ERRCODE_ARRAY_ELEMENT_ERROR:
+	case ERRCODE_DATETIME_VALUE_OUT_OF_RANGE:
+	case ERRCODE_INTERVAL_FIELD_OVERFLOW:
+	case ERRCODE_INVALID_CHARACTER_VALUE_FOR_CAST:
+	case ERRCODE_INVALID_DATETIME_FORMAT:
+	case ERRCODE_INVALID_ESCAP

Re: Avoid memory leaks during base backups

2022-09-29 Thread Robert Haas
On Thu, Sep 29, 2022 at 4:29 AM Bharath Rupireddy
 wrote:
> Please review the attached v3 patch.

template1=# select * from pg_backup_start('sdgkljsdgkjdsg', true);
 pg_backup_start
-
 0/228
(1 row)

template1=# select 1/0;
ERROR:  division by zero
template1=# select * from pg_backup_stop();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?>

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Ronan Dunklau
> I've just pushed the disable byref Datums patch I posted earlier. I
> only made a small adjustment to make use of the TupleDescAttr() macro.
> Önder, thank you for the report.

Thank you David for taking care of this.

> Yeah, I think the same rules around scope apply as
> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
> copy flag to the existing function, but I'd rather not add the
> branching to that function. It's probably just better to duplicate it
> and adjust.
> 

For the record, I tried to see if gcc would optimize the function by 
generating two different versions when copy is true or false, thus getting rid 
of the branching while still having only one function to deal with. Using the 
-fipa-cp-clone (or even the whole set of additional flags coming with -O3), it 
does generate a special-case version of the function, but it seems to then 
only be used by heapam_index_validate_scan and 
percentile_cont_multi_final_common. This is from my investigation looking for 
references to the specialized version in the DWARF debug information.

Regards,

-- 
Ronan Dunklau






Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Tom Lane
Ronan Dunklau  writes:
>> Yeah, I think the same rules around scope apply as
>> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
>> copy flag to the existing function, but I'd rather not add the
>> branching to that function. It's probably just better to duplicate it
>> and adjust.

> For the record, I tried to see if gcc would optimize the function by 
> generating two different versions when copy is true or false, thus getting 
> rid 
> of the branching while still having only one function to deal with.

TBH, I think this is completely ridiculous over-optimization.
There's exactly zero evidence that a second copy of the function
would improve performance, or do anything but contribute to code
bloat (which does have a distributed performance cost).

regards, tom lane




Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-29 Thread Bharath Rupireddy
On Tue, Sep 27, 2022 at 8:31 AM Kyotaro Horiguchi
 wrote:
>
> If all error-emitting site knows the LSN, we don't need the context
> message. But *I* would like that the additional message looks like
> "while reading record at LSN %X/%X" or slightly shorter version of
> it. Because the targetRecPtr is the beginning of the current reading
> record, not the LSN for the segment and offset. It may point to past
> segments.

I think we could just say "LSN %X/%X, offset %u", because the overall
context whether it's being read or written is implicit with the other
part of the message.

Please see the attached v1 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b19aa25a0d1f2ce85abe0c2081c0e7ede256e329 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 29 Sep 2022 13:29:44 +
Subject: [PATCH v1] Add LSN along with offset to error messages reported for
 WAL file read/write/validate header failures

---
 src/backend/access/transam/xlog.c |  8 ++--
 src/backend/access/transam/xlogreader.c   | 16 +++-
 src/backend/access/transam/xlogrecovery.c | 13 -
 src/backend/access/transam/xlogutils.c| 10 ++
 src/include/access/xlogreader.h   |  1 +
 5 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e15256db8..a495bbac85 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2218,6 +2218,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 {
 	char		xlogfname[MAXFNAMELEN];
 	int			save_errno;
+	XLogRecPtr	lsn;
 
 	if (errno == EINTR)
 		continue;
@@ -2226,11 +2227,14 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	XLogFileName(xlogfname, tli, openLogSegNo,
  wal_segment_size);
 	errno = save_errno;
+	XLogSegNoOffsetToRecPtr(openLogSegNo, startoffset,
+			wal_segment_size, lsn);
 	ereport(PANIC,
 			(errcode_for_file_access(),
 			 errmsg("could not write to log file %s "
-	"at offset %u, length %zu: %m",
-	xlogfname, startoffset, nleft)));
+	"at offset %u, LSN %X/%X, length %zu: %m",
+	xlogfname, startoffset,
+	LSN_FORMAT_ARGS(lsn), nleft)));
 }
 nleft -= written;
 from += written;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5a8fe81f82..c3befc44ba 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1229,9 +1229,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid magic number %04X in WAL segment %s, offset %u",
+			  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_magic,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1243,9 +1244,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1284,9 +1286,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1303,9 +1306,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "unexpected pageaddr %X/%X in WAL segment %s, offset %u",
+			  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 			  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1328,10 +1332,11 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
-  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",
+  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
   hdr->xlp_tli,
   state->latestPageTLI,
   fname,
+  LSN

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Ronan Dunklau
Le jeudi 29 septembre 2022, 16:10:03 CEST Tom Lane a écrit :
> Ronan Dunklau  writes:
> >> Yeah, I think the same rules around scope apply as
> >> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
> >> copy flag to the existing function, but I'd rather not add the
> >> branching to that function. It's probably just better to duplicate it
> >> and adjust.
> > 
> > For the record, I tried to see if gcc would optimize the function by
> > generating two different versions when copy is true or false, thus getting 
rid
> > of the branching while still having only one function to deal with.
> 
> TBH, I think this is completely ridiculous over-optimization.
> There's exactly zero evidence that a second copy of the function
> would improve performance, or do anything but contribute to code
> bloat (which does have a distributed performance cost).

I wasn't commenting on the merit of the optimization, but just that I tried to 
get gcc to apply it itself, which it doesn't.

Regards,

-- 
Ronan Dunklau






Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu
I'd like to revamp this important discussion.

As is well described in this fairly recent paper here 
https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) 
"estimation errors quickly grow as the number of joins increases, and that 
these errors are usually the reason for bad plans" - I think we can all get 
behind that statement.

While nested loop joins work great when cardinality estimates are correct, they 
are notoriously bad when the optimizer underestimates and they degrade very 
fast in such cases - the good vs. bad here is very asymmetric. On the other 
hand hash joins degrade much more gracefully - they are considered very robust 
against underestimation. The above mentioned paper illustrates that all mayor 
DBMS (including Postgres) tend to underestimate and usually that 
underestimation increases drastically with the number of joins (see Figures 3+4 
of the paper).

Now, a simple approach to guarding against bad plans that arise from 
underestimation could be to use what I would call a 
nested-loop-conviction-multiplier based on the current depth of the join tree, 
e.g. for a base table that multiplier would obviously be 1, but would then grow 
(e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew 
the cardinality estimates themselves, but rather be applied to the overall 
nested loop join cost at each particular stage of the plan when comparing it to 
other more robust join strategies like hash or sort-merge joins. That way when 
we can be sure to have a good estimate at the bottom of the join tree we treat 
all things equal, but favor nested loops less and less as we move up the join 
tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default 
cardinality constant as surely all bets are off at that point - we should 
definitely treat nested loop joins as out of favor in this instance and that 
could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

Cheers, Ben




Re: kerberos/001_auth test fails on arm CPU darwin

2022-09-29 Thread Stephen Frost
Greetings,

* Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
> On 27.09.22 03:37, Andres Freund wrote:
> > Maybe we should rely on PATH, rather than hardcoding OS dependent locations?
> > Or at least fall back to seach binaries in PATH? Seems pretty odd to 
> > hardcode
> > all these locations without a way to influence it from outside the test.
> 
> Homebrew intentionally does not install the krb5 and openldap packages into
> the path, because they conflict with macOS-provided software. However, those
> macOS-provided variants don't provide all the pieces we need for the tests.

The macOS-provided versions are also old and broken, or at least that
was the case when I looked into them last.

> Also, on Linux you need /usr/sbin, which is often not in the path.
> 
> So I think there is no good way around hardcoding a lot of these paths.

Yeah, not sure what else to do.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: identifying the backend that owns a temporary schema

2022-09-29 Thread Tom Lane
Nathan Bossart  writes:
> On Wed, Sep 28, 2022 at 06:56:20PM -0400, Tom Lane wrote:
>> A point that still bothers me a bit about pg_stat_get_backend_idset is
>> that it could miss or duplicate some backend IDs if the user calls
>> pg_stat_clear_snapshot() partway through the SRF's run, and we reload
>> a different set of backend entries than we had before.  I added a comment
>> about that, with an argument why it's not worth working harder, but
>> is the argument convincing?  If not, what should we do?

> Isn't this an existing problem?  Granted, it'd manifest differently with
> this patch, but ISTM we could already end up with too many or too few
> backend IDs if there's a refresh partway through.

Right.  I'd been thinking the current code wouldn't generate duplicate IDs
--- but it might produce duplicate query output anyway, in case a given
backend's entry is later in the array than it was before.  So really
there's not much guarantees here in any case.

>> -if (beid < 1 || beid > localNumBackends)
>> -return NULL;

> The reason I'd kept this in was because I was worried about overflow in the
> comparator function.  Upon further inspection, I don't think there's
> actually any way that will produce incorrect results.  And I'm not sure we
> should worry about such cases too much, anyway.

Ah, I see: if the user passes in a "backend ID" that is close to INT_MIN,
then the comparator's subtraction could overflow.  We could fix that by
writing out the comparator code longhand ("if (a < b) return -1;" etc),
but I don't really think it's necessary.  bsearch is guaranteed to
correctly report that such a key is not present, even if it takes a
strange search path through the array due to inconsistent comparator
results.  So the test quoted above just serves to fail a bit more quickly,
but we certainly shouldn't be optimizing for the case of a bad ID.

> Overall, LGTM.

OK.  Will push shortly.

regards, tom lane




Re: kerberos/001_auth test fails on arm CPU darwin

2022-09-29 Thread Tom Lane
Stephen Frost  writes:
> * Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
>> Homebrew intentionally does not install the krb5 and openldap packages into
>> the path, because they conflict with macOS-provided software. However, those
>> macOS-provided variants don't provide all the pieces we need for the tests.

> The macOS-provided versions are also old and broken, or at least that
> was the case when I looked into them last.

Yeah.  They also generate tons of deprecation warnings at compile time,
so it's not like Apple is encouraging you to use them.  I wonder why
they're still there at all.

regards, tom lane




Re: making relfilenodes 56 bits

2022-09-29 Thread Maxim Orlov
Hi!

I'm not in the context of this thread, but I've notice something strange by
attempting to rebase my patch set from 64XID thread.
As far as I'm aware, this patch set is adding "relfilenumber". So, in
pg_control_checkpoint, we have next changes:

diff --git a/src/backend/utils/misc/pg_controldata.c
b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..d441cd97e2 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -79,8 +79,8 @@ pg_control_system(PG_FUNCTION_ARGS)
 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-   Datum   values[18];
-   boolnulls[18];
+   Datum   values[19];
+   boolnulls[19];
TupleDesc   tupdesc;
HeapTuple   htup;
ControlFileData *ControlFile;
@@ -129,6 +129,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
   XIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
   TIMESTAMPTZOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 19, "next_relfilenumber",
+  INT8OID, -1, 0);
tupdesc = BlessTupleDesc(tupdesc);

/* Read the control file. */

In other words, we have 19 attributes. But tupdesc here is constructed for
18 elements:
tupdesc = CreateTemplateTupleDesc(18);

Is that normal or not? Again, I'm not in this thread and if that is
completely ok, I'm sorry about the noise.

-- 
Best regards,
Maxim Orlov.


Re: making relfilenodes 56 bits

2022-09-29 Thread Robert Haas
On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov  wrote:
> In other words, we have 19 attributes. But tupdesc here is constructed for 18 
> elements:
> tupdesc = CreateTemplateTupleDesc(18);
>
> Is that normal or not? Again, I'm not in this thread and if that is 
> completely ok, I'm sorry about the noise.

I think that's a mistake. Thanks for the report.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 7:10 AM Tom Lane  wrote:
> TBH, I think this is completely ridiculous over-optimization.
> There's exactly zero evidence that a second copy of the function
> would improve performance, or do anything but contribute to code
> bloat (which does have a distributed performance cost).

I thought that that was unjustified myself.

-- 
Peter Geoghegan




Re: Adding CommandID to heap xlog records

2022-09-29 Thread Matthias van de Meent
On Wed, 28 Sept 2022 at 19:40, Bruce Momjian  wrote:
>
> On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote:
> > On Thu, 8 Sept 2022 at 23:24, Tom Lane  wrote:
> > >
> > > Matthias van de Meent  writes:
> > > > Please find attached a patch that adds the CommandId of the inserting
> > > > transaction to heap (batch)insert, update and delete records. It is
> > > > based on the changes we made in the fork we maintain for Neon.
> > >
> > > This seems like a very significant cost increment with returns
> > > to only a minuscule number of users.  We certainly cannot consider
> > > it unless you provide some evidence that that impression is wrong.
> >
> > Attached a proposed set of patches to reduce overhead of the inital patch.
>
> This might be obvious to some, but the patch got a lot larger.  :-(

Sorry for that, but updating the field from which the redo manager
should pull its information does indeed touch a lot of files because
most users of xl_info are only interested in the 4 bits reserved for
the redo-manager. Most of 0001 is therefore updates to point code to
the new field in XLogRecord, and renaming the variables and arguments
from info to rminfo.

[tangent] With that refactoring, I also clean up a lot of code that
was using a wrong macro/constant for rmgr flags; `info &
~XLR_INFO_MASK` may have the same value as `info &
XLR_RMGR_INFO_MASK`, but that's only guaranteed by the documentation;
and would require the same significant rework if new bits were
assigned to non-XLR_INFO_MASK and non-XLR_RMGR_INFO_MASK. [/tangent]

0002 grew a bit as well, but not to a degree that I think is worrying
or otherwise impossible to review.

Kind regards,

Matthias van de Meent.




Should setting TupleTableSlotOps get_heap_tuple=NULL break INSERT..ON CONFLICT DO UPDATE..RETURNING?

2022-09-29 Thread Mark Dilger
Hackers,

Per the documentation in TupleTableSlotOps, an AM can choose not to supply a 
get_heap_tuple function, and instead set this field to NULL.  Doing so appears 
to almost work, but breaks the xmin and xmax returned by a INSERT..ON CONFLICT 
DO UPDATE..RETURNING.  In particular, the call chain ExecOnConflictUpdate -> 
ExecUpdate -> table_tuple_update seems to expect that upon return from 
table_tuple_update, the slot will hold onto a copy of the updated tuple, 
including its header fields.  This assumption is inherent in how the slot is 
later used by the destination receiver.  But for TAMs which do not keep a copy 
heaptuple of their own, the slot will only have copies of (tts_tupleDescriptor, 
tts_values, tts_isnull) to use to form up a tuple when the receiver asks for 
one, and that formed up MinimalTuple won't be preceded by any meaningful header.

I would expect similar problems for an UPDATE..RETURNING, but have not tested 
that yet.

I'd like to know if others agree with my analysis, and if this is a bug in the 
RETURNING, or just an unsupported way to design a custom TAM.  If the latter, 
is this documented somewhere?  For reference, I am working against 
REL_14_STABLE.



Details

To illustrate this issue, I expanded the update.sql a little to give a bit more 
information, which demonstrates that the xmin and xmax returned are not the 
same as what gets written to the table for the same row, using a custom TAM 
named "pile" which neglects to provide a get_heap_update implementation:


SELECT tableoid::regclass, xmin, xmax, pg_current_xact_id()::xid, a, b FROM 
upsert_test;
  tableoid   | xmin | xmax | pg_current_xact_id | a | b
-+--+--++---+---
 upsert_test |  756 |  756 |757 | 1 | Foo, Correlated, Excluded
 upsert_test |  756 |  756 |757 | 3 | Zoo, Correlated, Excluded
(2 rows)

INSERT INTO upsert_test VALUES (2, 'Beeble') ON CONFLICT(a)
  DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE 
i.a = excluded.a)
  RETURNING tableoid::regclass, xmin, xmax, pg_current_xact_id()::xid, xmin = 
pg_current_xact_id()::xid AS xmin_correct, xmax = 0 AS xmax_correct;
  tableoid   | xmin |xmax| pg_current_xact_id | xmin_correct | 
xmax_correct
-+--+++--+--
 upsert_test |  140 | 4294967295 |758 | f| f
(1 row)

SELECT tableoid::regclass, xmin, xmax, pg_current_xact_id()::xid, a, b FROM 
upsert_test;
  tableoid   | xmin | xmax | pg_current_xact_id | a | b
-+--+--++---+---
 upsert_test |  756 |  756 |759 | 1 | Foo, Correlated, Excluded
 upsert_test |  756 |  756 |759 | 3 | Zoo, Correlated, Excluded
 upsert_test |  758 |0 |759 | 2 | Beeble
(3 rows)


Adding a bogus Assert I can get the following stack trace, showing in frame 4 
tts_buffer_pile_copy_heap_tuple is called (rather than the 
tts_buffer_pile_get_heap_tuple which was called in this location prior to 
changing the get_heap_tuple to NULL).  In frame 6, pileam_tuple_update is going 
to see that shouldFree is true and will free the slot's tuple, so the slot's 
copy won't be valid by the time the dest receiver wants it.  That will force a 
tts_pile_materialize call, but since the slot's tuple will not be vaild, the 
materialize will operate by forming a tuple from the (descriptor,values,isnull) 
triple, rather than by copying a tuple, and the pile_form_tuple call won't do 
anything to set the tuple header fields.


* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x7fff70ea632a libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff70f62e60 libsystem_pthread.dylib`pthread_kill + 430
frame #2: 0x7fff70e2d808 libsystem_c.dylib`abort + 120
frame #3: 0x00010c992251 
postgres`ExceptionalCondition(conditionName="false", 
errorType="FailedAssertion", fileName="access/pile_slotops.c", lineNumber=419) 
at assert.c:69:2
frame #4: 0x00010d33f8b8 
pile.so`tts_buffer_pile_copy_heap_tuple(slot=0x7f9edb02c550) at 
pile_slotops.c:419:3
frame #5: 0x00010d33e2f7 
pile.so`ExecFetchSlotPileTuple(slot=0x7f9edb02c550, materialize=true, 
shouldFree=0x7ffee3aa3ace) at pile_slotops.c:639:32
frame #6: 0x00010d35bccc 
pile.so`pileam_tuple_update(relation=0x7f9ed0083c58, 
otid=0x7ffee3aa3f30, slot=0x7f9edb02c550, cid=0, 
snapshot=0x7f9edd04b7f0, crosscheck=0x, wait=true, 
tmfd=0x7ffee3aa3cf8, lockmode=0x7ffee3aa3ce8, 
update_indexes=0x7ffee3aa3ce6) at pileam_handler.c:327:20
frame #7: 0x00010c51bcad 
postgres`table_tuple_update(rel=0x7f9ed0083c58, otid=0x7ffee3aa3f30, 
slot=0x7f9edb02c550, cid=0, snapshot=0x7f9edd04b7f0, 
crosscheck=0x, wait=true, tmfd=0x7

Re: Should setting TupleTableSlotOps get_heap_tuple=NULL break INSERT..ON CONFLICT DO UPDATE..RETURNING?

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-29 09:04:42 -0700, Mark Dilger wrote:
> Per the documentation in TupleTableSlotOps, an AM can choose not to supply a
> get_heap_tuple function, and instead set this field to NULL.  Doing so
> appears to almost work, but breaks the xmin and xmax returned by a
> INSERT..ON CONFLICT DO UPDATE..RETURNING.  In particular, the call chain
> ExecOnConflictUpdate -> ExecUpdate -> table_tuple_update seems to expect
> that upon return from table_tuple_update, the slot will hold onto a copy of
> the updated tuple, including its header fields.  This assumption is inherent
> in how the slot is later used by the destination receiver.  But for TAMs
> which do not keep a copy heaptuple of their own, the slot will only have
> copies of (tts_tupleDescriptor, tts_values, tts_isnull) to use to form up a
> tuple when the receiver asks for one, and that formed up MinimalTuple won't
> be preceded by any meaningful header.

I would assume that this can be avoided by the tuple slot implementation, but
without seeing what precisely you did in your pile slot...

Greetings,

Andres Freund




Re: identifying the backend that owns a temporary schema

2022-09-29 Thread Nathan Bossart
On Thu, Sep 29, 2022 at 10:47:06AM -0400, Tom Lane wrote:
> OK.  Will push shortly.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: problems with making relfilenodes 56-bits

2022-09-29 Thread Matthias van de Meent
On Thu, 29 Sep 2022, 00:06 Robert Haas,  wrote:
>
> 2. WAL Size. Block references in the WAL are by RelFileLocator, so if
> you make RelFileLocators bigger, WAL gets bigger. We'd have to test
> the exact impact of this, but it seems a bit scary: if you have a WAL
> stream with few FPIs doing DML on a narrow table, probably most
> records will contain 1 block reference (and occasionally more, but I
> guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
> block reference feels like it might add up to something significant. I
> don't really see any way around this, either: if you make relfilenode
> values wider, they take up more space. Perhaps there's a way to claw
> that back elsewhere, or we could do something really crazy like switch
> to variable-width representations of integer quantities in WAL
> records, but there doesn't seem to be any simple way forward other
> than, you know, deciding that we're willing to pay the cost of the
> additional WAL volume.

Re: WAL volume and record size optimization

I've been working off and on with WAL for some time now due to [0] and
the interest of Neon in the area, and I think we can reduce the size
of the base record by a significant margin:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0], now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1], non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

A normal record header (XLOG record with at least some registered
data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and
one with XID only up to 21 bytes. So, when compared to the current
XLogRecord format, we would in general recover 2 or 3 bytes from the
xl_tot_len field, 1 or 2 bytes from the alignment hole, and
potentially the 4 bytes of the xid when that data is considered
useless during recovery, or physical or logical replication.

Kind regards,

Matthias van de Meent

[0] 
https://postgr.es/m/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc%3DSdEHHsW6HS%2Bk9zw%40mail.gmail.com
[1] https://postgr.es/m/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de




Re: Should setting TupleTableSlotOps get_heap_tuple=NULL break INSERT..ON CONFLICT DO UPDATE..RETURNING?

2022-09-29 Thread Mark Dilger



> On Sep 29, 2022, at 9:22 AM, Andres Freund  wrote:
> 
> I would assume that this can be avoided by the tuple slot implementation, but
> without seeing what precisely you did in your pile slot...

"pile" is just a copy of "heap" placed into an extension with a slightly 
smarter version of s/heap/pile/g performed across the sources.  It is intended 
to behave exactly as heap does.  Without disabling the get_heap_tuple function, 
it passes a wide variety of the regression/isolation/tap tests.  To test the 
claim made in the TupleTableSlotOps code comments, I disabled that one function:

/*
 * Return a heap tuple "owned" by the slot. It is slot's responsibility to
 * free the memory consumed by the heap tuple. If the slot can not "own" a
 * heap tuple, it should not implement this callback and should set it as
 * NULL.
 */
HeapTuple   (*get_heap_tuple) (TupleTableSlot *slot);

That comment suggests that I do not need to keep a copy of the heap tuple, and 
per the next comment:

/*
 * Return a copy of heap tuple representing the contents of the slot. The
 * copy needs to be palloc'd in the current memory context. The slot
 * itself is expected to remain unaffected. It is *not* expected to have
 * meaningful "system columns" in the copy. The copy is not be "owned" by
 * the slot i.e. the caller has to take responsibility to free memory
 * consumed by the slot.
 */
HeapTuple   (*copy_heap_tuple) (TupleTableSlot *slot);

I do not need to keep a copy of the "system columns".  But clearly this doesn't 
work.  When get_heap_tuple=NULL, the AM's tuple_update is at liberty to free 
the update tuple (per the above documentation) and later return a copy of the 
slot's tuple sans any "system columns" (also per the above documentation) and 
that's when the core code breaks.  It's not the TAM that is broken here, not 
according to the interface's documentation as I read it.  Am I reading it wrong?



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Avoid memory leaks during base backups

2022-09-29 Thread Bharath Rupireddy
On Thu, Sep 29, 2022 at 7:05 PM Robert Haas  wrote:
>
> On Thu, Sep 29, 2022 at 4:29 AM Bharath Rupireddy
>  wrote:
> > Please review the attached v3 patch.
>
> template1=# select * from pg_backup_start('sdgkljsdgkjdsg', true);
>  pg_backup_start
> -
>  0/228
> (1 row)
>
> template1=# select 1/0;
> ERROR:  division by zero
> template1=# select * from pg_backup_stop();
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> The connection to the server was lost. Attempting reset: Failed.
> !?>

Thanks! I used a variable to define the scope to clean up the backup
memory context for SQL functions/on-line backup. We don't have this
problem in case of base backup because we don't give control in
between start and stop backup in perform_base_backup().

Please review the v4 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From f22c5e3afcdf2bf0d7bbcab9356f9ad70a7ec201 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 29 Sep 2022 16:31:59 +
Subject: [PATCH v4] Avoid memory leaks during backups

Postgres currently can leak memory if a failure occurs during base
backup in do_pg_backup_start() or do_pg_backup_stop() or
perform_base_backup(). The palloc'd memory such as backup_state or
tablespace_map in xlogfuncs.c or basebackup.c or tablespaceinfo in
perform_base_backup() or any other, is left-out which may cause
memory bloat on the server eventually. To experience this issue,
run pg_basebackup with --label name longer than 1024 characters and
observe memory with watch command, the memory usage goes up. It looks
like the memory leak issue has been there for quite some time.

To solve the above problem, we create separate memory context for
backup and add clean up callback in PostgresMain()'s error handling
code using sigsetjmp().

The design of the patch is as follows:

For backup functions or on-line backups, a new memory context is
created as a child of TopMemoryContext in pg_backup_start() which
gets deleted upon any error or at the end of pg_backup_stop().

For perform_base_backup() or base backups, a new memory context is
created as a child of CurrentMemoryContext which gets deleted upon
any error or at the end of perform_base_backup().
---
 src/backend/access/transam/xlogfuncs.c | 103 +++--
 src/backend/backup/basebackup.c|  40 ++
 src/backend/replication/walsender.c|   2 +
 src/backend/tcop/postgres.c|   4 +
 src/include/access/xlogbackup.h|   1 +
 src/include/backup/basebackup.h|   1 +
 6 files changed, 127 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index a801a94fe8..985be59ca8 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -45,6 +45,30 @@
 static BackupState *backup_state = NULL;
 static StringInfo tablespace_map = NULL;
 
+/* A workspace for on-line backup. */
+static MemoryContext backupcontext = NULL;
+
+/*
+ * A session-level variable to define the scope of backup memory context clean
+ * up.
+ */
+static bool allowedToCleanBackupMemCxt = false;
+
+/*
+ * Clean up backup memory context we created.
+ */
+void
+BackupMemoryContextCleanup(void)
+{
+	if (backupcontext == NULL || allowedToCleanBackupMemCxt == false)
+		return;		/* Nothing to do. */
+
+	MemoryContextDelete(backupcontext);
+	backupcontext = NULL;
+	backup_state = NULL;
+	tablespace_map = NULL;
+}
+
 /*
  * pg_backup_start: set up for taking an on-line backup dump
  *
@@ -71,34 +95,45 @@ pg_backup_start(PG_FUNCTION_ARGS)
  errmsg("a backup is already in progress in this session")));
 
 	/*
-	 * backup_state and tablespace_map need to be long-lived as they are used
-	 * in pg_backup_stop().
+	 * We're okay to clean backup memory context while an error occurs within
+	 * this function.
 	 */
-	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+	allowedToCleanBackupMemCxt = true;
 
-	/* Allocate backup state or reset it, if it comes from a previous run */
-	if (backup_state == NULL)
-		backup_state = (BackupState *) palloc0(sizeof(BackupState));
-	else
-		MemSet(backup_state, 0, sizeof(BackupState));
+	/*
+	 * backup_state and tablespace_map need to be long-lived as they are used
+	 * in pg_backup_stop(). Create a special memory context as a direct child
+	 * of TopMemoryContext so that the memory allocated is carried till the
+	 * pg_backup_stop() and cleaned in case of errors (see error handling code
+	 * using sigsetjmp() in PostgresMain()).
+	 */
+	if (backupcontext == NULL)
+		backupcontext = AllocSetContextCreate(TopMemoryContext,
+			  "on-line backup context",
+			  ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * tablespace_map may have been created in a previous bac

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-29 Thread Önder Kalacı
Hi Hayato Kuroda,

Thanks for the review!


> 
> 01 relation.c - general
>
> Many files are newly included.
> I was not sure but some codes related with planner may be able to move to
> src/backend/optimizer/plan.
> How do you and any other one think?
>
>
My thinking on those functions is that they should probably stay
in src/backend/replication/logical/relation.c. My main motivation is that
those functions are so much tailored to the purposes of this file that I
cannot see any use-case for these functions in any other context.

Still, at some point, I considered maybe doing something similar
to src/backend/executor/execReplication.c, where I create a new file,
say, src/backend/optimizer/plan/planReplication.c or such as you noted. I'm
a bit torn on this.

Does anyone have any strong opinions for moving to
src/backend/optimizer/plan/planReplication.c? (or another name)


> 02 relation.c - FindLogicalRepUsableIndex
>
> ```
> +/*
> + * Returns an index oid if we can use an index for the apply side. If not,
> + * returns InvalidOid.
> + */
> +static Oid
> +FindLogicalRepUsableIndex(Relation localrel, LogicalRepRelation
> *remoterel)
> ```
>
> I grepped files, but I cannot find the word "apply side". How about
> "subscriber" instead?
>

Yes, it makes sense. I guess I made up the "apply side" as there is the
concept of "apply worker". But, yes, subscribers sound better, updated.


>
> 03 relation.c - FindLogicalRepUsableIndex
>
> ```
> +   /* Simple case, we already have an identity or pkey */
> +   idxoid = GetRelationIdentityOrPK(localrel);
> +   if (OidIsValid(idxoid))
> +   return idxoid;
> +
> +   /*
> +* If index scans are disabled, use a sequential scan.
> +*
> +* Note that we still allowed index scans above when there is a
> primary
> +* key or unique index replica identity, but that is the legacy
> behaviour
> +* (even when enable_indexscan is false), so we hesitate to move
> this
> +* enable_indexscan check to be done earlier in this function.
> +*/
> +   if (!enable_indexscan)
> +   return InvalidOid;
> ```
>
> a.
> I think "identity or pkey" should be "replica identity key or primary key"
> or "RI or PK"
>

Looking into other places, it seems "replica identity index" is favored
over "replica identity key". So, I used that term.

You can see this pattern in RelationGetReplicaIndex()


>
> b.
> Later part should be at around GetRelationIdentityOrPK.
>

Hmm, I cannot follow this comment. Can you please clarify?


>
>
> 04 relation.c - FindUsableIndexForReplicaIdentityFull
>
> ```
> +   MemoryContext usableIndexContext;
> ...
> +   usableIndexContext = AllocSetContextCreate(CurrentMemoryContext,
> +
> "usableIndexContext",
> +
> ALLOCSET_DEFAULT_SIZES);
> ```
>
> I grepped other sources, and I found that the name like "tmpcxt" is used
> for the temporary MemoryContext.
>

I think there are also several contextes that are named more specifically,
such as new_pdcxt, perTupCxt, anl_context, cluster_context and many others.

So, I think it is better to have specific names, no?


>
> 05 relation.c - SuitablePathsForRepIdentFull
>
> ```
> +   indexRelation = index_open(idxoid,
> AccessShareLock);
> +   indexInfo = BuildIndexInfo(indexRelation);
> +   is_btree = (indexInfo->ii_Am == BTREE_AM_OID);
> +   is_partial = (indexInfo->ii_Predicate != NIL);
> +   is_only_on_expression =
> IndexOnlyOnExpression(indexInfo);
> +   index_close(indexRelation, NoLock);
> ```
>
> Why the index is closed with NoLock? AccessShareLock is acquired, so
> shouldn't same lock be released?
>

Hmm, yes you are right. Keeping the lock seems unnecessary and wrong. It
could actually have prevented dropping an index. However, given
that RelationFindReplTupleByIndex() also closes this index at the end, the
apply worker releases the lock. Hence, no problem observed.

Anyway, I'm still changing it to releasing the lock.

Also note that as soon as any index is dropped on the relation, the cache
is invalidated and suitable indexes are re-calculated. That's why it seems
fine to release the lock.


>
>
> 06 relation.c - GetCheapestReplicaIdentityFullPath
>
> IIUC a query like "SELECT tbl WHERE attr1 = $1 AND attr2 = $2 ... AND
> attrN = $N" is emulated, right?
> you can write explicitly it as comment
>
>
The inlined comment in the function has a similar comment. Is that clear
enough?

/* * Generate restrictions for all columns in the form of col_1 = $1 AND *
col_2 = $2 ... */


> 07 relation.c - GetCheapestReplicaIdentityFullPath
>
> ```
> +   Path   *path = (Path *) lfirst(lc);
> +   Oid idxoid = GetIndexOidFromPath(path);
> +
> +   if (!OidIsValid(idxoid))
> +   {
> +   /* Unrelated Path, skip

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Önder Kalacı
Hi David, Tom, all,


> I've just pushed the disable byref Datums patch I posted earlier. I
> only made a small adjustment to make use of the TupleDescAttr() macro.
> Önder, thank you for the report.
>
>
With this commit, I re-run the query patterns where we observed the
problem, all looks good now. Wanted to share this information as fyi.

Thanks for the quick turnaround!

Onder KALACI


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Nathan Bossart
On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote:
> On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart  
> wrote:
>>
>> +PGAlignedXLogBlock zbuffer;
>> +
>> +memset(zbuffer.data, 0, XLOG_BLCKSZ);
>>
>> This seems excessive for only writing a single byte.
> 
> Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
> wal_segment_size - 1).

I don't think removing the use of PGAlignedXLogBlock here introduces any
sort of alignment risk, so this should be alright.

+#ifdef WIN32
+/*
+ * WriteFile() on Windows changes the current file position, hence we
+ * need an explicit lseek() here. See pg_pwrite() implementation in
+ * win32pwrite.c for more details.
+ */

Should we really surround this with a WIN32 check, or should we just
unconditionally lseek() here?  I understand that this helps avoid an extra
system call on many platforms, but in theory another platform introduced in
the future could have the same problem, and this seems like something that
could easily be missed.  Presumably we could do something fancier to
indicate pg_pwrite()'s behavior in this regard, but I don't know if that
sort of complexity is really worth it in order to save an lseek().

+iov[0].iov_base = zbuffer.data;

This seems superfluous, but I don't think it's hurting anything.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-09-29 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane  wrote in 
>> I think you should leave the primary message alone and add a DETAIL,
>> as the first version of the patch did.

> So I'm going to change the mssage as:

> LOG:  terminating process %d to release replication slot \"%s\"
> DETAIL:  The slot's restart_lsn %X/%X exceeds the limit by %lld bytes.
> HINT:  You might need to increase max_slot_wal_keep_size.

> LOG:  invalidating *replication* slot \"%s\"
> DETAILS:  (ditto)
> HINTS:  (ditto)

I thought the latter was a little *too* short; the primary message
should at least give you some clue why that happened, even if it
doesn't offer all the detail.  After some thought I changed it to

LOG:  invalidating obsolete replication slot \"%s\"

and pushed it that way.

regards, tom lane




Re: Refactor UnpinBuffer()

2022-09-29 Thread Nathan Bossart
I've marked this one as ready-for-committer.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Question: test "aggregates" failed in 32-bit machine

2022-09-29 Thread Tom Lane
"kuroda.hay...@fujitsu.com"  writes:
> While running `make check LANC=C` with 32-bit virtual machine,
> I found that it was failed at "aggregates".

Hmm, we're not seeing any such failures in the buildfarm's 32-bit
animals, so there must be some additional condition needed to make
it happen.  Can you be more specific?

regards, tom lane




Re: making relfilenodes 56 bits

2022-09-29 Thread Tom Lane
Robert Haas  writes:
> On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov  wrote:
>> In other words, we have 19 attributes. But tupdesc here is constructed for 
>> 18 elements:
>> tupdesc = CreateTemplateTupleDesc(18);

> I think that's a mistake. Thanks for the report.

The assertions in TupleDescInitEntry would have caught that,
if only utils/misc/pg_controldata.c had more than zero test coverage.
Seems like somebody ought to do something about that.

regards, tom lane




Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-29 Thread Tom Lane
Martin Kalcher  writes:
> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today.  I find this behavior a bit
surprising:

+SELECT 
array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-
+ [-1:1][2:3]
+(1 row)

I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

Some other comments:

+Returns n randomly chosen elements from 
array in selection order.

What's "selection order"?  And this probably shouldn't just rely
on the example to describe what happens with multi-D arrays.
Writing "elements" seems somewhere between confusing and wrong.

* Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n,
and let it pull out what it needs.  Less duplicate code that way.

* I find array_shuffle_n drastically undercommented, and what comments
it has are pretty misleading, eg

+   /* Swap all elements in item (i) with elements in item (j). */

j is *not* the index of the second item to be swapped.  You could make
it so, and that might be more readable:

j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem 
- 1);
jelms = elms + j * nelm;
jnuls = nuls + j * nelm;

But if you want the code to stay as it is, this comment needs work.

* I think good style for SQL-callable C functions is to make the arguments
clear at the top:

+array_sample(PG_FUNCTION_ARGS)
+{
+ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+int n = PG_GETARG_INT32(1);
+ArrayType  *result;
+... other declarations as needed ...

We can't quite make normal C declaration style work, but that's a poor
excuse for not making the argument list visible as directly as possible.

* I wouldn't bother with the PG_FREE_IF_COPY calls.  Those are generally
only used in btree comparison functions, in which there's a need to not
leak memory.

* I wonder if we really need these to be proparallel = 'r'.  If we let
a worker process execute them, they will take numbers from the worker's
pg_global_prng_state seeding not the parent process's seeding, but why
is that a problem?  In the prior version where you were tying them
to the parent's drandom() sequence, proparallel = 'r' made sense;
but now I think it's unnecessary.

regards, tom lane




Re: problems with making relfilenodes 56-bits

2022-09-29 Thread Robert Haas
On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent
 wrote:
> Currently, our minimal WAL record is exactly 24 bytes: length (4B),
> TransactionId (4B), previous record pointer (8B), flags (1B), redo
> manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
> fields, TransactionID could reasonably be omitted for certain WAL
> records (as example: index insertions don't really need the XID).
> Additionally, the length field could be made to be variable length,
> and any padding is just plain bad (adding 4 bytes to all
> insert/update/delete/lock records was frowned upon).

Right. I was shocked when I realized that we had two bytes of padding
in there, considering that numerous rmgrs are stealing bits from the
1-byte field that identifies the record type. My question was: why
aren't we exposing those 2 bytes for rmgr-type-specific use? Or for
something like xl_xact_commit, we could get rid of xl_xact_info if we
had those 2 bytes to work with.

Right now, I see that a bare commit record is 34 bytes which rounds
out to 40. With the trick above, we could shave off 4 bytes bringing
the size to 30 which would round to 32. That's a pretty significant
savings, although it'd be a lot better if we could get some kind of
savings for DML records which could be much higher frequency.

> I'm working on a prototype patch for a more bare-bones WAL record
> header of which the only required fields would be prevptr (8B), CRC
> (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
> yet know the performance of this, but the considering that there will
> be a lot more conditionals in header decoding it might be slower for
> any one backend, but faster overall (less overall IOps)
>
> The flags field would be indications for additional information: [flag
> name (bits): explanation (additional xlog header data in bytes)]
> - len_size(0..1): xlog record size is at most xlrec_header_only (0B),
> uint8_max(1B), uint16_max(2B), uint32_max(4B)
> - has_xid (2): contains transaction ID of logging transaction (4B, or
> probably 8B when we introduce 64-bit xids)
> - has_cid (3): contains the command ID of the logging statement (4B)
> (rationale for logging CID in [0], now in record header because XID is
> included there as well, and both are required for consistent
> snapshots.
> - has_rminfo (4): has non-zero redo-manager flags field (1B)
> (rationale for separate field [1], non-zero allows 1B space
> optimization for one of each RMGR's operations)
> - special_rel (5): pre-existing definition
> - check_consistency (6): pre-existing definition
> - unset (7): no meaning defined yet. Could be used for full record
> compression, or other purposes.

Interesting. One fly in the ointment here is that WAL records start on
8-byte boundaries (probably MAXALIGN boundaries, but I didn't check
the details). And after the 24-byte header, there's a 2-byte header
(or 5-byte header) introducing the payload data (see
XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload
data is a multiple of 8, and is short enough that we use the short
data header, we waste 6 bytes. If the data length is a multiple of 4,
we waste 2 bytes. And those are probably really common cases. So the
big improvements probably come from saving 2 bytes or 6 bytes or 10
bytes, and saving say 3 or 5 is probably not much better than 2. Or at
least that's what I'm guessing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-09-29 Thread Bruce Momjian
On Tue, Sep 27, 2022 at 06:52:21PM +0530, Bharath Rupireddy wrote:
> On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
>  wrote:
> >
> > I've explained the problem with the current HA setup with synchronous
> > replication upthread at [1]. Let me reiterate it here once again.
> >
> > When a query is cancelled (a simple stroke of CTRL+C or
> > pg_cancel_backend() call) while the txn is waiting for ack in
> > SyncRepWaitForLSN(); for the client, the txn is actually committed
> > (locally-committed-but-not-yet-replicated to all of sync standbys)
> > like any other txn, a warning is emitted into server logs but it is of
> > no use for the client (think of client as applications). There can be
> > many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
> > can be issued on all of those sessions. The problem is that the
> > subsequent reads will then be able to read all of those
> > locally-committed-but-not-yet-replicated to all of sync standbys txns
> > data - this is what I call an inconsistency (can we call this a
> > read-after-write inconsistency?) because of lack of proper query
> > cancel handling. And if the sync standbys are down or unable to come
> > up for some reason, until then, the primary will be serving clients
> > with the inconsistent data. BTW, I found a report of this problem here
> > [2].
> >
> > The solution proposed for the above problem is to just 'not honor
> > query cancels at all while waiting for ack in SyncRepWaitForLSN()'.
> >
> > When a proc die is pending, then also, there can be
> > locally-committed-but-not-yet-replicated to all of sync standbys txns.
> > Typically, there are two choices for the clients 1) reuse the primary
> > instance after restart 2) failover to one of sync standbys. For case
> > (1), there might be read-after-write inconsistency as explained above.
> > For case (2), those txns might get lost completely if the failover
> > target sync standby or the new primary didn't receive them and the
> > other sync standbys that have received them are now ahead and need a
> > special treatment (run pg_rewind) for them to be able to connect to
> > new primary.
> >
> > The solution proposed for case (1) of the above problem is to 'process
> > the ProcDiePending immediately and upon restart the first backend can
> > wait until the sync standbys are caught up to ensure no inconsistent
> > reads'.
> > The solution proposed for case (2) of the above problem is to 'either
> > run pg_rewind for the sync standbys that are ahead or use the idea
> > proposed at [3]'.
> >
> > I hope the above explanation helps.
> >
> > [1] 
> > https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
> > [2] 
> > https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
> > [3] 
> > https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com
> 
> I'm attaching the v2 patch rebased on the latest HEAD. Please note
> that there are still some open points, I'm yet to find time to think
> more about them. Meanwhile, I'm posting the v2 patch for making cfbot
> happy. Any further thoughts on the overall design of the patch are
> most welcome. Thanks.
... 
> In PostgreSQL high availability setup with synchronous replication,
> typically all the transactions first locally get committed, then
> streamed to the synchronous standbys and the backends that generated
> the transaction will wait for acknowledgement from synchronous
> standbys. While waiting for acknowledgement, it may happen that the
> query or the transaction gets canceled or the backend that's waiting
> for acknowledgement is asked to exit. In either of these cases, the
> wait for acknowledgement gets canceled and leaves transaction in an
> inconsistent state as it got committed locally but not on the
> standbys. When set the GUC synchronous_replication_naptime_before_cancel
> introduced in this patch, it will let the backends wait for the
> acknowledgement before canceling the wait for acknowledgement so
> that the transaction can be available in synchronous standbys as
> well.

I don't think this patch is going in the right direction, and I think we
need to step back to see why.

First, let's see how synchronous replication works.  Each backend waits
for one or more synchronous replicas to acknowledge the WAL related to
its commit and then it marks the commit as done in PGPROC and then to
the client;  I wrote a blog about it:

https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020

So, what happens when an insufficient number of synchronous replicas
reply?  Sessions hang because the synchronous behavior cannot be
guaranteed.  We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.

What the proposed patch effectively does is to prevent the ability to
recovery the hung sessions or 

Re: disfavoring unparameterized nested loops

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 7:32 AM Benjamin Coutu  wrote:
> Also, we can expand the multiplier whenever we fall back to using the default 
> cardinality constant as surely all bets are off at that point - we should 
> definitely treat nested loop joins as out of favor in this instance and that 
> could easily be incorporated by simply increasing the conviction-mutliplier.
>
> What are your thoughts on this simple idea - is it perhaps too simple?

Offhand I'd say it's more likely to be too complicated. Without
meaning to sound glib, the first question that occurs to me is "will
we also need a conviction multiplier conviction multiplier?". Anything
like that is going to have unintended consequences that might very
well be much worse than the problem that you set out to solve.

Personally I still like the idea of just avoiding unparameterized
nested loop joins altogether when an "equivalent" hash join plan is
available. I think of it as preferring the hash join plan because it
will have virtually the same performance characteristics when you have
a good cardinality estimate (probably very often), but far better
performance characteristics when you don't. We can perhaps be
approximately 100% sure that something like that will be true in all
cases, no matter the details. That seems like a very different concept
to what you've proposed.

--
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 04:12:06PM -0700, Peter Geoghegan wrote:
> Offhand I'd say it's more likely to be too complicated. Without
> meaning to sound glib, the first question that occurs to me is "will
> we also need a conviction multiplier conviction multiplier?". Anything
> like that is going to have unintended consequences that might very
> well be much worse than the problem that you set out to solve.
> 
> Personally I still like the idea of just avoiding unparameterized
> nested loop joins altogether when an "equivalent" hash join plan is
> available. I think of it as preferring the hash join plan because it
> will have virtually the same performance characteristics when you have
> a good cardinality estimate (probably very often), but far better
> performance characteristics when you don't. We can perhaps be
> approximately 100% sure that something like that will be true in all
> cases, no matter the details. That seems like a very different concept
> to what you've proposed.

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: disfavoring unparameterized nested loops

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 4:27 PM Bruce Momjian  wrote:
> I think the point the original poster as making, and I have made in the
> past, is that even of two optimizer costs are the same, one might be
> more penalized by misestimation than the other, and we don't have a good
> way of figuring that into our plan choices.

Right. But that seems fraught with difficulty. I suspect that the
costs that the planner attributes to each plan often aren't very
reliable in any absolute sense, even when everything is working very
well by every available metric. Even a very noisy cost model with
somewhat inaccurate selectivity estimates will often pick the cheapest
plan, or close enough.

Having a cost-based optimizer that determines the cheapest plan quite
reliably is one thing. Taking the same underlying information and
adding the dimension of risk to it and expecting a useful result is
quite another -- that seems far harder.

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Tom Lane
Bruce Momjian  writes:
> I think the point the original poster as making, and I have made in the
> past, is that even of two optimizer costs are the same, one might be
> more penalized by misestimation than the other, and we don't have a good
> way of figuring that into our plan choices.

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right.  "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost.  Trying to add such
policy with no data to go on is not going to work well.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better.  That's certainly my opinion
about the topic.

regards, tom lane




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I think the point the original poster as making, and I have made in the
> > past, is that even of two optimizer costs are the same, one might be
> > more penalized by misestimation than the other, and we don't have a good
> > way of figuring that into our plan choices.
> 
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.
> 
> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

Agreed on all points --- I was thinking error bars too.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson






Re: [PATCH] Add peer authentication TAP test

2022-09-29 Thread Michael Paquier
On Wed, Sep 28, 2022 at 04:24:44PM +0900, Michael Paquier wrote:
> Hmm, indeed.  It would be more reliable to rely on the contents
> returned by getpeereid()/getpwuid() after one successful peer
> connection, then use it in the map.  I was wondering whether using
> stuff like getpwuid() in the perl script itself would be better, but
> it sounds less of a headache in terms of portability to just rely on
> authn_id via SYSTEM_USER to generate the contents of the correct map.

By the way, on an extra read I have found a few things that can be
simplified
- I think that test_role() should be reworked so as the log patterns
expected are passed down to connect_ok() and connect_fails() rather
than involving find_in_log().  You still need find_in_log() to skip
properly the case where peer is not supported by the platform, of
course.
- get_log_size() is not necessary.  You should be able to get the same
information with "-s $self->logfile".
- Nit: a newline should be added at the end of 003_peer.pl.
--
Michael


signature.asc
Description: PGP signature


Re: disfavoring unparameterized nested loops

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 4:46 PM Tom Lane  wrote:
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

In general I suspect that we'd be better off focussing on mitigating
the impact at execution time. There are at least a few things that we
could do there, at least in theory. Mostly very ambitious, long term
things.

I like the idea of just avoiding unparameterized nested loop joins
altogether when an "equivalent" hash join plan is available because
it's akin to an execution-time mitigation, despite the fact that it
happens during planning. While it doesn't actually change anything in
the executor, it is built on the observation that we have virtually
everything to gain and nothing to lose during execution, no matter
what happens.

It seems like a very small oasis of certainty in a desert of
uncertainty -- which seems nice, as far as it goes.

> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

Right. Though I am actually sympathetic to the idea that users might
gladly pay a cost for performance stability -- even a fairly large
cost. That part doesn't seem like the problem.

-- 
Peter Geoghegan




Small miscellaneous fixes

2022-09-29 Thread Ranier Vilela
Hi.

There are assorted fixes to the head branch.

1. Avoid useless reassigning var _logsegno
(src/backend/access/transam/xlog.c)
Commit 7d70809

left a little oversight.
XLByteToPrevSeg and XLByteToSeg are macros, and both assign _logsegno.
So, the first assignment is lost and is useless.

2. Avoid retesting log_min_duration (src/backend/commands/analyze.c)
The log_min_duration has already been tested before and the second test
can be safely removed.

3. Avoid useless var declaration record (src/backend/utils/misc/guc.c)
The var record is never really used.

4. Fix declaration volatile signal var (src/bin/pgbench/pgbench.c)
Like how to commit 5ac9e86
,
this is a similar case.

regards,
Ranier Vilela


avoid_useless_reassign_lgosegno.patch
Description: Binary data


avoid_useless_retesting_log_min_duration.patch
Description: Binary data


avoid_useless_var_record.patch
Description: Binary data


fix_declaration_volatile_signal_var.patch
Description: Binary data


Re: making relfilenodes 56 bits

2022-09-29 Thread Michael Paquier
On Thu, Sep 29, 2022 at 02:39:44PM -0400, Tom Lane wrote:
> The assertions in TupleDescInitEntry would have caught that,
> if only utils/misc/pg_controldata.c had more than zero test coverage.
> Seems like somebody ought to do something about that.

While passing by, I have noticed this thread.  We don't really care
about the contents returned by these functions, and one simple trick
to check their execution is SELECT FROM.  Like in the attached, for
example.
--
Michael
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..93cba8e76d 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,22 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+--
+-- Test functions for control data
+--
+SELECT FROM pg_control_checkpoint();
+--
+(1 row)
+
+SELECT FROM pg_control_init();
+--
+(1 row)
+
+SELECT FROM pg_control_recovery();
+--
+(1 row)
+
+SELECT FROM pg_control_system();
+--
+(1 row)
+
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 639e9b352c..207d5a5292 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -223,3 +223,11 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,1000) g ON a.unique1 = g;
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
+
+--
+-- Test functions for control data
+--
+SELECT FROM pg_control_checkpoint();
+SELECT FROM pg_control_init();
+SELECT FROM pg_control_recovery();
+SELECT FROM pg_control_system();


signature.asc
Description: PGP signature


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-28 16:07:13 -0400, Tom Lane wrote:
> I agree that it'd be good if CI did some 32-bit testing so it could have
> caught (5) and (6), but that's being worked on.

I wasn't aware of anybody doing so, thus here's a patch for that.

I already added the necessary packages to the image. I didn't install llvm for
32bit because that'd have a) bloated the image unduly b) they can't currently
be installed in parallel afaics.

Attached is the patch adding it to CI. To avoid paying the task startup
overhead twice, it seemed a tad better to build and test 32bit as part of an
existing task. We could instead give each job fewer CPUs and run them
concurrently.

It might be worth changing one of the builds to use -Dwal_blocksize=4 and a
few other flags, to increase our coverage.

Greetings,

Andres Freund
>From 384322a92c845378728057fb38b143ca58185831 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 29 Sep 2022 16:09:09 -0700
Subject: [PATCH v1 1/2] ci: Add 32bit build and test

Discussion: https://postgr.es/m/4033181.1664395...@sss.pgh.pa.us
---
 .cirrus.yml | 35 +++
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 7b5cb021027..7a1887ab444 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -40,9 +40,9 @@ on_failure_ac: &on_failure_ac
 on_failure_meson: &on_failure_meson
   testrun_artifacts:
 paths:
-  - "build/testrun/**/*.log"
-  - "build/testrun/**/*.diffs"
-  - "build/testrun/**/regress_log_*"
+  - "build*/testrun/**/*.log"
+  - "build*/testrun/**/*.diffs"
+  - "build*/testrun/**/regress_log_*"
 type: text/plain
 
   # In theory it'd be nice to upload the junit files meson generates, so that
@@ -51,7 +51,7 @@ on_failure_meson: &on_failure_meson
   # don't end up useful. We could probably improve on that with a some custom
   # conversion script, but ...
   meson_log_artifacts:
-path: "build/meson-logs/*.txt"
+path: "build*/meson-logs/*.txt"
 type: text/plain
 
 
@@ -229,6 +229,9 @@ task:
 
 - name: Linux - Debian Bullseye - Meson
 
+  env:
+CCACHE_MAXSIZE: "400M" # tests two different builds
+
   configure_script: |
 su postgres <<-EOF
   meson setup \
@@ -239,7 +242,23 @@ task:
 build
 EOF
 
+  configure_32_script: |
+su postgres <<-EOF
+  export CC='ccache gcc -m32'
+  meson setup \
+--buildtype=debug \
+-Dcassert=true \
+${LINUX_MESON_FEATURES} \
+-Dllvm=disabled \
+--pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
+-DPERL=perl5.32-i386-linux-gnu \
+-DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
+build-32
+EOF
+
   build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS}'
+  build_32_script: su postgres -c 'ninja -C build-32 -j${BUILD_JOBS}'
+
   upload_caches: ccache
 
   test_world_script: |
@@ -247,6 +266,14 @@ task:
   ulimit -c unlimited
   meson test $MTEST_ARGS --num-processes ${TEST_JOBS}
 EOF
+# so that we don't upload 64bit logs if 32bit fails
+rm -rf build/
+
+  test_world_32_script: |
+su postgres <<-EOF
+  ulimit -c unlimited
+  meson test $MTEST_ARGS -C build-32 --num-processes ${TEST_JOBS}
+EOF
 
   on_failure:
 <<: *on_failure_meson
-- 
2.37.3.542.gdd3f6c4cae



Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-29 17:31:35 -0700, Andres Freund wrote:
> I already added the necessary packages to the image. I didn't install llvm for
> 32bit because that'd have a) bloated the image unduly b) they can't currently
> be installed in parallel afaics.

> Attached is the patch adding it to CI. To avoid paying the task startup
> overhead twice, it seemed a tad better to build and test 32bit as part of an
> existing task. We could instead give each job fewer CPUs and run them
> concurrently.

Ah, one thing I forgot to mention: The 32bit perl currently can't have a
packaged IO:Pty installed. We could install it via cpan, but it doesn't seem
worth the bother. Hence one skipped test in the 32bit build.

Example run:
https://cirrus-ci.com/task/4632556472631296?logs=test_world_32#L249 (scroll to
the bottom)


Onder if we should vary some build options like ICU or the system collation?
Tom, wasn't there something recently that made you complain about not having
coverage around collations due to system settings?

Greetings,

Andres Freund




Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error

2022-09-29 Thread Yugo NAGATA
Hi,

On Mon, 12 Sep 2022 17:03:43 +0200
Alvaro Herrera  wrote:

> On 2022-Mar-28, Yugo NAGATA wrote:
> 
> > On Fri, 25 Mar 2022 16:19:54 -0400
> > Tom Lane  wrote:
> 
> > > I am not convinced this is a great idea.  The current behavior is that
> > > a statement is not prepared until it's about to be executed, and I think
> > > we chose that deliberately to avoid semantic differences between prepared
> > > and not-prepared mode.  For example, if a script looks like
> > > 
> > > CREATE FUNCTION foo(...) ...;
> > > SELECT foo(...);
> > > DROP FUNCTION foo;
> > > 
> > > trying to prepare the SELECT in advance would lead to failure.
> > >
> > > We could perhaps get away with preparing the commands within a pipeline
> > > just before we start to execute the pipeline, but it looks to me like
> > > this patch tries to prepare the entire script in advance.
> 
> Maybe it would work to have one extra boolean in struct Command, indicating
> that the i-th command in the script is inside a pipeline; in -M
> prepared, issue PREPARE for each command marked with that flag ahead of
> time, and for all other commands, do as today.  That way, we don't
> change behavior for anything except those commands that need the change.

Well, I still don't understand why we need to prepare only "the
commands within a pipeline" before starting pipeline.  In the current
behavior,  the entire script is prepared in advance just before executing
the first SQL command in the script, instead of preparing each command
one by one. This patch also prepare the entire script in advance, so
there is no behavioural change in this sense.

However, there are a few behavioural changes. One is that the preparation
is not counted in the command performance statistics as Fabien mentioned.
Another is that all meta-commands including \shell and \sleep etc. are
executed before the preparation.

To reduce impact of these changes, I updated the patch to prepare the
commands just before executing the first SQL command or \startpipeline
meta-command instead of at the beginning of the script. 

Regards,
Yugo Nagata

> 
> -- 
> Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
> "Digital and video cameras have this adjustment and film cameras don't for the
> same reason dogs and cats lick themselves: because they can."   (Ken Rockwell)


-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index aa1a3541fe..12a268fe84 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3070,6 +3070,30 @@ chooseScript(TState *thread)
 	return i - 1;
 }
 
+/* Prepare SQL commands in the chosen script */
+static void
+prepareCommands(CState *st)
+{
+	int			j;
+	Command   **commands = sql_script[st->use_file].commands;
+
+	for (j = 0; commands[j] != NULL; j++)
+	{
+		PGresult   *res;
+		char		name[MAX_PREPARE_NAME];
+
+		if (commands[j]->type != SQL_COMMAND)
+			continue;
+		preparedStatementName(name, st->use_file, j);
+		res = PQprepare(st->con, name,
+		commands[j]->argv[0], commands[j]->argc - 1, NULL);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+			pg_log_error("%s", PQerrorMessage(st->con));
+		PQclear(res);
+	}
+	st->prepared[st->use_file] = true;
+}
+
 /* Send a SQL command, using the chosen querymode */
 static bool
 sendCommand(CState *st, Command *command)
@@ -3104,39 +3128,7 @@ sendCommand(CState *st, Command *command)
 		const char *params[MAX_ARGS];
 
 		if (!st->prepared[st->use_file])
-		{
-			int			j;
-			Command   **commands = sql_script[st->use_file].commands;
-
-			for (j = 0; commands[j] != NULL; j++)
-			{
-PGresult   *res;
-
-if (commands[j]->type != SQL_COMMAND)
-	continue;
-preparedStatementName(name, st->use_file, j);
-if (PQpipelineStatus(st->con) == PQ_PIPELINE_OFF)
-{
-	res = PQprepare(st->con, name,
-	commands[j]->argv[0], commands[j]->argc - 1, NULL);
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pg_log_error("%s", PQerrorMessage(st->con));
-	PQclear(res);
-}
-else
-{
-	/*
-	 * In pipeline mode, we use asynchronous functions. If a
-	 * server-side error occurs, it will be processed later
-	 * among the other results.
-	 */
-	if (!PQsendPrepare(st->con, name,
-	   commands[j]->argv[0], commands[j]->argc - 1, NULL))
-		pg_log_error("%s", PQerrorMessage(st->con));
-}
-			}
-			st->prepared[st->use_file] = true;
-		}
+			prepareCommands(st);
 
 		getQueryParams(&st->variables, command, params);
 		preparedStatementName(name, st->use_file, st->command);
@@ -4376,6 +4368,19 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 			commandFailed(st, "startpipeline", "cannot use pipeline mode with the simple query protocol");
 			return CSTATE_ABORTED;
 		}
+		/*
+		 * Prepare SQL commands if needed.
+		 *
+		 * We should send Parse messages before executing /startpipeline.
+		 * If a Parse message is sent in pipeline mode, a transaction starts
+		 * before BEGIN is sent

Re: SI-read predicate locks on materialized views

2022-09-29 Thread Yugo NAGATA
On Fri, 9 Sep 2022 16:27:45 +0530
Dilip Kumar  wrote:

> On Tue, Jul 26, 2022 at 3:31 PM Richard Guo  wrote:
> >
> >
> > On Tue, Jul 26, 2022 at 3:44 PM Yugo NAGATA  wrote:
> >>
> >> If such two transactions run concurrently, a write skew anomaly occurs,
> >> and the result of order_summary refreshed in T1 will not contain the
> >> record inserted in T2.
> 
> Yes we do have write skew anomaly.  I think the patch looks fine to me.

Thank you for comment. Do you think it can be marked as Ready for Commiter?

Regards,
Yugo Nagata

> 
> -- 
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com


-- 
Yugo NAGATA 




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Thomas Munro
On Fri, Sep 30, 2022 at 6:27 AM Nathan Bossart  wrote:
> On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote:
> > On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart  
> > wrote:
> >>
> >> +PGAlignedXLogBlock zbuffer;
> >> +
> >> +memset(zbuffer.data, 0, XLOG_BLCKSZ);
> >>
> >> This seems excessive for only writing a single byte.
> >
> > Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
> > wal_segment_size - 1).
>
> I don't think removing the use of PGAlignedXLogBlock here introduces any
> sort of alignment risk, so this should be alright.
>
> +#ifdef WIN32
> +/*
> + * WriteFile() on Windows changes the current file position, hence we
> + * need an explicit lseek() here. See pg_pwrite() implementation in
> + * win32pwrite.c for more details.
> + */
>
> Should we really surround this with a WIN32 check, or should we just
> unconditionally lseek() here?  I understand that this helps avoid an extra
> system call on many platforms, but in theory another platform introduced in
> the future could have the same problem, and this seems like something that
> could easily be missed.  Presumably we could do something fancier to
> indicate pg_pwrite()'s behavior in this regard, but I don't know if that
> sort of complexity is really worth it in order to save an lseek().

+1 for just doing it always, with a one-liner comment like
"pg_pwritev*() might move the file position".  No reason to spam the
source tree with more explanations of the exact reason.  If someone
ever comes up with another case where p- and non-p- I/O functions are
intermixed and it's really worth saving a system call (don't get me
wrong, we call lseek() an obscene amount elsewhere and I'd like to fix
that, but this case isn't hot?) then I like your idea of a macro to
tell you whether you need to.

Earlier I wondered why we'd want to include "pg_pwritev" in the name
of this zero-filling function (pwritev being an internal
implementation detail), but now it seems like maybe a good idea
because it highlights the file position portability problem by being a
member of that family of similarly-named functions.




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2022 at 5:40 PM Andres Freund  wrote:
> Onder if we should vary some build options like ICU or the system collation?
> Tom, wasn't there something recently that made you complain about not having
> coverage around collations due to system settings?

That was related to TRUST_STRXFRM:

https://postgr.es/m/cah2-wzmqrjqv9pgyzebgnqmcac1ct+uxg3vqu7ksvundf_y...@mail.gmail.com

--
Peter Geoghegan




Re: ubsan

2022-09-29 Thread Andres Freund
Hi,

On 2022-03-23 13:54:50 -0400, Tom Lane wrote:
> 0002: ugh, but my only real complaint is that __ubsan_default_options
> needs more than zero comment.  Also, it's not "our" getenv is it?
> 
> 0004: no opinion

Attached is a rebased version of this patch. Hopefully with a reasonable
amount of comments?  I kind of wanted to add a comment to reached_main, but it
just seems to end up restating the variable name...

Greetings,

Andres Freund
>From 641f77e53c86ad7eaa15138984ce39471689b1d9 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 23 Mar 2022 09:57:19 -0700
Subject: [PATCH v2 1/3] Add a hacky workaround to make ubsan and ps_status.c
 compatible

At least on linux set_ps_display() breaks /proc/$pid/environ. The sanitizer
library uses /proc/$pid/environ to implement getenv() because it wants to work
independent of libc. When just using undefined and alignment sanitizers, the
sanitizer library is only initialized when the first error occurs, by which
time we've often already called set_ps_display(), preventing the sanitizer
libraries from seeing the options.

We can work around that by defining __ubsan_default_options, a weak symbol
libsanitizer uses to get defaults from the application, and return
getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we
don't end up relying on a not-yet-working getenv().

As it's just a function that won't get called when not running a sanitizer, it
doesn't seem necessary to make compilation of the function conditional.

Reviewed-by: Tom Lane 
Discussion: https://postgr.es/m/20220323173537.ll7klrglnp4gn...@alap3.anarazel.de
---
 src/backend/main/main.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index f5da4260a13..16bcf045ae6 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -43,6 +43,7 @@
 
 
 const char *progname;
+static bool reached_main = false;
 
 
 static void startup_hacks(const char *progname);
@@ -59,6 +60,8 @@ main(int argc, char *argv[])
 {
 	bool		do_check_root = true;
 
+	reached_main = true;
+
 	/*
 	 * If supported on the current platform, set up a handler to be called if
 	 * the backend/postmaster crashes with a fatal signal or exception.
@@ -421,3 +424,27 @@ check_root(const char *progname)
 	}
 #endif			/* WIN32 */
 }
+
+/*
+ * At least on linux set_ps_display() breaks /proc/$pid/environ. The sanitizer
+ * library uses /proc/$pid/environ to implement getenv() because it wants to
+ * work independent of libc. When just using undefined and alignment
+ * sanitizers, the sanitizer library is only initialized when the first error
+ * occurs, by which time we've often already called set_ps_display(),
+ * preventing the sanitizer libraries from seeing the options.
+ *
+ * We can work around that by defining __ubsan_default_options, a weak symbol
+ * libsanitizer uses to get defaults from the application, and return
+ * getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we
+ * don't end up relying on a not-yet-working getenv().
+ */
+const char *__ubsan_default_options(void);
+const char *
+__ubsan_default_options(void)
+{
+	/* don't call libc before it's initialized */
+	if (!reached_main)
+		return "";
+
+	return getenv("UBSAN_OPTIONS");
+}
-- 
2.37.3.542.gdd3f6c4cae

>From e081cd01fba0a860a9cb3c0e4c9525e1c4146e57 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 29 Sep 2022 17:44:45 -0700
Subject: [PATCH v2 2/3] ci: use -fsanitize=undefined,alignment in linux tasks

---
 .cirrus.yml | 17 +
 1 file changed, 17 insertions(+)

diff --git a/.cirrus.yml b/.cirrus.yml
index 7b5cb021027..c329a2befe4 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -169,6 +169,19 @@ task:
 LINUX_CONFIGURE_FEATURES: *LINUX_CONFIGURE_FEATURES
 LINUX_MESON_FEATURES: *LINUX_MESON_FEATURES
 
+# Enable a reasonable set of sanitizers. Use the linux task for that, as
+# it currently is the fastest task. Also several of the sanitizers work
+# best on linux.
+# The overhead of alignment sanitizer is low, undefined behaviour has
+# moderate overhead. address sanitizer howerever is pretty expensive and
+# thus not enabled.
+#
+# disable_coredump=0, abort_on_error=1: for useful backtraces in case of crashes
+# print_stacktraces=1,verbosity=2, duh
+# detect_leaks=0: too many uninteresting leak errors in short-lived binaries
+UBSAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2
+ASAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0
+
   only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*linux.*'
 
   compute_engine_instance:
@@ -230,6 +243,10 @@ task:
 - name: Linux - Debian Bullseye - Meson
 
   configure_script: |
+SANITIZER_FLAGS="-fsanitize=alignment,undefined -fno-sanitize-recover=all -fno-common"
+CFLAGS="$SANITIZER_FLAGS $CF

Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-09-29 Thread Yugo NAGATA
Hi,

On Tue, 9 Aug 2022 00:21:02 +0900
Yugo NAGATA  wrote:

> On Wed, 27 Jul 2022 22:50:55 -0400
> Tom Lane  wrote:
> 
> > Yugo NAGATA  writes:
> > > I've looked at the commited fix. What I wonder is whether a change in
> > > IsInTransactionBlock() is necessary or not.
> > 
> > I've not examined ANALYZE's dependencies on this closely, but it doesn't
> > matter really, because I'm not willing to assume that ANALYZE is the
> > only caller.  There could be external modules with stronger assumptions
> > that IsInTransactionBlock() yielding false provides guarantees equivalent
> > to PreventInTransactionBlock().  It did before this patch, so I think
> > it needs to still do so after.
> 
> Thank you for your explanation. I understood that IsInTransactionBlock()
> and PreventInTransactionBlock() share the equivalent assumption.
> 
> As to ANALYZE, after investigating the code more, I found that setting 
> XACT_FLAGS_NEEDIMMEDIATECOMMIT in IsInTransactionBlock() is needed indeed.
> That is, some flags in pg_class such as relhasindex can be safely updated
> only if ANALYZE is not in a transaction block and never rolled back.  So,
> in a pipeline, ANALYZE must be immediately committed.
> 
> However, I think we need more comments on these functions to clarify what
> users can expect or not for them.  It is ensured that the statement that
> calls PreventInTransactionBlock()  or receives false from
> IsInTransactionBlock() never be rolled back if it finishes successfully.
> This can eliminate the harmful influence of non-rollback-able side effects.
> 
> On the other hand, it cannot ensure that the statement calling these
> functions is the first or only one in the transaction in pipelining. If
> there are preceding statements in a pipeline, they are committed in the
> same transaction of the current statement.
> 
> The attached patch tries to add comments explaining it on the functions.

I forward it to the hackers list because the patch is to fix comments.
Also, I'll register it to commitfest.

The past discussion is here.
https://www.postgresql.org/message-id/flat/17434-d9f7a064ce2a88a3%40postgresql.org

> 
> > > In fact, the result of IsInTransactionBlock does not make senses at
> > > all in pipe-line mode regardless to the fix. ANALYZE could commit all
> > > previous commands in pipelining, and this may not be user expected
> > > behaviour.
> > 
> > This seems pretty much isomorphic to the fact that CREATE DATABASE
> > will commit preceding steps in the pipeline.  
> 
> I am not sure if we can think CREATE DATABASE case and ANLALYZE case
> similarly. First, CREATE DATABASE is one of the commands that cannot be
> executed inside a transaction block, but ANALYZE can be. So, users would
> not be able to know ANALYZE in a pipeline causes a commit from the
> documentation. Second, ANALYZE issues a commit internally in an early
> stage not only after it finished successfully. For example, even if
> ANALYZE is failing because a not-existing column name is specified, it
> issues a commit before checking the column name. This makes more hard
> to know which statements will be committed and which statements not
> committed in a pipeline. Also, as you know, there are other commands
> that issue internal commits.
> 
> > That's not great,
> > I admit; we'd not have designed it like that if we'd had complete
> > understanding of the behavior at the beginning.  But it's acted
> > like that for a couple of decades now, so changing it seems far
> > more likely to make people unhappy than happy.  The same for
> > ANALYZE in a pipeline.
> 
> > > If the first command in a pipeline is  DDL commands such as CREATE
> > > DATABASE, this is allowed and immediately committed after success, as
> > > same as the current behavior. Executing such commands in the middle of
> > > pipeline is not allowed because the pipeline is regarded as "an implicit
> > > transaction block" at that time. Similarly, ANALYZE in the middle of
> > > pipeline can not close and open transaction.
> > 
> > I'm not going there.  If you can persuade some other committer that
> > this is worth breaking backward compatibility for, fine; the user
> > complaints will be their problem.
> 
> I don't have no idea how to reduce the complexity explained above and
> clarify the transactional behavior of pipelining to users other than the
> fix I proposed in the previous post. However, I also agree that such
> changing may make some people unhappy. If there is no good way and we
> would not like to change the behavior, I think it is better to mention
> the effects of commands that issue internal commits in the documentation
> at least.
> 
> Regards,
> Yugo Nagata
> 
> -- 
> Yugo NAGATA 


-- 
Yugo NAGATA 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 50f092d7eb..ee9f0fafdd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -3455,6 +3455,10 @@ AbortCurrentTransaction(void)
  *
  *	We must also set XACT

Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-29 18:16:51 -0700, Peter Geoghegan wrote:
> On Thu, Sep 29, 2022 at 5:40 PM Andres Freund  wrote:
> > Onder if we should vary some build options like ICU or the system collation?
> > Tom, wasn't there something recently that made you complain about not having
> > coverage around collations due to system settings?
> 
> That was related to TRUST_STRXFRM:
> 
> https://postgr.es/m/cah2-wzmqrjqv9pgyzebgnqmcac1ct+uxg3vqu7ksvundf_y...@mail.gmail.com

It wasn't even that one, although I do recall it now that I reread the thread.
But it did successfully jog my memory:
https://www.postgresql.org/message-id/69170.1663425842%40sss.pgh.pa.us

So possibly it could be worth running one of them with LANG=C?

Greetings,

Andres Freund




Re: making relfilenodes 56 bits

2022-09-29 Thread Tom Lane
Michael Paquier  writes:
> While passing by, I have noticed this thread.  We don't really care
> about the contents returned by these functions, and one simple trick
> to check their execution is SELECT FROM.  Like in the attached, for
> example.

Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
any actual checks on the sanity of the output?  I realize that the
output's far from static, but still ...

regards, tom lane




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Tom Lane
Andres Freund  writes:
> Tom, wasn't there something recently that made you complain about not having
> coverage around collations due to system settings?

We found there was a gap for ICU plus LANG=C, IIRC.

regards, tom lane




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-29 21:24:44 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Tom, wasn't there something recently that made you complain about not having
> > coverage around collations due to system settings?
> 
> We found there was a gap for ICU plus LANG=C, IIRC.

Using that then.


Any opinions about whether to do this only in head or backpatch to 15?

Greetings,

Andres Freund




Re: Introduce wait_for_subscription_sync for TAP tests

2022-09-29 Thread Thomas Munro
On Mon, Sep 12, 2022 at 10:42 PM Thomas Munro  wrote:
> On Sat, Sep 10, 2022 at 10:00 AM Thomas Munro  wrote:
> > On Sat, Sep 10, 2022 at 9:45 AM Tom Lane  wrote:
> > > Masahiko Sawada  writes:
> > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> > > > is relevant.
> > >
> > > Noting that the errors have only appeared in the past couple of
> > > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
> > > (Fix recovery_prefetch with low maintenance_io_concurrency).
> >
> > Yeah, I also just spotted the coincidence of those failures while
> > monitoring the build farm.  I'll look into this later today.  My
> > initial suspicion is that there was pre-existing code here that was
> > (incorrectly?) relying on the lack of error reporting in that case.
> > But maybe I misunderstood and it was incorrect to report the error for
> > some reason that was not robustly covered with tests.
>
> After I wrote that I saw Sawada-san's message and waited for more
> information, and I see there was now a commit.  I noticed that
> peripatus was already logging the 'missing contrecord' error even when
> it didn't fail the test, and still does.  I'm still looking into that
> (ie whether I need to take that new report_invalid_record() call out
> and replace it with errormsg_deferred = true so that XLogReadRecord()
> returns NULL with no error message in this case).

I will go ahead and remove this new error message added by adb46615.
The message was correlated with the problem on peripatus fixed by
88f48831, but not the cause of it -- but it's also not terribly
helpful and might be confusing.  It might be reported: (1) in
pg_waldump when you hit the end of the segment with a missing
contrecord after the end, arguably rightfully, but then perhaps
someone might complain that they expect an error from pg_waldump only
on the final live segment at the end of the WAL, and (2) in a
walsender that is asked to shut down while between reads of pages with
a spanning contrecord, reported by logical_read_xlog_page() with a
messageless error (presumably peripatus's case?), (3) in crash
recovery with wal_recycle off (whereas normally you'd expect to see
the page_addr message from a recycled file), maybe more legitimately
than the above.  The problem I needed to solve can be solved without
the message just by setting that flag as mentioned above, so I'll do
that to remove the new noise.




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Tom Lane
Andres Freund  writes:
> Any opinions about whether to do this only in head or backpatch to 15?

HEAD should be sufficient, IMO.

regards, tom lane




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Bruce Momjian
On Thu, Sep 29, 2022 at 07:51:47PM -0400, Bruce Momjian wrote:
> On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:
> > Agreed, but dealing with uncertainty in those numbers is an enormous
> > task if you want to do it right.  "Doing it right", IMV, would start
> > out by extending all the selectivity estimation functions to include
> > error bars; then we could have error bars on rowcount estimates and
> > then costs; then we could start adding policies about avoiding plans
> > with too large a possible upper-bound cost.  Trying to add such
> > policy with no data to go on is not going to work well.
> > 
> > I think Peter's point is that a quick-n-dirty patch is likely to make
> > as many cases worse as it makes better.  That's certainly my opinion
> > about the topic.
> 
> Agreed on all points --- I was thinking error bars too.

Actually, if we wanted to improve things in this area, we should have a
set of queries that don't chose optimal plans we can test with.  We used
to see them a lot before we had extended statistics, but I don't
remember seeing many recently, let alone a collection of them.  I guess
that is good.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-09-29 Thread Kyotaro Horiguchi
At Thu, 29 Sep 2022 13:31:00 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane  wrote in 
> > LOG:  invalidating *replication* slot \"%s\"
> > DETAILS:  (ditto)
> > HINTS:  (ditto)
> 
> I thought the latter was a little *too* short; the primary message
> should at least give you some clue why that happened, even if it
> doesn't offer all the detail.  After some thought I changed it to

Yeah, agreed. It looks better. (I was about to spell it as
"invalidating slot "%s"" then changed my mind to add "replication". I
felt that it is a bit too short but didn't think about further
streaching that by adding "obsolete"..).

> LOG:  invalidating obsolete replication slot \"%s\"
> 
> and pushed it that way.

Thanks. And thanks for fixing the test script, too.

By the way, I didn't notice at that time (and forgot about the
policy), but the HINT message has variations differing only by the
variable name.

What do you think about the attached?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 3bbd522724..2106a8d41d 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -363,7 +363,8 @@ retry:
 		ereport(WARNING,
 (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
  errmsg("out of logical replication worker slots"),
- errhint("You might need to increase max_logical_replication_workers.")));
+ errhint("You might need to increase %s.",
+		 "max_logical_replication_workers")));
 		return;
 	}
 
@@ -420,7 +421,8 @@ retry:
 		ereport(WARNING,
 (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
  errmsg("out of background worker slots"),
- errhint("You might need to increase max_worker_processes.")));
+ errhint("You might need to increase %s.",
+		 "max_worker_processes")));
 		return;
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e9328961dd..5ea9b73980 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1298,7 +1298,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
   LSN_FORMAT_ARGS(restart_lsn),
   (unsigned long long) (oldestLSN - restart_lsn)),
-		errhint("You might need to increase max_slot_wal_keep_size."));
+		errhint("You might need to increase %s.",
+"max_slot_wal_keep_size"));
 
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
@@ -1340,7 +1341,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
 			  LSN_FORMAT_ARGS(restart_lsn),
 			  (unsigned long long) (oldestLSN - restart_lsn)),
-	errhint("You might need to increase max_slot_wal_keep_size."));
+	errhint("You might need to increase %s.",
+			"max_slot_wal_keep_size"));
 
 			/* done with this slot for now */
 			break;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f5803f681..5a29f4d346 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -980,7 +980,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 ereport(ERROR,
 		(errcode(ERRCODE_OUT_OF_MEMORY),
 		 errmsg("out of shared memory"),
-		 errhint("You might need to increase max_locks_per_transaction.")));
+		 errhint("You might need to increase %s.",
+ "max_locks_per_transaction")));
 			else
 return LOCKACQUIRE_NOT_AVAIL;
 		}
@@ -1018,7 +1019,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 			ereport(ERROR,
 	(errcode(ERRCODE_OUT_OF_MEMORY),
 	 errmsg("out of shared memory"),
-	 errhint("You might need to increase max_locks_per_transaction.")));
+	 errhint("You might need to increase %s.",
+			 "max_locks_per_transaction")));
 		else
 			return LOCKACQUIRE_NOT_AVAIL;
 	}
@@ -2843,7 +2845,8 @@ FastPathGetRelationLockEntry(LOCALLOCK *locallock)
 			ereport(ERROR,
 	(errcode(ERRCODE_OUT_OF_MEMORY),
 	 errmsg("out of shared memory"),
-	 errhint("You might need to increase max_locks_per_transaction.")));
+	 errhint("You might need to increase %s.",
+			 "max_locks_per_transaction")));
 		}
 		GrantLock(proclock->tag.myLock, proclock, lockmode);
 		FAST_PATH_CLEAR_LOCKMODE(MyProc, f, lockmode);
@@ -4257,7 +4260,8 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of shared memory"),
- errhint("You might need to increase max_locks_per_transaction.")));
+ errhint("You might need to increase %s.",
+		 "max_locks_per_transaction")));
 	}
 
 	/*
@@ -4322,7 +4326,8 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Bharath Rupireddy
On Fri, Sep 30, 2022 at 6:46 AM Thomas Munro  wrote:
>
> +1 for just doing it always, with a one-liner comment like
> "pg_pwritev*() might move the file position".  No reason to spam the
> source tree with more explanations of the exact reason.

+1 for resetting the file position in a platform-independent manner.
But, a description there won't hurt IMO and it saves time for the
hackers who spend time there and think why it's that way.

> If someone
> ever comes up with another case where p- and non-p- I/O functions are
> intermixed and it's really worth saving a system call (don't get me
> wrong, we call lseek() an obscene amount elsewhere and I'd like to fix
> that, but this case isn't hot?) then I like your idea of a macro to
> tell you whether you need to.

I don't think we go that route as the code isn't a hot path and an
extra system call wouldn't hurt performance much, a comment there
should work.

> Earlier I wondered why we'd want to include "pg_pwritev" in the name
> of this zero-filling function (pwritev being an internal
> implementation detail), but now it seems like maybe a good idea
> because it highlights the file position portability problem by being a
> member of that family of similarly-named functions.

Hm.

On Thu, Sep 29, 2022 at 10:57 PM Nathan Bossart
 wrote:
>
> +iov[0].iov_base = zbuffer.data;
>
> This seems superfluous, but I don't think it's hurting anything.

Yes, I removed it. Adding a comment, something like [1], would make it
more verbose, hence I've not added.

I'm attaching the v6 patch set, please review it further.

[1]
/*
 * Use the first vector buffer to write the remaining size. Note that
 * zero buffer was already pointed to it above, hence just specifying
 * the size is enough here.
 */

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 7ae3f77c0444e6c5383e13a502457813b850fd08 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 30 Sep 2022 01:26:03 +
Subject: [PATCH v6] Move pg_pwritev_with_retry() to file_utils.c

Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c
so that non-backend code or FRONTEND cases can also use it.

Author: Bharath Rupireddy
Reviewed-by: Thomas Munro
Reviewed-by: Nathan Bossart
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw%40mail.gmail.com
---
 src/backend/storage/file/fd.c   | 65 -
 src/common/file_utils.c | 65 +
 src/include/common/file_utils.h |  7 
 src/include/storage/fd.h|  6 ---
 4 files changed, 72 insertions(+), 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e4d954578c..4151cafec5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -93,7 +93,6 @@
 #include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "port/pg_iovec.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
 #include "storage/fd.h"
@@ -3738,67 +3737,3 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
-
-/*
- * A convenience wrapper for pg_pwritev() that retries on partial write.  If an
- * error is returned, it is unspecified how much has been written.
- */
-ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
-{
-	struct iovec iov_copy[PG_IOV_MAX];
-	ssize_t		sum = 0;
-	ssize_t		part;
-
-	/* We'd better have space to make a copy, in case we need to retry. */
-	if (iovcnt > PG_IOV_MAX)
-	{
-		errno = EINVAL;
-		return -1;
-	}
-
-	for (;;)
-	{
-		/* Write as much as we can. */
-		part = pg_pwritev(fd, iov, iovcnt, offset);
-		if (part < 0)
-			return -1;
-
-#ifdef SIMULATE_SHORT_WRITE
-		part = Min(part, 4096);
-#endif
-
-		/* Count our progress. */
-		sum += part;
-		offset += part;
-
-		/* Step over iovecs that are done. */
-		while (iovcnt > 0 && iov->iov_len <= part)
-		{
-			part -= iov->iov_len;
-			++iov;
-			--iovcnt;
-		}
-
-		/* Are they all done? */
-		if (iovcnt == 0)
-		{
-			/* We don't expect the kernel to write more than requested. */
-			Assert(part == 0);
-			break;
-		}
-
-		/*
-		 * Move whatever's left to the front of our mutable copy and adjust
-		 * the leading iovec.
-		 */
-		Assert(iovcnt > 0);
-		memmove(iov_copy, iov, sizeof(*iov) * iovcnt);
-		Assert(iov->iov_len > part);
-		iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part;
-		iov_copy[0].iov_len -= part;
-		iov = iov_copy;
-	}
-
-	return sum;
-}
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index df4d6d240c..4a4bdc31c4 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -28,6 +28,7 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "port/pg_iovec.h"
 
 #ifdef FRONTEND
 
@@ -460,3 +461,67 @@ get_dirent_type(const char *path,
 
 	return result;
 }
+
+/*
+ * A 

RE: Question: test "aggregates" failed in 32-bit machine

2022-09-29 Thread kuroda.hay...@fujitsu.com
Dear Tom,

> Hmm, we're not seeing any such failures in the buildfarm's 32-bit
> animals, so there must be some additional condition needed to make
> it happen.  Can you be more specific?

Hmm, I was not sure about additional conditions, sorry.
I could reproduce with followings steps: 

$ git clone https://github.com/postgres/postgres.git
$ cd postgres
$ ./configure --enable-cassert --enable-debug
$ make -j2
$ make check LANG=C

->  aggregates   ... FAILED 3562 ms




The hypervisor of the virtual machine is " VMware vSphere 7.0"

And I picked another information related with the machine.
Could you find something?

```

pg_config]$ ./pg_config 
...
CONFIGURE =  '--enable-cassert' '--enable-debug'
CC = gcc -std=gnu99
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -O2
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgcommon -lpgport -lz -lreadline -lrt -ldl -lm 
VERSION = PostgreSQL 16devel

$ locale
LANG=C
...

$ arch 
i686


$cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 85
model name  : Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
stepping: 7
microcode   : 83898371
cpu MHz : 2394.374
cache size  : 36608 KB
physical id : 0
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 22
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ss nx pdpe1gb rdtscp lm constant_tsc 
arch_perfmon xtopology tsc_reliable nonstop_tsc unfair_spinlock eagerfpu pni 
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch arat xsaveopt ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 smep bmi2 
invpcid avx512f rdseed adx avx512cd md_clear flush_l1d arch_capabilities
bogomips: 4788.74
clflush size: 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 85
model name  : Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
stepping: 7
microcode   : 83898371
cpu MHz : 2394.374
cache size  : 36608 KB
physical id : 2
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 2
initial apicid  : 2
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 22
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ss nx pdpe1gb rdtscp lm constant_tsc 
arch_perfmon xtopology tsc_reliable nonstop_tsc unfair_spinlock eagerfpu pni 
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch arat xsaveopt ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 smep bmi2 
invpcid avx512f rdseed adx avx512cd md_clear flush_l1d arch_capabilities
bogomips: 4788.74
clflush size: 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-09-29 Thread Tom Lane
Kyotaro Horiguchi  writes:
> By the way, I didn't notice at that time (and forgot about the
> policy), but the HINT message has variations differing only by the
> variable name.

> What do you think about the attached?

Hmm, maybe, but a quick grep for 'You might need to increase'
finds about a dozen other cases, and none of them are using %s.
If we do this we should change all of them, and they probably
need "translator:" hints.  I'm not sure whether abstracting
away the variable names will make translation harder.

regards, tom lane




Re: SQL/JSON features for v15

2022-09-29 Thread Andrew Dunstan


On 2022-09-01 Th 09:54, Nikita Glukhov wrote:
>
> On 31.08.2022 23:39, Nikita Glukhov wrote:
>
>> And here is a quick POC patch with an example for COPY and float4
> I decided to go further and use new API in SQL/JSON functions 
> (even if it does not make real sense now).
>
> I have added function for checking expressions trees, special
> executor steps for handling errors in FuncExpr, CoerceViaIO, 
> CoerceToDomain which are passed through ExprState.edata.
>
> Of course, there is still a lot of work:
>   1. JIT for new expression steps
>   2. Removal of subsidary ExprStates (needs another solution for 
> ErrorData passing)
>   3. Checking of domain constraint expressions
>   4. Error handling in coercion to bytea
>   5. Error handling in json_populate_type()
>   6. Error handling in jsonb::type casts
>   7. ...
>
>
> Also I have added lazy creation of JSON_VALUE coercions, which was 
> not present in previous patches.  It really greatly speeds up JIT 
> and reduces memory consumption.  But it requires using of subsidary 
> ExprStates.
>
>
> jsonb_sqljson test now fails because of points 4, 5, 6.




It looks like this needs to be rebased anyway.

I suggest just submitting the Input function stuff on its own, I think
that means not patches 3,4,15 at this stage. Maybe we would also need a
small test module to call the functions, or at least some of them.

The earlier we can get this in the earlier SQL/JSON patches based on it
can be considered.

A few comments:


. proissafe isn't really a very informative name. Safe for what? maybe
proerrorsafe or something would be better?

. I don't think we need the if test or else clause here:

+   if (edata)
+   return InputFunctionCallInternal(flinfo, str, typioparam,
typmod, edata);
+   else
+   return InputFunctionCall(flinfo, str, typioparam, typmod);

. I think we should probably cover float8 as well as float4, and there
might be some other odd gaps.


As mentioned previously, this should really go in a new thread, so
please don't reply to this but start a completely new thread.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Nathan Bossart
On Fri, Sep 30, 2022 at 08:09:04AM +0530, Bharath Rupireddy wrote:
> I'm attaching the v6 patch set, please review it further.

Looks reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: SQL/JSON features for v15

2022-09-29 Thread Tom Lane
Andrew Dunstan  writes:
> I suggest just submitting the Input function stuff on its own, I think
> that means not patches 3,4,15 at this stage. Maybe we would also need a
> small test module to call the functions, or at least some of them.
> The earlier we can get this in the earlier SQL/JSON patches based on it
> can be considered.

+1

> . proissafe isn't really a very informative name. Safe for what? maybe
> proerrorsafe or something would be better?

I strongly recommend against having a new pg_proc column at all.
I doubt that you really need it, and having one will create
enormous mechanical burdens to making the conversion.  (For example,
needing a catversion bump every time we convert one more function,
or an extension version bump to convert extensions.)

regards, tom lane




Re: Assign TupleTableSlot->tts_tableOid duplicated in tale AM.

2022-09-29 Thread Wenchao Zhang
Firstly, thank you for your reply.
Yeah, I think maybe just assigning tts_tableOid of TTS only once
during scanning the same table may be better. That really needs
to be thinked over.

Regards,
Wenchao

Peter Geoghegan  于2022年9月28日周三 10:47写道:

> On Tue, Sep 20, 2022 at 11:51 PM Wenchao Zhang 
> wrote:
> > We can get the two assigned values are same by reading codes. Maybe we
> should remove one?
> >
> > What's more, we find that maybe we assign slot->tts_tableOid in outer
> interface like table_scan_getnextslot may be better and more friendly when
> we import other pluggable storage formats.
>
> I suppose that you're right; it really should happen in exactly one
> place, based on some overarching theory about how tts_tableOid works
> with the table AM interface. I just don't know what that theory is.
>
> --
> Peter Geoghegan
>


Re: Suppressing useless wakeups in walreceiver

2022-09-29 Thread Nathan Bossart
I was surprised to see that this patch was never committed!

On Mon, Jan 31, 2022 at 04:28:11PM +0900, Kyotaro Horiguchi wrote:
> +static void
> +WalRcvComputeNextWakeup(WalRcvInfo *state,
> + WalRcvWakeupReason reason,
> + TimestampTz now)
> +{
> + switch (reason)
> + {
> ...
> + default:
> + break;
> + }
> +}
> 
> Mmm.. there's NUM_WALRCV_WAKEUPS.. But I think we don't need to allow
> the case.  Isn't it better to replace the break with Assert()?

+1

> It might be suitable for another patch, but we can do that
> together. If we passed "now" to the function XLogWalRcvProcessMsg, we
> would be able to remove further several calls to
> GetCurrentTimestamp(), in the function itself and
> ProcessWalSndrMessage (the name is improperly abbreviated..).

While reading the patch, I did find myself wondering if there were some
additional opportunities for reducing the calls to GetCurrentTimestamp().
That was really the only thing that stood out to me.

Thomas, are you planning to continue with this patch?  If not, I don't mind
picking it up.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: disfavoring unparameterized nested loops

2022-09-29 Thread David Rowley
On Fri, 30 Sept 2022 at 13:06, Peter Geoghegan  wrote:
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I'm not sure if it's a good idea to assume that performing
non-parameterised Nested Loops when we shouldn't is the only shape of
plan that causes us problems.

We also have the case where we assume early start-up plans are
favourable.  For example:

SELECT * FROM t WHERE a = 1 ORDER BY b LIMIT 10;

where we have two indexes, one on t(a) and another on t(b).

Should we use the t(b) index and filter out the rows that don't match
a = 1 and hope we get 10 a=1 rows soon in the t(b) index? or do we use
t(a) and then perform a sort? Best case for using the t(b) index is
that we find 10 a=1 rows in the first 10 rows of the index scan, the
worst case is that there are no rows with a=1.

Having something coded into the cost model is a more generic way of
addressing this issue.  Providing we design the cost model correctly,
we'd be able to address future issues we discover using which ever
cost model infrastructure that we design for this.

I understand that what you propose would be a fast way to fix this
issue. However, if we went and changed the join path creation code to
not add non-parameterised nested loop paths when other paths exist,
then how could we ever dare to put that code back again when we come
up with a better solution?

David




Re: Suppressing useless wakeups in walreceiver

2022-09-29 Thread Thomas Munro
On Fri, Sep 30, 2022 at 4:51 PM Nathan Bossart  wrote:
> Thomas, are you planning to continue with this patch?  If not, I don't mind
> picking it up.

Hi Nathan,

Please go ahead!  One thing I had in my notes to check with this patch
is whether there might be a bug due to computing all times in
microseconds, but sleeping for a number of milliseconds without
rounding up (what I mean is that if it's trying to sleep for 900
microseconds, it might finish up sleeping for 0 milliseconds in a
tight loop a few times).




interrupted tap tests leave postgres instances around

2022-09-29 Thread Andres Freund
Hi,

When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
postgres instances etc. That doesn't strike me as a good thing.

In contrast, the postgres instances started by pg_regress do terminate. I
assume this is because pg_regress starts postgres directly, whereas tap tests
largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
without a controlling terminal. Thus a ctrl-c won't be delivered to it.

ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
the stuff we already do in END.

That still leaves us with some other potential processes around that won't
immediately exec, but it'd be much better already.

Greetings,

Andres Freund




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-29 22:16:10 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Any opinions about whether to do this only in head or backpatch to 15?
> 
> HEAD should be sufficient, IMO.

Pushed. I think we should add some more divergent options to increase the
coverage. E.g. a different xlog pagesize, a smaller segmement size so we can
test the "segment boundary" code (although we don't currently allow < 1GB via
normal means right now) etc. But that's for later.

Greetings,

Andres Freund




Re: meson vs mingw, plpython, readline and other fun

2022-09-29 Thread Andres Freund
Hi,

On 2022-09-28 12:33:17 -0700, Andres Freund wrote:
> 0001 - meson: windows: Normalize slashes in prefix
> 0002 - meson: pg_regress: Define a HOST_TUPLE sufficient to make resultmap 
> work
> 0003 - meson: mingw: Allow multiple definitions
> 0004 - meson: Implement getopt logic from autoconf
> 0005 - mingw: Define PGDLLEXPORT as __declspec (dllexport) as done for msvc
> 0006 - meson: mingw: Add -Wl,--disable-auto-import, enable when linking with 
> readline
> 
> 0005 is the one that I'd most like review for. The rest just affect meson, so
> I'm planning to push them fairly soon - review would nevertheless be nice.

I have pushed 1-4, was holding out for opinions on 5.

I'm planning to push 0005 (and 0006) soon, to allow the mingw CI patch to
progress. So if somebody has concerns defining PGDLLEXPORT to __declspec
(dllexport) for mingw (already the case for msvc)...

Greetings,

Andres Freund




Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-09-29 Thread Kyotaro Horiguchi
At Thu, 29 Sep 2022 22:49:00 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > By the way, I didn't notice at that time (and forgot about the
> > policy), but the HINT message has variations differing only by the
> > variable name.
> 
> > What do you think about the attached?
> 
> Hmm, maybe, but a quick grep for 'You might need to increase'
> finds about a dozen other cases, and none of them are using %s.
(Mmm. I didn't find others only in po files..)
> If we do this we should change all of them, and they probably
> need "translator:" hints.  I'm not sure whether abstracting
> away the variable names will make translation harder.

I expect that dedicated po-editing tools can lookup corresponding code
lines, which gives the answer if no hint is attached at least in this
specific case.

Anyway, thinking calmly, since we are not about to edit these
messages, it's unlikely to happen on purpose, I think. So I don't mean
to push this so hard.

Thanks for the comment!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [RFC] building postgres with meson - v13

2022-09-29 Thread John Naylor
On Tue, Sep 27, 2022 at 2:41 PM John Naylor 
wrote:
>
> On Tue, Sep 27, 2022 at 2:06 AM Andres Freund  wrote:
> >
> > On 2022-09-26 15:18:29 +0700, John Naylor wrote:

> > Yea, it's /usr/local on x86-64, based on what was required to make
macos CI
> > work. I updated the wiki page, half-blindly - it'd be nice if you could
> > confirm that that works?
>
> Not sure if you intended for me to try the full script in your last
response or just what's in the wiki page, but for the latter (on commit
bed0927aeb0c6), it fails at
>
> [1656/2199] Linking target src/bin/psql/psql
> FAILED: src/bin/psql/psql

Per off-list discussion with Andres, the linking failure was caused by some
env variables set in my bash profile for the sake of Homebrew. After
removing those, the recipe in the wiki worked fine.

--
John Naylor
EDB: http://www.enterprisedb.com


Getting rid of SQLValueFunction

2022-09-29 Thread Michael Paquier
Hi all,

I have bumped a few days ago on the fact that COERCE_SQL_SYNTAX
(introduced by 40c24bf) and SQLValueFunction are around to do the
exact same thing, as known as enforcing single-function calls with
dedicated SQL keywords.  For example, keywords like SESSION_USER,
CURRENT_DATE, etc. go through SQLValueFunction and rely on the parser
to set a state that gets then used in execExprInterp.c.  And it is
rather easy to implement incorrect SQLValueFunctions, as these rely on
more hardcoded assumptions in the parser and executor than the
equivalent FuncCalls (like collation to assign when using a text-like
SQLValueFunctions).

There are two categories of single-value functions:
- The ones returning names, where we enforce a C collation in two
places of the code (current_role, role, current_catalog,
current_schema, current_database, current_user), even if
get_typcollation() should do that for name types.
- The ones working on time, date and timestamps (localtime[stamp],
current_date, current_time[stamp]), for 9 patterns as these accept an
optional typmod.

I have dug into the possibility to unify all that with a single
interface, and finish with the attached patch set which is a reduction
of code, where all the SQLValueFunctions are replaced by a set of
FuncCalls:
 25 files changed, 338 insertions(+), 477 deletions(-)

0001 is the move done for the name-related functions, cleaning up two
places in the executor when a C collation is assigned to those
function expressions.  0002 is the remaining cleanup for the
time-related ones, moving a set of parser-side checks to the execution
path within each function, so as all this knowledge is now local to
each file holding the date and timestamp types.  Most of the gain is
in 0002, obviously.

The pg_proc entries introduced for the sake of the move use the same
name as the SQL keywords.  These should perhaps be prefixed with a
"pg_" at least.  There would be an exception with pg_localtime[stamp],
though, where we could use a pg_localtime[stamp]_sql for the function
name for prosrc.  I am open to suggestions for these names.

Thoughts?
--
Michael
From 5eaabb5c306b06df2c3b0d18f075adb96074001f Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 30 Sep 2022 10:48:22 +0900
Subject: [PATCH 1/2] Remove from SQLValueFunctionOp all name-based functions

This includes six functions, moved to be COERCE_SQL_SYNTAX:
- current_role
- role
- current_catalog
- current_schema
- current_database
- current_user

"user" and "current_role" need specific functions to allow ruleutils.c
to map to the expected SQL grammar these require.  SQLValueFunctionOp is
reduced to half its contents.
---
 src/include/catalog/pg_proc.dat   |  6 
 src/include/nodes/primnodes.h |  8 +
 src/backend/executor/execExprInterp.c | 27 -
 src/backend/nodes/nodeFuncs.c |  4 +--
 src/backend/parser/gram.y | 30 +++
 src/backend/parser/parse_expr.c   |  8 -
 src/backend/parser/parse_target.c | 18 
 src/backend/utils/adt/ruleutils.c | 36 +++
 src/test/regress/expected/create_view.out | 12 
 src/test/regress/sql/create_view.sql  |  6 
 10 files changed, 68 insertions(+), 87 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 68bb032d3e..414d752f9d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1505,6 +1505,12 @@
 { oid => '745', descr => 'current user name',
   proname => 'current_user', provolatile => 's', prorettype => 'name',
   proargtypes => '', prosrc => 'current_user' },
+{ oid => '9695', descr => 'current role name',
+  proname => 'current_role', provolatile => 's', prorettype => 'name',
+  proargtypes => '', prosrc => 'current_user' },
+{ oid => '9696', descr => 'user name',
+  proname => 'user', provolatile => 's', prorettype => 'name',
+  proargtypes => '', prosrc => 'current_user' },
 { oid => '746', descr => 'session user name',
   proname => 'session_user', provolatile => 's', prorettype => 'name',
   proargtypes => '', prosrc => 'session_user' },
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 40661334bb..e818231e15 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1313,13 +1313,7 @@ typedef enum SQLValueFunctionOp
 	SVFOP_LOCALTIME,
 	SVFOP_LOCALTIME_N,
 	SVFOP_LOCALTIMESTAMP,
-	SVFOP_LOCALTIMESTAMP_N,
-	SVFOP_CURRENT_ROLE,
-	SVFOP_CURRENT_USER,
-	SVFOP_USER,
-	SVFOP_SESSION_USER,
-	SVFOP_CURRENT_CATALOG,
-	SVFOP_CURRENT_SCHEMA
+	SVFOP_LOCALTIMESTAMP_N
 } SQLValueFunctionOp;
 
 typedef struct SQLValueFunction
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 9b9bbf00a9..6ebf5c287e 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2495,15 +2495,10 @@ ExecEvalParamExtern(ExprState *state, Ex

Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu
> Right. But that seems fraught with difficulty. I suspect that the
> costs that the planner attributes to each plan often aren't very
> reliable in any absolute sense, even when everything is working very
> well by every available metric. Even a very noisy cost model with
> somewhat inaccurate selectivity estimates will often pick the cheapest
> plan, or close enough.

Sure, the absolute cost of a complex plan will always be inaccurate at best.
My point is that we can be very confident in the cardinalities of base tables. 
As the paper states in "3.1. Estimates for Base Tables":

"The median q-error is close to the optimal value of 1 for all systems,
indicating that the majority of all selections are estimated correctly."

Thanks to the statistics will practically never be off by an order of magnitude 
when estimating base table cardinalities.

The paper also clearly shows (and that certainly coincides with my experience) 
that those cardinality underestimations grow exponentially as they propagate up 
the join tree.

Given the research I'd stipulate that at any given level of the join tree, the 
current depth is a reasonable indicator of underestimation. Taking that into 
account (even if only to mitigate nested loops on higher levels) is IMV a 
principled approach, and not necesseraly a hack.

Obviously having something like error bars as proposed by Tom would be even 
better and perhaps more general, but that is on a whole different level in 
terms of complexity and I certainly have no idea how we would easily get there.




Re: interrupted tap tests leave postgres instances around

2022-09-29 Thread Michael Paquier
On Thu, Sep 29, 2022 at 09:07:34PM -0700, Andres Freund wrote:
> ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
> the stuff we already do in END.

Hmm, indeed.  And here I thought that END was actually taking care of
that on an interrupt..
--
Michael


signature.asc
Description: PGP signature


Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu


> In general I suspect that we'd be better off focussing on mitigating
> the impact at execution time. There are at least a few things that we
> could do there, at least in theory. Mostly very ambitious, long term
> things.

I think these things are orthogonal. No matter how good the cost model ever 
gets, we will always have degenerate cases.
Having some smarts about that in the executor is surely a good thing, but it 
shouldn't distract us from improving on the planner front.

> 
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I agree with you, that those plans are too risky. But let's maybe find a more 
general way of dealing with this.

> Right. Though I am actually sympathetic to the idea that users might
> gladly pay a cost for performance stability -- even a fairly large
> cost. That part doesn't seem like the problem.




Re: Refactor UnpinBuffer()

2022-09-29 Thread Michael Paquier
On Thu, Sep 29, 2022 at 10:35:20AM -0700, Nathan Bossart wrote:
> I've marked this one as ready-for-committer.

UnpinBuffer() is local to bufmgr.c, so it would not be an issue for
external code, and that's 10 callers that don't need to worry about
that anymore.  2d115e4 is from 2015, and nobody has used this option
since, additionally.

Anyway, per the rule of consistency with the surroundings (see
ReleaseBuffer() and ReleaseAndReadBuffer()), it seems to me that there
is a good case for keeping the adjustment of CurrentResourceOwner
before any refcount checks.  I have also kept a mention to
CurrentResourceOwner in the top comment of the function, and applied
that.
--
Michael


signature.asc
Description: PGP signature