Re: Synchronizing slots from primary to standby

2023-08-04 Thread Drouvot, Bertrand

Hi,

On 7/24/23 4:32 AM, Bharath Rupireddy wrote:

On Fri, Jul 21, 2023 at 5:16 PM shveta malik  wrote:



Here are my thoughts about this feature:


Thanks for looking at it!




Important considerations:

1. Does this design guarantee the row versions required by subscribers
aren't removed on candidate standbys as raised here -
https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de?

It seems safe with logical decoding on standbys feature. Also, a
test-case from upthread is already in patch sets (in v9 too)
https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com.
However, we need to verify the use cases extensively.


Agree. We also discussed up-thread that we'd have to drop any "sync" slots if 
they
are invalidated. And they should be re-created based on the 
synchronize_slot_names.


Please feel free to add the list if I'm missing anything.



We'd also have to ensure that "sync" slots can't be consumed on the standby 
(this has been
discussed up-thread).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Adding a LogicalRepWorker type field

2023-08-04 Thread Peter Smith
Thank you for the feedback received so far. Sorry, I have become busy
lately with other work.

IIUC there is a general +1 for the idea, but I still have some
juggling necessary to make the functions/macros of patch 0001
acceptable to everybody.

The difficulties arise from:
- no function overloading C
- ideally, we prefer the same names for everything (e.g. instead of
dual-set macros)
- but launcher code calls need to pass param (other code always uses
MyLogicalRepWorker)
- would be nice (although no mandatory) to not have to always pass
MyLogicalRepWorker as a param.

Anyway, I will work towards finding some compromise on this next week.
Currently, I feel the best choice is to go with what Bharath suggested
in the first place -- just always pass the parameter (including
passing MyLogicalRepWorker). I will think more about it...

--

Meanwhile, here are some replies to other feedback received:

Ashutosh --  "May be we should create a union of type specific members" [1].

Yes, I was wondering this myself, but I won't pursue this yet until
getting over the hurdle of finding acceptable functions/macros for
patch 0001. Hopefully, we can come back to this idea.

~~

Mellih -- "Isn't the apply worker type still decided by eliminating
the other choices even with the patch?".

AFAIK the only case of that now is the one-time check in the
logicalrep_worker_launch() function. IMO, that is quite a different
proposition to the HEAD macros that have to make that deduction
10,000s ot times in executing code. Actually, even the caller knows
exactly the kind of worker it wants to launch so we can pass the
LogicalRepWorkerType directly to logicalrep_worker_launch() and
eliminate even this one-time-check. I can code it that way in the next
patch version.

~~

Barath -- "-1 for these names starting with prefix TYPE_, in fact LR_
looked better."

Hmmm. I'm not sure what is best. Of the options below I prefer
"WORKER_TYPE_xxx", but I don't mind so long as there is a consensus.

LR_APPLY_WORKER
LR_PARALLEL_APPLY_WORKER
LR_TABLESYNC_WORKER

TYPE_APPLY_WORKERT
TYPE_PARALLEL_APPLY_WORKER
TYPE_TABLESYNC_WORKER

WORKER_TYPE_APPLY
WORKER_TYPE_PARALLEL_APPLY
WORKER_TYPE_TABLESYNC

APPLY_WORKER
PARALLEL_APPLY_WORKER
TABLESYNC_WORKER

APPLY
PARALLEL_APPLY
TABLESYNC

--
[1] Ashutosh - 
https://www.postgresql.org/message-id/CAExHW5uALiimrdpdO0vwuDivD99TW%2B_9vvfFsErVNzq1ehYV9Q%40mail.gmail.com
[2] Melih - 
https://www.postgresql.org/message-id/CAGPVpCRZ-zEOa2Lkvw%2BiTU3NhJzfbGwH1dU7Htreup--8e5nxg%40mail.gmail.com
[3] Bharath - 
https://www.postgresql.org/message-id/CALj2ACVro6oCsTg_gpYu%2BV_LPMSgk4wjmSPDk8d5thArWNRoRQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Adding a pg_servername() function

2023-08-04 Thread Laetitia Avrot
Le jeu. 3 août 2023 à 15:17, Christoph Moench-Tegeder 
a écrit :

> ## Laetitia Avrot (laetitia.av...@gmail.com):
>

> > For my customer,  their use case is to be able from an SQL client to
> double
> > check they're on the right host before doing things that could become a
> > production disaster.
>
> Why not use cluster_name for that? Even if it may not be relevant
> for their setup, there might be multiple clusters on the same host
> (multiple clusters with the same hostname) or you database could be
> hiding behind some failover/loadbalancer mechanism (multiple hostnames
> in one cluster), thus using the hostname to identify the cluster is
> totally not universal (the same goes for the monitoring/metrics
> stuff). And there's of course inet_server_addr(), if you really
> need to double-check if you're connected to the right machine.
>

Hello Christoph,

I understand your point and sure enough, my customer could set and use the
cluster_name for that purpose. I totally disagree with using
inet_server_addr() for that purpose as there are so many different network
settings with VIPs and so on that it won't help. Also, a socket connection
will give a NULL value, not that it's *that* bad because if it's a socket,
you're running locally and could still use `\! ifconfig`, but it does not
work on some configurations (docker for example). Also, most humans will
find it easier to memorize a name than a number and that's one of the
reasons why we remember websites' URLs instead of their IP.

I still disagree with the monitoring part. A good monitoring tool will have
to reconcile data from the OS with data from the Postgres cluster. So that,
we kind of need a way for the monitoring tool to know on which host this
particular cluster is running and I think it's smarter to get this
information from the Postgres cluster.

Of course, I do love the idea that Postgres is agnostic from where it's
running, but I don't think this new function removes that.

Have a nice day,

Lætitia


Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-08-04 Thread Ashutosh Bapat
On Fri, Jul 28, 2023 at 7:28 PM Ashutosh Bapat 
wrote:

> >
> > +   bms_free(child_sjinfo->commute_above_l);
> > +   bms_free(child_sjinfo->commute_above_r);
> > +   bms_free(child_sjinfo->commute_below_l);
> > +   bms_free(child_sjinfo->commute_below_r);
> >
> > These four members in SpecialJoinInfo only contain outer join relids.
> > They do not need to be translated.  So they would reference the same
> > memory areas in child_sjinfo as in parent_sjinfo.  We should not free
> > them, otherwise they would become dangling pointers in parent sjinfo.
> >
>
>
Attached patchset fixing those.
0001 - patch to report planning memory, with to explain.out regression
output fix. We may consider committing this as well.
0002 - with your comment addressed above.


>
> I think we should 1. add an assert to make sure that commute_above_*
> do not require any transations i.e. they do not contain any parent
> relids of the child rels.


Looking at the definitions of commute_above and commute_below relids and OJ
relids, I don't think these assertions are necessary. So didn't add those.


> 2. Do not free those.


Done


> 3. Add a comment about
> keeping the build and free functions in sync.
>

Done.

-- 
Best Wishes,
Ashutosh Bapat
From 553a147b4566e4f6da28e2f7dd2c9759371afd7b Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Mon, 24 Jul 2023 20:20:53 +0530
Subject: [PATCH 2/2] Reduce memory used by child SpecialJoinInfo

The SpecialJoinInfo applicable to a child join is computed by
translating the same applicable to the parent join in
try_partitionwise_join(). The child SpecialJoinInfo is not needed once
the child join RelOptInfo is created and paths are added to it. Use a
local variable to hold child SpecialJoinInfo so that it doesn't need
memory allocated separately.

Also free the memory allocated to various Bitmapsets in SpecialJoinInfo.
Those are not referenced anywhere. But we do not free the memory
allocated to expression trees of operator OID list since those may be
referenced in paths or other objects.
---
 src/backend/optimizer/path/joinrels.c | 70 +++
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 015a0b3cbe..5cf32f8b07 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -42,9 +42,11 @@ static void try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1,
    RelOptInfo *rel2, RelOptInfo *joinrel,
    SpecialJoinInfo *parent_sjinfo,
    List *parent_restrictlist);
-static SpecialJoinInfo *build_child_join_sjinfo(PlannerInfo *root,
-SpecialJoinInfo *parent_sjinfo,
-Relids left_relids, Relids right_relids);
+static void build_child_join_sjinfo(PlannerInfo *root,
+	SpecialJoinInfo *parent_sjinfo,
+	Relids left_relids, Relids right_relids,
+	SpecialJoinInfo *child_sjinfo);
+static void free_child_sjinfo_members(SpecialJoinInfo *child_sjinfo);
 static void compute_partition_bounds(PlannerInfo *root, RelOptInfo *rel1,
 	 RelOptInfo *rel2, RelOptInfo *joinrel,
 	 SpecialJoinInfo *parent_sjinfo,
@@ -1540,7 +1542,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		RelOptInfo *child_rel2;
 		bool		rel1_empty;
 		bool		rel2_empty;
-		SpecialJoinInfo *child_sjinfo;
+		SpecialJoinInfo child_sjinfo;
 		List	   *child_restrictlist;
 		RelOptInfo *child_joinrel;
 		AppendRelInfo **appinfos;
@@ -1635,9 +1637,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		 * Construct SpecialJoinInfo from parent join relations's
 		 * SpecialJoinInfo.
 		 */
-		child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
-			   child_rel1->relids,
-			   child_rel2->relids);
+		build_child_join_sjinfo(root, parent_sjinfo, child_rel1->relids,
+child_rel2->relids, &child_sjinfo);
 
 		/* Find the AppendRelInfo structures */
 		appinfos = find_appinfos_by_relids(root,
@@ -1660,7 +1661,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		{
 			child_joinrel = build_child_join_rel(root, child_rel1, child_rel2,
  joinrel, child_restrictlist,
- child_sjinfo);
+ &child_sjinfo);
 			joinrel->part_rels[cnt_parts] = child_joinrel;
 			joinrel->live_parts = bms_add_member(joinrel->live_parts, cnt_parts);
 			joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
@@ -1674,10 +1675,11 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 		/* And make paths for the child join */
 		populate_joinrel_with_paths(root, child_rel1, child_rel2,
-	child_joinrel, child_sjinfo,
+	child_joinrel, &child_sjinfo,
 	child_restrictlist);
 
 		pfree(appinfos);
+		free_child_sjinfo_members(&child_sjinfo);
 	}
 }
 
@@ -1685,43 +1687,71 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *r

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-08-04 Thread tender wang
Oversight the DetachPartitionFinalize(), I found another bug below:

postgres=# CREATE TABLE p ( id bigint PRIMARY KEY ) PARTITION BY list (id);
CREATE TABLE
postgres=# CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);
CREATE TABLE
postgres=# CREATE TABLE r_1 (
postgres(# id   bigint PRIMARY KEY,
postgres(# p_id bigint NOT NULL
postgres(#   );
CREATE TABLE
postgres=#  CREATE TABLE r (
postgres(# id   bigint PRIMARY KEY,
postgres(# p_id bigint NOT NULL,
postgres(# FOREIGN KEY (p_id) REFERENCES p (id)
postgres(#   ) PARTITION BY list (id);
CREATE TABLE
postgres=# ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1);
ALTER TABLE
postgres=# ALTER TABLE r DETACH PARTITION r_1;
ALTER TABLE
postgres=# insert into r_1 values(1,1);
ERROR:  insert or update on table "r_1" violates foreign key constraint
"r_p_id_fkey"
DETAIL:  Key (p_id)=(1) is not present in table "p".

After detach operation, r_1 is normal relation and the inherited foreign
key 'r_p_id_fkey' should be removed.


tender wang  于2023年8月3日周四 17:34写道:

> I think the code to determine that fk of a partition is inherited or not
> is not enough.
> For example, in this case, foreign key r_1_p_id_fkey1  is not inherited
> from parent.
>
> If conform->conparentid(in DetachPartitionFinalize func) is valid, we
> should recheck confrelid(pg_constraint) field.
>
> I try to fix this problem in the attached patch.
> Any thoughts.
>
> Alvaro Herrera  于2023年8月3日周四 17:02写道:
>
>> On 2023-Aug-03, tender wang wrote:
>>
>> > I think  old "sub-FK" should not be dropped, that will be violates
>> foreign
>> > key constraint.
>>
>> Yeah, I've been playing more with the patch and it is definitely not
>> doing the right things.  Just eyeballing the contents of pg_trigger and
>> pg_constraint for partitions added by ALTER...ATTACH shows that the
>> catalog contents are inconsistent with those added by CREATE TABLE
>> PARTITION OF.
>>
>> --
>> Álvaro Herrera PostgreSQL Developer  —
>> https://www.EnterpriseDB.com/
>>
>


Re: Synchronizing slots from primary to standby

2023-08-04 Thread Drouvot, Bertrand

Hi,

On 7/28/23 4:39 PM, Bharath Rupireddy wrote:

On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila  wrote:



2. All candidate standbys will start one slot sync worker per logical
slot which might not be scalable.


Yeah, that doesn't sound like a good idea but IIRC, the proposed patch
is using one worker per database (for all slots corresponding to a
database).


Right. It's based on one worker for each database.


Is having one (or a few more - not
necessarily one for each logical slot) worker for all logical slots
enough?


I guess for a large number of slots the is a possibility of a large
gap in syncing the slots which probably means we need to retain
corresponding WAL for a much longer time on the primary. If we can
prove that the gap won't be large enough to matter then this would be
probably worth considering otherwise, I think we should find a way to
scale the number of workers to avoid the large gap.


I think the gap is largely determined by the time taken to advance
each slot and the amount of WAL that each logical slot moves ahead on
primary. 


Sorry to be late, but I gave a second thought and I wonder if we really need 
this design.
(i.e start a logical replication background worker on the standby to sync the 
slots).

Wouldn't that be simpler to "just" update the sync slots "metadata"
as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by 
Peter
up-thread) is doing?
(making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot()
and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() 
correctly).


I've measured the time it takes for
pg_logical_replication_slot_advance with different amounts WAL on my
system. It took 2595ms/5091ms/31238ms to advance the slot by
3.7GB/7.3GB/13GB respectively. To put things into perspective here,
imagine there are 3 logical slots to sync for a single slot sync
worker and each of them are in need of advancing the slot by
3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again
after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after
advance time of slot 1 with amount of WAL that the slot has moved
ahead on primary during 40sec, gets to slot 3 again after advance time
of slot 1 and slot 2 with amount of WAL that the slot has moved ahead
on primary and so on. If WAL generation on the primary is pretty fast,
and if the logical slot moves pretty fast on the primary, the time it
takes for a single sync worker to sync a slot can increase.


That would be way "faster" and we would probably not need to
worry that much about the number of "sync" workers (if it/they "just" has/have
to sync slot's "metadata") as proposed above.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Split index and table statistics into different types of stats

2023-08-04 Thread Drouvot, Bertrand

Hi,

On 7/10/23 11:14 AM, Daniel Gustafsson wrote:

On 4 Apr 2023, at 12:04, Drouvot, Bertrand  wrote:
On 4/3/23 11:47 PM, Gregory Stark (as CFM) wrote:

On Thu, 16 Mar 2023 at 05:25, Drouvot, Bertrand
 wrote:


My plan was to get [1] done before resuming working on the "Split index and table 
statistics into different types of stats" one.
[1]: https://commitfest.postgresql.org/42/4106/

Generate pg_stat_get_xact*() functions with Macros ([1]) was committed March 27.
There's only a few days left in this CF. Would you like to leave this
here? Should it be marked Needs Review or Ready for Commit? Or should
we move it to the next CF now?


I moved it to the next commitfest and marked the target version as v17.


This patch no longer applies (with tests failing when it did), and the thread
has stalled.  I'm marking this returned with feedback for now, please feel free
to resubmit to a future CF with a new version of the patch.


Thanks for the update.
I'll resume working on it and re-submit once done.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Using defines for protocol characters

2023-08-04 Thread Alvaro Herrera
On 2023-Aug-03, Dave Cramer wrote:

> New patch attached which uses  PREPARED_SUB_COMMAND   and
> PORTAL_SUB_COMMAND instead

Hmm, I would keep the prefix in this case and make the message type a
second prefix, with the subtype last -- PQMSG_CLOSE_PREPARED,
PQMSG_DESCRIBE_PORTAL and so on.

You define PASSWORD and GSS both with 'p', which I think is bogus; in
some place that isn't doing GSS, you've replaced 'p' with the GSS one
(CheckSASLAuth).  I think it'd be better to give 'p' a single name, and
not try to distinguish which is PASSWORD and which is GSS, because
ultimately it's not important.

There are some unpatched places, such as basebackup_copy.c -- there are
several matches for /'.'/ that correspond to protocol chars in that file.
Also CopySendEndOfRow has one 'd', and desc.c has two 'C'.

I think fe-trace.c will require further adjustment of the comments for
each function.  We could change them to be the symbol for each char, like so:

/* PQMSG_RESP_READY_FOR_QUERY */
static void
pqTraceOutputZ(FILE *f, const char *message, int *cursor)


Thanks

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)




Re: remaining sql/json patches

2023-08-04 Thread Erik Rijkers

Op 7/21/23 om 12:33 schreef Amit Langote:


Thanks for taking a look.



Hi Amit,

Is there any chance to rebase the outstanding SQL/JSON patches, (esp. 
json_query)?


Thanks!

Erik Rijkers





Re: remaining sql/json patches

2023-08-04 Thread Amit Langote
Hi,

On Fri, Aug 4, 2023 at 19:01 Erik Rijkers  wrote:

> Op 7/21/23 om 12:33 schreef Amit Langote:
> >
> > Thanks for taking a look.
> >
>
> Hi Amit,
>
> Is there any chance to rebase the outstanding SQL/JSON patches, (esp.
> json_query)?


Yes, working on it.  Will post a WIP shortly.

> --
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


Re: Simplify some logical replication worker type checking

2023-08-04 Thread Amit Kapila
On Wed, Aug 2, 2023 at 8:20 AM Amit Kapila  wrote:
>
> On Tue, Aug 1, 2023 at 12:11 PM Alvaro Herrera  
> wrote:
> >
> > On 2023-Aug-01, Peter Smith wrote:
> >
> > > PSA a small patch making those above-suggested changes. The 'make
> > > check' and TAP subscription tests are all passing OK.
> >
> > I think the code ends up more readable with this style of changes, so
> > +1.  I do wonder if these calls should appear in a proc_exit callback or
> > some such instead, though.
> >
>
> But the call to
> ApplyLauncherForgetWorkerStartTime()->logicalrep_launcher_attach_dshmem()
> has some dynamic shared memory allocation/attach calls which I am not
> sure is a good idea to do in proc_exit() callbacks. We may want to
> evaluate whether moving the suggested checks to proc_exit or any other
> callback is a better idea. What do you think?
>

I have pushed the existing patch but feel free to pursue further improvements.

-- 
With Regards,
Amit Kapila.




Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-08-04 Thread tender wang
Oversight the DetachPartitionFinalize() again, I found the root cause why
'r_p_id_fkey' wat not removed.

DetachPartitionFinalize() call the GetParentedForeignKeyRefs() func to get
tuple from pg_constraint that will be delete but failed.
 according to the comments, the GetParentedForeignKeyRefs() func get the
tuple reference me not I reference others.

I try to fix this bug :
i. ConstraintSetParentConstraint() should not be called in
DetachPartitionFinalize(), because after conparentid was set to 0,
we can not find inherited foreign keys.
ii. create another function like GetParentedForeignKeyRefs(), but the
ScanKey should be conrelid field not confrelid.

I quickly test on my above solution in my env, can be solve above issue.

tender wang  于2023年8月4日周五 17:04写道:

> Oversight the DetachPartitionFinalize(), I found another bug below:
>
> postgres=# CREATE TABLE p ( id bigint PRIMARY KEY ) PARTITION BY list (id);
> CREATE TABLE
> postgres=# CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);
> CREATE TABLE
> postgres=# CREATE TABLE r_1 (
> postgres(# id   bigint PRIMARY KEY,
> postgres(# p_id bigint NOT NULL
> postgres(#   );
> CREATE TABLE
> postgres=#  CREATE TABLE r (
> postgres(# id   bigint PRIMARY KEY,
> postgres(# p_id bigint NOT NULL,
> postgres(# FOREIGN KEY (p_id) REFERENCES p (id)
> postgres(#   ) PARTITION BY list (id);
> CREATE TABLE
> postgres=# ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1);
> ALTER TABLE
> postgres=# ALTER TABLE r DETACH PARTITION r_1;
> ALTER TABLE
> postgres=# insert into r_1 values(1,1);
> ERROR:  insert or update on table "r_1" violates foreign key constraint
> "r_p_id_fkey"
> DETAIL:  Key (p_id)=(1) is not present in table "p".
>
> After detach operation, r_1 is normal relation and the inherited foreign
> key 'r_p_id_fkey' should be removed.
>
>
> tender wang  于2023年8月3日周四 17:34写道:
>
>> I think the code to determine that fk of a partition is inherited or not
>> is not enough.
>> For example, in this case, foreign key r_1_p_id_fkey1  is not inherited
>> from parent.
>>
>> If conform->conparentid(in DetachPartitionFinalize func) is valid, we
>> should recheck confrelid(pg_constraint) field.
>>
>> I try to fix this problem in the attached patch.
>> Any thoughts.
>>
>> Alvaro Herrera  于2023年8月3日周四 17:02写道:
>>
>>> On 2023-Aug-03, tender wang wrote:
>>>
>>> > I think  old "sub-FK" should not be dropped, that will be violates
>>> foreign
>>> > key constraint.
>>>
>>> Yeah, I've been playing more with the patch and it is definitely not
>>> doing the right things.  Just eyeballing the contents of pg_trigger and
>>> pg_constraint for partitions added by ALTER...ATTACH shows that the
>>> catalog contents are inconsistent with those added by CREATE TABLE
>>> PARTITION OF.
>>>
>>> --
>>> Álvaro Herrera PostgreSQL Developer  —
>>> https://www.EnterpriseDB.com/
>>>
>>


Check volatile functions in ppi_clauses for memoize node

2023-08-04 Thread Richard Guo
In get_memoize_path() we have a theory about avoiding memoize node if
there are volatile functions in the inner rel's target/restrict list.

 /*
  * We can't use a memoize node if there are volatile functions in the
  * inner rel's target list or restrict list.  A cache hit could reduce the
  * number of calls to these functions.
  */
 if (contain_volatile_functions((Node *) innerrel->reltarget))
 return NULL;

 foreach(lc, innerrel->baserestrictinfo)
 {
 RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);

 if (contain_volatile_functions((Node *) rinfo))
 return NULL;
 }

It seems that the check for restrict list is not thorough, because for a
parameterized scan we are supposed to also consider all the join clauses
available from the outer rels, ie, ppi_clauses.  For instance,

create table t (a float);
insert into t values (1.0), (1.0), (1.0), (1.0);
analyze t;

explain (costs off)
select * from t t1 left join lateral
(select t1.a as t1a, t2.a as t2a from t t2) s
on t1.a = s.t2a + random();
  QUERY PLAN
---
 Nested Loop Left Join
   ->  Seq Scan on t t1
   ->  Memoize
 Cache Key: t1.a
 Cache Mode: binary
 ->  Seq Scan on t t2
   Filter: (t1.a = (a + random()))
(7 rows)

According to the theory we should not use memoize node for this query
because of the volatile function in the inner side.  So propose a patch
to fix that.

Thanks
Richard


v1-0001-Check-volatile-functions-in-ppi_clauses-for-memoize-node.patch
Description: Binary data


Re: [RFC] Clang plugin for catching suspicious typedef casting

2023-08-04 Thread Dmitry Dolgov
> On Thu, Aug 03, 2023 at 12:23:52PM -0500, Tristan Partin wrote:
>
> This is the first I am learning about clang plugins. Interesting concept.
> Did you give any thought to using libclang instead of a compiler plugin? I
> am kind of doing similar work, but I went with libclang instead of a plugin.

Nope, never thought about trying libclang. From the clang documentation
it seems a plugin is a suitable interface if one wants to:

special lint-style warnings or errors for your project

Which is what I was trying to achieve. Are there any advantages of
libclang that you see?




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-04 Thread Amit Kapila
On Wed, Aug 2, 2023 at 1:43 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > 3.
> > + /*
> > + * Get replication slots.
> > + *
> > + * XXX: Which information must be extracted from old node? Currently three
> > + * attributes are extracted because they are used by
> > + * pg_create_logical_replication_slot().
> > + */
> > + appendPQExpBufferStr(query,
> > + "SELECT slot_name, plugin, two_phase "
> > + "FROM pg_catalog.pg_replication_slots "
> > + "WHERE database = current_database() AND temporary = false "
> > + "AND wal_status IN ('reserved', 'extended');");
> >
> > Why are we ignoring the slots that have wal status as WALAVAIL_REMOVED
> > or WALAVAIL_UNRESERVED? I think the slots where wal status is
> > WALAVAIL_REMOVED, the corresponding slots are invalidated at some
> > point. I think such slots can't be used for decoding but these will be
> > dropped along with the subscription or when a user does it manually.
> > So, if we don't copy such slots after the upgrade then there could be
> > a problem in dropping the corresponding subscription. If we don't want
> > to copy over such slots then we need to provide instructions on what
> > users should do in such cases. OTOH, if we want to copy over such
> > slots then we need to find a way to invalidate such slots after copy.
> > Either way, this needs more analysis.
>
> I considered again here. At least WALAVAIL_UNRESERVED should be supported 
> because
> the slot is still usable. It can return reserved or extended.
>
> As for WALAVAIL_REMOVED, I don't think it should be so that I added a 
> description
> to the document.
>
> This feature re-create slots which have same name/plugins as old ones, not 
> replicate
> its state. So if we copy them as-is slots become usable again. If subscribers 
> refer
> the slot and then connect again at that time, changes between 
> 'WALAVAIL_REMOVED'
> may be lost.
>
> Based on above slots must be copied as WALAVAIL_REMOVED, but as you said, we 
> do
> not have a way to control that. the status is calculated by using restart_lsn,
> but there are no function to modify directly.
>
> One approach is adding an SQL funciton which set restart_lsn to aritrary value
> (or 0/0, invalid), but it seems dangerous.
>

So, we have three options here (a) As you have done in the patch,
document this limitation and request user to perform some manual steps
to drop the subscription; (b) don't allow upgrade to proceed if there
are invalid slots in the old cluster; (c) provide a new function like
pg_copy_logical_replication_slot_contents() where we copy the required
contents like invalid status(ReplicationSlotInvalidationCause), etc.

Personally, I would prefer (b) because it will minimize the steps
required to perform by the user after the upgrade and looks cleaner
solution.

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2023-08-04 Thread shveta malik
On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 7/28/23 4:39 PM, Bharath Rupireddy wrote:
> > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila  wrote:
> >>
> >>> 2. All candidate standbys will start one slot sync worker per logical
> >>> slot which might not be scalable.
> >>
> >> Yeah, that doesn't sound like a good idea but IIRC, the proposed patch
> >> is using one worker per database (for all slots corresponding to a
> >> database).
> >
> > Right. It's based on one worker for each database.
> >
> >>> Is having one (or a few more - not
> >>> necessarily one for each logical slot) worker for all logical slots
> >>> enough?
> >>
> >> I guess for a large number of slots the is a possibility of a large
> >> gap in syncing the slots which probably means we need to retain
> >> corresponding WAL for a much longer time on the primary. If we can
> >> prove that the gap won't be large enough to matter then this would be
> >> probably worth considering otherwise, I think we should find a way to
> >> scale the number of workers to avoid the large gap.
> >
> > I think the gap is largely determined by the time taken to advance
> > each slot and the amount of WAL that each logical slot moves ahead on
> > primary.
>
> Sorry to be late, but I gave a second thought and I wonder if we really need 
> this design.
> (i.e start a logical replication background worker on the standby to sync the 
> slots).
>
> Wouldn't that be simpler to "just" update the sync slots "metadata"
> as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by 
> Peter
> up-thread) is doing?
> (making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot()
> and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() 
> correctly).
>

Agreed. It would be simpler to just update the metadata. I think you
have not got chance to review the latest posted patch ('v10-0003')
yet, it does the same.

But I do not quite get it as in how can we do it w/o starting a
background worker? Even the failover-slots extension starts one
background worker. The question here is how many background workers we
need to have. Will one be sufficient or do we need one per db (as done
earlier by the original patches in this thread) or are we good with
dividing work among some limited number of workers?

I feel syncing all slots in one worker may increase the lag between
subsequent syncs for a particular slot and if the number of slots are
huge, the chances of losing the slot-data is more in case of failure.
Starting one worker per db also might not be that efficient as it will
increase load on the system (both in terms of background worker and
network traffic) especially for a case where the number of dbs are
more. Thus starting max 'n' number of workers where 'n' is decided by
GUC and dividing the work/DBs among these looks a better option to me.
Please see the discussion in and around the email at [1]

[1]: 
https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com

> > I've measured the time it takes for
> > pg_logical_replication_slot_advance with different amounts WAL on my
> > system. It took 2595ms/5091ms/31238ms to advance the slot by
> > 3.7GB/7.3GB/13GB respectively. To put things into perspective here,
> > imagine there are 3 logical slots to sync for a single slot sync
> > worker and each of them are in need of advancing the slot by
> > 3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again
> > after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after
> > advance time of slot 1 with amount of WAL that the slot has moved
> > ahead on primary during 40sec, gets to slot 3 again after advance time
> > of slot 1 and slot 2 with amount of WAL that the slot has moved ahead
> > on primary and so on. If WAL generation on the primary is pretty fast,
> > and if the logical slot moves pretty fast on the primary, the time it
> > takes for a single sync worker to sync a slot can increase.
>
> That would be way "faster" and we would probably not need to
> worry that much about the number of "sync" workers (if it/they "just" has/have
> to sync slot's "metadata") as proposed above.
>

Agreed, we need not to worry about delay due to
pg_logical_replication_slot_advance if we are only going to update a
few simple things using the function calls as mentioned above.

thanks
Shveta




Re: Use of additional index columns in rows filtering

2023-08-04 Thread Tomas Vondra



On 8/4/23 02:08, Peter Geoghegan wrote:
> On Thu, Aug 3, 2023 at 3:04 PM Tomas Vondra
>  wrote:
>> When you say "index page accesses" do you mean accesses to index pages,
>> or accesses to heap pages from the index scan?
> 
> Yes, that's exactly what I mean. Note that that's the dominant cost
> for the original BitmapOr plan.
> 

Well, I presented multiple options, so "yes" doesn't really clarify
which of them applies. But my understanding is you meant the index pages
accesses.

> As I said upthread, the original BitmapOr plan has 7 buffer hits. The
> breakdown is 1 single heap page access, 3 root page accesses, and 3
> leaf page accesses. There is only 1 heap page access because only 1
> out of the 3 index scans that feed into the BitmapOr actually end up
> finding any matching rows in the index.
> 
> In short, the dominant cost here is index page accesses. It's a
> particularly good case for my SAOP patch!
> 

Understood. It makes sense considering the SAOP patch is all about
optimizing the index walk / processing fewer pages.

>> Because my patch is all about reducing the heap pages, which are usually
>> the expensive part of the index scan. But you're right the "index scan"
>> with index filter may access more index pages, because it has fewer
>> "access predicates".
> 
> The fact is that your patch correctly picks the cheapest plan, which
> is kinda like a risky version of the plan that my SAOP patch would
> pick -- it is cheaper for the very same reason. I understand that
> that's not your intention at all, but this is clearly what happened.
> That's what I meant by "weird second order effects".
> 
> To me, it really does kinda look like your patch accidentally
> discovered a plan that's fairly similar to the plan that my SAOP patch
> would have found by design! Perhaps I should have been clearer on this
> point earlier. (If you're only now seeing this for yourself for the
> first time, then...oops. No wonder you were confused about which
> patch it was I was going on about!)
> 

Thanks. I think I now see the relationship between the plan with my
patch and your SAOP patch. It's effectively very similar, except that
the responsibilities are split a bit differently. With my patch the OR
clause happens outside AM, while the SAOP patch would do that in the AM
and also use that to walk the index more efficiently.

>> I don't quite see that with the tenk1 query we've been discussing (the
>> extra buffers were due to non-allvisible heap pages), but I guess that's
>> possible.
> 
> The extra buffer hits occur because I made them occur by inserting new
> tuples where thousand = 42. Obviously, I did it that way because I had
> a point that I wanted to make. Obviously, there wouldn't have been any
> notable regression from your patch at all if I had (say) inserted
> tuples where thousand = 43 instead. (Not for the original "42" query,
> at least.)
> 

Right. I know where the heap accesses come from, but clearly we're able
to skip those when allvisible=true, and we don't need to scan more index
pages either. But I guess we could make the data more complex to make
this part worse (for my patch, while the SAOP would not be affected).

> That's part of the problem, as I see it. Local changes like that can
> have outsized impact on individual queries, even though there is no
> inherent reason to expect it. How can statistics reliably guide the
> planner here? Statistics are supposed to be a summary of the whole
> attribute, that allow us to make various generalizations during
> planning. But this plan leaves us sensitive to relatively small
> changes in one particular "thousand" grouping, with potentially
> outsized impact. And, this can happen very suddenly, because it's so
> "local".
> 
> Making this plan perform robustly just doesn't seem to be one of the
> things that statistics can be expected to help us with very much.
> 

I agree there certainly are cases where the estimates will be off. This
is not that different from correlated columns, in fact it's exactly the
same issue, I think. But it's also not a novel/unique issue - we should
probably do the "usual" thing, i.e. plan based on estimates (maybe with
some safety margin), and have some fallback strategy at runtime.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-04 Thread Christoph Berg
Re: Noah Misch
> On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote:
> > I've observed the following failure once in one of my Cirrus CI runs
> > on Windows Server on HEAD:
> > 
> > timed out waiting for match: (?^:User was holding shared buffer pin
> > for too long) at
> > C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
> > # Postmaster PID for node "primary" is 696
> > 
> > https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict
> > 
> > https://github.com/BRupireddy/postgres/runs/15296698158
> 
> Known:
> https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de
> https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=ney...@mail.gmail.com
> 
> A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from
> https://postgr.es/m/ca+hukg+hi5p1j_8cveqllwnsvyjh4rntf04fywkenxftrh2...@mail.gmail.com
> (near the end of that second thread).

I am also seeing problems with t/031_recovery_conflict.pl, on the
newly added s390x architecture on apt.postgresql.org. The failure is
flaky, but has been popping up in build logs often enough.

I managed to reproduce it on the shell by running the test in a loop a
few times. The failure looks like this:

echo "# +++ tap check in src/test/recovery +++" && rm -rf 
'/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check && 
/bin/mkdir -p 
'/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check && cd 
/home/myon/postgresql/pg/postgresql/build/../src/test/recovery && 
TESTLOGDIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check/log'
 
TESTDATADIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check'
 
PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/postgresql/17/bin:/home/myon/postgresql/pg/postgresql/build/src/test/recovery:$PATH"
 
LD_LIBRARY_PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/s390x-linux-gnu"
  PGPORT='65432' 
top_builddir='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../..'
 
PG_REGRESS='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../../src/test/regress/pg_regress'
 /usr/bin/prove -I /home/myon/postgresql/pg/postgresql/build/../src/test/perl/ 
-I /home/myon/postgresql/pg/postgresql/build/../src/test/recovery --verbose 
t/031_recovery_conflict.pl
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl ..
# issuing query via background psql:
# BEGIN;
# DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM 
test_recovery_conflict_table1;
# FETCH FORWARD FROM test_recovery_conflict_cursor;
ok 1 - buffer pin conflict: cursor with conflicting pin established
ok 2 - buffer pin conflict: logfile contains terminated connection due to 
recovery conflict
ok 3 - buffer pin conflict: stats show conflict on standby
# issuing query via background psql:
# BEGIN;
# DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM 
test_recovery_conflict_table1;
# FETCH FORWARD FROM test_recovery_conflict_cursor;
#
ok 4 - snapshot conflict: cursor with conflicting snapshot established
ok 5 - snapshot conflict: logfile contains terminated connection due to 
recovery conflict
ok 6 - snapshot conflict: stats show conflict on standby
# issuing query via background psql:
# BEGIN;
# LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE;
# SELECT 1;
#
ok 7 - lock conflict: conflicting lock acquired
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 7.
Dubious, test returned 255 (wstat 65280, 0xff00)
All 7 subtests passed

Test Summary Report
---
t/031_recovery_conflict.pl (Wstat: 65280 Tests: 7 Failed: 0)
  Non-zero exit status: 255
  Parse errors: No plan found in TAP output
Files=1, Tests=7, 186 wallclock secs ( 0.01 usr  0.00 sys +  0.74 cusr  0.71 
csys =  1.46 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1

I.e. the test file just exits after test 7, without running the rest.

Is it dying because it's running into a deadlock instead of some other
expected error message?

timed out waiting for match: (?^:User was holding a relation lock for too long) 
at t/031_recovery_conflict.pl line 318.
2023-08-04 12:20:38.061 UTC [1920073] 031_recovery_conflict.pl FATAL:  
terminating connection due to conflict with recovery

Christoph
2023-08-04 12:20:36.343 UTC [1920003] LOG:  starting PostgreSQL 17devel (Ubuntu 
17~~devel-1) on s390x-ibm-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-08-04 12:20:36.343 UTC [1920003] LOG:  listening on Unix socket 
"/tmp/GoFBiPMX9Y/.s.PGSQL.55090"
2023-08-04 12:20:36.345 UTC [1920006] LOG:  database system was shut down at 
2023-08-04 12:20:36 UTC
2023-08-04 12:20:36.351 UTC [1920003] LOG:  database system is ready to accept 

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-04 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> So, we have three options here (a) As you have done in the patch,
> document this limitation and request user to perform some manual steps
> to drop the subscription; (b) don't allow upgrade to proceed if there
> are invalid slots in the old cluster; (c) provide a new function like
> pg_copy_logical_replication_slot_contents() where we copy the required
> contents like invalid status(ReplicationSlotInvalidationCause), etc.
> 
> Personally, I would prefer (b) because it will minimize the steps
> required to perform by the user after the upgrade and looks cleaner
> solution.
> 
> Thoughts?

Thanks for suggestion. I agreed (b) was better because it did not endanger users
for data lost. I implemented locally and worked well, so I'm planning to adopt
the idea in next version, if no objections.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Using defines for protocol characters

2023-08-04 Thread Dave Cramer
On Fri, 4 Aug 2023 at 03:44, Alvaro Herrera  wrote:

> On 2023-Aug-03, Dave Cramer wrote:
>
> > New patch attached which uses  PREPARED_SUB_COMMAND   and
> > PORTAL_SUB_COMMAND instead
>
> Hmm, I would keep the prefix in this case and make the message type a
> second prefix, with the subtype last -- PQMSG_CLOSE_PREPARED,
> PQMSG_DESCRIBE_PORTAL and so on.
>
I used

#define PQMSG_REQ_PREPARED 'S'
#define PQMSG_REQ_PORTAL 'P'

Duplicating them for CLOSE PREPARED|PORTAL seems awkward

>
> You define PASSWORD and GSS both with 'p', which I think is bogus; in
> some place that isn't doing GSS, you've replaced 'p' with the GSS one
> (CheckSASLAuth).  I think it'd be better to give 'p' a single name, and
> not try to distinguish which is PASSWORD and which is GSS, because
> ultimately it's not important.
>
Done

>
> There are some unpatched places, such as basebackup_copy.c -- there are
> several matches for /'.'/ that correspond to protocol chars in that file.
> Also CopySendEndOfRow has one 'd', and desc.c has two 'C'.
>
> I think fe-trace.c will require further adjustment of the comments for
> each function.  We could change them to be the symbol for each char, like
> so:
>
> /* PQMSG_RESP_READY_FOR_QUERY */
> static void
> pqTraceOutputZ(FILE *f, const char *message, int *cursor)
>
>
Thanks for reviewing see attached for fixes

Dave


0001-Created-protocol.h.patch
Description: Binary data


WIP: new system catalog pg_wait_event

2023-08-04 Thread Drouvot, Bertrand

Hi hackers,

Now that fa88928470 generates automatically code and documentation
related to wait events, why not exposing the wait events description
through a system catalog relation? (the idea has been proposed on twitter
by Yves Colin [1]).

I think that could be useful to:

- join this new relation with pg_stat_activity and have a quick
understanding of what the sessions are waiting for (if any).
- quickly compare the wait events across versions (what are the new
ones if any,..)

Please find attached a POC patch creating this new system catalog
pg_wait_event.

The patch:

- updates the documentation
- adds a new option to generate-wait_event_types.pl to generate the 
pg_wait_event.dat
- creates the pg_wait_event.h
- works with autoconf

It currently does not:

- works with meson (has to be done)
- add tests (not sure it's needed)
- create an index on the new system catalog (not sure it's needed as the data 
fits
in 4 pages (8kB default size)).

Outcome example:

postgres=# select a.pid, a.application_name, a.wait_event,d.description from 
pg_stat_activity a, pg_wait_event d where a.wait_event = d.wait_event_name and 
state='active';
   pid   | application_name | wait_event  |
description
-+--+-+---
 2937546 | pgbench  | WALInitSync | Waiting for a newly initialized WAL 
file to reach durable storage
(1 row)

There is still some work to be done to generate the pg_wait_event.dat file, 
specially when the
same wait event name can be found in multiple places (like for example 
"WALWrite" in IO and LWLock),
leading to:

postgres=# select * from pg_wait_event where wait_event_name = 'WALWrite';
 wait_event_name |   description
-+--
 WALWrite| Waiting for a write to a WAL file. Waiting for WAL buffers 
to be written to disk
 WALWrite| Waiting for WAL buffers to be written to disk
(2 rows)

which is obviously not right (we'll probably have to add the wait class name to 
the game).

I'm sharing it now (even if it's still WIP) so that you can start sharing your 
thoughts
about it.

[1]: https://twitter.com/Ycolin/status/1676598065048743948

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom c56ce13af7193bf0d679ec0a7533ab686464f34e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 1 Aug 2023 03:55:51 +
Subject: [PATCH v1] pg_wait_event

Adding a new shared catalog relation, namely pg_wait_event, that describes the
wait events.

The content is auto-generated with generate-wait_event_types.pl.
---
 doc/src/sgml/catalogs.sgml| 62 ++
 src/backend/catalog/.gitignore|  1 +
 src/backend/catalog/Makefile  | 13 ++--
 src/backend/catalog/catalog.c |  4 +-
 .../activity/generate-wait_event_types.pl | 65 ++-
 src/include/catalog/catversion.h  |  2 +-
 src/include/catalog/pg_wait_event.h   | 45 +
 7 files changed, 184 insertions(+), 8 deletions(-)
  26.7% doc/src/sgml/
  16.5% src/backend/catalog/
  33.6% src/backend/utils/activity/
  23.0% src/include/catalog/

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 307ad88b50..f181a5cedb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -369,6 +369,11 @@
   pg_user_mapping
   mappings of users to foreign servers
  
+
+ 
+  pg_wait_event
+  wait events description
+ 
 

   
@@ -9668,4 +9673,61 @@ SCRAM-SHA-256$:&l
   
  
 
+
+ 
+  pg_wait_event
+
+  
+   pg_wait_event
+  
+
+  
+   The catalog pg_wait_event stores description about
+   the wait events.
+  
+
+  
+   Unlike most system catalogs, pg_wait_event
+   is shared across all databases of a cluster: there is only one
+   copy of pg_wait_event per cluster, not
+   one per database.
+  
+
+  
+   pg_wait_event Columns
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   wait_event_name text
+  
+  
+   Wait event name
+  
+ 
+
+ 
+  
+   description text
+  
+  
+   Wait event description
+  
+ 
+
+   
+  
+ 
+
 
diff --git a/src/backend/catalog/.gitignore b/src/backend/catalog/.gitignore
index 237ff54165..7f2f3045dc 100644
--- a/src/backend/catalog/.gitignore
+++ b/src/backend/catalog/.gitignore
@@ -4,3 +4,4 @@
 /system_constraints.sql
 /pg_*_d.h
 /bki-stamp
+/pg_wait_event.dat
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index a60107bf94..68a97fbdcd 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -72,7 +72,8 @@ CATALOG_HEADER

"duplicated" wait events

2023-08-04 Thread Drouvot, Bertrand

Hi hackers,

while working on the new system catalog pg_wait_event (see [1]) I noticed that 
some wait
events are currently "duplicated":

postgres=# select wait_event_name,count(*) from pg_wait_event group by 
wait_event_name having count(*) > 1;
 wait_event_name | count
-+---
 SyncRep | 2
 WALWrite| 2
(2 rows)

Indeed:

SynRep currently appears in "IPC" and "LWLock" (see [2])
WALWrite currently appears in "IO" and "LWLock" (see [2])

I think that can lead to confusion and it would be better to avoid duplicate 
wait event
name across Wait Class (and so fix those 2 ones above), what do you think?

[1]: 
https://www.postgresql.org/message-id/0e2ae164-dc89-03c3-cf7f-de86378053ac%40gmail.com
[2]: https://www.postgresql.org/docs/current/monitoring-stats.html

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: WIP: new system catalog pg_wait_event

2023-08-04 Thread Tom Lane
"Drouvot, Bertrand"  writes:
> Now that fa88928470 generates automatically code and documentation
> related to wait events, why not exposing the wait events description
> through a system catalog relation?

I think you'd be better off making this a view over a set-returning
function.  The nearby work to allow run-time extensibility of the
set of wait events is not going to be happy with a static catalog.

regards, tom lane




Release 17 of the PostgreSQL Buildfarm Client

2023-08-04 Thread Andrew Dunstan

I have pushed Release 17 of the PostgreSQL Buildfarm client.

Release 17 has two main features:

 * Modernize the way we do cross-version upgrade tests. Most of the
   logic for modifying instances to make them suitable for cross
   version upgrade testing has now been migrated to the Postgres core
   code in |src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm|. The new
   code simply imports this module and leverages its knowledge.
 * Support of building with |meson|. This is only supported on version
   16 or later of Postgres, older branches will continue to use the
   older toolsets. To enable building with |meson|, there are several
   new settings, illustrated in the sample configuration file:
 o |using_meson| this must be set to a true value
 o |meson_jobs| this controls the degree of parallelism that
   |meson| will use
 o |meson_test_timeout| this is used to multiply the meson test
   timeout. The default is 3, 0 turns off timeout
 o |meson_config| This is an array of settings for passing to
   |meson setup|. Note that all options need to be explicitly given
   here - the client disables all |auto| options. This includes use
   of |zlib| and |readline|, which do not default to on, unlike
   |autoconf| setups.

There are also a number of relatively small bug fixes and tweaks (e.g. 
some improvements in processing typedefs).


The release is available at 
 or 



Enjoy!


cheers


andrew



--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Use of additional index columns in rows filtering

2023-08-04 Thread Peter Geoghegan
On Fri, Aug 4, 2023 at 4:47 AM Tomas Vondra
 wrote:
> Well, I presented multiple options, so "yes" doesn't really clarify
> which of them applies. But my understanding is you meant the index pages
> accesses.

Sorry. Your understanding of what I must have meant before was correct
-- your patch picked that plan because it reduced the number of index
page accesses significantly. Just like my SAOP patch would have.

> > In short, the dominant cost here is index page accesses. It's a
> > particularly good case for my SAOP patch!
> >
>
> Understood. It makes sense considering the SAOP patch is all about
> optimizing the index walk / processing fewer pages.

Actually, some of the most compelling cases for my SAOP patch are
those involving heap page access savings, which come from the planner
changes. Basically, the nbtree/executor changes make certain index
access patterns much more efficient. Which is useful in itself, but
often much more useful as an indirect enabler of avoiding heap page
accesses by altering other related aspects of a plan in the planner.
Sometimes by replacing table filter quals with index quals (the nbtree
changes make nbtree a strictly better place for the quals). You know,
somewhat like your patch.

That's really why I'm so interested in your patch, and its
relationship with my own patch, and the BitmapOr issue. If your patch
enables some tricks that are really quite similar to the tricks that
my own patch enables, then delineating which patch does which exact
trick when is surely important for both patches.

I actually started out just thinking about index page accesses, before
eventually coming to understand that heap page accesses were also very
relevant. Whereas you started out just thinking about heap page
accesses, and now see some impact from saving on index page accesses.

> Thanks. I think I now see the relationship between the plan with my
> patch and your SAOP patch. It's effectively very similar, except that
> the responsibilities are split a bit differently. With my patch the OR
> clause happens outside AM, while the SAOP patch would do that in the AM
> and also use that to walk the index more efficiently.

Right. That's where my idea of structuring things so that there is
only one best choice really comes from.

> I agree there certainly are cases where the estimates will be off. This
> is not that different from correlated columns, in fact it's exactly the
> same issue, I think.

In one way I think you're right that it's the same issue -- if you
just focus on that one executor node, then it's the same issue. But I
don't think it's the same issue in a deeper sense, since this is one
case where you simply don't have to accept any risk. We really should
be able to just not ever do this, for a limited though important
subset of cases involving ORs + indexable operators.

-- 
Peter Geoghegan




Re: cataloguing NOT NULL constraints

2023-08-04 Thread Alvaro Herrera
On 2023-Jul-28, Alvaro Herrera wrote:

> To avoid that, one option would be to make this NN constraint
> undroppable ...  but I don't see how.  One option might be to add a
> pg_depend row that links the NOT NULL constraint to its PK constraint.
> But this will be a strange case that occurs nowhere else, since other
> NOT NULL constraint don't have such pg_depend rows.  Also, I won't know
> how pg_dump likes this until I implement it.

I've been completing the implementation for this.  It seems to work
reasonably okay; pg_dump requires somewhat strange contortions, but they
are similar to what we do in flagInhTables already, so I don't feel too
bad about that.

What *is* odd and bothersome is that it also causes a problem dropping
the child table.  For example,

CREATE TABLE parent (a int primary key);
CREATE TABLE child () INHERITS (parent);
\d+ child

 Tabla «public.child»
 Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión │ Almacenamiento │ 
Compresión │ Estadísticas │ Descripción 
─┼─┼──┼──┼─┼┼┼──┼─
 a   │ integer │  │ not null │ │ plain  │   
 │  │ 
Not null constraints:
"child_a_not_null" NOT NULL "a"
Hereda: parent
Método de acceso: heap

This is the behavior that I think we wanted to prevent drop of the child
constraint, and it seems okay to me:

=# alter table child drop constraint child_a_not_null;
ERROR:  cannot drop constraint child_a_not_null on table child because 
constraint parent_pkey on table parent requires it
SUGERENCIA:  You can drop constraint parent_pkey on table parent instead.

But the problem is this:

=# drop table child;
ERROR:  cannot drop table child because other objects depend on it
DETALLE:  constraint parent_pkey on table parent depends on table child
SUGERENCIA:  Use DROP ... CASCADE to drop the dependent objects too.


To be clear, what my patch is doing is add one new dependency:

dep │  ref  
 │ deptype 
┼┼─
 type foo   │ table foo 
 │ i
 table foo  │ schema public 
 │ n
 constraint foo_pkey on table foo   │ column a of table foo 
 │ a
 type bar   │ table bar 
 │ i
 table bar  │ schema public 
 │ n
 table bar  │ table foo 
 │ n
 constraint bar_a_not_null on table bar │ column a of table bar 
 │ a
 constraint child_a_not_null on table child │ column a of table child   
 │ a
 constraint child_a_not_null on table child │ constraint parent_pkey on table 
parent │ i

the last row here is what is new.  I'm not sure what's the right fix.
Maybe I need to invert the direction of that dependency.


Even with that fixed, I'd still need to write more code so that ALTER
TABLE INHERIT adds the link (I already patched the DROP INHERIT part).
Not sure what else might I be missing.

Separately, I also noticed that some code that's currently
dropconstraint_internal needs to be moved to DropConstraintById, because
if the PK is dropped for some other reason than ALTER TABLE DROP
CONSTRAINT, some ancillary actions are not taken.

Sigh.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 Are you not unsure you want to delete Firefox?
   [Not unsure] [Not not unsure][Cancel]
   http://smylers.hates-software.com/2008/01/03/566e45b2.html




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-08-04 Thread Melih Mutlu
Hi hackers,


Melih Mutlu , 16 Haz 2023 Cum, 17:03 tarihinde şunu
yazdı:

> With this change, here's a query to find how much space used by each
> context including its children:
>
> > WITH RECURSIVE cte AS (
> > SELECT id, total_bytes, id as root, name as root_name
> > FROM memory_contexts
> > UNION ALL
> > SELECT r.id, r.total_bytes, cte.root, cte.root_name
> > FROM memory_contexts r
> > INNER JOIN cte ON r.parent_id = cte.id
> > ),
> > memory_contexts AS (
> > SELECT * FROM pg_backend_memory_contexts
> > )
> > SELECT root as id, root_name as name, sum(total_bytes)
> > FROM cte
> > GROUP BY root, root_name
> > ORDER BY sum DESC;
>

Given that the above query to get total bytes including all children is
still a complex one, I decided to add an additional info in
pg_backend_memory_contexts.
The new "path" field displays an integer array that consists of ids of all
parents for the current context. This way it's easier to tell whether a
context is a child of another context, and we don't need to use recursive
queries to get this info.

Here how pg_backend_memory_contexts would look like with this patch:

postgres=# SELECT name, id, parent, parent_id, path
FROM pg_backend_memory_contexts
ORDER BY total_bytes DESC LIMIT 10;
  name   | id  |  parent  | parent_id | path
-+-+--+---+--
 CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
 Timezones   | 124 | TopMemoryContext | 0 | {0}
 TopMemoryContext|   0 |  |   |
 MessageContext  |   8 | TopMemoryContext | 0 | {0}
 WAL record construction | 118 | TopMemoryContext | 0 | {0}
 ExecutorState   |  18 | PortalContext|17 | {0,16,17}
 TupleSort main  |  19 | ExecutorState|18 | {0,16,17,18}
 TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
 smgr relation table |  10 | TopMemoryContext | 0 | {0}
 GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
(10 rows)


An example query to calculate the total_bytes including its children for a
context (say CacheMemoryContext) would look like this:

WITH contexts AS (
SELECT * FROM pg_backend_memory_contexts
)
SELECT sum(total_bytes)
FROM contexts
WHERE ARRAY[(SELECT id FROM contexts WHERE name = 'CacheMemoryContext')] <@
path;

We still need to use cte since ids are not persisted and might change in
each run of pg_backend_memory_contexts. Materializing the result can
prevent any inconsistencies due to id change. Also it can be even good for
performance reasons as well.

Any thoughts?

Thanks,
-- 
Melih Mutlu
Microsoft


v2-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
Description: Binary data


Re: Extract numeric filed in JSONB more effectively

2023-08-04 Thread Chapman Flack
On 2023-08-03 23:55, Andy Fan wrote:
> The double quotes look weird to me.  but it looks like  a common
> situation.
> 
> select numeric('1'::int); -- failed.
> select "numeric"('1'::int); -- ok.

It arises when you have an object (type, function, cast, whatever) whose
name in the catalog is the same as some SQL-standard keyword that the
parser knows. The same thing happens with the PG type named char, which has
to be spelled "char" in a query because otherwise you get the SQL standard
char type, which is different.

On 2023-08-03 09:50, Andy Fan wrote:
> I don't think this is a key conflict so far. but I'd explain this in more
> detail. If timestamp -> timestamptz or timestamptz -> timestamp is
> binary compatible,  ... however cast between timestamp and timestamptz is
> not binary compatible. whose castmethod is 'f';

This is one of those cases where the incompatibility is a semantic one.
Both types are the same number of bits and they both represent microseconds
since midnight of the "Postgres epoch", but only timestamptz is anchored
to a time zone (UTC), so unless you happen to live in that time zone, they
mean different things. To just copy the binary bits from one to the other
would be lying (unless you know that the person you are copying them for
lives in UTC).

> On Fri, Aug 4, 2023 at 3:13 AM Tom Lane  wrote:
>> Andy Fan  writes:
>> > Can you point to me where the code is for the XML stuff?
>>
>> I think Pavel means XMLTABLE, which IMO is an ugly monstrosity of
>> syntax --- but count on the SQL committee to do it that way :-(.

Another interesting thing about XMLTABLE is that ISO defines its behavior
entirely as a big rewriting into another SQL query built out of XMLQUERY
and XMLCAST functions, and a notional XMLITERATE function that isn't
visible to a user but is a factor of the rewriting. And the definition of
XMLCAST itself is something that is sometimes rewritten to plain CAST, and
sometimes rewritten away. Almost as if they had visions of planner support
functions.

Regards,
-Chap




Re: Use of additional index columns in rows filtering

2023-08-04 Thread Tomas Vondra
On 8/4/23 04:07, Peter Geoghegan wrote:
> On Thu, Aug 3, 2023 at 3:04 PM Tomas Vondra
>  wrote:
>> Because my patch is all about reducing the heap pages, which are usually
>> the expensive part of the index scan. But you're right the "index scan"
>> with index filter may access more index pages, because it has fewer
>> "access predicates".
> 
> It's not so much the unnecessary index page accesses that bother me.
> At least I didn't push that aspect very far when I constructed my
> adversarial test case -- index pages were only a small part of the
> overall problem. (I mean the problem at runtime, in the executor. The
> planner expected to save a small number of leaf page accesses, which
> was kinda, sorta the problem there -- though the planner might have
> technically still been correct about that, and can't have been too far
> wrong in any case.)
> 

Thanks for the clarification, I think I understand better now.

Let me briefly sum my understanding for the two patches:

- The SAOP patch eliminates those heap accesses because it manages to
evaluate all clauses in the AM, including clauses that would previously
be treated as "table filters" and evaluated on the heap row.

- My patch achieves a similar result by evaluating the clauses as index
filters (i.e. on the index tuple). That's probably not as good as proper
access predicates, so it can't help with the index page accesses, but
better than what we had before.

There's a couple more related thoughts later in my reply.

> The real problem that my adversarial case seemed to highlight was a
> problem of extra heap page accesses. The tenk1 table's VM is less than
> one page in size, so how could it have been VM buffer hits? Sure,
> there were no "table filters" involved -- only "index filters". But
> even "index filters" require heap access when the page isn't marked
> all-visible in the VM.
> 

No, the extra accesses were not because of VM buffer hits - it was
because of having to actually fetch the heap tuple for pages that are
not fully visible, which is what happens right after the insert.

The patch does what we index-only scans do - before evaluating the
filters on an index tuple, we check if the page is fully visible. If
not, we fetch the heap tuple and evaluate the filters on it.

This means even an index-only scan would behave like this too. And it
goes away as the table gets vacuumed, at which point we can eliminate
the rows using only the index tuple again.

> That problem just cannot happen with a similar plan that eliminates
> the same index tuples within the index AM proper (the index quals
> don't even have to be "access predicates" for this to apply, either).
> Of course, we never need to check the visibility of index tuples just
> to be able to consider eliminating them via nbtree search scan
> keys/index quals -- and so there is never any question of heap/VM
> access for tuples that don't pass index quals. Not so for "index
> filters", where there is at least some chance of accessing the heap
> proper just to be able to eliminate non-matches.
> 

Right. This however begs a question - why would we actually need to
check the visibility map before evaluating the index filter, when the
index tuple alone is clearly good enough for the bitmapOr plan?

Because if we didn't need to do that VM check, this issue with extra
heap accesses would disappear.

I copied this from the IOS somewhat blindly, but now I'm starting to
think it was misguided. I thought it's a protection against processing
"invalid" tuples - not tuples broken after a crash (as that would be
fixed by crash recovery), but e.g. tuples with schema changes from an
aborted transaction.

But can this actually happen for indexes? For heap it's certainly
possible (BEGIN - ALTER - INSERT - ROLLBACK will leave behind tuples
like that). But we don't support changing indexes like this, right?

And if we had this issue, how come the bitmapOr plan (which ultimately
does the same thing, although entirely in the AM) does not need to do
these VM checks / heap accesses too? It's evaluating essentially the
same conditions on the index tuple ...

So I'm starting to think this just my misunderstanding of why IOS does
this VM check - it's purely to determine visibility of the result. When
it sees a pointer to page that is not all-visible, it decides it'll need
to do check visibility on a heap tuple anyway, and just fetches the heap
tuple right away. Which however ignores that the filters may eliminate
many of those tuples, so IOS could also make such unnecessary heap
accesses. It might be better to check the filters first, and only then
maybe fetch the heap tuple ...

> While I think that it makes sense to assume that "index filters" are
> strictly better than "table filters" (assuming they're directly
> equivalent in that they contain the same clause), they're not
> *reliably* any better. So "index filters" are far from being anywhere
> near as good as an equivalent index qual (AFAICT we should always

Re: Use of additional index columns in rows filtering

2023-08-04 Thread Peter Geoghegan
On Fri, Aug 4, 2023 at 4:34 PM Tomas Vondra
 wrote:
> Thanks for the clarification, I think I understand better now.

Honestly, it's gratifying to be understood at all in a discussion like
this one. Figuring out how to articulate some of my more subtle ideas
(without leaving out important nuances) can be a real struggle for me.

> Let me briefly sum my understanding for the two patches:
>
> - The SAOP patch eliminates those heap accesses because it manages to
> evaluate all clauses in the AM, including clauses that would previously
> be treated as "table filters" and evaluated on the heap row.

Yes, exactly.

> - My patch achieves a similar result by evaluating the clauses as index
> filters (i.e. on the index tuple). That's probably not as good as proper
> access predicates, so it can't help with the index page accesses, but
> better than what we had before.

Yes, exactly.

> No, the extra accesses were not because of VM buffer hits - it was
> because of having to actually fetch the heap tuple for pages that are
> not fully visible, which is what happens right after the insert.

Yes, exactly.

> The patch does what we index-only scans do - before evaluating the
> filters on an index tuple, we check if the page is fully visible. If
> not, we fetch the heap tuple and evaluate the filters on it.
>
> This means even an index-only scan would behave like this too. And it
> goes away as the table gets vacuumed, at which point we can eliminate
> the rows using only the index tuple again.

Yes, exactly.

> Right. This however begs a question - why would we actually need to
> check the visibility map before evaluating the index filter, when the
> index tuple alone is clearly good enough for the bitmapOr plan?
>
> Because if we didn't need to do that VM check, this issue with extra
> heap accesses would disappear.

The index AM is entitled to make certain assumptions of opclass
members -- assumptions that cannot be made during expression
evaluation. The classic example is division-by-zero during evaluation
of a qual, for a tuple that wasn't visible anyway. Our assumption is
that stuff like that just cannot happen with index quals -- users
shouldn't ever encounter sanity check errors caused by
invisible-to-their-MVCC-snapshot tuples.

I think that that's the main difficulty, as far as avoiding heap
access for index filters is concerned. Of course, you could just limit
yourself to those cases where the index AM assumptions were safe. But
at that point, why not simply make sure to generate true index quals,
and be done with it?

> I copied this from the IOS somewhat blindly, but now I'm starting to
> think it was misguided. I thought it's a protection against processing
> "invalid" tuples - not tuples broken after a crash (as that would be
> fixed by crash recovery), but e.g. tuples with schema changes from an
> aborted transaction.

I agree that schema changes for indexes shouldn't be an issue, though.

> I'm not quite sure what are the differences between "index qual" vs.
> "access predicate index qual" vs. "index filter predicate index quals",
> or what "dispacing" would mean exactly.

For the purposes of this point about "a hierarchy of quals", it
doesn't really matter -- that was the point I was trying to make.

In other words, "index quals" are strictly better than equivalent
"index filters", which are themselves strictly better than equivalent
"table filters". While it is also true that you can meaningfully
classify "index quals" into their own hierarchy (namely access
predicates vs index filter predicates), that doesn't necessarily need
to be discussed when discussing the hierarchy from a planner point of
view, since it is (at least for now) internal to the nbtree index AM.

On second thought, I tend to doubt that your patch needs to worry
about each type of index qual directly. It probably needs to worry
about index quals in general.

It is always better to make what could be an "index filter" into an
index qual. Of course, the current practical problem for you is
figuring out how to deal with that in cases like the BitmapOr case.
Since it's not as if the planner is wrong, exactly...it really is the
cheapest plan, so the planner is at least right on its own terms. I am
interested in finding a solution to that problem.

> FWIW this also reminds me that this whole discussion mostly focused on
> SAOP clauses (and how they may be translated into access predicates
> etc.). My patch is however way more general - it applies to all clauses,
> not just SAOP ones, including clauses with no chance of evaluating at
> the AM level.

I agree that your patch is way more general, in one way. But the SAOP
patch is also much more general than you might think.

For one thing the whole BitmapOr plan issue actually required that you
compared your patch to a combination of my SAOP patch and Alena
Rybakina's OR-to-SAOP transformation patch -- you needed both patches.
Her patch effectively made my own patch much more general. But the

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-08-04 Thread Thomas Munro
Here is a rebase over 26669757, which introduced
PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT.

I got a bit confused about why this new conflict reason didn't follow
the usual ERROR->FATAL promotion rules and pinged Andres who provided:
 "Logical decoding slots are only acquired while performing logical
decoding. During logical decoding no user controlled code is run.
During [sub]transaction abort, the slot is released. Therefore user
controlled code cannot intercept an error before the replication slot
is released."  That's included in a comment in the attached to explain
the special treatment.
From df200ac813ce75f07cb98c4b9144a5d82b535efc Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 10 May 2022 16:00:23 +1200
Subject: [PATCH v7] Fix recovery conflict SIGUSR1 handling.

We shouldn't be doing real work in a signal handler, to avoid reaching
code that is not safe in that context.  The previous coding also
confused the 'reason' shown in error messages by clobbering global
variables.  Move all recovery conflict checking logic into the next CFI,
and have the signal handler just set flags and the latch, following the
standard pattern.  Since there are several different reasons, use a
separate flag for each.

With this refactoring, the recovery conflict system no longer
piggy-backs on top of the regular query cancelation mechanisms, but
instead ereports directly if it decides that is necessary.  It still
needs to respect QueryCancelHoldoffCount, because otherwise the FEBE
protocol might be corrupted (see commit 2b3a8b20c2d).

Back-patch to 16.  For now we have agreed not to back-patch this change
any further than that, due to its complexity and the regex changes in
commit bea3d7e that it depends on.

Reviewed-by: Andres Freund 
Reviewed-by: Michael Paquier 
Reviewed-by: Robert Haas 
Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c  |   4 +-
 src/backend/storage/ipc/procsignal.c |  14 +-
 src/backend/tcop/postgres.c  | 333 ++-
 src/include/storage/procsignal.h |   4 +-
 src/include/tcop/tcopprot.h  |   3 +-
 5 files changed, 187 insertions(+), 171 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df22aaa1c5..55c484a43e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4923,8 +4923,8 @@ LockBufferForCleanup(Buffer buffer)
 }
 
 /*
- * Check called from RecoveryConflictInterrupt handler when Startup
- * process requests cancellation of all pin holders that are blocking it.
+ * Check called from ProcessRecoveryConflictInterrupts() when Startup process
+ * requests cancellation of all pin holders that are blocking it.
  */
 bool
 HoldingBufferPinThatDelaysRecovery(void)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c85cb5cc18..b7427906de 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -662,25 +662,25 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 		HandleParallelApplyMessageInterrupt();
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOCK))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
 	SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cc99ec9c..24f326e99d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -161,9 +161,8 @@ static bool EchoQuery = false;	/* -E switch */
 static bool UseSemiNewlineNewline = false;	/* -j switch */
 
 /* whether or not, and why, we were canceled by conflict with recovery */
-sta

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-04 Thread Thomas Munro
On Sat, Aug 5, 2023 at 12:43 AM Christoph Berg  wrote:
> I managed to reproduce it on the shell by running the test in a loop a
> few times. The failure looks like this:

It's great that you can reproduce this semi-reliably!  I've rebased
the patch, hoping you can try it out.

https://www.postgresql.org/message-id/CA%2BhUKGJDcyW8Hbq3UyG-5-5Y7WqqOTjrXbFTMxxmhiofFraE-Q%40mail.gmail.com




Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-08-04 Thread Thomas Munro
On Sat, Aug 5, 2023 at 1:39 PM Thomas Munro  wrote:
> Here is a rebase over 26669757, which introduced
> PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT.

Oops, please disregard v7 (somehow lost a precious line of code).  V8 is better.
From 6544931e533aa015f39215f9c9d2df3e06700a96 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 10 May 2022 16:00:23 +1200
Subject: [PATCH v8] Fix recovery conflict SIGUSR1 handling.

We shouldn't be doing real work in a signal handler, to avoid reaching
code that is not safe in that context.  The previous coding also
confused the 'reason' shown in error messages by clobbering global
variables.  Move all recovery conflict checking logic into the next CFI,
and have the signal handler just set flags and the latch, following the
standard pattern.  Since there are several different reasons, use a
separate flag for each.

With this refactoring, the recovery conflict system no longer
piggy-backs on top of the regular query cancelation mechanisms, but
instead ereports directly if it decides that is necessary.  It still
needs to respect QueryCancelHoldoffCount, because otherwise the FEBE
protocol might be corrupted (see commit 2b3a8b20c2d).

Back-patch to 16.  For now we have agreed not to back-patch this change
any further than that, due to its complexity and the regex changes in
commit bea3d7e that it depends on.

Reviewed-by: Andres Freund 
Reviewed-by: Michael Paquier 
Reviewed-by: Robert Haas 
Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c  |   4 +-
 src/backend/storage/ipc/procsignal.c |  14 +-
 src/backend/tcop/postgres.c  | 333 ++-
 src/include/storage/procsignal.h |   4 +-
 src/include/tcop/tcopprot.h  |   3 +-
 5 files changed, 188 insertions(+), 170 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df22aaa1c5..55c484a43e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4923,8 +4923,8 @@ LockBufferForCleanup(Buffer buffer)
 }
 
 /*
- * Check called from RecoveryConflictInterrupt handler when Startup
- * process requests cancellation of all pin holders that are blocking it.
+ * Check called from ProcessRecoveryConflictInterrupts() when Startup process
+ * requests cancellation of all pin holders that are blocking it.
  */
 bool
 HoldingBufferPinThatDelaysRecovery(void)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c85cb5cc18..b7427906de 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -662,25 +662,25 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 		HandleParallelApplyMessageInterrupt();
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOCK))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
 	SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cc99ec9c..fab976227f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -161,9 +161,8 @@ static bool EchoQuery = false;	/* -E switch */
 static bool UseSemiNewlineNewline = false;	/* -j switch */
 
 /* whether or not, and why, we were canceled by conflict with recovery */
-static bool RecoveryConflictPending = false;
-static bool RecoveryConflictRetryable = true;
-static ProcSignalReason RecoveryConflictReason;
+static volatile sig_atomic_t RecoveryConflictPending = false;
+static volatile sig_atomic_t RecoveryConflictPendingReasons[NUM_PROCSIGNALS];
 
 /* reused buffer to pass to SendRowDescriptionMessage() */
 static MemoryContext row_de

Re: pgbench: allow to exit immediately when any client is aborted

2023-08-04 Thread Tatsuo Ishii
> Hi,
> 
> I would like to propose to add an option to pgbench so that benchmark
> can quit immediately when any client is aborted. Currently, when a
> client is aborted due to some error, for example, network trouble, 
> other clients continue their run until a certain number of transactions
> specified -t is reached or the time specified by -T is expired. At the
> end, the results are printed, but they are not useful, as the message
> "Run was aborted; the above results are incomplete" shows.

Sounds like a good idea. It's a waste of resources waiting for
unusable benchmark results until t/T expired. If we graze on the
screen, then it's easy to cancel the pgbench run. But I frequently let
pgbench run without sitting in front of the screen especially when t/T
is large (it's recommended that running pgbench with large enough t/T
to get usable results).

> For precise benchmark purpose, we would not want to wait to get such
> incomplete results, rather we would like to know some trouble happened
> to allow a quick retry. Therefore, it would be nice to add an option to
> make pgbench exit instead of continuing run in other clients when any
> client is aborted. I think adding the optional is better than  whole
> behavioural change because some users that use pgbench just in order
> to stress on backends for testing purpose rather than benchmark might
> not want to stop pgbench even a client is aborted. 
> 
> Attached is the patch to add the option --exit-on-abort.
> If this option is specified, when any client is aborted, pgbench
> immediately quit by calling exit(2).
> 
> What do you think about it?

I think aborting pgbench by calling exit(2) is enough. It's not worth
the trouble to add more codes for this purpose.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp