Re: SQL JSON compliance

2022-04-29 Thread Peter Eisentraut

On 13.04.22 22:43, Andrew Dunstan wrote:

Simon has just pointed out to me that as a result of recent commits, a
number of things now should move from the unsupported table to the
supported table in features.sgml. In particular, it looks to me like all
of these should move:


This all looks correct to me.  Please go ahead.


T811           Basic SQL/JSON constructor functions
T812           SQL/JSON: JSON_OBJECTAGG
T813           SQL/JSON: JSON_ARRAYAGG with ORDER BY
T814           Colon in JSON_OBJECT or JSON_OBJECTAGG
T821           Basic SQL/JSON query operators
T822           SQL/JSON: IS JSON WITH UNIQUE KEYS predicate
T823           SQL/JSON: PASSING clause
T824           JSON_TABLE: specific PLAN clause
T825           SQL/JSON: ON EMPTY and ON ERROR clauses
T826           General value expression in ON ERROR or ON EMPTY clauses
T827           JSON_TABLE: sibling NESTED COLUMNS clauses
T828           JSON_QUERY
T829           JSON_QUERY: array wrapper options
T830           Enforcing unique keys in SQL/JSON constructor functions
T838           JSON_TABLE: PLAN DEFAULT clause





Re: Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Richard Guo
On Fri, Apr 29, 2022 at 12:53 AM Robert Haas  wrote:

> Gather doesn't require a parallel aware subpath, just a parallel-safe
> subpath. In a case like this, the parallel seq scan will divide the
> rows from the underlying relation across the three processes executing
> it. Each process will pass the rows it receives through its own copy
> of the subquery scan. Then, the Gather node will collect all the rows
> from all the workers to produce the final result.
>
> It's an extremely important feature of parallel query that the
> parallel-aware node doesn't have to be immediately beneath the Gather.
> You need to have a parallel-aware node in there someplace, but it
> could be separated from the gather by any number of levels e.g.
>
> Gather
> -> Nested Loop
>   -> Nested Loop
> -> Nested Loop
>-> Parallel Seq Scan
>-> Index Scan
>  -> Index Scan
>-> Index Scan
>

Thanks for the explanation. That's really helpful to understand the
parallel query mechanism.

So for the nodes between Gather and parallel-aware node, how should we
calculate their estimated rows?

Currently subquery scan is using rel->rows (if no parameterization),
which I believe is not correct. That's not the size the subquery scan
node in each worker needs to handle, as the rows have been divided
across workers by the parallel-aware node.

Using subpath->rows is not correct either, as subquery scan node may
have quals.

It seems to me the right way is to divide the rel->rows among all the
workers.

Thanks
Richard


Re: Multi-Master Logical Replication

2022-04-29 Thread Peter Smith
On Fri, Apr 29, 2022 at 2:16 PM Yura Sokolov  wrote:
>
> В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:
> > On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  
> > wrote:
> > > В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
> > >
> > > > 1.1 ADVANTAGES OF MMLR
> > > >
> > > > - Increases write scalability (e.g., all nodes can write arbitrary 
> > > > data).
> > >
> > > I've never heard how transactional-aware multimaster increases
> > > write scalability. More over, usually even non-transactional
> > > multimaster doesn't increase write scalability. At the best it
> > > doesn't decrease.
> > >
> > > That is because all hosts have to write all changes anyway. But
> > > side cost increases due to increased network interchange and
> > > interlocking (for transaction-aware MM) and increased latency.
> >
> > I agree it won't increase in all cases, but it will be better in a few
> > cases when the user works on different geographical regions operating
> > on independent schemas in asynchronous mode. Since the write node is
> > closer to the geographical zone, the performance will be better in a
> > few cases.
>
> From EnterpriseDB BDB page [1]:
>
> > Adding more master nodes to a BDR Group does not result in
> > significant write throughput increase when most tables are
> > replicated because BDR has to replay all the writes on all nodes.
> > Because BDR writes are in general more effective than writes coming
> > from Postgres clients via SQL, some performance increase can be
> > achieved. Read throughput generally scales linearly with the number
> > of nodes.
>
> And I'm sure EnterpriseDB does the best.
>
> > > В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > > > Dear Laurenz,
> > > >
> > > > Thank you for your interest in our works!
> > > >
> > > > > I am missing a discussion how replication conflicts are handled to
> > > > > prevent replication from breaking
> > > >
> > > > Actually we don't have plans for developing the feature that avoids 
> > > > conflict.
> > > > We think that it should be done as core PUB/SUB feature, and
> > > > this module will just use that.
> > >
> > > If you really want to have some proper isolation levels (
> > > Read Committed? Repeatable Read?) and/or want to have
> > > same data on each "master", there is no easy way. If you
> > > think it will be "easy", you are already wrong.
> >
> > The synchronous_commit and synchronous_standby_names configuration
> > parameters will help in getting the same data across the nodes. Can
> > you give an example for the scenario where it will be difficult?
>
> So, synchronous or asynchronous?
> Synchronous commit on every master, every alive master or on quorum
> of masters?
>
> And it is not about synchronicity. It is about determinism at
> conflicts.
>
> If you have fully determenistic conflict resolution that works
> exactly same way on each host, then it is possible to have same
> data on each host. (But it will not be transactional.)And it seems EDB BDB 
> achieved this.
>
> Or if you have fully and correctly implemented one of distributed
> transactions protocols.
>
> [1]  
> https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance
>
> regards
>
> --
>
> Yura Sokolov

Thanks for your feedback.

This MMLR proposal was mostly just to create an interface making it
easier to use PostgreSQL core logical replication CREATE
PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
Otherwise, this is difficult for a user to do manually. (e.g.
difficulties as mentioned in section 2.2 of the original post [1] -
dealing with initial table data, coordinating the timing/locking to
avoid concurrent updates, getting the SUBSCRIPTION options for
copy_data exactly right etc)

At this time we have no provision for HA, nor for transaction
consistency awareness, conflict resolutions, node failure detections,
DDL replication etc. Some of the features like DDL replication are
currently being implemented [2], so when committed it will become
available in the core, and can then be integrated into this module.

Once the base feature of the current MMLR proposal is done, perhaps it
can be extended in subsequent versions.

Probably our calling this “Multi-Master” has been
misleading/confusing, because that term implies much more to other
readers. We really only intended it to mean the ability to set up
logical replication across a set of nodes. Of course, we can rename
the proposal (and API) to something different if there are better
suggestions.

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuwRAoWY9pz%3DEubps3ooQCOBFiYPU9Yi%3DVB-U%2ByORU7OA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/45d0d97c-3322-4054-b94f-3c08774bbd90%40www.fastmail.com#db6e810fc93f17b0a5585bac25fb3d4b

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-04-29 Thread Bharath Rupireddy
On Sun, Feb 27, 2022 at 12:50 PM Bharath Rupireddy
 wrote:
>
> On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera  
> > wrote:
> > >
> > > Maybe some tunable like
> > > log_wal_traffic = { none, medium, high }
> > > where "none" is current behavior of no noise, "medium" gets (say) once
> > > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets
> > > you one message per segment.
> >
> > On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby  wrote:
> > >
> > > If you're talking about a new feature that uses the infrastructre from 
> > > 9ce3,
> > > but is controlled by a separate GUC like log_wal_traffic, that could be 
> > > okay.
> >
> > Thanks for the suggestion. I've added a new GUC log_wal_traffic as
> > suggested above. PSA  v7 patch.
>
> Here's the rebased v8 patch, please review it.
>
> https://commitfest.postgresql.org/37/3443/

Here's the rebased v9 patch.

Regards,
Bharath Rupireddy.


v9-0001-Add-WAL-recovery-messages-with-log_wal_traffic-GU.patch
Description: Binary data


add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2022-04-29 Thread Bharath Rupireddy
On Sun, Feb 27, 2022 at 3:16 PM Bharath Rupireddy
 wrote:
>
> On Mon, Jan 10, 2022 at 6:50 AM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Jan 5, 2022 at 12:13 PM Kyotaro Horiguchi  
> > wrote:
> > > > Here's the v2 patch. Please provide your thoughts.
> > >
> > > Thanks!  I have three comments on this version.
> >
> > Thanks for your review.
> >
> > > - I still think "acquire/release" logs on creation/dropping should be
> > >   silenced.  Adding the third parameter to ReplicationSlotAcquire()
> > >   that can mute the acquiring (and as well as the corresponding
> > >   releasing) log will work.
> >
> > Done.
> >
> > > can piggy-back on log_replication_commands for the purpose, changing
> > > its meaning slightly to "log replication commands and related
> > > activity".
> > > - Need to mute the logs by log_replication_commands.  (We could add
> > >   another choice for the variable for this purpose but I think we
> > >   don't need it.)
> >
> > Done.
> >
> > > - The messages are not translatable as the variable parts are
> > >   adjectives. They need to consist of static sentences.  The
> > >   combinations of the two properties are 6 (note that persistence is
> > >   tristate) but I don't think we accept that complexity for the
> > >   information.  So I recommend to just remove the variable parts from
> > >   the messages.
> >
> > Done.
> >
> > Here's v3, please review it further.
>
> Here's the rebased v4 patch.

Here's the rebased v5 patch.

Regards,
Bharath Rupireddy.


v5-0001-Add-LOG-messages-when-replication-slots-become-ac.patch
Description: Binary data


Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2022-04-29 Thread Alvaro Herrera
If I read the code right, this patch emits logs when
pg_logical_slot_get_changes and pg_replication_slot_advance SQL
functions are called.  Is this desirable/useful, for the use case you
stated at start of thread?  I think it is most likely pointless.  If you
get rid of those, then the only acquisitions that would log messages are
those in StartReplication and StartLogicalReplication.  So I wonder if
it would be better to leave the API of ReplicationSlotAcquire() alone,
and instead make StartReplication and StartLogicalReplication
responsible for those messages.

I didn't look at the release-side messages you're adding, but I suppose
it should be symmetrical.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)




Re: Multi-Master Logical Replication

2022-04-29 Thread vignesh C
On Fri, Apr 29, 2022 at 2:35 PM Peter Smith  wrote:
>
> On Fri, Apr 29, 2022 at 2:16 PM Yura Sokolov  wrote:
> >
> > В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:
> > > On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  
> > > wrote:
> > > > В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
> > > >
> > > > > 1.1 ADVANTAGES OF MMLR
> > > > >
> > > > > - Increases write scalability (e.g., all nodes can write arbitrary 
> > > > > data).
> > > >
> > > > I've never heard how transactional-aware multimaster increases
> > > > write scalability. More over, usually even non-transactional
> > > > multimaster doesn't increase write scalability. At the best it
> > > > doesn't decrease.
> > > >
> > > > That is because all hosts have to write all changes anyway. But
> > > > side cost increases due to increased network interchange and
> > > > interlocking (for transaction-aware MM) and increased latency.
> > >
> > > I agree it won't increase in all cases, but it will be better in a few
> > > cases when the user works on different geographical regions operating
> > > on independent schemas in asynchronous mode. Since the write node is
> > > closer to the geographical zone, the performance will be better in a
> > > few cases.
> >
> > From EnterpriseDB BDB page [1]:
> >
> > > Adding more master nodes to a BDR Group does not result in
> > > significant write throughput increase when most tables are
> > > replicated because BDR has to replay all the writes on all nodes.
> > > Because BDR writes are in general more effective than writes coming
> > > from Postgres clients via SQL, some performance increase can be
> > > achieved. Read throughput generally scales linearly with the number
> > > of nodes.
> >
> > And I'm sure EnterpriseDB does the best.
> >
> > > > В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > > > > Dear Laurenz,
> > > > >
> > > > > Thank you for your interest in our works!
> > > > >
> > > > > > I am missing a discussion how replication conflicts are handled to
> > > > > > prevent replication from breaking
> > > > >
> > > > > Actually we don't have plans for developing the feature that avoids 
> > > > > conflict.
> > > > > We think that it should be done as core PUB/SUB feature, and
> > > > > this module will just use that.
> > > >
> > > > If you really want to have some proper isolation levels (
> > > > Read Committed? Repeatable Read?) and/or want to have
> > > > same data on each "master", there is no easy way. If you
> > > > think it will be "easy", you are already wrong.
> > >
> > > The synchronous_commit and synchronous_standby_names configuration
> > > parameters will help in getting the same data across the nodes. Can
> > > you give an example for the scenario where it will be difficult?
> >
> > So, synchronous or asynchronous?
> > Synchronous commit on every master, every alive master or on quorum
> > of masters?
> >
> > And it is not about synchronicity. It is about determinism at
> > conflicts.
> >
> > If you have fully determenistic conflict resolution that works
> > exactly same way on each host, then it is possible to have same
> > data on each host. (But it will not be transactional.)And it seems EDB BDB 
> > achieved this.
> >
> > Or if you have fully and correctly implemented one of distributed
> > transactions protocols.
> >
> > [1]  
> > https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance
> >
> > regards
> >
> > --
> >
> > Yura Sokolov
>
> Thanks for your feedback.
>
> This MMLR proposal was mostly just to create an interface making it
> easier to use PostgreSQL core logical replication CREATE
> PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
> Otherwise, this is difficult for a user to do manually. (e.g.
> difficulties as mentioned in section 2.2 of the original post [1] -
> dealing with initial table data, coordinating the timing/locking to
> avoid concurrent updates, getting the SUBSCRIPTION options for
> copy_data exactly right etc)

Different problems and how to solve each scenario is mentioned detailly in [1].
It gets even more complex when there are more nodes associated, let's
consider the 3 node case:
Adding a new node node3 to the existing node1 and node2 when data is
present in existing nodes node1 and node2, the following steps are
required:
Create a publication in node3:
CREATE PUBLICATION pub_node3 for all tables;

Create a subscription in node1 to subscribe the changes from node3:
CREATE SUBSCRIPTION sub_node1_node3 CONNECTION 'dbname=foo host=node3
user=repuser' PUBLICATION pub_node3 WITH (copy_data = off, local_only
= on);

Create a subscription in node2 to subscribe the changes from node3:
CREATE SUBSCRIPTION sub_node2_node3 CONNECTION 'dbname=foo host=node3
user=repuser' PUBLICATION pub_node3 WITH (copy_data = off, local_only
= on);

Lock database at node2 and wait till walsender sends WAL to node1(upto
current lsn) to avoid any data loss because of node2's WAL not being
sent to node1. This lock needs 

Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-04-29 Thread Alvaro Herrera
Did we ever consider the idea of using a new pg_stat_wal_activity_progress
view or something like that, using the backend_progress.c functionality?
I don't see it mentioned in the thread.

It's not an *exact* fit, since this is not about one "command" being
executed by a "backend", but it seems a near enough fit, and it would
unobtrusive enough that we don't need to worry about the progress-update
rate or have a GUC to determine how frequently to update (with a gauge
that's heavily workload-dependant and is likely to make little sense to
some users -- consider differently-sized WAL segments, for one thing).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Handle infinite recursion in logical replication setup

2022-04-29 Thread vignesh C
On Fri, Apr 29, 2022 at 8:08 AM Peter Smith  wrote:
>
> Here are my review comments for the v11* patches.
>
> ==
>
> v11-0001 - no more comments. LGTM
>
> ==
>
> V11-0002
>
> 1. doc/src/sgml/logical-replication.sgml
>
> +   
> + Bidirectional replication is useful in creating multi master database
> + which helps in performing read/write operations from any of the nodes.
> + Setting up bidirectional logical replication between two nodes requires
> + creation of the publication in all the nodes, creating subscriptions in
> + each of the nodes that subscribes to data from all the nodes. The steps
> + to create a two-node bidirectional replication are given below:
> +   
>
> Wording: "creating multi master database" -> "creating a multi-master 
> database"
> Wording: "creation of the publication in all the nodes" -> "creation
> of a publication in all the nodes".

Modified

> ~~~
>
> 2. doc/src/sgml/logical-replication.sgml
>
> > +
> > +node2=# CREATE SUBSCRIPTION sub_node1_node2
> > +node2=# CONNECTION 'dbname=foo host=node1 user=repuser'
> > +node2=# PUBLICATION pub_node1
> > +node2=# WITH (copy_data = off, local_only = on);
> > +CREATE SUBSCRIPTION
> > +
> > + 
> >
> > IIUC the closing  should be on the same line as the
> > . I recall there was some recent github push about
> > this sometime in the last month - maybe you can check to confirm it.
>
> Vignesh: I have seen the programlisting across multiple places and noticed
> that there is no such guideline being followed. I have not made any
> change for this comment.
>
> FYI – I found the push [1] by PeterE that I was referring to so. I
> thought we perhaps should follow this, even if not all other code is
> doing so. But you can choose.

Modified

> ~~~
>
> 3. doc/src/sgml/logical-replication.sgml
>
> +   
> +Create a subscription in node3 to subscribe the 
> changes
> +from node1 and node2, here
> +copy_data is specified as force 
> when
> +creating a subscription to node1 so that the existing
> +table data is copied during initial sync:
> +
> +node3=# CREATE SUBSCRIPTION
> +node3-# sub_node3_node1 CONNECTION 'dbname=foo host=node1 user=repuser'
> +node3-# PUBLICATION pub_node1
> +node3-# WITH (copy_data = force, local_only = on);
> +CREATE SUBSCRIPTION
> +node3=# CREATE SUBSCRIPTION
> +node3-# sub_node3_node2 CONNECTION 'dbname=foo host=node2 user=repuser'
> +node3-# PUBLICATION pub_node2
> +node3-# WITH (copy_data = off, local_only = on);
> +CREATE SUBSCRIPTION
> +
>
> I think this part should be split into 2 separate program listings
> each for the different subscriptions. Then those descriptions can be
> described separately (e.g. why one is force and the other is not).
> Then it will also be more consistent with how you split/explained
> something similar in the previous section on this page.

Modified
The attached v12 patch has the changes for the same.

Regards,
Vignesh
From 28a94a0e335241f8cd60adb4cdeb359c69cecbfe Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Fri, 8 Apr 2022 11:10:05 +0530
Subject: [PATCH v12 1/2] Skip replication of non local data.

This patch adds a new SUBSCRIPTION boolean option
"local_only". The default is false. When a SUBSCRIPTION is
created with this option enabled, the publisher will only publish data
that originated at the publisher node.
Usage:
CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=postgres port='
PUBLICATION pub1 with (local_only = true);
---
 doc/src/sgml/ref/alter_subscription.sgml  |   5 +-
 doc/src/sgml/ref/create_subscription.sgml |  12 ++
 src/backend/catalog/pg_subscription.c |   1 +
 src/backend/catalog/system_views.sql  |   4 +-
 src/backend/commands/subscriptioncmds.c   |  26 ++-
 .../libpqwalreceiver/libpqwalreceiver.c   |   5 +
 src/backend/replication/logical/worker.c  |   2 +
 src/backend/replication/pgoutput/pgoutput.c   |  23 +++
 src/bin/pg_dump/pg_dump.c |  17 +-
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/psql/describe.c   |   8 +-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/pg_subscription.h |   3 +
 src/include/replication/logicalproto.h|  10 +-
 src/include/replication/pgoutput.h|   1 +
 src/include/replication/walreceiver.h |   1 +
 src/test/regress/expected/subscription.out| 142 ---
 src/test/regress/sql/subscription.sql |  10 ++
 src/test/subscription/t/032_localonly.pl  | 162 ++
 19 files changed, 365 insertions(+), 72 deletions(-)
 create mode 100644 src/test/subscription/t/032_localonly.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 353ea5def2..c5ebcf5500 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -207,8 +207,9 @@ ALTER SUBSCRIPTION name RENAME TO <
   information.  The parameters that can be a

Re: Skipping schema changes in publication

2022-04-29 Thread vignesh C
On Thu, Apr 28, 2022 at 4:50 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Wednesday, April 27, 2022 9:50 PM vignesh C  wrote:
> > Thanks for the comments, the attached v3 patch has the changes for the same.
> Hi
>
> Thank you for updating the patch. Several minor comments on v3.
>
> (1) commit message
>
> The new syntax allows specifying schemas. For example:
> CREATE PUBLICATION pub1 FOR ALL TABLES EXCEPT TABLE t1,t2;
> OR
> ALTER PUBLICATION pub1 ADD EXCEPT TABLE t1,t2;
>
> We have above sentence, but it looks better
> to make the description a bit more accurate.
>
> Kindly change
> From :
> "The new syntax allows specifying schemas"
> To :
> "The new syntax allows specifying excluded relations"
>
> Also, kindly change "OR" to "or",
> because this description is not syntax.

Slightly reworded and modified

> (2) publication_add_relation
>
> @@ -396,6 +400,9 @@ publication_add_relation(Oid pubid, PublicationRelInfo 
> *pri,
> ObjectIdGetDatum(pubid);
> values[Anum_pg_publication_rel_prrelid - 1] =
> ObjectIdGetDatum(relid);
> +   values[Anum_pg_publication_rel_prexcept - 1] =
> +   BoolGetDatum(pri->except);
> +
>
> /* Add qualifications, if available */
>
> It would be better to remove the blank line,
> because with this change, we'll have two blank
> lines in a row.

Modified

> (3) pg_dump.h & pg_dump_sort.c
>
> @@ -80,6 +80,7 @@ typedef enum
> DO_REFRESH_MATVIEW,
> DO_POLICY,
> DO_PUBLICATION,
> +   DO_PUBLICATION_EXCEPT_REL,
> DO_PUBLICATION_REL,
> DO_PUBLICATION_TABLE_IN_SCHEMA,
> DO_SUBSCRIPTION
>
> @@ -90,6 +90,7 @@ enum dbObjectTypePriorities
> PRIO_FK_CONSTRAINT,
> PRIO_POLICY,
> PRIO_PUBLICATION,
> +   PRIO_PUBLICATION_EXCEPT_REL,
> PRIO_PUBLICATION_REL,
> PRIO_PUBLICATION_TABLE_IN_SCHEMA,
> PRIO_SUBSCRIPTION,
> @@ -144,6 +145,7 @@ static const int dbObjectTypePriority[] =
> PRIO_REFRESH_MATVIEW,   /* DO_REFRESH_MATVIEW */
> PRIO_POLICY,/* DO_POLICY */
> PRIO_PUBLICATION,   /* DO_PUBLICATION */
> +   PRIO_PUBLICATION_EXCEPT_REL,/* DO_PUBLICATION_EXCEPT_REL */
> PRIO_PUBLICATION_REL,   /* DO_PUBLICATION_REL */
> PRIO_PUBLICATION_TABLE_IN_SCHEMA,   /* 
> DO_PUBLICATION_TABLE_IN_SCHEMA */
> PRIO_SUBSCRIPTION   /* DO_SUBSCRIPTION */
>
> How about having similar order between
> pg_dump.h and pg_dump_sort.c, like
> we'll add DO_PUBLICATION_EXCEPT_REL
> after DO_PUBLICATION_REL in pg_dump.h ?
>

Modified

> (4) GetAllTablesPublicationRelations
>
> +   /*
> +* pg_publication_rel and pg_publication_namespace  will only have 
> except
> +* tables in case of all tables publication, no need to pass except 
> flag
> +* to get the relations.
> +*/
> +   List   *exceptpubtablelist = GetPublicationRelations(pubid, 
> PUBLICATION_PART_ALL);
> +
>
> There is one unnecessary space in a comment
> "...pg_publication_namespace  will only have...". Kindly remove it.
>
> Then, how about diving the variable declaration and
> the insertion of the return value of GetPublicationRelations ?
> That might be aligned with other places in this file.

Modified

> (5) GetTopMostAncestorInPublication
>
>
> @@ -302,8 +303,9 @@ GetTopMostAncestorInPublication(Oid puboid, List 
> *ancestors, int *ancestor_level
> foreach(lc, ancestors)
> {
> Oid ancestor = lfirst_oid(lc);
> -   List   *apubids = GetRelationPublications(ancestor);
> +   List   *apubids = GetRelationPublications(ancestor, 
> false);
> List   *aschemaPubids = NIL;
> +   List   *aexceptpubids;
>
> level++;
>
> @@ -317,7 +319,9 @@ GetTopMostAncestorInPublication(Oid puboid, List 
> *ancestors, int *ancestor_level
> else
> {
> aschemaPubids = 
> GetSchemaPublications(get_rel_namespace(ancestor));
> -   if (list_member_oid(aschemaPubids, puboid))
> +   aexceptpubids = GetRelationPublications(ancestor, 
> true);
> +   if (list_member_oid(aschemaPubids, puboid) ||
> +   (puballtables && 
> !list_member_oid(aexceptpubids, puboid)))
> {
> topmost_relid = ancestor;
>
> It seems we forgot to call list_free for "aexceptpubids".

Modified

The attached v4 patch has the changes for the same.

Regards,
Vignesh
From b900c9468bc5c9deb3412361c7a7891352b0ffb6 Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Wed, 20 Apr 2022 11:19:50 +0530
Subject: [PATCH v4] Skip publishing the tables specified in EXCEPT TABLE.

A new option "EXCEPT TABLE" in Create/Alter Publication allows
one or more tables to be excl

Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary)

2022-04-29 Thread Bharath Rupireddy
On Tue, Apr 26, 2022 at 12:09 AM Stephen Frost  wrote:
>
> I was thinking more specifically along the lines of "if there's > X GB
> of WAL that hasn't been archived, give up on archiving anything new"
> (which is how the pgbackrest option works).

IMO, this option is dangerous because the users might lose the
unarchived WAL files completely. Instead, fixing the archive_command
either before the server runs out of disk space (if someone/some
monitoring detected the archive_command failure) or after the server
crashes with no space.

> As archiving with this command is optional, it does present the same
> risk too.  Perhaps if we flipped it around to require the
> archive_command be provided then it'd be a bit better, though we would
> also need a way for users to ask for us to just delete the WAL without
> archiving it.  There again though ... the server already has a way of
> both archiving and removing archived WAL and also has now grown the
> archive_library option, something that this tool would be pretty hard to
> replicate, I feel like, as it wouldn't be loading the library into a PG
> backend anymore.  As we don't have any real archive libraries yet, it's
> hard to say if that's going to be an actual issue or not.  Something to
> consider though.

Actually, why are we just thinking that the WAL files grow up only
because of archive command failures (of course, it's the main cause
even in a well-configured server)?. Imagine if the checkpoints weren't
happening frequently for whatever reasons or the
max_slot_wal_keep_size or wal_keep_size weren't set appropriately
(even if they set properly, someone, could be an attacker, can reset
them), high write activity generating huge WAL files (especially on
smaller servers) - these too can be reasons for server going down
because of WAL files growth.

The main motto of this tool is to use (by DBAs or service engineers)
it as a last resort option after the server goes out of disk space to
quickly bring the server back online so that the regular cleaning up
activity can take place or the storage scale operations can be
performed if required. There's a dry run option in pg_walcleaner to
see if it can free up some space at all. If required we can provide an
option to archive and delete only the maximum of specified number of
WAL files.

I'm okay with making the archive_command mandatory and if archiving
isn't required to be used with the pg_walcleaner tool they can set
some dummy values such as archive_command='/bin/true'.

> > The initial version of the patch doesn't check if the server crashed
> > or not before running it. I was thinking of looking at the
> > postmaster.pid or pg_control file (however they don't guarantee
> > whether the server is up or crashed because the server can crash
> > without deleting postmaster.pid or updating pg_control file). Another
> > idea is to let pg_walcleaner fire a sample query ('SELECT 1') to see
> > if the server is up and running, if yes, exit, otherwise proceed with
> > its work.
>
> All of which isn't an issue if we don't have an external tool trying to
> do this and instead have the server doing it as the server knows its
> internal status, that the archive command has been failing long enough
> to pass the configuration threshold, and that the WAL isn't needed for
> crash recovery.  The ability to avoid having to crash and go through
> that process is pretty valuable.  Still, a crash may still happen and
> it'd be useful to have a clean way to deal with it.  I'm not really a
> fan of having to essentially configure this external command as well as
> have the server configured.  Have we settled that there's no way to make
> the server archive while there's no space available and before trying to
> write out more data?

The pg_walcleaner tool isn't intrusive in the sense that it doesn't
delete the WAL files that are required for the server to come up (as
it checks for the checkpoint redo WAL file), apart from this it has
archive_command too so no loss of the WAL file(s) at all unlike the
pgbackrest option.

> > Also, to not cause losing of WAL permanently, we must recommend using
> > archvie_command so that the WAL can be moved to an alternative
> > location (could be the same archvie_location that primary uses).
>
> I agree we should recommend using archive_command or archive_library, of
> course, but if that's been done and is working properly then this tool
> isn't really needed.  The use-case we're trying to address, I believe,
> is something like:
>
> 1) archive command starts failing for some reason
> 2) WAL piles up on the primary
> 3) primary runs out of disk space, crash happens
> 4) archive command gets 'fixed' in some fashion
> 5) WAL is archived and removed from primary
> 6) primary is restarted and able to go through crash recovery
> 7) server is online again
>
> Now, I was trying to suggest an approach to addressing the issue at #2,
> that is, avoid having WAL pile up without end on the primary and avoid
> t

Re: Building Postgres with lz4 on Visual Studio

2022-04-29 Thread Andrew Dunstan


On 2022-04-26 Tu 16:26, Robert Haas wrote:
> On Wed, Apr 13, 2022 at 10:22 AM Melih Mutlu  wrote:
>> What I realized is that c:\lz4\lib\liblz4.lib actually does not exist.
>> The latest versions of lz4, downloaded from [2], do not contain \liblz4.lib 
>> anymore, as opposed to what's written here [3]. Also there isn't a lib 
>> folder too.
>>
>> After those changes on lz4 side, AFAIU there seems like this line adds 
>> library from wrong path in Solution.pm file [4].
>>> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
>>> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>> Even if I spent some time on this problem and tried to fix some places, I'm 
>> not able to run a successful build yet.
>> This is also the case for zstd too. Enabling zstd gives the same error.
>>
>> Has anyone had this issue before? Is this something that anyone is aware of 
>> and somehow made it work?
>> I would appreciate any comment/suggestion etc.
> In Solution.pm we have this:
>
> if ($self->{options}->{lz4})
> {
> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
> }
> if ($self->{options}->{zstd})
> {
> $proj->AddIncludeDir($self->{options}->{zstd} . '\include');
> $proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
> }
>
> I think what you're saying is that the relative pathnames here may not
> be correct, depending on which version of lz4/zstd you're using. The
> solution is probably to use perl's -e to test which files actually
> exists e.g.
>
> if ($self->{options}->{lz4})
> {
> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
> if (-e $proj->AddLibrary($self->{options}->{lz4} .
> '\someplace\somelz4.lib')
> {
> $proj->AddLibrary($self->{options}->{lz4} .
> '\someplace\somelz4.lib');
> }
> else
> {
> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
> }
> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>}
>
> The trick, at least as it seems to me, is figuring out exactly what
> the right set of conditions is, based on what kinds of different
> builds exist out there and where they put stuff.


I agree that we should use perl's -e to test that the files actually
exists. But I don't think we should try to adjust to everything the zstd
and lz4 people put in their release files. They are just horribly
inconsistent.

What I did was to install the packages using vcpkg[1] which is a
standard framework created by Microsoft for installing package
libraries. It does install the .lib files in a sane place
(installdir/lib), but it doesn't use the lib suffix. Also it names the
lib file for zlib differently.

I got around those things by renaming the lib files, but that's a bit
ugly. So I came up with this (untested) patch.


cheers


andrew


[1] https://github.com/microsoft/vcpkg

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Building Postgres with lz4 on Visual Studio

2022-04-29 Thread Andrew Dunstan

On 2022-04-29 Fr 08:50, Andrew Dunstan wrote:
> On 2022-04-26 Tu 16:26, Robert Haas wrote:
>> On Wed, Apr 13, 2022 at 10:22 AM Melih Mutlu  wrote:
>>> What I realized is that c:\lz4\lib\liblz4.lib actually does not exist.
>>> The latest versions of lz4, downloaded from [2], do not contain \liblz4.lib 
>>> anymore, as opposed to what's written here [3]. Also there isn't a lib 
>>> folder too.
>>>
>>> After those changes on lz4 side, AFAIU there seems like this line adds 
>>> library from wrong path in Solution.pm file [4].
 $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
 $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>>> Even if I spent some time on this problem and tried to fix some places, I'm 
>>> not able to run a successful build yet.
>>> This is also the case for zstd too. Enabling zstd gives the same error.
>>>
>>> Has anyone had this issue before? Is this something that anyone is aware of 
>>> and somehow made it work?
>>> I would appreciate any comment/suggestion etc.
>> In Solution.pm we have this:
>>
>> if ($self->{options}->{lz4})
>> {
>> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
>> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>> }
>> if ($self->{options}->{zstd})
>> {
>> $proj->AddIncludeDir($self->{options}->{zstd} . '\include');
>> $proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
>> }
>>
>> I think what you're saying is that the relative pathnames here may not
>> be correct, depending on which version of lz4/zstd you're using. The
>> solution is probably to use perl's -e to test which files actually
>> exists e.g.
>>
>> if ($self->{options}->{lz4})
>> {
>> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
>> if (-e $proj->AddLibrary($self->{options}->{lz4} .
>> '\someplace\somelz4.lib')
>> {
>> $proj->AddLibrary($self->{options}->{lz4} .
>> '\someplace\somelz4.lib');
>> }
>> else
>> {
>> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>> }
>> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>>}
>>
>> The trick, at least as it seems to me, is figuring out exactly what
>> the right set of conditions is, based on what kinds of different
>> builds exist out there and where they put stuff.
>
> I agree that we should use perl's -e to test that the files actually
> exists. But I don't think we should try to adjust to everything the zstd
> and lz4 people put in their release files. They are just horribly
> inconsistent.
>
> What I did was to install the packages using vcpkg[1] which is a
> standard framework created by Microsoft for installing package
> libraries. It does install the .lib files in a sane place
> (installdir/lib), but it doesn't use the lib suffix. Also it names the
> lib file for zlib differently.
>
> I got around those things by renaming the lib files, but that's a bit
> ugly. So I came up with this (untested) patch.
>
>

er this patch

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/tools/msvc/Project.pm b/src/tools/msvc/Project.pm
index d39c502f30..ec2b860462 100644
--- a/src/tools/msvc/Project.pm
+++ b/src/tools/msvc/Project.pm
@@ -175,6 +175,15 @@ sub AddLibrary
 {
 	my ($self, $lib, $dbgsuffix) = @_;
 
+	if ( ! -e $lib)
+	{
+		my ($libname = $lib) =~ s!.*[/\\]!!;
+		my $altlibname =  $libname eq 'zdll.lib' ? 'zlib.lib' : "lib$libname";
+		my ($altlib = $lib) =~ s/$libname/$altlibname/;
+		die "cannot find $lib or $altlib" unless -e $altlib;
+		$lib = $altlib;
+	}
+
 	# quote lib name if it has spaces and isn't already quoted
 	if ($lib =~ m/\s/ && $lib !~ m/^[&]quot;/)
 	{
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index cb2ad6cd29..d0a949b391 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -1089,12 +1089,12 @@ sub AddProject
 	if ($self->{options}->{lz4})
 	{
 		$proj->AddIncludeDir($self->{options}->{lz4} . '\include');
-		$proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
+		$proj->AddLibrary($self->{options}->{lz4} . '\lib\lz4.lib');
 	}
 	if ($self->{options}->{zstd})
 	{
 		$proj->AddIncludeDir($self->{options}->{zstd} . '\include');
-		$proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
+		$proj->AddLibrary($self->{options}->{zstd} . '\lib\zstd.lib');
 	}
 	if ($self->{options}->{uuid})
 	{


Re: SQL JSON compliance

2022-04-29 Thread Andrew Dunstan


On 2022-04-29 Fr 04:13, Peter Eisentraut wrote:
> On 13.04.22 22:43, Andrew Dunstan wrote:
>> Simon has just pointed out to me that as a result of recent commits, a
>> number of things now should move from the unsupported table to the
>> supported table in features.sgml. In particular, it looks to me like all
>> of these should move:
>
> This all looks correct to me.  Please go ahead.


Thanks, done.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Assorted small doc patches

2022-04-29 Thread David G. Johnston
Updated status of the set.

On Wed, Apr 20, 2022 at 5:59 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

>
> v0001-database-default-name (-bugs, with a related cleanup suggestion as
> well)
>
> https://www.postgresql.org/message-id/flat/CAKFQuwZvHH1HVSOu7EYjvshynk4pnDwC5RwkF%3DVfZJvmUskwrQ%40mail.gmail.com#0e6d799478d88aee93402bec35fa64a2
>
>
> v0002-doc-extension-dependent-routine-behavior (-general, reply to user
> confusion)
>
> https://www.postgresql.org/message-id/CAKFQuwb_QtY25feLeh%3D8uNdnyo1H%3DcN4R3vENsUwQzJP4-0xZg%40mail.gmail.com
>
>
> v0001-doc-savepoint-name-reuse (-docs, reply to user request for
> improvement)
>
> https://www.postgresql.org/message-id/CAKFQuwYzSb9OW5qTFgc0v9RWMN8bX83wpe8okQ7x6vtcmfA2KQ%40mail.gmail.com
>

Pending discussion of alternate presentation of transaction sequence; if
not favorable, I can just go with a simple factual fix of the mistake in
v0001 (see this thread).


>
> v0001-on-conflict-excluded-is-name-not-table (-docs, figured out while
> trying to improve the docs to reduce user confusion in this area)
>
> https://www.postgresql.org/message-id/flat/CAKFQuwYN20c0%2B7kKvm3PBgibu77BzxDvk9RvoXBb1%3Dj1mDODPw%40mail.gmail.com#ea79c88b55fdccecbd2c4fe549f321c9
>
>
> v0001-doc-make-row-estimation-example-match-prose (-docs, reply to user
> pointing of an inconsistency)
>
> https://www.postgresql.org/message-id/CAKFQuwax7V5R_rw%3DEOWmy%3DTBON6v3sveBx_WvwsENskCL5CLQQ%40mail.gmail.com
>

David J.


Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Tom Lane
Richard Guo  writes:
> Currently subquery scan is using rel->rows (if no parameterization),
> which I believe is not correct. That's not the size the subquery scan
> node in each worker needs to handle, as the rows have been divided
> across workers by the parallel-aware node.

Really?  Maybe I misunderstand the case under consideration, but
what I think will be happening is that each worker will re-execute
the pushed-down subquery in full.  Otherwise it can't compute the
correct answer.  What gets divided across the set of workers is
the total *number of executions* of the subquery, which should be
independent of the number of workers, so that the cost is (more
or less) the same as the non-parallel case.

At least that's true for a standard correlated subplan, which is
normally run again for each row processed by the parent node.
For hashed subplans and initplans, what would have been "execute
once" semantics becomes "execute once per worker", creating a
strict cost disadvantage for parallelization.  I don't know
whether the current costing model accounts for that.  But if it
does that wrong, arbitrarily altering the number of rows won't
make it better.

regards, tom lane




Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread David G. Johnston
On Fri, Apr 29, 2022 at 7:02 AM Tom Lane  wrote:

> Richard Guo  writes:
> > Currently subquery scan is using rel->rows (if no parameterization),
> > which I believe is not correct. That's not the size the subquery scan
> > node in each worker needs to handle, as the rows have been divided
> > across workers by the parallel-aware node.
>
> Really?  Maybe I misunderstand the case under consideration, but
> what I think will be happening is that each worker will re-execute
> the pushed-down subquery in full.  Otherwise it can't compute the
> correct answer.  What gets divided across the set of workers is
> the total *number of executions* of the subquery, which should be
> independent of the number of workers, so that the cost is (more
> or less) the same as the non-parallel case.
>
>
I've been operating under the belief that a subquery node (or, rather, all
nodes) in a path get their row count estimates from self-selectivity * rows
provided by the next node down in the path.  In a partial path, eventually
the parallel aware node at the bottom of the path divides its row count
estimate by the parallel divisor (2.4 for two workers).  That value then
returns back up through the path until it hits a gather.  Every node in
between, which are almost exclusively parallel-safe (not parallel aware),
can just use the row count being percolated up the path (i.e., they get the
divided row count in their partial pathing and full row counts in the
normal pathing).  The gather node, realizing that the row count it is
dealing with has been divided by the parallel divisor, undoes the division
in order to provide the correct row count to the non-parallel executing
nodes above it.

So a subquery is only ever executed once in a path - but the number of
returned rows depends on the number of planned workers done under the
assumption that a query executed among 2 workers costs 1/2.4 the amount of
the same query done with just the leader.  It is an independent factor
compared to everything else going on and so the multiplication can simply
be done first to the original row count.



allpaths.c@2974-2975 (generate_gather_paths)
(L: 2993 as well)
(L: 3167 - generate_useful_gather_paths)

rows =
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;

Shouldn't that be:

rows =
cheapest_partial_path->rows * (get_parallel_divisor(cheapest_partial_path))

?

David J.


Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Tom Lane
I wrote:
> Really?  Maybe I misunderstand the case under consideration, but
> what I think will be happening is that each worker will re-execute
> the pushed-down subquery in full.

Oh ... nevermind that: what I was thinking about was the SubLink/SubPlan
case, which is unrelated to SubqueryScan.  -ENOCAFFEINE, sorry about that.

regards, tom lane




Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Tom Lane
I wrote:
> -ENOCAFFEINE, sorry about that.

As penance for that blunder, I spent a little time looking into this
issue (responding to Robert's upthread complaint that it wasn't
explained clearly).  See the test case in parallel_subquery.sql,
attached, which is adapted from the test in incremental_sort.sql.
It produces these plans:

explain select * from t union select * from t;

 Unique  (cost=29344.85..30544.85 rows=12 width=12)
   ->  Sort  (cost=29344.85..29644.85 rows=12 width=12)
 Sort Key: t.a, t.b, t.c
 ->  Append  (cost=0.00..2950.00 rows=12 width=12)
   ->  Gather  (cost=0.00..575.00 rows=6 width=12)
 Workers Planned: 2
 ->  Parallel Seq Scan on t  (cost=0.00..575.00 rows=25000 
width=12)
   ->  Gather  (cost=0.00..575.00 rows=6 width=12)
 Workers Planned: 2
 ->  Parallel Seq Scan on t t_1  (cost=0.00..575.00 
rows=25000 width=12)

explain select * from t union all select * from t;

 Gather  (cost=0.00..1400.00 rows=12 width=12)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..1400.00 rows=5 width=12)
 ->  Parallel Seq Scan on t  (cost=0.00..575.00 rows=25000 width=12)
 ->  Parallel Seq Scan on t t_1  (cost=0.00..575.00 rows=25000 width=12)

I take no position on whether the second plan's costs are correct;
but if they are, then the Gather-atop-Append structure is clearly
cheaper than the Append-atop-Gathers structure, so the planner made
the wrong choice in the first case.

I then modified setrefs.c to not remove SubqueryScan nodes
(just make trivial_subqueryscan() return constant false)
and got this output from the UNION case:

 Unique  (cost=29344.85..30544.85 rows=12 width=12)
   ->  Sort  (cost=29344.85..29644.85 rows=12 width=12)
 Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
 ->  Append  (cost=0.00..2950.00 rows=12 width=12)
   ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1175.00 
rows=6 width=12)
 ->  Gather  (cost=0.00..575.00 rows=6 width=12)
   Workers Planned: 2
   ->  Parallel Seq Scan on t  (cost=0.00..575.00 
rows=25000 width=12)
   ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..1175.00 
rows=6 width=12)
 ->  Gather  (cost=0.00..575.00 rows=6 width=12)
   Workers Planned: 2
   ->  Parallel Seq Scan on t t_1  (cost=0.00..575.00 
rows=25000 width=12)

The UNION ALL plan doesn't change, because no SubqueryScans are ever
created in that case (we flattened the UNION ALL to an appendrel
early on).  Removing the test case's settings to allow a parallel
plan, the non-parallelized plan for the UNION case looks like

 Unique  (cost=30044.85..31244.85 rows=12 width=12)
   ->  Sort  (cost=30044.85..30344.85 rows=12 width=12)
 Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
 ->  Append  (cost=0.00..3650.00 rows=12 width=12)
   ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1525.00 
rows=6 width=12)
 ->  Seq Scan on t  (cost=0.00..925.00 rows=6 width=12)
   ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..1525.00 
rows=6 width=12)
 ->  Seq Scan on t t_1  (cost=0.00..925.00 rows=6 
width=12)

So: the row counts for SubqueryScan are correct, thus the upthread
proposal to change them is not.  The cost estimates, however, seem
pretty bogus.  What we're seeing here is that we're charging 600
cost units to pass the data through SubqueryScan, and that's
consistent across the parallel and non-parallel cases, so it's
correct on its own terms.  But actually it's totally wrong because
we're going to elide that node later.

So I think the actual problem here is that we leave the decision
to elide no-op SubqueryScan nodes till setrefs.c.  We should detect
that earlier, and when it applies, label the SubqueryScanPath with
the exact cost of its child.

(I think the current implementation might've been all right when
it was devised, on the grounds that we had no real planning
flexibility for UNION operations anyway.  But now we do, and the
bogus cost charge is causing visible problems.)

regards, tom lane

drop table if exists t;
create table t (a int, b int, c int);
insert into t select mod(i,10),mod(i,10),i from generate_series(1,6) s(i);
create index on t (a);
analyze t;

set min_parallel_table_scan_size = '1kB';
set min_parallel_index_scan_size = '1kB';
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set max_parallel_workers_per_gather = 2;

set enable_hashagg to off;

explain select * from t union select * from t;
explain select * from t union all select * from t;


Re: Support logical replication of DDLs

2022-04-29 Thread Zheng Li
Hello,

> You should forbid it. Unless you can decompose the command into multiple SQL
> commands to make it a safe operation for logical replication.
>
> Let's say you want to add a column with a volatile default.
>
> ALTER TABLE foo ADD COLUMN bar double precision DEFAULT random();
>
> If you replicate the DDL command as is, you will have different data
> downstream. You should forbid it. However, this operation can be supported if
> the DDL command is decomposed in multiple steps.
>
> -- add a new column without DEFAULT to avoid rewrite
> ALTER TABLE foo ADD COLUMN bar double precision;
>
> -- future rows could use the DEFAULT expression
> -- it also doesn't rewrite the table
> ALTER TABLE foo ALTER COLUMN bar SET DEFAULT random();
>
> -- it effectively rewrites the table
> -- all rows are built from one source node
> -- data will be the same on all nodes
> UPDATE foo SET bar = random();

I looked more into this. In order to support statements like "ALTER
TABLE foo ADD COLUMN bar double precision DEFAULT random();", we have
two potential solutions, but both of them are non-trivial to
implement:

1. As Euler pointed out, we could decompose the statement on the
publisher into multiple statements so that the table rewrite (using
volatile function) is handled by a DML sub-command. The decomposition
requires changes in parse analysis/transform. We also need new logic
to assemble the decomposed DDL commands string back from the parse
trees so we can log them for logical replication.

2. Force skipping table rewrite when executing the same command on the
subscriber, and let DML replication replicate the table rewrite from
the publisher. The problem is table rewrite is not replicated at all
today, and it doesn't seem easy to just enable it for logical
replication. Table rewrite is an expensive operation involving heap
file swap, details can be found in ATRewriteTables().

In light of this, I propose to temporarily block replication of such
DDL command on the replication worker until we figure out a better
solution. This is implemented in patch
0008-Fail-replication-worker-on-DDL-command-that-rewrites.patch.
Notice only DDL statements that rewrite table using a VOLATILE
expression will be blocked. I don't see a problem replicating
non-volatile expression.
Here is the github commit of the same patch:
https://github.com/zli236/postgres/commit/1e6115cb99a1286a61cb0a6a088f7476da29d0b9

> The ALTER TABLE ... ALTER COLUMN ... TYPE has a similar issue. This DDL 
> command
> can be decomposed to avoid the rewrite. If you are changing the data type, in
> general, you add a new column and updates all rows doing the proper 
> conversion.
> (If you are updating in batches, you usually add a trigger to automatically
> adjust the new column value for INSERTs and UPDATEs. Another case is when you
> are reducing the the typmod (for example, varchar(100) to varchar(20)). In 
> this
> case, the DDL command can de decomposed removing the typmod information (ALTER
> TABLE ... ALTER COLUMN ... TYPE varchar) and replacing it with a CHECK
> constraint.

I tested ALTER TABLE ... ALTER COLUMN ... TYPE. It seems to be working
fine. Is there a particular case you're concerned about?

> --
> Euler Taveira
> EDB   https://www.enterprisedb.com/
>

Regards,
Zheng Li


0008-Fail-replication-worker-on-DDL-command-that-rewrites.patch
Description: Binary data


Re: [RFC] building postgres with meson -v8

2022-04-29 Thread Andres Freund
Hi,

On 2022-04-20 15:09:31 +0200, Peter Eisentraut wrote:
> On 13.04.22 12:26, Peter Eisentraut wrote:
> > Some feedback and patches for your branch at
> > 3274198960c139328fef3c725cee1468bbfff469:
> 
> Here is another patch.  It adds support for building ecpg.

Cool!

I have merged this, with a few changes (split parse.pl change out, changed its
invocation in Solution.pm, indentation, explicitly using shared_library()
rather than library(), indentation).

But there's need for some more - exports.txt handling is needed for windows
(and everywhere else, but not as crucially) - hence CI currently being broken
on windows. I've done that in a VM, and it indeed fixes the issues. But it
needs to be generalized, I just copied and pasted stuff around...

Greetings,

Andres Freund




Re: [RFC] building postgres with meson -v8

2022-04-29 Thread Andres Freund
Hi,

On 2022-04-27 21:56:27 +0200, Peter Eisentraut wrote:
> Here is a patch that adds in NLS.

Cool! I know very little about translations, so I was reticent tackling
this...


> For example, we could move the list of languages from the meson.build files
> into separate LINGUAS files, which could be shared with the makefile-based
> build system.  I need to research this a bit more.

Yea, that'd be nice.


> The annoying thing is that the i18n module doesn't appear to have a way to
> communicate with feature options or dependencies, so there isn't a way to
> tell it to only do its things when some option is enabled, or conversely to
> check whether the module found the things it needs and to enable or disable
> an option based on that.  So right now for example if you explicitly disable
> the 'nls' option, the binaries are built without NLS but the .mo files are
> still built and installed.

One partial way to deal with that, I think, would be to change all the
subdir('po') invocations to subdir('po', if_found: libintl). If we don't want
that for some reason, is there a reason a simple if libintl.found() wouldn't
work?


> In any case, this works for the main use cases and gets us a step forward,
> so it's worth considering.

Agreed.

Greetings,

Andres Freund




Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Tom Lane
I wrote:
> So I think the actual problem here is that we leave the decision
> to elide no-op SubqueryScan nodes till setrefs.c.  We should detect
> that earlier, and when it applies, label the SubqueryScanPath with
> the exact cost of its child.

Hmm ... actually, while doing that seems like it'd be a good idea,
it doesn't have much bearing on the case at hand.  I approximated
the results of such a change on this test case by just removing
the "cpu_tuple_cost" component from cost_subqueryscan:

@@ -1420,7 +1420,7 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo 
*root,
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
 
startup_cost = qpqual_cost.startup;
-   cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+   cpu_per_tuple = qpqual_cost.per_tuple;
run_cost = cpu_per_tuple * baserel->tuples;
 
/* tlist eval costs are paid per output row, not per tuple scanned */

and what I got was

 Unique  (cost=28144.85..29344.85 rows=12 width=12)
   ->  Sort  (cost=28144.85..28444.85 rows=12 width=12)
 Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
 ->  Append  (cost=0.00..1750.00 rows=12 width=12)
   ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..575.00 rows=6 
width=12)
 ->  Gather  (cost=0.00..575.00 rows=6 width=12)
   Workers Planned: 2
   ->  Parallel Seq Scan on t  (cost=0.00..575.00 
rows=25000 width=12)
   ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..575.00 rows=6 
width=12)
 ->  Gather  (cost=0.00..575.00 rows=6 width=12)
   Workers Planned: 2
   ->  Parallel Seq Scan on t t_1  (cost=0.00..575.00 
rows=25000 width=12)

The subqueryscans are being costed the way we want now, but it's still
going for the append-atop-gathers plan.  So I dug a bit deeper, and
found that generate_union_paths does also create the gather-atop-append
plan, but *it's costed at 1750 units*, exactly the same as the other
path.  So add_path is just making an arbitrary choice between two paths
of identical cost, and it happens to pick this one.

(If I don't make the above tweak to cost_subqueryscan, the same thing
happens, although the two competing paths now each have cost 2950.)

So: why do we come out with a cost of 1750 for the very same plan
that in the UNION ALL case is costed at 1400?  AFAICS it's because
the UNION ALL case thinks that the two inputs of the parallel append
produce 25000 rows apiece so the parallel append produces 5 rows,
and it costs the append's overhead on that basis.  But in the UNION
case, the parallel append sees two inputs that are claimed to return
6 rows, so the append produces 12 rows, meaning more append
overhead.  We can't readily get EXPLAIN to print this tree since
it's rejected by add_path, but what I'm seeing (with the above
costing hack) is

   {GATHERPATH 
   :rows 12 
   :startup_cost 0.00 
   :total_cost 1750.00 
   :subpath 
  {APPENDPATH 
  :parallel_aware true 
  :parallel_safe true 
  :parallel_workers 2 
  :rows 12 
  :startup_cost 0.00 
  :total_cost 1750.00 
  :subpaths (
 {SUBQUERYSCANPATH 
 :rows 6 
 :startup_cost 0.00 
 :total_cost 575.00 
 :subpath 
{PATH 
:parallel_aware true 
:parallel_safe true 
:parallel_workers 2 
:rows 25000 
:startup_cost 0.00 
:total_cost 575.00 
}
 }
 {SUBQUERYSCANPATH 
 :rows 6 
 :startup_cost 0.00 
 :total_cost 575.00 
 :subpath 
{PATH 
:parallel_aware true 
:parallel_safe true 
:parallel_workers 2 
:rows 25000 
:startup_cost 0.00 
:total_cost 575.00 
}
 }
  )
  }
   }

In short, these SubqueryScans are being labeled as producing 6 rows
when their input only produces 25000 rows, which is surely insane.

So: even though the SubqueryScan itself isn't parallel-aware, the number
of rows it processes has to be de-rated according to the number of workers
involved.  Perhaps something like the original patch in this thread is
indeed correct.  But I can't help feeling that this definition of
path_rows is mighty unintuitive and error-prone, and that if
cost_subqueryscan is wrong then it's likely got lots of company.
Maybe we need to take two steps back and rethink the whole approach.

regards, tom lane




Re: allow specifying action when standby encounters incompatible parameter settings

2022-04-29 Thread Cary Huang
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hello

The patch applied and tested fine. I think for this kind of exception, it 
really is up to the administrator how he/she should proceed to resolve 
depending on his/her business application. Leaving things configurable by the 
user is generally a nice and modest change. I also like that you leave the 
parameters as enum entry so it is possible to extend other possible actions 
such as automatically adjust to match the new value. 


-
Cary Huang
HighGo Software Canada

Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread David G. Johnston
On Fri, Apr 29, 2022 at 11:09 AM Tom Lane  wrote:

>
> In short, these SubqueryScans are being labeled as producing 6 rows
> when their input only produces 25000 rows, which is surely insane.
>
> So: even though the SubqueryScan itself isn't parallel-aware, the number
> of rows it processes has to be de-rated according to the number of workers
> involved.


Right, so why does baserel.rows show 60,000 here when path->subpath->rows
only shows 25,000?  Because if you substitute path->subpath->rows for
baserel.rows in cost_subquery you get (with your cost change above):

 Incremental Sort  (cost=27875.50..45577.57 rows=12 width=12) (actual
time=165.285..235.749 rows=6 loops=1)
   Sort Key: "*SELECT* 1".a, "*SELECT* 1".c
   Presorted Key: "*SELECT* 1".a
   Full-sort Groups: 10  Sort Method: quicksort  Average Memory: 28kB  Peak
Memory: 28kB
   Pre-sorted Groups: 10  Sort Method: quicksort  Average Memory: 521kB
 Peak Memory: 521kB
   ->  Unique  (cost=27794.85..28994.85 rows=12 width=12) (actual
time=157.882..220.501 rows=6 loops=1)
 ->  Sort  (cost=27794.85..28094.85 rows=12 width=12) (actual
time=157.881..187.232 rows=12 loops=1)
   Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
   Sort Method: external merge  Disk: 2600kB
   ->  Gather  (cost=0.00..1400.00 rows=12 width=12)
(actual time=0.197..22.705 rows=12 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Parallel Append  (cost=0.00..1400.00 rows=5
width=12) (actual time=0.015..13.101 rows=4 loops=3)
   ->  Subquery Scan on "*SELECT* 1"
 (cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..6.864
rows=3 loops=2)
 ->  Parallel Seq Scan on t
 (cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..3.708
rows=3 loops=2)
   ->  Subquery Scan on "*SELECT* 2"
 (cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..6.918
rows=3 loops=2)
 ->  Parallel Seq Scan on t t_1
 (cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..3.769
rows=3 loops=2)
 Planning Time: 0.137 ms
 Execution Time: 239.958 ms
(19 rows)

Which shows your 1400 cost goal from union all, and the expected row
counts, for gather-atop-append.

The fact that (baserel.rows > path->subpath->rows) here seems like a
straight bug: there are no filters involved in this case but in the
presence of filters baserel->rows should be strictly (<=
path->subpath->rows), right?

David J.


Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2022-04-29 Thread Andres Freund
Hi,

On 2020-12-16 18:12:39 +0900, Fujii Masao wrote:
> - /* Wait to be signaled by UnpinBuffer() */
> + /*
> +  * Wait to be signaled by UnpinBuffer().
> +  *
> +  * We assume that only UnpinBuffer() and the timeout requests 
> established
> +  * above can wake us up here. WakeupRecovery() called by walreceiver or
> +  * SIGHUP signal handler, etc cannot do that because it uses the 
> different
> +  * latch from that ProcWaitForSignal() waits on.
> +  */
>   ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
>  
>   /*

Isn't this comment bogus?  The latch could e.g. be set by
procsignal_sigusr1_handler(), which the startup process uses. Or it could
already be set, when entering ResolveRecoveryConflictWithBufferPin().

Why is it even relevant that we only get woken up by UnpinBuffer()?

Greetings,

Andres Freund




Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Tom Lane
"David G. Johnston"  writes:
> The fact that (baserel.rows > path->subpath->rows) here seems like a
> straight bug: there are no filters involved in this case but in the
> presence of filters baserel->rows should be strictly (<=
> path->subpath->rows), right?

No, because the subpath's rowcount has been derated to represent the
number of rows any one worker is expected to process.  Since the
SubqueryScan is below the Gather, it has to represent that number
of rows too.  Like I said, this design is pretty confusing; though
I do not have a better alternative to suggest.

Anyway, after chewing on this for awhile, it strikes me that the
sanest way to proceed is for cost_subqueryscan to compute its rows
estimate as the subpath's rows estimate times the selectivity of
the subqueryscan's quals (if any).  That'd be the natural way to
proceed for most sorts of non-bottom-level paths to begin with.
I think there are a few reasons why cost_subqueryscan currently
does it the way it does:

* By analogy to other sorts of relation-scan nodes.  But those don't
have any subpath that they could consult instead.  This analogy is
really a bit faulty, since SubqueryScan isn't a relation scan node
in any ordinary meaning of that term.

* To ensure that all paths for the same relation have the same rowcount
estimate (modulo different parameterization or parallelism).  But we'll
have that anyway because the only possible path type for an unflattened
RTE_SUBQUERY rel is a SubqueryScan, so there's no risk of different
cost_xxx functions arriving at slightly different estimates.

* To avoid redundant computation of the quals' selectivity.
This is slightly annoying; but in practice the quals will always
be RestrictInfos which will cache the per-qual selectivities,
so it's not going to cost us very much to perform that calculation
over again.

So perhaps we should do it more like the attached, which produces
this plan for the UNION case:

 Unique  (cost=28994.85..30194.85 rows=12 width=12)
   ->  Sort  (cost=28994.85..29294.85 rows=12 width=12)
 Sort Key: t.a, t.b, t.c
 ->  Gather  (cost=0.00..2600.00 rows=12 width=12)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..2600.00 rows=5 width=12)
 ->  Parallel Seq Scan on t  (cost=0.00..575.00 rows=25000 
width=12)
 ->  Parallel Seq Scan on t t_1  (cost=0.00..575.00 
rows=25000 width=12)

It'd still be a good idea to fix the cost estimation to not charge
extra for SubqueryScans that will be elided later, because this
Parallel Append's cost should be 1400 not 2600.  But as I showed
upthread, that won't affect the plan choice for this particular test
case.

regards, tom lane

#text/x-diff; name="change-cost_subqueryscan-rowcount-estimate-2.patch" 
[change-cost_subqueryscan-rowcount-estimate-2.patch] /home/postgres/zzz




Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread Tom Lane
I wrote:
> So perhaps we should do it more like the attached, which produces
> this plan for the UNION case:

sigh ... actually attached this time.

regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b787c6f81a..18749e842d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1395,6 +1395,7 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	List	   *qpquals;
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
 
@@ -1402,11 +1403,24 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
 	Assert(baserel->relid > 0);
 	Assert(baserel->rtekind == RTE_SUBQUERY);
 
-	/* Mark the path with the correct row estimate */
+	/*
+	 * We compute the rowcount estimate as the subplan's estimate times the
+	 * selectivity of relevant restriction clauses.  In simple cases this will
+	 * come out the same as baserel->rows; but when dealing with parallelized
+	 * paths we must do it like this to get the right answer.
+	 */
 	if (param_info)
-		path->path.rows = param_info->ppi_rows;
+		qpquals = list_concat_copy(param_info->ppi_clauses,
+   baserel->baserestrictinfo);
 	else
-		path->path.rows = baserel->rows;
+		qpquals = baserel->baserestrictinfo;
+
+	path->path.rows = clamp_row_est(path->subpath->rows *
+	clauselist_selectivity(root,
+		   qpquals,
+		   0,
+		   JOIN_INNER,
+		   NULL));
 
 	/*
 	 * Cost of path is cost of evaluating the subplan, plus cost of evaluating
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 21c429226f..0d8d77140a 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1487,14 +1487,12 @@ explain (costs off) select * from t union select * from t order by 1,3;
->  Unique
  ->  Sort
Sort Key: t.a, t.b, t.c
-   ->  Append
- ->  Gather
-   Workers Planned: 2
+   ->  Gather
+ Workers Planned: 2
+ ->  Parallel Append
->  Parallel Seq Scan on t
- ->  Gather
-   Workers Planned: 2
->  Parallel Seq Scan on t t_1
-(13 rows)
+(11 rows)
 
 -- Full sort, not just incremental sort can be pushed below a gather merge path
 -- by generate_useful_gather_paths.


Re: failures in t/031_recovery_conflict.pl on CI

2022-04-29 Thread Andres Freund
Hi,

Attached are patches for this issue.

It adds a test case for deadlock conflicts to make sure that case isn't
broken. I also tested the recovery conflict tests in the back branches, and
they work there with a reasonably small set of changes.

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
  preceding numbers are unused. It seems way more problematic to use a
  different number in the backbranches than have gaps?

- The test uses pump_until() and wait_for_log(), which don't exist in the
  backbranches. For now I've just inlined the implementation, but I guess we
  could also backpatch their introduction?

- There's a few incompatibilities in the test with older branches:
  - older branches don't have allow_in_place_tablespaces - I think just
skipping tablespace conflicts is fine, they're comparatively
simple.

Eventually it might make sense to backpatch allow_in_place_tablespaces,
our test coverage in the area is quite poor.

  - the stats tests can't easily made reliably in the backbranches - which is
ok, as the conflict itself is verified via the log

  - some branches don't have log_recovery_conflict_waits, since it's not
critical to the test, it's ok to just not include it there

  I played with the idea of handling the differences using version comparisons
  in the code, and have the test be identically across branches. Since it's
  something we don't do so far, I'm leaning against it, but ...


> - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
>   replay progress reporting. Is there a reason to keep them in the
>   backbranches? Hard to see how an extension or something could rely on it,
>   but ...?

I've left it as is for now, will start a separate thread.


> - There's the following comment in ResolveRecoveryConflictWithBufferPin():
>
>   "We assume that only UnpinBuffer() and the timeout requests established
>above can wake us up here."
>
>   That bogus afaict? There's plenty other things that can cause MyProc->latch
>   to be set. Is it worth doing something about this at the same time? Right
>   now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
>   succession initially.

The comment is more recent than I had realized. I raised this separately in
https://postgr.es/m/20220429191815.xewxjlpmq7mxhsr2%40alap3.anarazel.de


pgindent uses some crazy formatting nearby:
SendRecoveryConflictWithBufferPin(
  
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I'm tempted to clean that up in passing by having just one
SendRecoveryConflictWithBufferPin() call instead of two, storing the type of
conflict in a local variable? Doesn't look entirely pretty either, but ...


I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(),
btw. But that'd be a non-backpatchable cleanup, I think:
 * The ProcWaitForSignal() sleep normally done in LockBufferForCleanup()
 * (when not InHotStandby) is performed here, for code clarity.


Greetings,

Andres Freund
>From 08b71cceefabe48ca80b1b15752031b27d05229d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 29 Apr 2022 12:50:10 -0700
Subject: [PATCH vHEAD 1/2] Fix possibility of self-deadlock in
 ResolveRecoveryConflictWithBufferPin().

Author:
Reviewed-By:
Discussion: https://postgr.es/m/20220413002626.udl7lll7f3o7n...@alap3.anarazel.de
Backpatch:
---
 src/backend/storage/ipc/standby.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 2850867323b..8c5e8432e73 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -46,6 +46,7 @@ static HTAB *RecoveryLockLists;
 
 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
+static volatile sig_atomic_t got_standby_delay_timeout = false;
 static volatile sig_atomic_t got_standby_lock_timeout = false;
 
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
@@ -793,7 +794,8 @@ ResolveRecoveryConflictWithBufferPin(void)
 	}
 
 	/*
-	 * Wait to be signaled by UnpinBuffer().
+	 * Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
+	 * by one of the timeouts established above.
 	 *
 	 * We assume that only UnpinBuffer() and the timeout requests established
 	 * above can wake us up here. WakeupRecovery() called by walreceiver or
@@ -802,7 +804,9 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 */
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
-	if (got_standby_deadlock_timeout)
+	if (got_standby_delay_timeout)
+		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+	else if (got_standby_deadlock_timeout)
 	{
 		/*
 		 * Send out a request for hot-standby backends to check themselves for
@@ -828,6 +832,7 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 * individually, but that'd be slower

Re: [RFC] building postgres with meson -v8

2022-04-29 Thread Andres Freund
Hi,

On 2022-04-29 11:00:43 -0700, Andres Freund wrote:
> On 2022-04-27 21:56:27 +0200, Peter Eisentraut wrote:
> > Here is a patch that adds in NLS.
> 
> Cool! I know very little about translations, so I was reticent tackling
> this...

> > The annoying thing is that the i18n module doesn't appear to have a way to
> > communicate with feature options or dependencies, so there isn't a way to
> > tell it to only do its things when some option is enabled, or conversely to
> > check whether the module found the things it needs and to enable or disable
> > an option based on that.  So right now for example if you explicitly disable
> > the 'nls' option, the binaries are built without NLS but the .mo files are
> > still built and installed.
> 
> One partial way to deal with that, I think, would be to change all the
> subdir('po') invocations to subdir('po', if_found: libintl). If we don't want
> that for some reason, is there a reason a simple if libintl.found() wouldn't
> work?

Merged into my tree now, using if_found. I've also made the intl check work
with older meson versions, since I didn't include your version requirement
upgrades.


For now I "fixed" the ecpg issue on windows by just not building ecpg
there. Ecpg also needs tests ported...

Greetings,

Andres Freund




Re: fix cost subqueryscan wrong parallel cost

2022-04-29 Thread David G. Johnston
On Fri, Apr 29, 2022 at 12:31 PM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > The fact that (baserel.rows > path->subpath->rows) here seems like a
> > straight bug: there are no filters involved in this case but in the
> > presence of filters baserel->rows should be strictly (<=
> > path->subpath->rows), right?
>
> No, because the subpath's rowcount has been derated to represent the
> number of rows any one worker is expected to process.  Since the
> SubqueryScan is below the Gather, it has to represent that number
> of rows too.  Like I said, this design is pretty confusing; though
> I do not have a better alternative to suggest.
>
> Anyway, after chewing on this for awhile, it strikes me that the
> sanest way to proceed is for cost_subqueryscan to compute its rows
> estimate as the subpath's rows estimate times the selectivity of
> the subqueryscan's quals (if any).


This is what Robert was getting at, and I followed-up on.

The question I ended up at is why doesn't baserel->rows already produce the
value you now propose to calculate directly within cost_subquery

set_baserel_size_estimates (multiplies rel->tuples - which is derated - by
selectivity, sets rel->rows)
set_subquery_size_estimates
rel->subroot = subquery_planner(...)
  // my expectation is that
sub_final_rel->cheapest_total_path->rows is the derated number of rows;
  // the fact you can reach the derated amount later by using
path->subpath->rows seems to affirm this.
sets rel->tuples from sub_final_rel->cheapest_total_path->rows)
set_subquery_pathlist (executes the sizing call stack above, then proceeds
to create_subqueryscan_path which in turn calls cost_subquery)
set_rel_size
...


> * By analogy to other sorts of relation-scan nodes.  But those don't
> have any subpath that they could consult instead.  This analogy is
> really a bit faulty, since SubqueryScan isn't a relation scan node
> in any ordinary meaning of that term.
>

I did observe this copy-and-paste dynamic; I take it this is why we cannot
or would not just change all of the baserel->rows usages to
path->subpath->rows.

>
> So perhaps we should do it more like the attached, which produces
> this plan for the UNION case:
>
>
The fact this changes row counts in a costing function is bothersome - it
would be nice to go the other direction and remove the if block.  More
generally, path->path.rows should be read-only by the time we get to
costing. But I'm not out to start a revolution here either.  But it feels
like we are just papering over a bug in how baserel->rows is computed; per
my analysis above.

In short, the solution seems like it should, and in fact does here, fix the
observed problem.  I'm fine with that.

David J.


Re: bogus: logical replication rows/cols combinations

2022-04-29 Thread Tomas Vondra
On 4/29/22 06:48, Amit Kapila wrote:
> On Thu, Apr 28, 2022 at 11:00 PM Tomas Vondra
>  wrote:
>>
>> On 4/28/22 14:26, Peter Eisentraut wrote:
>>> On 27.04.22 12:33, Amit Kapila wrote:
>>>
>>> I wonder how we handle the combination of
>>>
>>> pub1: publish=insert WHERE (foo)
>>> pub2: publish=update WHERE (bar)
>>>
>>> I think it would be incorrect if the combination is
>>>
>>> pub1, pub2: publish=insert,update WHERE (foo OR bar).
>>
>> That's a good question, actually. No, we don't combine the publications
>> like this, the row filters are kept "per action".
>>
> 
> Right, and it won't work even if try to combine in this case because
> of replica identity restrictions.
> 
>> But the exact behavior
>> turns out to be rather confusing in this case.
>>
>> (Note: This has nothing to do with column lists.)
>>
>> Consider an example similar to what Alvaro posted earlier:
>>
>>   create table uno (a int primary key, b int, c int);
>>
>>   create publication uno for table uno where (a > 0)
>> with (publish='insert');
>>
>>   create publication dos for table uno where (a < 0)
>> with (publish='update');
>>
>> And do this:
>>
>>   insert into uno values (1, 2, 3), (-1, 3, 4)
>>
>> which on the subscriber produces just one row, because (a<0) replicates
>> only updates:
>>
>>a | b | c
>>   ---+---+---
>>1 | 2 | 3
>>   (1 row)
>>
>> Now, let's update the (a<0) row.
>>
>>   update uno set a = 2 where a = -1;
>>
>> It might seem reasonable to expect the updated row (2,3,4) to appear on
>> the subscriber, but no - that's not what happens. Because we have (a<0)
>> for UPDATE, and we evaluate this on the old row (matches) and new row
>> (does not match). And pgoutput_row_filter() decides the update needs to
>> be converted to DELETE, despite the old row was not replicated at all.
>>
> 
> Right, but we don't know what previously would have happened maybe the
> user would have altered the publication action after the initial row
> is published in which case this DELETE is required as is shown in the
> example below. We can only make the decision based on the current
> tuple. For example:
> 
> create table uno (a int primary key, b int, c int);
> 
>   create publication uno for table uno where (a > 0)
> with (publish='insert');
> 
>   create publication dos for table uno where (a < 0)
> with (publish='insert');
> 
> -- create subscription for both these publications.
> 
> insert into uno values (1, 2, 3), (-1, 3, 4);
> 
> Alter publication dos set (publish='update');
> 
> update uno set a = 2 where a = -1;
> 
> Now, in this case, the old row was replicated and we would need a
> DELETE corresponding to it.
> 

I think such issues due to ALTER of the publication are somewhat
expected, and I think users will understand they might need to resync
the subscription or something like that.

A similar example might be just changing the where condition,

  create publication p for table t where (a > 10);

and then

  alter publication p set table t where (a > 15);

If we replicated any rows with (a > 10) and (a <= 15), we'll just stop
replicating them. But if we re-create the subscription, we end up with a
different set of rows on the subscriber, omitting rows with (a <= 15).

In principle we'd need to replicate the ALTER somehow, to delete or
insert the rows that start/stop matching the row filter. It's a bit
similar to not replicating DDL, perhaps.

But I think the issue I've described is different, because you don't
have to change the subscriptions at all and you'll still have the
problem. I mean, imagine doing this:

-- publisher
create table t (a int primary key, b int);
create publication p for table t where (a > 10) with (publish='update');

-- subscriber
create table t (a int primary key, b int);
create subscription s connection '...' publication p;

-- publisher
insert into t select i, i from generate_series(1,20) s(i);
update t set b = b * 10;

-- subscriber
--> has no rows in "t"
--> recreate the subscription
drop subscription s;
create subscription s connection '...' publication p;

--> now it has all the rows with (a>10), because tablesync ignores
publication actions


The reason why I find this really annoying is that it makes it almost
impossible to setup two logical replicas that'd be "consistent", unless
you create them at the same time (= without any writes in between). And
it's damn difficult to think about the inconsistencies.


IMHO this all stems from allowing row filters and restricting pubactions
at the same time (notice this only used a single publication). So maybe
the best option would be to disallow combining these two features? That
would ensure the row filter filter is always applied to all actions in a
consistent manner, preventing all these issues.

Maybe that's not possible - maybe there are valid use cases that would
need such combination, and you mentioned replica identity might be an
issue (and maybe requiring RIF with row filters is not desirable).

So maybe we should at leas

Re: failures in t/031_recovery_conflict.pl on CI

2022-04-29 Thread Tom Lane
Andres Freund  writes:
> Questions:
> - I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
>   preceding numbers are unused. It seems way more problematic to use a
>   different number in the backbranches than have gaps?

+1

> - The test uses pump_until() and wait_for_log(), which don't exist in the
>   backbranches. For now I've just inlined the implementation, but I guess we
>   could also backpatch their introduction?

I'd backpatch --- seems unlikely this will be the last need for 'em.

> pgindent uses some crazy formatting nearby:
> SendRecoveryConflictWithBufferPin(
>   
> PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I do not believe that that line break is pgindent's fault.
If you just fold it into one line it should stay that way.

regards, tom lane




Re: bogus: logical replication rows/cols combinations

2022-04-29 Thread Amit Kapila
On Sat, Apr 30, 2022 at 2:02 AM Tomas Vondra
 wrote:
>
> On 4/29/22 06:48, Amit Kapila wrote:
> > On Thu, Apr 28, 2022 at 11:00 PM Tomas Vondra
>
> I think such issues due to ALTER of the publication are somewhat
> expected, and I think users will understand they might need to resync
> the subscription or something like that.
>
> A similar example might be just changing the where condition,
>
>   create publication p for table t where (a > 10);
>
> and then
>
>   alter publication p set table t where (a > 15);
>
> If we replicated any rows with (a > 10) and (a <= 15), we'll just stop
> replicating them. But if we re-create the subscription, we end up with a
> different set of rows on the subscriber, omitting rows with (a <= 15).
>
> In principle we'd need to replicate the ALTER somehow, to delete or
> insert the rows that start/stop matching the row filter. It's a bit
> similar to not replicating DDL, perhaps.
>
> But I think the issue I've described is different, because you don't
> have to change the subscriptions at all and you'll still have the
> problem. I mean, imagine doing this:
>
> -- publisher
> create table t (a int primary key, b int);
> create publication p for table t where (a > 10) with (publish='update');
>
> -- subscriber
> create table t (a int primary key, b int);
> create subscription s connection '...' publication p;
>
> -- publisher
> insert into t select i, i from generate_series(1,20) s(i);
> update t set b = b * 10;
>
> -- subscriber
> --> has no rows in "t"
> --> recreate the subscription
> drop subscription s;
> create subscription s connection '...' publication p;
>
> --> now it has all the rows with (a>10), because tablesync ignores
> publication actions
>
>
> The reason why I find this really annoying is that it makes it almost
> impossible to setup two logical replicas that'd be "consistent", unless
> you create them at the same time (= without any writes in between). And
> it's damn difficult to think about the inconsistencies.
>

I understood your case related to the initial sync and it is with or
without rowfilter.

>
> IMHO this all stems from allowing row filters and restricting pubactions
> at the same time (notice this only used a single publication). So maybe
> the best option would be to disallow combining these two features? That
> would ensure the row filter filter is always applied to all actions in a
> consistent manner, preventing all these issues.
>
> Maybe that's not possible - maybe there are valid use cases that would
> need such combination, and you mentioned replica identity might be an
> issue
>

Yes, that is the reason we can't combine the row filters for all pubactions.

> (and maybe requiring RIF with row filters is not desirable).
>
> So maybe we should at least warn against this in the documentation?
>

Yeah, I find this as the most suitable thing to do to address your
concern. I would like to add this information to the 'Initial
Snapshot' page with some examples (both with and without a row
filter).

> >
> > True, I think to some extent we rely on users to define it sanely
> > otherwise currently also it can easily lead to even replication being
> > stuck. This can happen when the user is trying to operate on the same
> > table and define publication/subscription on multiple nodes for it.
> > See [1] where we trying to deal with such a problem.
> >
> > [1] - https://commitfest.postgresql.org/38/3610/
> >
>
> That seems to deal with a circular replication, i.e. two logical
> replication links - a bit like a multi-master. Not sure how is that
> related to the issue we're discussing here?
>

It is not directly related to what we are discussing here but I was
trying to emphasize the point that users need to define the logical
replication via pub/sub sanely otherwise they might see some weird
behaviors like that.

-- 
With Regards,
Amit Kapila.




Progress report removal of temp files and temp relation files using ereport_startup_progress

2022-04-29 Thread Bharath Rupireddy
Hi,

At times, there can be many temp files (under pgsql_tmp) and temp
relation files (under removal which after crash may take longer during
which users have no clue about what's going on in the server before it
comes up online.

Here's a proposal to use ereport_startup_progress to report the
progress of the file removal.

Thoughts?

Regards,
Bharath Rupireddy.


v1-0001-Progress-report-removal-of-temp-files-and-temp-re.patch
Description: Binary data