date:20211021

Re: pg_receivewal starting position

2021-10-21 Thread Michael Paquier

On Thu, Oct 21, 2021 at 08:29:54AM +0200, Ronan Dunklau wrote:
> Ok, do you want me to propose a different patch for previous versions ?

That's not necessary.  Thanks for proposing.

> Do you mean restart_lsn as the pointer argument to the function, or 
> restart_lsn as the field returned by the command ? If it's the first, I'll 
> change it but if it's the latter it is expected that we sometime run this on 
> a 
> slot where WAL has never been reserved yet.

restart_lsn as the pointer of the function.
--
Michael


signature.asc
Description: PGP signature

Re: Failed transaction statistics to measure the logical replication progress

2021-10-21 Thread Amit Kapila

On Thu, Oct 21, 2021 at 6:49 AM Masahiko Sawada  wrote:
>
> On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila  wrote:
> >
> > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila 
> > > 
> > > >
> > > > These all views are related to untransmitted to the collector but what
> > > > we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one might
> > > > consider that as a compatibility break. The other idea I have shared
> > > > is that we display these stats under the new view introduced by
> > > > Sawada-San's patch [1] and probably rename that view as
> > > > pg_stat_subscription_worker where all the stats (xact info and last
> > > > failure information) about each worker will be displayed. Do you have
> > > > any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into the 
> > > view from
> > > sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error
> > > currently have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'. I 
> > > think we
> > > might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, 
> > > some column
> > > seems need to be renamed. Maybe like:
> > > error_message => Last_error_message, command=> last_error_command..
> > >
> >
> > Don't you think that keeping the view name as
> > pg_stat_subscription_error would be a bit confusing if it has to
> > display xact_info? Isn't it better to change it to
> > pg_stat_subscription_worker or some other worker-specific generic
> > name?
>
> I agree that it'd be better to include xact info to
> pg_stat_subscription_errors view rather than making
> pg_stat_subscription a cumulative view. It would be more simple and
> consistent.
>
> The user might want to reset only either error stats or xact stats but
> we can do that by passing a value to the reset function. For example,
> pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only
> error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’)
> for resetting only xact stats etc (passing NULL or omitting the third
> argument means resetting all stats).
>

Sounds reasonable.

> I'll change the view name in the
> next version patch.
>

Thanks, that will be helpful.

-- 
With Regards,
Amit Kapila.

Re: pg_receivewal starting position

2021-10-21 Thread Ronan Dunklau

Le jeudi 21 octobre 2021, 09:21:44 CEST Michael Paquier a écrit :
> On Thu, Oct 21, 2021 at 08:29:54AM +0200, Ronan Dunklau wrote:
> > Ok, do you want me to propose a different patch for previous versions ?
> 
> That's not necessary.  Thanks for proposing.
> 
> > Do you mean restart_lsn as the pointer argument to the function, or
> > restart_lsn as the field returned by the command ? If it's the first, I'll
> > change it but if it's the latter it is expected that we sometime run this
> > on a slot where WAL has never been reserved yet.
> 
> restart_lsn as the pointer of the function.

Done. I haven't touched the timeline switch test patch for now, but I still 
include it here for completeness.





-- 
Ronan Dunklau>From b3703868d230f773378742a0deb4360f6cc7343a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 21 Oct 2021 13:21:43 +0900
Subject: [PATCH v10 1/4] doc: Describe calculation method of streaming start
 for pg_receivewal

The documentation was unprecise about the fact that the current WAL
flush location is used if nothing can be found on the local archive
directory describe, independently of the compression used by each
segment (ZLIB or uncompressed).
---
 doc/src/sgml/ref/pg_receivewal.sgml | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 45b544cf49..9adbddb657 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -75,6 +75,29 @@ PostgreSQL documentation
one session available for the stream.
   
 
+  
+   The starting point of the write-ahead log streaming is calculated when
+   pg_receivewal starts:
+   
+
+ 
+  First, scan the directory where the WAL segment files are written and
+  find the newest completed segment, using as starting point the beginning
+  of the next WAL segment file. This is calculated independently of the
+  compression method used to compress each segment.
+ 
+
+
+
+ 
+  If a starting point cannot be calculated with the previous method,
+  the latest WAL flush location is used as reported by the server from
+  a IDENTIFY_SYSTEM command.
+ 
+
+   
+  
+
   
If the connection is lost, or if it cannot be initially established,
with a non-fatal error, pg_receivewal will
-- 
2.33.0

>From 067353a88803e96b295f425167c1195ccf984352 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 21 Oct 2021 14:23:41 +0900
Subject: [PATCH v10 2/4] Add READ_REPLICATION_SLOT

---
 doc/src/sgml/protocol.sgml  | 48 +++
 src/backend/replication/repl_gram.y | 16 +++-
 src/backend/replication/repl_scanner.l  |  1 +
 src/backend/replication/walsender.c | 93 +
 src/include/nodes/nodes.h   |  1 +
 src/include/nodes/replnodes.h   | 11 +++
 src/test/recovery/t/001_stream_rep.pl   | 32 ++-
 src/test/recovery/t/006_logical_decoding.pl | 11 ++-
 src/tools/pgindent/typedefs.list|  1 +
 9 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b95cc88599..37c26ec8ae 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2067,6 +2067,54 @@ The commands accepted in replication mode are:
 
   
 
+  
+READ_REPLICATION_SLOT slot_name
+  READ_REPLICATION_SLOT
+
+
+ 
+  Read some information associated to a replication slot. Returns a tuple
+  with NULL values if the replication slot does not
+  exist. This command is currently only supported for physical replication
+  slots.
+ 
+ 
+  In response to this command, the server will return a one-row result set,
+  containing the following fields:
+  
+   
+slot_type (text)
+
+ 
+  The replication slot's type, either physical or
+  NULL.
+ 
+
+   
+
+   
+restart_lsn (text)
+
+ 
+  The replication slot's restart_lsn.
+ 
+
+   
+
+   
+restart_tli (int4)
+
+ 
+  The timeline ID associated to restart_lsn,
+  following the current timeline history.
+ 
+
+   
+  
+ 
+
+  
+
   
 START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]
  START_REPLICATION
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 126380e2df..dcb1108579 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void);
 /* Keyword tokens. */
 %token K_BASE_BACKUP
 %token K_IDENTIFY_SYSTEM
+%token K_READ_REPLICATION_SLOT
 %token K_SHOW
 %token K_START_REPLICATION
 %token K_CREATE_REPLICATION_SLOT
@@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void);
 %type 	command
 %type 	base_bac

Re: LogicalChanges* and LogicalSubxact* wait events are never reported

2021-10-21 Thread Amit Kapila

On Wed, Oct 20, 2021 at 3:46 PM Masahiro Ikeda  wrote:
>
> On 2021/10/20 18:17, Amit Kapila wrote:
> > On Wed, Oct 20, 2021 at 10:50 AM Michael Paquier  
> > wrote:
> >>
> >> On Wed, Oct 20, 2021 at 02:12:20PM +0900, Masahiro Ikeda wrote:
> >>> If my understanding is right, it's better to remove them since they make
> >>> users confused. Please see the attached patch. I confirmed that to make
> >>> check-world passes all tests.
> >>
> >> Yeah, I don't see the point in keeping these events around if they are
> >> not used.  Perhaps Amit has some plans for them, though.
> >>
> >
> > No, there is no plan for these. As far as I remember, during
> > development, we have decided to use BufFile interface and forgot to
> > remove these events which were required by the previous versions of
> > the patch. I'll take care of this.
> >
> > Thanks for the report and patch!
> Thanks for your replies and for handling it!
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: relation OID in ReorderBufferToastReplace error message

2021-10-21 Thread Amit Kapila

On Mon, Oct 18, 2021 at 3:59 PM Amit Kapila  wrote:
>
> On Fri, Oct 15, 2021 at 3:40 AM Bossart, Nathan  wrote:
> >
> > On 9/23/21, 11:26 AM, "Alvaro Herrera"  wrote:
> > > On 2021-Sep-23, Jeremy Schneider wrote:
> > >
> > >> On 9/22/21 20:11, Amit Kapila wrote:
> > >> >
> > >> > On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider  
> > >> > wrote:
> > >> >>
> > >> >> Any chance of back-patching this?
> > >> >
> > >> > Normally, we don't back-patch code improvements unless they fix some
> > >> > bug or avoid future back-patch efforts. So, I am not inclined to
> > >> > back-patch this but if others also feel strongly about this then we
> > >> > can consider it.
> > >>
> > >> The original thread about the logical replication bugs spawned a few
> > >> different threads and code changes. The other code changes coming out of
> > >> those threads were all back-patched, but I guess I can see arguments
> > >> both ways on this one.
> > >
> > > I think that for patches that are simple debugging aids we do
> > > backpatch, with the intent to get them deployed in users' systems as
> > > soon and as widely possible.  I did that in this one, for example
> >
> > +1 for back-patching
> >
>
> I can take care of backpatching this in the next few days unless there
> is any objection.
>

Done.

-- 
With Regards,
Amit Kapila.

Re: Data is copied twice when specifying both child and parent table in publication

2021-10-21 Thread Amit Kapila

On Wed, Oct 20, 2021 at 7:11 PM Greg Nancarrow  wrote:
>
> On Wed, Oct 20, 2021 at 9:19 PM Amit Kapila  wrote:
> >
> > I don't see why data need to be replicated again even in that case.
> > Can you see any such duplicate data replicated for non-partitioned
> > tables?
> >
>
> If my example is slightly modified to use the same-named tables on the
> subscriber side, but without partitioning, i.e.:
>
> PUB:
>
> CREATE SCHEMA sch;
> CREATE SCHEMA sch1;
> CREATE TABLE sch.sale (sale_date date not null, country_code text,
> product_sku text, units integer) PARTITION BY RANGE (sale_date);
> CREATE TABLE sch1.sale_201901 PARTITION OF sch.sale FOR VALUES FROM
> ('2019-01-01') TO ('2019-02-01');
> CREATE TABLE sch1.sale_201902 PARTITION OF sch.sale FOR VALUES FROM
> ('2019-02-01') TO ('2019-03-01');
>
>
> SUB:
>
> CREATE SCHEMA sch;
> CREATE SCHEMA sch1;
> CREATE TABLE sch.sale (sale_date date not null, country_code text,
> product_sku text, units integer);
> CREATE TABLE sch1.sale_201901 (sale_date date not null, country_code
> text, product_sku text, units integer);
> CREATE TABLE sch1.sale_201902 (sale_date date not null, country_code
> text, product_sku text, units integer);
>
> then the INSERTed data on the publisher side gets replicated to the
> subscriber's "sch1.sale_201901" and "sch1.sale_201902" tables (only),
> depending on the date values.
> Now if the partitioned table is then added to the publication and
> ALTER SUBSCRIPTION ... REFRESH PUBLICATION done by the subscriber,
> then the current functionality is that the existing sch.sale data is
> replicated (only) to the subscriber's "sch.sale" table (even though
> data had been replicated previously to the "sch1.sale_201901" and
> "sch1.sale_201902" tables, only).
> So, just to be clear, you think that this current functionality isn't
> correct (i.e. no data should be replicated on the REFRESH in this
> case)?
>

Right, I don't think it is correct because it will behave differently
when the tables on the subscriber are partitioned. Also, the idea I
speculated in one of my above emails should be able to deal with this
case.

> I think it's debatable because here copy_data=true and sch.sale was
> not a previously-subscribed table (so pre-existing data in that table
> should be copied, in accordance with the current documentation).
>

What about the partition (child) table? In this case, the same data
will be present in two tables sch.sale and sch1.sale_201901 after you
have refreshed the publication, and then any future insertions will
only be inserted into parent table sch.sale in this case which doesn't
sound consistent. The bigger problem is that it will lead to duplicate
data when tables are partitioned. I think if the user really wants to
do in a way you are describing, there is no need to keep sub-tables
(*_201901 and *__201902). I understand that it depends on the use case
but we should also behave sanely when tables/partitions are created in
the same way in both publisher and subscriber which I guess will most
likely be the case.

-- 
With Regards,
Amit Kapila.

Re: Added schema level support for publication.

2021-10-21 Thread Greg Nancarrow

On Thu, Oct 21, 2021 at 3:25 AM vignesh C  wrote:
>
> Attached v44 patch as the fixes for the same.
>

In the v44-0001 patch, I have some doubts about the condition guarding
the following code in pg_get_publication_tables():

+ if (schemarelids)
+ {
+/*
+ * If the publication publishes partition changes via their
+ * respective root partitioned tables, we must exclude
+ * partitions in favor of including the root partitioned
+ * tables. Otherwise, the function could return both the child
+ * and parent tables which could cause data of the child table
+ * to be double-published on the subscriber side.
+ *
+ * XXX As of now, we do this when a publication has associated
+ * schema or for all tables publication. See
+ * GetAllTablesPublicationRelations().
+ */
+tables = list_concat_unique_oid(relids, schemarelids);
+if (publication->pubviaroot)
+   tables = filter_partitions(tables);
+ }

Shouldn't a partition be filtered out only if publication->pubviaroot
and the partition belongs to a schema (i.e. ALL TABLES IN SCHEMA)
included in the publication?
The current code seems to filter out partitions of partitioned tables
included in the publication if ANY schemas are included as part of the
publication (e.g. which could be a schema not including any
partitioned tables or partitions).

Regards,
Greg Nancarrow
Fujitsu Australia

Re: [PATCH] improve the pg_upgrade error message

2021-10-21 Thread Daniel Gustafsson

> On 14 Jul 2021, at 07:27, Suraj Kharage  
> wrote:

> Overall patch looks good to me.

Agreed, I think this is a good change and in line with how the check functions
work in general.

> Instead of giving suggestion about updating the pg_database catalog, can we 
> give "ALTER DATABASE  ALLOW_CONNECTIONS true;" command?

I would actually prefer to not give any suggestions at all, we typically don't
in these error messages.  Since there are many ways to do it (dropping the
database being one) I think leaving that to the user is per application style.

> Also, it would be good if we give 2 spaces after full stop in an error 
> message.

Correct, fixed in the attached which also tweaks the language slightly to match
other errors.

I propose to commit the attached, which also adds a function comment while
there, unless there are objections.

--
Daniel Gustafsson   https://vmware.com/

0001-List-offending-databases-in-pg_upgrade-datallowconn-.patch
Description: Binary data

Re: Partial aggregates pushdown

2021-10-21 Thread Alexander Pyhalov


Tomas Vondra писал 2021-10-19 16:25:

On 10/19/21 08:56, Alexander Pyhalov wrote:

Hi.

Tomas Vondra писал 2021-10-15 17:56:

As for the proposed approach, it's probably good enough for the first
version to restrict this to aggregates where the aggregate result is
sufficient, i.e. we don't need any new export/import procedures.

But it's very unlikely we'd want to restrict it the way the patch 
does

it, i.e. based on aggregate name. That's both fragile (people can
create new aggregates with such name) and against the PostgreSQL
extensibility (people may implement custom aggregates, but won't be
able to benefit from this just because of name).

So for v0 maybe, but I think there neeeds to be a way to relax this 
in

some way, for example we could add a new flag to pg_aggregate to mark
aggregates supporting this.



Updated patch to mark aggregates as pushdown-safe in pg_aggregates.

So far have no solution for aggregates with internal aggtranstype.


Hi. Updated patch.
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),
have internal states and associated converters. Converters are called 
locally, they transform aggregate result to serialized internal 
representation.
As converters don't have access to internal aggregate state, partial 
aggregates like avg() are still not pushable.


For now the overall logic is quite simple. We now also call 
add_foreign_grouping_paths() for partial aggregation.  In 
foreign_expr_walker() we check if aggregate is pushable (which means 
that it is simple, marked as pushable and if has 'internal' as 
aggtranstype, has associated converter).
If it is pushable, we proceed as with usual aggregates (but forbid 
having pushdown). During postgresGetForeignPlan() we produce list of 
converters for aggregates. As converters has different input argument 
type from their result (bytea), we have to generate alternative 
metadata, which is used by make_tuple_from_result_row().
If make_tuple_from_result_row() encounters field with converter, it 
calls converter and returns result. For now we expect converter to have 
only one input and output argument. Existing converters just transform 
input value to internal representation and return its serialized form.


--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 52cd61fdb5cb5fceeacd832462468d8676f57ca6 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Thu, 14 Oct 2021 17:30:34 +0300
Subject: [PATCH] Partial aggregates push down

---
 contrib/postgres_fdw/deparse.c|  49 -
 .../postgres_fdw/expected/postgres_fdw.out| 185 -
 contrib/postgres_fdw/postgres_fdw.c   | 195 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  27 ++-
 src/backend/catalog/pg_aggregate.c|  28 ++-
 src/backend/commands/aggregatecmds.c  |  23 ++-
 src/backend/utils/adt/numeric.c   |  96 +
 src/bin/pg_dump/pg_dump.c |  21 +-
 src/include/catalog/catversion.h  |   2 +-
 src/include/catalog/pg_aggregate.dat  | 106 +-
 src/include/catalog/pg_aggregate.h|  10 +-
 src/include/catalog/pg_proc.dat   |   6 +
 src/test/regress/expected/oidjoins.out|   1 +
 13 files changed, 666 insertions(+), 83 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index d98bd666818..50ef1009b97 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -196,6 +196,7 @@ static bool is_subquery_var(Var *node, RelOptInfo *foreignrel,
 static void get_relation_column_alias_ids(Var *node, RelOptInfo *foreignrel,
 		  int *relno, int *colno);
 
+static bool partial_agg_ok(Aggref *agg);
 
 /*
  * Examine each qual clause in input_conds, and classify them into two groups,
@@ -831,8 +832,10 @@ foreign_expr_walker(Node *node,
 if (!IS_UPPER_REL(glob_cxt->foreignrel))
 	return false;
 
-/* Only non-split aggregates are pushable. */
-if (agg->aggsplit != AGGSPLIT_SIMPLE)
+if ((agg->aggsplit != AGGSPLIT_SIMPLE) && (agg->aggsplit != AGGSPLIT_INITIAL_SERIAL))
+	return false;
+
+if (agg->aggsplit == AGGSPLIT_INITIAL_SERIAL && !partial_agg_ok(agg))
 	return false;
 
 /* As usual, it must be shippable. */
@@ -3249,7 +3252,7 @@ deparseAggref(Aggref *node, deparse_expr_cxt *context)
 	bool		use_variadic;
 
 	/* Only basic, non-split aggregation accepted. */
-	Assert(node->aggsplit == AGGSPLIT_SIMPLE);
+	Assert(node->aggsplit == AGGSPLIT_SIMPLE || node->aggsplit == AGGSPLIT_INITIAL_SERIAL);
 
 	/* Check if need to print VARIADIC (cf. ruleutils.c) */
 	use_variadic = node->aggvariadic;
@@ -3719,3 +3722,43 @@ get_relation_column_alias_ids(Var *node, RelOptInfo *foreignrel,
 	/* Shouldn't get here */
 	elog(ERROR, "unexpected expression in subquery output");
 }
+
+/*
+ * Check that partial aggregate agg is fine to push down
+ */
+st

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Japin Li



On Thu, 21 Oct 2021 at 14:04, Dilip Kumar  wrote:
> On Thu, Oct 21, 2021 at 11:16 AM Masahiko Sawada  
> wrote:
>>
>> On Wed, Oct 20, 2021 at 8:12 PM Japin Li  wrote:
>> >
>> >
>> > On Mon, 18 Oct 2021 at 17:27, Dilip Kumar  wrote:
>> > > On Mon, Oct 18, 2021 at 1:41 PM Japin Li  wrote:
>> > >
>> > >> I attached v3 patch that set IntervalStyle to 'postgres' when the
>> > >> server backend is walsender, and this problem has gone.
>> > >
>> > >> I test that set IntervalStyle to 'sql_standard' on publisher and
>> > >> 'iso_8601' on subscriber, it works fine.
>> > >
>> > >> Please try v3 patch and let me know if they work as unexpected.
>> > >> Thanks in advance.
>> > >
>> > > I think the idea of setting the standard DateStyle and the
>> > > IntervalStyle on the walsender process looks fine to me.  As this will
>> > > avoid extra network round trips as Tom mentioned.
>> >
>> > After some test, I find we also should set the extra_float_digits to avoid
>> > precision lossing.
>>
>> Thank you for the patch!
>>
>> --- a/src/backend/postmaster/postmaster.c
>> +++ b/src/backend/postmaster/postmaster.c
>> @@ -2223,6 +2223,24 @@ retry1:
>> {
>> am_walsender = true;
>> am_db_walsender = true;
>> +
>> +   /*
>> +* Force assorted GUC
>> parameters to settings that ensure
>> +* that we'll output data
>> values in a form that is
>> +* unambiguous to the walreceiver.
>> +*/
>> +   port->guc_options =
>> lappend(port->guc_options,
>> +
>>  pstrdup("datestyle"));
>> +   port->guc_options =
>> lappend(port->guc_options,
>> +
>>  pstrdup("ISO"));
>> +   port->guc_options =
>> lappend(port->guc_options,
>> +
>>  pstrdup("intervalstyle"));
>> +   port->guc_options =
>> lappend(port->guc_options,
>> +
>>  pstrdup("postgres"));
>> +   port->guc_options =
>> lappend(port->guc_options,
>> +
>>  pstrdup("extra_float_digits"));
>> +   port->guc_options =
>> lappend(port->guc_options,
>> +
>>  pstrdup("3"));
>> }
>>
>> I'm concerned that it sets parameters too early since wal senders end
>> up setting the parameters regardless of logical decoding plugins. It
>> might be better to force the parameters within the plugin for logical
>> replication, pgoutput, in order to avoid affecting other plugins? On
>> the other hand, if we do so, we will need to handle table sync worker
>> cases separately since they copy data via COPY executed by the wal
>> sender process. For example, we can have table sync workers set the
>> parameters.
>
> You mean table sync worker to set over the replication connection
> right?  I think that was the first solution where normal workers, as
> well as table sync workers, were setting over the replication
> connection, but Tom suggested that setting on the walsender is a
> better option as we can avoid the network round trip.
>
> If we want to set it over the replication connection then do it for
> both as Japin's first patch is doing, otherwise, I am not seeing any
> big issue in setting it early in the walsender also.  I think it is
> good to let walsender always send in the standard format which can be
> understood by other node, no?

+1

I inclined to let walsender set the parameters.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Masahiko Sawada

On Thu, Oct 21, 2021 at 3:04 PM Dilip Kumar  wrote:
>
> On Thu, Oct 21, 2021 at 11:16 AM Masahiko Sawada  
> wrote:
> >
> > On Wed, Oct 20, 2021 at 8:12 PM Japin Li  wrote:
> > >
> > >
> > > On Mon, 18 Oct 2021 at 17:27, Dilip Kumar  wrote:
> > > > On Mon, Oct 18, 2021 at 1:41 PM Japin Li  wrote:
> > > >
> > > >> I attached v3 patch that set IntervalStyle to 'postgres' when the
> > > >> server backend is walsender, and this problem has gone.
> > > >
> > > >> I test that set IntervalStyle to 'sql_standard' on publisher and
> > > >> 'iso_8601' on subscriber, it works fine.
> > > >
> > > >> Please try v3 patch and let me know if they work as unexpected.
> > > >> Thanks in advance.
> > > >
> > > > I think the idea of setting the standard DateStyle and the
> > > > IntervalStyle on the walsender process looks fine to me.  As this will
> > > > avoid extra network round trips as Tom mentioned.
> > >
> > > After some test, I find we also should set the extra_float_digits to avoid
> > > precision lossing.
> >
> > Thank you for the patch!
> >
> > --- a/src/backend/postmaster/postmaster.c
> > +++ b/src/backend/postmaster/postmaster.c
> > @@ -2223,6 +2223,24 @@ retry1:
> > {
> > am_walsender = true;
> > am_db_walsender = true;
> > +
> > +   /*
> > +* Force assorted GUC
> > parameters to settings that ensure
> > +* that we'll output data
> > values in a form that is
> > +* unambiguous to the walreceiver.
> > +*/
> > +   port->guc_options =
> > lappend(port->guc_options,
> > +
> >  pstrdup("datestyle"));
> > +   port->guc_options =
> > lappend(port->guc_options,
> > +
> >  pstrdup("ISO"));
> > +   port->guc_options =
> > lappend(port->guc_options,
> > +
> >  pstrdup("intervalstyle"));
> > +   port->guc_options =
> > lappend(port->guc_options,
> > +
> >  pstrdup("postgres"));
> > +   port->guc_options =
> > lappend(port->guc_options,
> > +
> >  pstrdup("extra_float_digits"));
> > +   port->guc_options =
> > lappend(port->guc_options,
> > +
> >  pstrdup("3"));
> > }
> >
> > I'm concerned that it sets parameters too early since wal senders end
> > up setting the parameters regardless of logical decoding plugins. It
> > might be better to force the parameters within the plugin for logical
> > replication, pgoutput, in order to avoid affecting other plugins? On
> > the other hand, if we do so, we will need to handle table sync worker
> > cases separately since they copy data via COPY executed by the wal
> > sender process. For example, we can have table sync workers set the
> > parameters.
>
> You mean table sync worker to set over the replication connection
> right?  I think that was the first solution where normal workers, as
> well as table sync workers, were setting over the replication
> connection, but Tom suggested that setting on the walsender is a
> better option as we can avoid the network round trip.

Right.

BTW I think we can set the parameters from the subscriber side without
additional network round trips by specifying the "options" parameter
in the connection string, no?

> If we want to set it over the replication connection then do it for
> both as Japin's first patch is doing, otherwise, I am not seeing any
> big issue in setting it early in the walsender also.  I think it is
> good to let walsender always send in the standard format which can be
> understood by other node, no?

Yeah, probably the change on HEAD is fine but I'm a bit concerned
about possible issues on back branches like if the user expects to get
date data in the style of DateStyle setting on the server via
pg_recvlogical, this change could break it.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [RFC] speed up count(*)

2021-10-21 Thread Joe Conway

On 10/20/21 2:33 PM, John Naylor wrote:

On Wed, Oct 20, 2021 at 2:23 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 >
 > Couldn't we simply inspect the visibility map, use the index data only
 > for fully visible/summarized ranges, and inspect the heap for the
 > remaining pages? That'd still be a huge improvement for tables with most
 > only a few pages modified recently, which is a pretty common case.
 >
 > I think the bigger issue is that people rarely do COUNT(*) on the whole
 > table. There are usually other conditions and/or GROUP BY, and I'm not
 > sure how would that work.

Right. My (possibly hazy) recollection is that people don't have quite 
as high an expectation for queries with more complex predicates and/or 
grouping. It would be interesting to see what the balance is.

I think you are exactly correct. People seem to understand that with a 
predicate it is harder, but they expect

 select count(*) from foo;

to be nearly instantaneous, and they don't really need it to be exact. 
The stock answer for that has been to do

 select reltuples from pg_class
 where relname = 'foo';

But that is unsatisfying because the problem is often with some 
benchmark or another that cannot be changed.

I'm sure this idea will be shot down in flames suit>, but what if we had a default "off" GUC which could be turned on 
causing the former to be transparently rewritten into the latter 
?

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Added schema level support for publication.

2021-10-21 Thread vignesh C

On Thu, Oct 21, 2021 at 3:29 PM Greg Nancarrow  wrote:
>
> On Thu, Oct 21, 2021 at 3:25 AM vignesh C  wrote:
> >
> > Attached v44 patch as the fixes for the same.
> >
>
> In the v44-0001 patch, I have some doubts about the condition guarding
> the following code in pg_get_publication_tables():
>
> + if (schemarelids)
> + {
> +/*
> + * If the publication publishes partition changes via their
> + * respective root partitioned tables, we must exclude
> + * partitions in favor of including the root partitioned
> + * tables. Otherwise, the function could return both the child
> + * and parent tables which could cause data of the child table
> + * to be double-published on the subscriber side.
> + *
> + * XXX As of now, we do this when a publication has associated
> + * schema or for all tables publication. See
> + * GetAllTablesPublicationRelations().
> + */
> +tables = list_concat_unique_oid(relids, schemarelids);
> +if (publication->pubviaroot)
> +   tables = filter_partitions(tables);
> + }
>
> Shouldn't a partition be filtered out only if publication->pubviaroot
> and the partition belongs to a schema (i.e. ALL TABLES IN SCHEMA)
> included in the publication?
> The current code seems to filter out partitions of partitioned tables
> included in the publication if ANY schemas are included as part of the
> publication (e.g. which could be a schema not including any
> partitioned tables or partitions).

I could reproduce the issue by using the following test:
--- Setup
create schema sch1;
create schema sch2;
create table sch1.tbl1 (a int) partition by range (a);
create table sch2.tbl1_part1 partition of sch1.tbl1 for values from (1) to (10);
create table sch2.tbl1_part2 partition of sch1.tbl1 for values from
(10) to (20);
create schema sch3;
create table sch3.t1(c1 int);

--- Publication
create publication pub1 for all tables in schema sch3, table
sch1.tbl1, table sch2.tbl1_part1 with ( publish_via_partition_root
=on);
insert into sch1.tbl1 values(1);
insert into sch1.tbl1 values(11);
insert into sch3.t1 values(1);

 Subscription
CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost
port=5432' PUBLICATION pub1;

The patch posted at [1] has the fix for the same.
[1] - 
https://www.postgresql.org/message-id/CALDaNm1onqBEr0WE_e7%3DCNw3bURfrGRmbMjX31d-nx3FGLS10A%40mail.gmail.com

Regards,
Vignesh

Re: ThisTimeLineID can be used uninitialized

2021-10-21 Thread Alvaro Herrera

On 2021-Oct-21, Michael Paquier wrote:

> There is already an assumption in walsender.c where an invalid
> timeline is 0, by the way?  See sendTimeLineNextTLI and sendTimeLine.
> Asserting here and there looks like a good thing to do for code paths
> where the timeline should, or should not, be set.

Sure, but as Robert suggested, let's make that value a known and obvious
constant InvalidTimeLineId rather than magic value 0.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

wait event and archive_command

2021-10-21 Thread Fujii Masao


Hi,

I'd like to propose to add new wait event reported while archiver process
is executing archive_command. This would be helpful to observe
what archiver is doing and check whether it has some troubles or not.
Thought? PoC patch attached.

Also how about adding wait events for other commands like
archive_cleanup_command, restore_command and recovery_end_command?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3173ec2566..13c7e991d0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1569,7 +1569,12 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   Waiting for subplan nodes of an Append plan
node to be ready.
  
-
+ 
+  ArchiveCommand
+  Waiting for  to
+   complete.
+ 
+ 
   BackendTermination
   Waiting for the termination of another backend.
  
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c4d0..2201f069ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -515,7 +515,10 @@ pgarch_archiveXlog(char *xlog)
snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
set_ps_display(activitymsg);
 
+   pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
rc = system(xlogarchcmd);
+   pgstat_report_wait_end();
+
if (rc != 0)
{
/*
diff --git a/src/backend/utils/activity/wait_event.c 
b/src/backend/utils/activity/wait_event.c
index 4a5b7502f5..d8e3688ab5 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -313,6 +313,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_APPEND_READY:
event_name = "AppendReady";
break;
+   case WAIT_EVENT_ARCHIVE_COMMAND:
+   event_name = "ArchiveCommand";
+   break;
case WAIT_EVENT_BACKEND_TERMINATION:
event_name = "BackendTermination";
break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index c22142365f..2edfdd3e51 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -80,6 +80,7 @@ typedef enum
 typedef enum
 {
WAIT_EVENT_APPEND_READY = PG_WAIT_IPC,
+   WAIT_EVENT_ARCHIVE_COMMAND,
WAIT_EVENT_BACKEND_TERMINATION,
WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE,
WAIT_EVENT_BGWORKER_SHUTDOWN,

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Japin Li



On Thu, 21 Oct 2021 at 19:54, Masahiko Sawada  wrote:
> On Thu, Oct 21, 2021 at 3:04 PM Dilip Kumar  wrote:
>>
>> On Thu, Oct 21, 2021 at 11:16 AM Masahiko Sawada  
>> wrote:
>> >
>> > On Wed, Oct 20, 2021 at 8:12 PM Japin Li  wrote:
>> > >
>> > >
>> > > On Mon, 18 Oct 2021 at 17:27, Dilip Kumar  wrote:
>> > > > On Mon, Oct 18, 2021 at 1:41 PM Japin Li  wrote:
>> > > >
>> > > >> I attached v3 patch that set IntervalStyle to 'postgres' when the
>> > > >> server backend is walsender, and this problem has gone.
>> > > >
>> > > >> I test that set IntervalStyle to 'sql_standard' on publisher and
>> > > >> 'iso_8601' on subscriber, it works fine.
>> > > >
>> > > >> Please try v3 patch and let me know if they work as unexpected.
>> > > >> Thanks in advance.
>> > > >
>> > > > I think the idea of setting the standard DateStyle and the
>> > > > IntervalStyle on the walsender process looks fine to me.  As this will
>> > > > avoid extra network round trips as Tom mentioned.
>> > >
>> > > After some test, I find we also should set the extra_float_digits to 
>> > > avoid
>> > > precision lossing.
>> >
>> > I'm concerned that it sets parameters too early since wal senders end
>> > up setting the parameters regardless of logical decoding plugins. It
>> > might be better to force the parameters within the plugin for logical
>> > replication, pgoutput, in order to avoid affecting other plugins? On
>> > the other hand, if we do so, we will need to handle table sync worker
>> > cases separately since they copy data via COPY executed by the wal
>> > sender process. For example, we can have table sync workers set the
>> > parameters.
>>
>> You mean table sync worker to set over the replication connection
>> right?  I think that was the first solution where normal workers, as
>> well as table sync workers, were setting over the replication
>> connection, but Tom suggested that setting on the walsender is a
>> better option as we can avoid the network round trip.
>
> Right.
>
> BTW I think we can set the parameters from the subscriber side without
> additional network round trips by specifying the "options" parameter
> in the connection string, no?
>

Yes, we can.  However, each client should be concerned the style for
datestyle, IMO it is boring.

>> If we want to set it over the replication connection then do it for
>> both as Japin's first patch is doing, otherwise, I am not seeing any
>> big issue in setting it early in the walsender also.  I think it is
>> good to let walsender always send in the standard format which can be
>> understood by other node, no?
>
> Yeah, probably the change on HEAD is fine but I'm a bit concerned
> about possible issues on back branches like if the user expects to get
> date data in the style of DateStyle setting on the server via
> pg_recvlogical, this change could break it.
>

How it breaks?  The user also can specify the "options" to get date data
in the style which they are wanted. Right?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: lastOverflowedXid does not handle transaction ID wraparound

2021-10-21 Thread Stan Hu

On Wed, Oct 20, 2021 at 9:01 PM Kyotaro Horiguchi
 wrote:
>
> lastOverflowedXid is the smallest subxid that possibly exists but
> possiblly not known to the standby. So if all top-level transactions
> older than lastOverflowedXid end, that means that all the
> subtransactions in doubt are known to have been ended.

Thanks for the patch! I verified that it appears to reset
lastOverflowedXid properly.

I may not be understanding
https://github.com/postgres/postgres/blob/dc899146dbf0e1d23fb24155a5155826ddce34c9/src/backend/storage/ipc/procarray.c#L1326-L1327
correctly, but isn't lastOverflowedXid the last subxid for a given
top-level XID, so isn't it actually the largest subxid that possibly
exists?

Re: Assorted improvements in pg_dump

2021-10-21 Thread Justin Pryzby

On Wed, Oct 20, 2021 at 05:14:45PM -0400, Tom Lane wrote:
> Lastly, patch 0003 addresses the concern I raised at [3] that it's
> unsafe to call pg_get_partkeydef() and pg_get_expr(relpartbound)
> in getTables().  Looking closer I realized that we can't cast
> pg_class.reloftype to regtype at that point either, since regtypeout
> is going to notice if the type has been concurrently dropped.
> 
> In [3] I'd imagined that we could just delay those calls to a second
> query in getTables(), but that doesn't work at all: if we apply
> these functions to every row of pg_class, we still risk failure
> against any relation that we didn't lock.  So there seems little
> alternative but to push these functions out to secondary queries
> executed later.
> 
> Arguably, 0003 is a bug fix that we should consider back-patching.
> However, I've not heard field reports of the problems it fixes,
> so maybe there's no need to bother.

> [3] https://www.postgresql.org/message-id/1462940.1634496313%40sss.pgh.pa.us

FYI, I see this issue happen in production environment.

Grepping logfiles on ~40 servers, I see it happened at least 3 times since
October 1.

Our backup script is probably particularly sensitive to this: it separately
dumps each "old" partition once and for all.  We're more likely to hit this
since pg_dump is called in a loop.

I never reported it, since I think it's a documented issue, and it's no great
problem, so long as it runs the next day.  But it'd be a pain to find that the
backup was incomplete when we needed it.  Also, I use the backups to migrate to
new servers, and it would be a pain to start the job at a calculated time
expecting it to finish at the beginning of a coordinated maintenance window,
but then discover that it had failed and needs to be rerun with fingers
crossed.

On Sun, Oct 17, 2021 at 02:45:13PM -0400, Tom Lane wrote:
> While poking at pg_dump for some work I'll show later, I grew quite
> unhappy with the extent to which people have ignored this advice
> given near the head of getTables():
> 
>  * Note: in this phase we should collect only a minimal amount of
>  * information about each table, basically just enough to decide if it is
>  * interesting. We must fetch all tables in this phase because otherwise
>  * we cannot correctly identify inherited columns, owned sequences, etc.
> 
> Far from collecting "a minimal amount of information", we have
> repeatedly stuffed extra joins and expensive sub-selects into this
> query.  That's fairly inefficient if we aren't interested in dumping
> every table, but it's much worse than that: we are collecting all this
> info *before we have lock* on the tables.  That means we are at
> serious risk of data skew in any place where we consult server-side
> functions, eg pg_get_partkeydef().  For example:
> 
> regression=# create table foo(f1 int) partition by range(f1);
> CREATE TABLE
> regression=# begin; drop table foo;
> BEGIN
> DROP TABLE
> 
> Now start a pg_dump in another shell session, wait a second or two,
> and
> 
> regression=*# commit;
> COMMIT
> 
> and the pg_dump blows up:
> 
> $ pg_dump -s regression
> pg_dump: error: query failed: ERROR:  could not open relation with OID 37796
...

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Tom Lane

Japin Li  writes:
> On Thu, 21 Oct 2021 at 19:54, Masahiko Sawada  wrote:
>> BTW I think we can set the parameters from the subscriber side without
>> additional network round trips by specifying the "options" parameter
>> in the connection string, no?

> Yes, we can.  However, each client should be concerned the style for
> datestyle, IMO it is boring.

There's another issue here: the subscriber can run user-defined code
(in triggers), while AFAIK the sender cannot.  People might be surprised
if their triggers run with a datestyle setting different from the
database's prevailing setting.  So while I think it should be okay
to set-and-forget the datestyle on the sender side, we could not get
away with that in the subscriber.  We'd have to set and unset for
each row, much as (e.g.) postgres_fdw has to do.

regards, tom lane

Re: Assorted improvements in pg_dump

2021-10-21 Thread Tom Lane

Justin Pryzby  writes:
> On Wed, Oct 20, 2021 at 05:14:45PM -0400, Tom Lane wrote:
>> Arguably, 0003 is a bug fix that we should consider back-patching.
>> However, I've not heard field reports of the problems it fixes,
>> so maybe there's no need to bother.

> FYI, I see this issue happen in production environment.
> I never reported it, since I think it's a documented issue, and it's no great
> problem, so long as it runs the next day.  But it'd be a pain to find that the
> backup was incomplete when we needed it.  Also, I use the backups to migrate 
> to
> new servers, and it would be a pain to start the job at a calculated time
> expecting it to finish at the beginning of a coordinated maintenance window,
> but then discover that it had failed and needs to be rerun with fingers
> crossed.

Yeah, if you're dropping tables all the time, pg_dump is going to have
a problem with that.  The fix I'm suggesting here would only ensure
that you get a "clean" failure at the LOCK TABLE command --- but from
an operational standpoint, that's little improvement.

The natural response is to consider retrying the whole dump after a lock
failure.  I'm not sure if it'd be practical to do that within pg_dump
itself, as opposed to putting a loop in whatever script calls it.

regards, tom lane

Re: [PATCH] Fix memory corruption in pg_shdepend.c

2021-10-21 Thread Alvaro Herrera

On 2021-Oct-21, Michael Paquier wrote:

> On Wed, Oct 20, 2021 at 09:19:51AM -0300, Alvaro Herrera wrote:
> > Ouch ... this means that pg_shdepends contents are broken for databases
> > created with 14.0?  hmm ... yes.
> 
> Yes, it means so :(

For the upcoming release notes in 14.1 I think we'd do well to document
how to find out if you're affected by this; and if you are, how to fix
it.

I suppose pg_describe_object can be used on the contents of pg_shdepend
to detect it.  I'm less sure what to do to correct it -- delete the
bogus entries and regenerate them with some bulk query?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)

Re: wait event and archive_command

2021-10-21 Thread Bharath Rupireddy

On Thu, Oct 21, 2021 at 7:28 PM Fujii Masao  wrote:
>
> Hi,
>
> I'd like to propose to add new wait event reported while archiver process
> is executing archive_command. This would be helpful to observe
> what archiver is doing and check whether it has some troubles or not.
> Thought? PoC patch attached.
>
> Also how about adding wait events for other commands like
> archive_cleanup_command, restore_command and recovery_end_command?

+1 for the wait event.

The following activitymsg that are being set to ps display in
XLogFileRead and pgarch_archiveXlog have come up for one of our
internal team discussions recently:

snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 xlogfname);
set_ps_display(activitymsg);

snprintf(activitymsg, sizeof(activitymsg), "recovering %s",
 xlogfname);
set_ps_display(activitymsg);

snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
set_ps_display(activitymsg);

The ps display info might be useful if we run postgres on a stand
alone box and there's someone monitoring at the ps output, but it
doesn't help debugging after an issue has occurred. How about we have
the following statements which will be useful for someone to look at
the server logs and know what was/is happening during the recovery and
archiving. IMO, we should also have the elog statement.

elog(LOG, "waiting for %s", xlogfname);
elog(LOG, "recovering %s"", xlogfname);
elog(LOG, "archiving %s", xlog);

Another idea could be to have a hook emitting the above info to
outside components, but a hook just for this purpose isn't a great
idea IMO.

Thoughts?

Regards,
Bharath Rupireddy.

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Japin Li



On Thu, 21 Oct 2021 at 22:46, Tom Lane  wrote:
> Japin Li  writes:
>> On Thu, 21 Oct 2021 at 19:54, Masahiko Sawada  wrote:
>>> BTW I think we can set the parameters from the subscriber side without
>>> additional network round trips by specifying the "options" parameter
>>> in the connection string, no?
>
>> Yes, we can.  However, each client should be concerned the style for
>> datestyle, IMO it is boring.
>
> There's another issue here: the subscriber can run user-defined code
> (in triggers), while AFAIK the sender cannot.

Sorry, I'm not sure about this.  Could you give me an example?

> People might be surprised
> if their triggers run with a datestyle setting different from the
> database's prevailing setting.  So while I think it should be okay
> to set-and-forget the datestyle on the sender side, we could not get
> away with that in the subscriber.  We'd have to set and unset for
> each row, much as (e.g.) postgres_fdw has to do.
>

Yeah! As Masahiko said, we can avoid the network round trips by specifying
the "options" parameter in the connection string.

If this way is more accepted, I'll update the patch later.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Tom Lane

Japin Li  writes:
> On Thu, 21 Oct 2021 at 22:46, Tom Lane  wrote:
>> There's another issue here: the subscriber can run user-defined code
>> (in triggers), while AFAIK the sender cannot.

> Sorry, I'm not sure about this.  Could you give me an example?

If you're doing logical replication into a table that has triggers,
the replication worker has to execute those triggers.

regards, tom lane

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Japin Li

On Thu, 21 Oct 2021 at 23:10, Tom Lane  wrote:
> Japin Li  writes:
>> On Thu, 21 Oct 2021 at 22:46, Tom Lane  wrote:
>>> There's another issue here: the subscriber can run user-defined code
>>> (in triggers), while AFAIK the sender cannot.
>
>> Sorry, I'm not sure about this.  Could you give me an example?
>
> If you're doing logical replication into a table that has triggers,
> the replication worker has to execute those triggers.
>

Does that mean we should use the subscriber's settings to set the
replication's parameter (e.g. datestyle)?  If we do this, it might
loss precision (for example: extra_float_digits on publisher is 3
and on subscriber is -4), is this accpted?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: [PATCH] Fix memory corruption in pg_shdepend.c

2021-10-21 Thread Tom Lane

Alvaro Herrera  writes:
> I suppose pg_describe_object can be used on the contents of pg_shdepend
> to detect it.  I'm less sure what to do to correct it -- delete the
> bogus entries and regenerate them with some bulk query?

Seems that what copyTemplateDependencies wants to do can easily be
modeled by a SQL query, assuming you know which DB was cloned to
which other one, and that the source's shdeps didn't change since
then.  However, I'm not sure how we can get rid of existing bogus
entries, especially if we'd like to preserve not-bogus ones
(which very likely have gotten added to the destination DB since
it was created).

On the whole I'm afraid that people messing with this manually are
likely to do more harm than good.  pg_shdepend entries that don't
match any object probably won't cause a problem, and the lack of
protection against untimely dropping a role is unlikely to be much
of an issue for a role you're referencing in a template database.
So I suspect that practical issues will be rare.  We're fortunate
that cloning a nonempty template database is rare already.

BTW, I think there is an additional bug in copyTemplateDependencies:
I do not see it initializing slot->tts_isnull[] anywhere.  It
probably accidentally works (at least in devel builds) because we zero
that memory somewhere else, but surely this code shouldn't assume that?

regards, tom lane

Re: Adding CI to our tree

2021-10-21 Thread Matthias van de Meent

On Sat, 2 Oct 2021 at 17:05, Tom Lane  wrote:
>
> Andres Freund  writes:
> > It's not like this forces you to use cirrus or anything. For people that 
> > don't
> > want to use CI, It'll make cfbot a bit more effective (because people can
> > adjust what it tests as appropriate for $patch), but that's it.

I don't disagree on that part, but I fail to see what makes the
situations of an unused CI config file in the tree and an unused
`/.idea/` or `/.vs/` specifier in the .gitignore [0][1] distinct
enough for it to be resolved differently. Both are quality-of-life
additions for those that use that tool, while non-users of that tool
can ignore those configuration entries.

> Yeah.  I cannot see any reason to object to Andres' 0002 patch: you can
> just ignore those files if you don't want to use cirrus.  It does set a
> precedent that we'd also accept infrastructure for other CI systems,
> but as long as they're similarly noninvasive, why not?  (Maybe there
> needs to be one more directory level though, ie ci/cirrus/whatever.
> I don't want to end up with one toplevel directory per CI platform.)

With the provided arguments I won't object to the addition of these
config files, but I would appreciate it if a clear policy could be
provided on the inclusion of configurations for external tools that
are not expected to be used by all users of the repository, such as
CI, editors and IDEs.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/OS3PR01MB71593D78DD857C2BBA9FB824F2A69%40OS3PR01MB7159.jpnprd01.prod.outlook.com
[1] 
https://www.postgresql.org/message-id/flat/15BFD11D-5D72-46B2-BDB1-2DF4E049C13D%40me.com

Re: [PATCH] Fix memory corruption in pg_shdepend.c

2021-10-21 Thread David G. Johnston

On Thu, Oct 21, 2021 at 8:52 AM Tom Lane  wrote:

> We're fortunate
> that cloning a nonempty template database is rare already.
>
>
That, and a major use case for doing so is to quickly stage up testing data
in a new database (i.e., not a production use case).  Though I could see
tenant-based products using this to bootstrap new clients I'd hope that is
a minority case.

David J.

Re: Adding CI to our tree

2021-10-21 Thread Tom Lane

Matthias van de Meent  writes:
> I don't disagree on that part, but I fail to see what makes the
> situations of an unused CI config file in the tree and an unused
> `/.idea/` or `/.vs/` specifier in the .gitignore [0][1] distinct
> enough for it to be resolved differently. Both are quality-of-life
> additions for those that use that tool, while non-users of that tool
> can ignore those configuration entries.

Um ... I don't see a connection at all.  One is talking about files
we put into the git tree, and one is talking about files that are
*not* in the tree.

We do have a policy that files that are created by a supported build
process should be .gitignore'd, so that might lead to more .gitignore
entries as this idea moves ahead.  I'm not on board though with the
idea of .gitignore'ing anything that anybody anywhere thinks is junk.
That's more likely to lead to conflicts and mistakes than anything
useful.  We expect developers to have personal excludesfile lists
that block off editor backup files and other cruft from the tools
that they personally use but are not required by the build.

regards, tom lane

Re: Adding CI to our tree

2021-10-21 Thread Andreas Karlsson


On 10/21/21 5:55 PM, Matthias van de Meent wrote:

On Sat, 2 Oct 2021 at 17:05, Tom Lane  wrote:


Andres Freund  writes:

It's not like this forces you to use cirrus or anything. For people that don't
want to use CI, It'll make cfbot a bit more effective (because people can
adjust what it tests as appropriate for $patch), but that's it.


I don't disagree on that part, but I fail to see what makes the
situations of an unused CI config file in the tree and an unused
`/.idea/` or `/.vs/` specifier in the .gitignore [0][1] distinct
enough for it to be resolved differently. Both are quality-of-life
additions for those that use that tool, while non-users of that tool
can ignore those configuration entries.


There is a better solution to that. Just add those files to the global 
gitignore on your machine. You will want to ignore those files in all 
git repositories on your machine anyway. On the other hand the 
configuration files for the CI are relevant to just the PostgreSQL repo.


Andreas

Re: Drop replslot after pgstat_shutdown cause assert coredump

2021-10-21 Thread Fujii Masao





On 2021/10/11 22:15, Greg Nancarrow wrote:

On Mon, Oct 11, 2021 at 6:55 PM houzj.f...@fujitsu.com
 wrote:


I can see the walsender tried to release a not-quite-ready repliaction slot
that was created when create a subscription. But the pgstat has been shutdown
before invoking ReplicationSlotRelease().

The stack is as follows:

#2  in ExceptionalCondition (pgstat_is_initialized && !pgstat_is_shutdown)
#3  in pgstat_assert_is_up () at pgstat.c:4852
#4  in pgstat_send (msg=msg@entry=0x7ffe716f7470, len=len@entry=144) at 
pgstat.c:3075
#5  in pgstat_report_replslot_drop (slotname=slotname@entry=0x7fbcf57a3c98 
"sub") at pgstat.c:1869
#6  in ReplicationSlotDropPtr (slot=0x7fbcf57a3c80) at slot.c:696
#7  in ReplicationSlotDropAcquired () at slot.c:585
#8  in ReplicationSlotRelease () at slot.c:482
#9  in ProcKill (code=, arg=) at proc.c:852
#10 in shmem_exit (code=code@entry=0) at ipc.c:272
#11 in proc_exit_prepare (code=code@entry=0) at ipc.c:194
#12 in proc_exit (code=code@entry=0) at ipc.c:107
#13 in ProcessRepliesIfAny () at walsender.c:1807
#14 in WalSndWaitForWal (loc=loc@entry=22087632) at walsender.c:1417
#15 in logical_read_xlog_page (state=0x2f8c600, targetPagePtr=22085632,
 reqLen=, targetRecPtr=, cur_page=0x2f6c1e0 "\016\321\005") at 
walsender.c:821
#16 in ReadPageInternal (state=state@entry=0x2f8c600,
 pageptr=pageptr@entry=22085632, reqLen=reqLen@entry=2000) at 
xlogreader.c:667
#17 in XLogReadRecord (state=0x2f8c600,
 errormsg=errormsg@entry=0x7ffe716f7f98) at xlogreader.c:337
#18 in DecodingContextFindStartpoint (ctx=ctx@entry=0x2f8c240)
 at logical.c:606
#19 in CreateReplicationSlot (cmd=cmd@entry=0x2f1aef0)

Is this behavior expected ?



I'd say it's not!


Yes. I think this is a bug.



Just looking at the stacktrace, I'm thinking that the following commit
may have had a bearing on this problem, by causing pgstat to be
shutdown earlier:

commit fb2c5028e63589c01fbdf8b031be824766dc7a68
Author: Andres Freund 
Date:   Fri Aug 6 10:05:57 2021 -0700

 pgstat: Schedule per-backend pgstat shutdown via before_shmem_exit().


Can you see if the problem can be reproduced prior to this commit?


Even in prior to the commit, pgstat_shutdown_hook() can be called
before ProcKill() at the backend exit, so ISTM that the problem can
be reproduced.

Probably we need to make sure that pgstat_shutdown_hook() is called
after ProcKill(), e.g., by registering pgstat_shutdown_hook() into
on_proc_exit_list (I'm not sure if this change is safe, though).
Or maybe pgstat logic for replication slot drop needs to be overhauled.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgstat_assert_is_up() can fail in walsender

2021-10-21 Thread Fujii Masao





On 2021/10/19 22:14, Amit Langote wrote:

Hi,

I can (almost) consistently reproduce $subject by executing the
attached shell script, which I was using while constructing a test
case for another thread.


This seems the same issue that was reported at the thread [1].

[1]
https://www.postgresql.org/message-id/os0pr01mb571621b206eeb17d8ab171f094...@os0pr01mb5716.jpnprd01.prod.outlook.com

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: XTS cipher mode for cluster file encryption

2021-10-21 Thread Stephen Frost

Greetings,

* Sasasu (i...@sasa.su) wrote:
> On 2021/10/20 20:24, Stephen Frost wrote:
> > PG does have a block-based IO API, it's just not exposed as hooks.  In
> > particular, take a look at md.c, though perhaps you'd be more interested
> > in the higher level bufmgr.c routines.  For the specific places where
> > encryption may hook in, looking at the DataChecksumsEnabled() call sites
> > may be informative (there aren't that many of them).
> 
> md.c is great, easy to understand. but PG does not have a unified API. There
> has many unexpected pread()/pwrite() in many corners. md.c only for heap
> table, bufmgr.c only for a buffered heap table.

There's certainly other calls out there, yes.

> eg: XLogWrite() looks like a block API, but is a range write. equivalent to
> the append(2)

XLog is certainly another thing that has to be dealt with, of course,
but I don't see us trying to shoehorn that into using md.c somehow.

> eg: ALTER DATABASE SET TABLESPACE , the movedb() call. use copy_file() on
> heap table. which is just pread() pwrite() with 8*BLCKSZ.
> eg: all front-end tools use pread() to read heap table. in particular,
> pg_rewind write heap table by offset.
> eg: in contrib, pg_standby use system("cp") to copy WAL.

None of these are actually working with or changing the data though,
they're just copying it.  I don't think we'd actually want these to
decrypt and reencrypt the data.

> On 2021/10/20 20:24, Stephen Frost wrote:
> > Breaking our ondisk format explicitly means that pg_upgrade won't work
> > any longer and folks won't be able to do in-place upgrades. That's a
> > pretty huge deal and it's something we've not done in over a decade.
> > I doubt that's going to fly.
> 
> I completely agree.

Great.

> On 2021/10/20 20:24, Stephen Frost wrote:
> > Yes, using another fork for this is something that's been considered but
> > it's not without its own drawbacks, in particular having to do another
> > write and later fsync when a page changes.
> >
> > Further, none of this is necessary for XTS, but only for GCM. This is
> > why it was put forward that GCM involves a lot more changes to the
> > system and means that we won't be able to do things like binary
> > replication to switch from an unencrypted to encrypted cluster. Those
> > are good reasons to consider an XTS implementation first and then later,
> > perhaps, implement GCM.
> 
> same as Robert Haas. I wish PG can do some infrastructure first. add more
> abstract layers like md.c (maybe a block-based API with ondisk format
> version field). so people can dive in without understanding the things which
> isolated by the abstract layer.

I really don't think this is necessary.  Similar to PageSetChecksumCopy
and PageSetChecksumInplace, I'm sure we would have functions which are
called in the appropriate spots to do encryption (such as 'encrypt_page'
and 'encrypt_block' in the Cybertec patch) and folks could review those
in relative isolation to the rest.  Dealing with blocks in PG is already
pretty well handled, the infrastructure that needs to be added is around
handling temporary files and is being actively worked on ... if we could
move past this debate around if we should be adding support for XTS or
if only GCM-SIV would be accepted.

> On 2021/10/20 20:24, Stephen Frost wrote:
> > What's the point of using GCM if we aren't going to actually verify the
> > tag? Also, the Cybertec patch didn't add an extra reserved field to the
> > page format, and it used CTR anyway..
> 
> Oh, I am wrong, Cybertec patch can not use XTS, because WAL may not be
> aligned to 16bytes. for WAL need a stream cipher. The CTR implement is still
> correct.

No, the CTR approach isn't great because, as has been discussed quite a
bit already, using the LSN as the IV means that different data might be
re-encrypted with the same LSN and that's not an acceptable thing to
have happen with CTR.

> CTR with hash(offset) as IV is basically equal to XTS. if use another AES
> key to encrypt the hash(offset), and block size is 16bytes it is XTS.

I don't understand why we're talking about CTR+other-stuff.  Maybe if
you use CTR and then do other things then it's equivilant in some
fashion to XTS ... but then it's not CTR anymore and we shouldn't be
calling it that.  Saying that we should do "CTR+other stuff" (which
happens to make it equivilant to XTS) instead of just saying we should
use "XTS" is very confusing and further means that we're starting down
the path of trying to come up with our own hack on existing encryption
schemes, and regardless of who shows up on this list to claim that doing
so makes sense and is "right", I'm going to be extremely skeptical.

> The point is that can not save random IV for WAL without adding a reserved
> field, no matter use GCM or CTR.

Yes, it's correct that we can't use a random IV for the WAL without
figuring out how to keep track of that random IV.  Thankfully, for WAL
(unlike heap and index blocks) we don't really have that issue-

Re: ThisTimeLineID can be used uninitialized

2021-10-21 Thread Robert Haas

On Tue, Oct 19, 2021 at 4:44 PM Andres Freund  wrote:
> It's quite confusing. It's *really* not helped by physical replication using
> but not really using an xlogreader to keep state. Which presumably isn't
> actually used during a physical CreateReplicationSlot(), but is referenced by
> a comment :/

I can't figure out what you're referring to here. I can't find
CreateReplicationSlot() using an xlogreader in the logical replication
case, or a comment that refers to it doing so.

> Istm we should introduce an InvalidTimeLineID, and explicitly initialize
> sendTimeLine to that, and assert that it's valid / invalid in a bunch of
> places?

I think the correct fix for this particular problem is just to delete
the initialization, as in the attached patch. I spent a long time
studying this today and eventually convinced myself that there's just
no way these initializations can ever do anything (details in proposed
commit message).  While it is important that we do not access the
global variable when it's uninitialized, here there is no reason to
access it in the first place.

Regarding the more general problem, I think we should consider (1)
reducing the number of places that access ThisTimeLineID directly,
preferring to add TimeLineID arguments to functions and pass the
relevant timeline value around explicitly and then (2) changing all of
the remaining accesses to ThisTimeLineID to function calls instead,
e.g. by inventing a function GetCurrentTimeLineID(). Once we do that,
I think this kind of problem just goes away. On the one hand,
GetCurrentTimeLineID() could assert that the value is valid before
returning it, and then we would have centralized checking that we're
not using a bogus value. But, there's no reason to stop there. If all
the callers are using this function rather than accessing the global
variable directly, then the function can just initialize the value
from shared memory as required! Or it can forget about having a local
copy stored in a global variable and just always read the current
value from shared memory! With a little thought, I think this approach
can avoid this sort of unfortunate coding:

if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
tli = ThisTimeLineID;

What is going on here? Well, if we're not still in recovery, then the
call to RecoveryInProgress() will initialize ThisTimeLineID as a side
effect, and after that it can't change. If we are still in recovery
then GetXLogReplayRecPtr()'s will update the global variable as a side
effect on every trip through the function. Either way, read_upto is
the end of WAL in the way that's relevant to whichever operating mode
is current. But imagine that we could code this in a way that didn't
depend on global variables getting updated as a side effect. For
example:

if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr();
currTLI = GetCurrentTimeLineID();

Or perhaps:

if (!RecoveryInProgress())
read_upto = GetFlushRecPtr(&currTLI);
else
read_upto = GetXLogReplayRecPtr(&currTLI);

My point here is that the current idiom only makes sense if you
realize that RecoveryInProgress() has a side effect of updating
ThisTimeLineID, and on the other hand that the only reason we're using
ThisTimeLineID instead of a local variable here is that that's what
RecoveryInProgress() updates. It's just two mutually-reinforcing bad
decisions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

0001-Remove-useless-code-from-CreateReplicationSlot.patch
Description: Binary data

Re: [RFC] speed up count(*)

2021-10-21 Thread Robert Haas

On Thu, Oct 21, 2021 at 9:09 AM Joe Conway  wrote:
> I think you are exactly correct. People seem to understand that with a
> predicate it is harder, but they expect
>
>   select count(*) from foo;
>
> to be nearly instantaneous, and they don't really need it to be exact.
> The stock answer for that has been to do
>
>   select reltuples from pg_class
>   where relname = 'foo';
>
> But that is unsatisfying because the problem is often with some
> benchmark or another that cannot be changed.

I would also expect it to almost always give the wrong answer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE ROLE IF NOT EXISTS

2021-10-21 Thread David Christensen

On Tue, Oct 19, 2021 at 4:29 PM Isaac Morland 
wrote:

> On Tue, 19 Oct 2021 at 16:12, David Christensen <
> david.christen...@crunchydata.com> wrote:
>
>> Greetings -hackers,
>>
>> Enclosed is a patch that implements CREATE ROLE IF NOT EXISTS (along with
>> the same support for USER/GROUP).  This is a fairly straightforward
>> approach in that we do no validation of anything other than existence, with
>> the user needing to ensure that permissions/grants are set up in the proper
>> way.
>>
>
> One little tricky aspect that occurs to me is the ALTER ROLE to set the
> role flag options: it really needs to mention *all* the available options
> if it is to leave the role in a specific state regardless of how it started
> out. For example, if the existing role has BYPASSRLS but you want the
> default NOBYPASSRLS you have to say so explicitly.
>
> Because of this, I think my preference, based just on thinking about
> setting the flag options, would be for CREATE OR REPLACE.
>
> However, I'm wondering about the role name options: IN ROLE, ROLE, ADMIN.
> With OR REPLACE should they replace the set of memberships or augment it?
> Either seems potentially problematic to me. By contrast it’s absolutely
> clear what IF NOT EXISTS should do with these.
>
> So I’m not sure what I think overall.
>

Sure, the ambiguity here for merging options was exactly the reason I went
with the IF NOT EXISTS route.  Whatever concerns with merging already exist
with ALTER ROLE, so nothing new is introduced by this functionality, at
least that was my original thought.

David

Re: [RFC] speed up count(*)

2021-10-21 Thread Joe Conway


On 10/21/21 4:06 PM, Robert Haas wrote:

On Thu, Oct 21, 2021 at 9:09 AM Joe Conway  wrote:

I think you are exactly correct. People seem to understand that with a
predicate it is harder, but they expect

  select count(*) from foo;

to be nearly instantaneous, and they don't really need it to be exact.
The stock answer for that has been to do

  select reltuples from pg_class
  where relname = 'foo';

But that is unsatisfying because the problem is often with some
benchmark or another that cannot be changed.


I would also expect it to almost always give the wrong answer.



That is a grossly overstated position. When I have looked, it is often 
not that terribly far off. And for many use cases that I have heard of 
at least, quite adequate.


Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: parallelizing the archiver

2021-10-21 Thread Robert Haas

On Tue, Oct 19, 2021 at 2:50 PM Stephen Frost  wrote:
> I keep seeing this thrown around and I don't quite get why we feel this
> is the case.  I'm not completely against trying to maintain backwards
> compatibility, but at the same time, we just went through changing quite
> a bit around in v12 with the restore command and that's the other half
> of this.  Why are we so concerned about backwards compatibility here
> when there was hardly any complaint raised about breaking it in the
> restore case?

There are 0 references to restore_command in the v12 release notes.
Just in case you had the version number wrong in this email, I
compared the documentation for restore_command in v10 to the
documentation in v14. The differences seem to be only cosmetic. So I'm
not sure what functional change you think we made. It was probably
less significant than what was being discussed here in regards to
archive_command.

Also, more to the point, when there's a need to break backward
compatibility in order to get some improvement, it's worth
considering, but here there just isn't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [RFC] speed up count(*)

2021-10-21 Thread Robert Haas

On Thu, Oct 21, 2021 at 4:19 PM Joe Conway  wrote:
> That is a grossly overstated position. When I have looked, it is often
> not that terribly far off. And for many use cases that I have heard of
> at least, quite adequate.

I don't think it's grossly overstated. If you need an approximation it
may be good enough, but I don't think it will often be exactly correct
- probably only if the table is small and rarely modified.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: parallelizing the archiver

2021-10-21 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Oct 19, 2021 at 2:50 PM Stephen Frost  wrote:
> > I keep seeing this thrown around and I don't quite get why we feel this
> > is the case.  I'm not completely against trying to maintain backwards
> > compatibility, but at the same time, we just went through changing quite
> > a bit around in v12 with the restore command and that's the other half
> > of this.  Why are we so concerned about backwards compatibility here
> > when there was hardly any complaint raised about breaking it in the
> > restore case?
> 
> There are 0 references to restore_command in the v12 release notes.
> Just in case you had the version number wrong in this email, I
> compared the documentation for restore_command in v10 to the
> documentation in v14. The differences seem to be only cosmetic. So I'm
> not sure what functional change you think we made. It was probably
> less significant than what was being discussed here in regards to
> archive_command.

restore_command used to be in recovery.conf, which disappeared with v12
and it now has to go into postgresql.auto.conf or postgresql.conf.

That's a huge breaking change.

> Also, more to the point, when there's a need to break backward
> compatibility in order to get some improvement, it's worth
> considering, but here there just isn't.

There won't be any thought towards a backwards-incompatible capability
if everyone is saying that we can't possibly break it.  That's why I was
commenting on it.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: [RFC] speed up count(*)

2021-10-21 Thread Joe Conway


On 10/21/21 4:23 PM, Robert Haas wrote:

On Thu, Oct 21, 2021 at 4:19 PM Joe Conway  wrote:

That is a grossly overstated position. When I have looked, it is often
not that terribly far off. And for many use cases that I have heard of
at least, quite adequate.


I don't think it's grossly overstated. If you need an approximation it
may be good enough, but I don't think it will often be exactly correct
- probably only if the table is small and rarely modified.


meh -- the people who expect this to be impossibly fast don't typically 
need or expect it to be exactly correct, and there is no way to make it 
"exactly correct" in someone's snapshot without doing all the work.


That is why I didn't suggest making it the default. If you flip the 
switch, you would get a very fast approximation. If you care about 
accuracy, you accept it has to be slow.


Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: [RFC] speed up count(*)

2021-10-21 Thread Andrew Dunstan



On 10/21/21 4:29 PM, Joe Conway wrote:
> On 10/21/21 4:23 PM, Robert Haas wrote:
>> On Thu, Oct 21, 2021 at 4:19 PM Joe Conway  wrote:
>>> That is a grossly overstated position. When I have looked, it is often
>>> not that terribly far off. And for many use cases that I have heard of
>>> at least, quite adequate.
>>
>> I don't think it's grossly overstated. If you need an approximation it
>> may be good enough, but I don't think it will often be exactly correct
>> - probably only if the table is small and rarely modified.
>
> meh -- the people who expect this to be impossibly fast don't
> typically need or expect it to be exactly correct, and there is no way
> to make it "exactly correct" in someone's snapshot without doing all
> the work.
>
> That is why I didn't suggest making it the default. If you flip the
> switch, you would get a very fast approximation. If you care about
> accuracy, you accept it has to be slow.
>

I don't think we really want a switch for "inaccurate results
acceptable", and I doubt the standard would accept an approximation for
count(*).

But something else that gave a fast approximate answer
("count_estimate(*)"?) would be useful to many. Not portable but still
useful, if someone could come up with a reasonable implementation.


cheers


andrew

 

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [RFC] speed up count(*)

2021-10-21 Thread Robert Haas

On Thu, Oct 21, 2021 at 4:29 PM Joe Conway  wrote:
> meh -- the people who expect this to be impossibly fast don't typically
> need or expect it to be exactly correct, and there is no way to make it
> "exactly correct" in someone's snapshot without doing all the work.

I think it could actually be WAY faster than it is if, as Andres says,
we had the ability to push the count operation inside the heap AM. I
believe we have a tendency to attribute complaints like this to people
have unreasonable expectations, but here I'm not sure the expectation
is unreasonable. I vaguely recall writing a special-purpose code to
count the number of tuples in relation years ago, and IIRC it was
blazingly fast compared to letting our executor do it.  I agree,
however, that an approximation can be faster still.

> That is why I didn't suggest making it the default. If you flip the
> switch, you would get a very fast approximation. If you care about
> accuracy, you accept it has to be slow.

I'm not really here to take a position on the proposal. It doesn't
excite me, because I have not run across any users in the combination
of circumstances you mention: query can't be changed, exact answer not
actually required, whole table being counted. But I am not here to
call you a liar either. If you run across users in that situation all
the time, then you do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: parallelizing the archiver

2021-10-21 Thread Robert Haas

On Thu, Oct 21, 2021 at 4:29 PM Stephen Frost  wrote:
> restore_command used to be in recovery.conf, which disappeared with v12
> and it now has to go into postgresql.auto.conf or postgresql.conf.
>
> That's a huge breaking change.

Not in the same sense. Moving the functionality to a different
configuration file can and probably did cause a lot of problems for
people, but the same basic functionality was still available.

(Also, I'm pretty sure that the recovery.conf changes would have
happened years earlier if there hadn't been backward compatibility
concerns, from Simon in particular. So saying that there was "hardly
any complaint raised" in that case doesn't seem to me to be entirely
accurate.)

> > Also, more to the point, when there's a need to break backward
> > compatibility in order to get some improvement, it's worth
> > considering, but here there just isn't.
>
> There won't be any thought towards a backwards-incompatible capability
> if everyone is saying that we can't possibly break it.  That's why I was
> commenting on it.

I can't speak for anyone else, but that is not what I am saying. I am
open to the idea of breaking it if we thereby get some valuable
benefit which cannot be obtained otherwise. But Nathan has now
implemented something which, from the sound of it, will allow us to
obtain all of the available benefits with no incompatibilities. If we
think of additional benefits that we cannot obtain without
incompatibilities, then we can consider that situation when it arises.
In the meantime, there's no need to go looking for reasons to break
stuff that works in existing releases.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Thinking about ANALYZE stats and autovacuum and large non-uniform tables

2021-10-21 Thread Greg Stark

One problem I've seen in multiple databases and is when a table has a
mixture of data sets within it. E.g. A queue table where 99% of the
entries are "done" but most queries are working with the 1% that are
"new" or in other states. Often the statistics are skewed by the
"done" entries and give bad estimates for query planning when the
query is actually looking at the other rows.

We've always talked about this as a "skewed distribution" or
"intercolumn correlation" problem. And we've developed some tools for
dealing with those issues. But I've been thinking that's not the only
problem with these cases.

The problem I'm finding is that the distribution of these small
subsets can swing quickly. And understanding intercolumn correlations
even if we could do it perfectly would be no help at all.

Consider a table with millions of rows that are "done" but none that
are "pending". Inserting just a few hundred or thousand new pending
rows makes any estimates based on the existing statistics entirely
incorrect. Even if we had perfect statistics capable of making perfect
estimates they would be entirely wrong once a few inserts of pending
rows are done.

Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze,
etc are kind of affected by this. It's easy (at least on older
versions, maybe Peter's work has fixed this for btree) to get severe
index bloat because vacuum doesn't run for a long time relative to the
size of the busy portion of a table.

I'm imagining to really tackle this we should be doing something like
noticing when inserts, updates, deletes are affecting key values that
are "rare" according to the statistics and triggering autovacuum
ANALYZE statements that use indexes to only update the statistics for
the relevant key ranges.

Obviously this could get complex quickly. Perhaps it should be
something users could declare. Some kind of "partitioned statistics"
where you declare a where clause and we generate statistics for the
table where that where clause is true. Then we could fairly easily
also count things like n_mod_since_analyze for that where clause too.

And yes, partitioning the table could be a solution to this in some
cases. I think there are reasons why it isn't always going to work for
these issues though, not least that users will likely have other ways
they want to partition the data already.


-- 
greg

Re: Partial aggregates pushdown

2021-10-21 Thread Zhihong Yu

Hi,
w.r.t. 0001-Partial-aggregates-push-down-v03.patch

For partial_agg_ok(),

+   if (agg->aggdistinct || agg->aggvariadic || agg->aggkind !=
AGGKIND_NORMAL || agg->aggorder != NIL)
+   ok = false;

Since SearchSysCache1() is not called yet, you can return false directly.

+   if (aggform->aggpartialpushdownsafe != true)

The above can be written as:

   if (!aggform->aggpartialpushdownsafe)

For build_conv_list():

+   Oid converter_oid = InvalidOid;
+
+   if (IsA(tlentry->expr, Aggref))
...
+   }
+   convlist = lappend_oid(convlist, converter_oid);

Do you intend to append InvalidOid to convlist (when tlentry->expr is
not Aggref) ?

Cheers

Re: Thinking about ANALYZE stats and autovacuum and large non-uniform tables

2021-10-21 Thread Thomas Munro

On Fri, Oct 22, 2021 at 10:13 AM Greg Stark  wrote:
> Obviously this could get complex quickly. Perhaps it should be
> something users could declare. Some kind of "partitioned statistics"
> where you declare a where clause and we generate statistics for the
> table where that where clause is true. Then we could fairly easily
> also count things like n_mod_since_analyze for that where clause too.

It's a different thing, but somehow related and maybe worth
mentioning, that in DB2 you can declare a table to be VOLATILE.  In
that case, by some unspecified different heuristics, it'll prefer
index scans over table scans, and it's intended to give stable
performance for queue-like tables by defending against automatically
scheduled stats being collected at a bad time.  It's been a while
since I ran busy queue-like workloads on DB2 but I seem to recall it
was more about the dangers of tables that sometimes have say 10 rows
and something 42 million, rather than the case of 42 million DONE rows
and 0-10 PENDING rows, but not a million miles off.

Re: should we allow users with a predefined role to access pg_backend_memory_contexts view and pg_log_backend_memory_contexts function?

2021-10-21 Thread Bossart, Nathan

On 10/20/21, 11:44 PM, "Bharath Rupireddy" 
 wrote:
> I would like to confine this thread to allowing non-superusers with a
> predefined role (earlier suggestion was to use pg_read_all_stats) to
> access views pg_backend_memory_contexts and pg_shmem_allocations and
> functions pg_get_backend_memory_contexts and pg_get_shmem_allocations.
> Attaching the previous v2 patch here for further review and thoughts.

I took a look at the new patch.  The changes to system_views.sql look
good to me.  Let's be sure to update doc/src/sgml/catalogs.sgml as
well.

-SELECT * FROM pg_log_backend_memory_contexts(pg_backend_pid());
+SELECT pg_log_backend_memory_contexts(pg_backend_pid());

nitpick: Do we need to remove the "* FROM" here?  This seems like an
unrelated change.

+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_backend_memory_contexts and function pg_log_backend_memory_contexts.

I think the comment needs to be updated to remove the reference to
pg_log_backend_memory_contexts.  It doesn't appear to be tested here.

+SELECT name, ident, parent, level, total_bytes >= free_bytes
+  FROM pg_backend_memory_contexts WHERE level = 0; -- permission denied error
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations; -- permission denied 
error

Since we're really just checking the basic permissions, could we just
do the "count(*) >= 0" check for both views?

Nathan

Re: Thinking about ANALYZE stats and autovacuum and large non-uniform tables

2021-10-21 Thread Peter Geoghegan

On Thu, Oct 21, 2021 at 2:13 PM Greg Stark  wrote:
> The problem I'm finding is that the distribution of these small
> subsets can swing quickly. And understanding intercolumn correlations
> even if we could do it perfectly would be no help at all.
>
> Consider a table with millions of rows that are "done" but none that
> are "pending". Inserting just a few hundred or thousand new pending
> rows makes any estimates based on the existing statistics entirely
> incorrect. Even if we had perfect statistics capable of making perfect
> estimates they would be entirely wrong once a few inserts of pending
> rows are done.

I am very sympathetic to this view of things. Because this asymmetry
obviously exists, and matters. There is no getting around that.

> Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze,
> etc are kind of affected by this. It's easy (at least on older
> versions, maybe Peter's work has fixed this for btree) to get severe
> index bloat because vacuum doesn't run for a long time relative to the
> size of the busy portion of a table.

My work (especially in 14) has definitely helped a great deal with
index bloat, by cleaning it up in a targeted fashion, based on
page-level considerations. This is just the only thing that can work;
we can never expect VACUUM to be able to deal with that, no matter
what. Simply because it's totally normal and expected for index bloat
to grow at an uneven rate over time.

I do still think that there is an unsolved issue here, which leads to
problems with index bloat when there isn't "B-Tree keyspace
concentration" of garbage index tuples. That problem is with the
statistics that drive VACUUM themselves; they just don't work very
well in certain cases [1]. Statistics that drive autovacuum usually
come from ANALYZE, of course. The entire intellectual justification
for database statistics doesn't really carry over to VACUUM. There are
certain "physical database" implementation details that bleed into the
way ANALYZE counts dead rows. For example, most dead tuples are
usually LP_DEAD stub line pointers (not even tuples). They're only 4
bytes, whereas live tuples are about 30 bytes at a minimum (depending
on how you count it). This leads to the ANALYZE block-based sampling
becoming confused.

This confusion seems related to the fact that ANALYZE is really a
"logical database" thing. It's slightly amazing that statistics from
ANALYZE work as well as they do for query planning, so we shouldn't be
too surprised.

Note that the TPC-C issue I describe in [1] involves a table that's a
little bit like a queue table, but with lots of non-HOT updates (lots
overall, but only one update per logical row, ever). This might tie
things to what Thomas just said about DB2 and queue tables.

> I'm imagining to really tackle this we should be doing something like
> noticing when inserts, updates, deletes are affecting key values that
> are "rare" according to the statistics and triggering autovacuum
> ANALYZE statements that use indexes to only update the statistics for
> the relevant key ranges.

I'm not sure. I tend to think that the most promising approaches all
involve some kind of execution time smarts about the statistics, and
their inherent unreliability. Somehow query execution itself should
become less gullible, at least in cases where we really can have high
confidence in the statistics being wrong at this exact time, for this
exact key space.

[1] 
https://postgr.es/m/CAH2-Wz=9r83wcwzcpuh4fvpedm4znzbzmvp3rt21+xhqwmu...@mail.gmail.com
--
Peter Geoghegan

Re: [RFC] building postgres with meson

2021-10-21 Thread Andres Freund

Hi,

On 2021-10-12 15:55:22 -0400, John Naylor wrote:
> Also, could utility makefile targets be made to work? I'm thinking in
> particular of update-unicode and reformat-dat-files, for example.

Implementing reformat-dat-files was trivial:
https://github.com/anarazel/postgres/commit/29c1ce1ad4731290714978da5ce81e99ef051bec


However, update-unicode is a bit harder.  Partially not directly because of
meson, but because update-unicode as-is afaict doesn't support VPATH builds,
and meson enforces those.

make update-unicode
...
make -C src/common/unicode update-unicode
'/usr/bin/perl' generate-unicode_norm_table.pl
Can't open perl script "generate-unicode_norm_table.pl": No such file or 
directory

It's not too hard to fix. See attached for the minimal stuff that I
immediately found to be needed. There's likely more,
e.g. src/backend/utils/mb/Unicode - but I didn't immediately see where that's
invoked from.


The slightly bigger issue making update-unicode work with meson is that meson
doesn't provide support for invoking build targets in specific directories
(because it doesn't map nicely to e.g. msbuild). But scripts like
src/common/unicode/generate-unicode_norm_table.pl rely on CWD. It's not hard
to work around that, but IMO it's better for such scripts to not rely on CWD.


Greetings,

Andres Freund
diff --git i/src/common/unicode/Makefile w/src/common/unicode/Makefile
index a3683dd86b9..e69054d4671 100644
--- i/src/common/unicode/Makefile
+++ w/src/common/unicode/Makefile
@@ -12,14 +12,14 @@ subdir = src/common/unicode
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -DFRONTEND $(CPPFLAGS)
+override CPPFLAGS := -DFRONTEND -I$(abs_top_builddir)/src/common/unicode $(CPPFLAGS)
 LIBS += $(PTHREAD_LIBS)
 
 # By default, do nothing.
 all:
 
 update-unicode: unicode_norm_table.h unicode_combining_table.h unicode_east_asian_fw_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
-	mv $^ ../../../src/include/common/
+	mv $^ $(top_srcdir)/src/include/common/
 	$(MAKE) normalization-check
 
 # These files are part of the Unicode Character Database. Download
@@ -33,7 +33,7 @@ UnicodeData.txt EastAsianWidth.txt DerivedNormalizationProps.txt CompositionExcl
 unicode_norm_hashfunc.h: unicode_norm_table.h
 
 unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt CompositionExclusions.txt
-	$(PERL) generate-unicode_norm_table.pl
+	$(PERL) $^
 
 unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
 	$(PERL) $^ >$@
@@ -58,7 +58,7 @@ submake-common:
 	$(MAKE) -C .. all
 
 norm_test_table.h: generate-norm_test_table.pl NormalizationTest.txt
-	perl generate-norm_test_table.pl NormalizationTest.txt $@
+	perl $^ $@
 
 .PHONY: normalization-check
 
diff --git i/contrib/unaccent/Makefile w/contrib/unaccent/Makefile
index b8307d1601e..d6c466e07ad 100644
--- i/contrib/unaccent/Makefile
+++ w/contrib/unaccent/Makefile
@@ -27,12 +27,12 @@ include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
 
-update-unicode: unaccent.rules
+update-unicode: $(srcdir)/unaccent.rules
 
 # Allow running this even without --with-python
 PYTHON ?= python
 
-unaccent.rules: generate_unaccent_rules.py ../../src/common/unicode/UnicodeData.txt Latin-ASCII.xml
+$(srcdir)/unaccent.rules: generate_unaccent_rules.py ../../src/common/unicode/UnicodeData.txt Latin-ASCII.xml
 	$(PYTHON) $< --unicode-data-file $(word 2,$^) --latin-ascii-file $(word 3,$^) >$@
 
 # Only download it once; dependencies must match src/common/unicode/

Re: Fixing WAL instability in various TAP tests

2021-10-21 Thread Bossart, Nathan

On 9/28/21, 8:17 PM, "Michael Paquier"  wrote:
> On Tue, Sep 28, 2021 at 03:00:13PM -0400, Tom Lane wrote:
>> Should we back-patch 0002?  I'm inclined to think so.  Should
>> we then also back-patch enablement of the bloom test?  Less
>> sure about that, but I'd lean to doing so.  A test that appears
>> to be there but isn't actually invoked is pretty misleading.
>
> A backpatch sounds adapted to me for both patches.  The only risk that
> I could see here is somebody implementing a new test by copy-pasting
> this one if we were to keep things as they are on stable branches.

I found this thread via the Commitfest entry
(https://commitfest.postgresql.org/35//), and I also see that the
following patches have been committed:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7d1aa6b
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6bc6bd4

However, it looks like there are a couple of other patches upthread
[0] that attempt to ensure the tests pass for different settings of
max_wal_size.  Do we intend to proceed with those, or should we just
close out the Commmitfest entry?

Nathan

[0] https://postgr.es/m/C1D227C2-C271-4310-8C85-C5368C298622%40enterprisedb.com

Experimenting with hash tables inside pg_dump

2021-10-21 Thread Tom Lane

Today, pg_dump does a lot of internal lookups via binary search
in presorted arrays.  I thought it might improve matters
to replace those binary searches with hash tables, theoretically
converting O(log N) searches into O(1) searches.  So I tried making
a hash table indexed by CatalogId (tableoid+oid) with simplehash.h,
and replacing as many data structures as I could with that.

This makes the code shorter and (IMO anyway) cleaner, but

(a) the executable size increases by a few KB --- apparently, even
the minimum subset of simplehash.h's functionality is code-wasteful.

(b) I couldn't measure any change in performance at all.  I tried
it on the regression database and on a toy DB with 1 simple
tables.  Maybe on a really large DB you'd notice some difference,
but I'm not very optimistic now.

So this experiment feels like a failure, but I thought I'd post
the patch and results for the archives' sake.  Maybe somebody
will think of a way to improve matters.  Or maybe it's worth
doing just to shorten the code?

regards, tom lane

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 1f24e79665..49932fd598 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -18,6 +18,14 @@
 #include 
 
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_collation_d.h"
+#include "catalog/pg_extension_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "catalog/pg_operator_d.h"
+#include "catalog/pg_proc_d.h"
+#include "catalog/pg_publication_d.h"
+#include "catalog/pg_type_d.h"
+#include "common/hashfn.h"
 #include "fe_utils/string_utils.h"
 #include "pg_backup_archiver.h"
 #include "pg_backup_utils.h"
@@ -31,54 +39,54 @@ static int	allocedDumpIds = 0;
 static DumpId lastDumpId = 0;	/* Note: 0 is InvalidDumpId */
 
 /*
- * Variables for mapping CatalogId to DumpableObject
- */
-static bool catalogIdMapValid = false;
-static DumpableObject **catalogIdMap = NULL;
-static int	numCatalogIds = 0;
-
-/*
- * These variables are static to avoid the notational cruft of having to pass
- * them into findTableByOid() and friends.  For each of these arrays, we build
- * a sorted-by-OID index array immediately after the objects are fetched,
- * and then we use binary search in findTableByOid() and friends.  (qsort'ing
- * the object arrays themselves would be simpler, but it doesn't work because
- * pg_dump.c may have already established pointers between items.)
+ * Infrastructure for mapping CatalogId to DumpableObject
+ *
+ * We use a hash table generated by simplehash.h.  That infrastructure
+ * requires all the hash table entries to be the same size, and it also
+ * expects that it can move them around when resizing the table.  So we
+ * cannot make the DumpableObjects be elements of the hash table directly;
+ * instead, the hash table elements contain pointers to DumpableObjects.
+ *
+ * It turns out to be convenient to also use this data structure to map
+ * CatalogIds to owning extensions, if any.  Since extension membership
+ * data is read before creating most DumpableObjects, either one of dobj
+ * and ext could be NULL.
  */
-static DumpableObject **tblinfoindex;
-static DumpableObject **typinfoindex;
-static DumpableObject **funinfoindex;
-static DumpableObject **oprinfoindex;
-static DumpableObject **collinfoindex;
-static DumpableObject **nspinfoindex;
-static DumpableObject **extinfoindex;
-static DumpableObject **pubinfoindex;
-static int	numTables;
-static int	numTypes;
-static int	numFuncs;
-static int	numOperators;
-static int	numCollations;
-static int	numNamespaces;
-static int	numExtensions;
-static int	numPublications;
-
-/* This is an array of object identities, not actual DumpableObjects */
-static ExtensionMemberId *extmembers;
-static int	numextmembers;
+typedef struct _catalogIdMapEntry
+{
+	CatalogId	catId;			/* the indexed CatalogId */
+	uint32		status;			/* hash status */
+	uint32		hashval;		/* hash code for the CatalogId */
+	DumpableObject *dobj;		/* the associated DumpableObject, if any */
+	ExtensionInfo *ext;			/* owning extension, if any */
+} CatalogIdMapEntry;
+
+#define SH_PREFIX		catalogid
+#define SH_ELEMENT_TYPE	CatalogIdMapEntry
+#define SH_KEY_TYPE		CatalogId
+#define	SH_KEY			catId
+#define SH_HASH_KEY(tb, key)	hash_bytes((const unsigned char *) &(key), sizeof(CatalogId))
+#define SH_EQUAL(tb, a, b)		((a).oid == (b).oid && (a).tableoid == (b).tableoid)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hashval
+#define	SH_SCOPE		static inline
+#define SH_RAW_ALLOCATOR	pg_malloc0
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define CATALOGIDHASH_INITIAL_SIZE	1
+
+static catalogid_hash *catalogIdHash = NULL;
 
 static void flagInhTables(Archive *fout, TableInfo *tbinfo, int numTables,
 		  InhInfo *inhinfo, int numInherits);
 static void flagInhIndexes(Archive *fout, TableInfo *tblinfo, int numTables);
 static void flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int numTables);
-static Du

Re: CREATEROLE and role ownership hierarchies

2021-10-21 Thread Bossart, Nathan

On 10/20/21, 11:46 AM, "Mark Dilger"  wrote:
> The purpose of these patches is to fix the CREATEROLE escalation
> attack vector misfeature.  (Not everyone will see CREATEROLE that
> way, but the perceived value of the patch set likely depends on how
> much you see CREATEROLE in that light.)

Regarding the "attack vector misfeature" comment, I remember being
surprised when I first learned how much roles with CREATEROLE can do.
When I describe CREATEROLE to others, I am sure to emphasize the note
in the docs about such roles being "almost-superuser" roles.
CREATEROLE is a rather big hammer at the moment, so I certainly think
there is value in reducing its almost-superuser-ness.

I mentioned this in the other thread [0] already, but the first thing
that comes to mind when I look at these patches is how upgrades might
work.  Will we just make the bootstrap superuser the owner for all
roles when you first upgrade to v15?  Also, are we just going to strip
the current CREATEROLE roles of much of their powers?  Maybe it's
worth keeping a legacy CREATEROLE role attribute for upgraded clusters
that could eventually be removed down the road.

I'd also like to bring up my note about allowing users to transfer
role ownership.  When I tested the patches earlier, REASSIGN OWNED BY
was failing with an "unexpected classid" ERROR.  Besides REASSIGN
OWNED BY, perhaps there should be another mechanism for transferring
ownership on a role-by-role basis (i.e., ALTER ROLE OWNER TO).  I
haven't looked at this new patch set too closely, so my apologies if
this has already been added.

Nathan

[0] https://postgr.es/m/53C7DF4C-8463-4647-9DFD-779B5E1861C4%40amazon.com

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Bossart, Nathan

On 10/21/21, 3:29 PM, "Tom Lane"  wrote:
> (b) I couldn't measure any change in performance at all.  I tried
> it on the regression database and on a toy DB with 1 simple
> tables.  Maybe on a really large DB you'd notice some difference,
> but I'm not very optimistic now.

I wonder how many tables you'd need to start seeing a difference.
There are certainly databases out there with many more than 10,000
tables.  I'll look into this...

Nathan

Re: CREATEROLE and role ownership hierarchies

2021-10-21 Thread Mark Dilger

> On Oct 21, 2021, at 4:04 PM, Bossart, Nathan  wrote:
> 
> Regarding the "attack vector misfeature" comment, I remember being
> surprised when I first learned how much roles with CREATEROLE can do.
> When I describe CREATEROLE to others, I am sure to emphasize the note
> in the docs about such roles being "almost-superuser" roles.
> CREATEROLE is a rather big hammer at the moment, so I certainly think
> there is value in reducing its almost-superuser-ness.

It is hard to know how many people are using CREATEROLE currently.  There isn't 
much reason to give it out, since if you care enough about security to not give 
out superuser, you probably care too much about security to give away 
CREATEROLE.

> I mentioned this in the other thread [0] already, but the first thing
> that comes to mind when I look at these patches is how upgrades might
> work.  Will we just make the bootstrap superuser the owner for all
> roles when you first upgrade to v15?

Yes, that's the idea.  After upgrade, all roles will form a tree, with the 
bootstrap superuser at the root of the tree.  The initial tree structure isn't 
very interesting, with all other roles directly owned by it, but from there the 
superuser can rearrange the tree, and after that non-superuser roles can manage 
whatever subtree of roles they are the root of.

>  Also, are we just going to strip
> the current CREATEROLE roles of much of their powers?  Maybe it's
> worth keeping a legacy CREATEROLE role attribute for upgraded clusters
> that could eventually be removed down the road.

The patch as written drastically reduces the power of the CREATEROLE attribute, 
in a non-backwards compatible way.  I wondered if there would be complaints 
about that.  If so, we could instead leave CREATEROLE alone, and create some 
other privileged role for the same thing, but it does start to look funny 
having a CREATEROLE privilege bit and also a privileged role named, perhaps, 
pg_can_create_roles.

> I'd also like to bring up my note about allowing users to transfer
> role ownership.  When I tested the patches earlier, REASSIGN OWNED BY
> was failing with an "unexpected classid" ERROR.  Besides REASSIGN
> OWNED BY, perhaps there should be another mechanism for transferring
> ownership on a role-by-role basis (i.e., ALTER ROLE OWNER TO).  I
> haven't looked at this new patch set too closely, so my apologies if
> this has already been added.

Yes, I completely agree with you on that.  Both REASSIGN OWNED BY and ALTER 
ROLE OWNER TO should work.  I'll take a look at the patches and repost with any 
adjustments that I find necessary to make those work.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Andres Freund

Hi,

On 2021-10-21 18:27:25 -0400, Tom Lane wrote:
> Today, pg_dump does a lot of internal lookups via binary search
> in presorted arrays.  I thought it might improve matters
> to replace those binary searches with hash tables, theoretically
> converting O(log N) searches into O(1) searches.  So I tried making
> a hash table indexed by CatalogId (tableoid+oid) with simplehash.h,
> and replacing as many data structures as I could with that.

That does sound like a good idea in theory...

> This makes the code shorter and (IMO anyway) cleaner, but
> 
> (a) the executable size increases by a few KB --- apparently, even
> the minimum subset of simplehash.h's functionality is code-wasteful.

Hm. Surprised a bit by that. In an optimized build the difference is a
smaller, at least.

optimized:
   textdata bss dec hex filename
 44806670481368  456482   6f722 src/bin/pg_dump/pg_dump
 44753070481496  456074   6f58a src/bin/pg_dump/pg_dump.orig

debug:
   textdata bss dec hex filename
 51688370241352  525259   803cb src/bin/pg_dump/pg_dump
 50981970241480  518323   7e8b3 src/bin/pg_dump/pg_dump.orig

The fact that optimization plays such a role makes me wonder if a good chunk
of the difference is the slightly more complicated find{Type,Func,...}ByOid()
functions.

> (b) I couldn't measure any change in performance at all.  I tried
> it on the regression database and on a toy DB with 1 simple
> tables.  Maybe on a really large DB you'd notice some difference,
> but I'm not very optimistic now.

Did you measure runtime of pg_dump, or how much CPU it used?  I think a lot of
the time the backend is a bigger bottleneck than pg_dump...

For the regression test DB the majority of the time seems to be spent below
two things:
1) libpq
2) sortDumpableObjects().

I don't think 2) hits the binary search / hashtable path?

It does seem interesting that a substantial part of the time is spent in/below
PQexec() and PQfnumber(). Especially the latter shouldn't be too hard to
optimize away...

Greetings,

Andres Freund

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Bossart, Nathan

On 10/21/21, 4:14 PM, "Bossart, Nathan"  wrote:
> On 10/21/21, 3:29 PM, "Tom Lane"  wrote:
>> (b) I couldn't measure any change in performance at all.  I tried
>> it on the regression database and on a toy DB with 1 simple
>> tables.  Maybe on a really large DB you'd notice some difference,
>> but I'm not very optimistic now.
>
> I wonder how many tables you'd need to start seeing a difference.
> There are certainly databases out there with many more than 10,000
> tables.  I'll look into this...

Well, I tested with 200,000 tables and saw no difference with this.

Nathan

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Tom Lane

Andres Freund  writes:
> Did you measure runtime of pg_dump, or how much CPU it used?

I was looking mostly at wall-clock runtime, though I did notice
that the CPU time looked about the same too.

> I think a lot of
> the time the backend is a bigger bottleneck than pg_dump...

Yeah, that.  I tried doing a system-wide "perf" measurement, and soon
realized that a big fraction of the time for a "pg_dump -s" run is
being spent in the planner :-(.  I'm currently experimenting with
PREPARE'ing pg_dump's repetitive queries, and it's looking very
promising.  More later.

regards, tom lane

Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"

2021-10-21 Thread Jeff Davis

On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
> The  FATAL error "recovery ended before configured recovery target
> was
> reached" introduced by commit at [1] in PG 14 is causing the standby
> to go down after having spent a good amount of time in recovery.
> There
> can be cases where the arrival of required WAL (for reaching recovery
> target) from the archive location to the standby may take time and
> meanwhile the standby failing with the FATAL error isn't good.
> Instead, how about we make the standby wait for a certain amount of
> time (with a GUC) so that it can keep looking for the required WAL. 

How is archiving configured, and would it be possible to introduce
logic into the restore_command to handle slow-to-arrive WAL?

Regards,
Jeff Davis

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Masahiko Sawada

On Thu, Oct 21, 2021 at 11:18 PM Japin Li  wrote:
>
>
> On Thu, 21 Oct 2021 at 19:54, Masahiko Sawada  wrote:
> > On Thu, Oct 21, 2021 at 3:04 PM Dilip Kumar  wrote:
> >>
> >> On Thu, Oct 21, 2021 at 11:16 AM Masahiko Sawada  
> >> wrote:
> >> >
> >> > On Wed, Oct 20, 2021 at 8:12 PM Japin Li  wrote:
> >> > >
> >> > >
> >> > > On Mon, 18 Oct 2021 at 17:27, Dilip Kumar  
> >> > > wrote:
> >> > > > On Mon, Oct 18, 2021 at 1:41 PM Japin Li  wrote:
> >> > > >
> >> > > >> I attached v3 patch that set IntervalStyle to 'postgres' when the
> >> > > >> server backend is walsender, and this problem has gone.
> >> > > >
> >> > > >> I test that set IntervalStyle to 'sql_standard' on publisher and
> >> > > >> 'iso_8601' on subscriber, it works fine.
> >> > > >
> >> > > >> Please try v3 patch and let me know if they work as unexpected.
> >> > > >> Thanks in advance.
> >> > > >
> >> > > > I think the idea of setting the standard DateStyle and the
> >> > > > IntervalStyle on the walsender process looks fine to me.  As this 
> >> > > > will
> >> > > > avoid extra network round trips as Tom mentioned.
> >> > >
> >> > > After some test, I find we also should set the extra_float_digits to 
> >> > > avoid
> >> > > precision lossing.
> >> >
> >> > I'm concerned that it sets parameters too early since wal senders end
> >> > up setting the parameters regardless of logical decoding plugins. It
> >> > might be better to force the parameters within the plugin for logical
> >> > replication, pgoutput, in order to avoid affecting other plugins? On
> >> > the other hand, if we do so, we will need to handle table sync worker
> >> > cases separately since they copy data via COPY executed by the wal
> >> > sender process. For example, we can have table sync workers set the
> >> > parameters.
> >>
> >> You mean table sync worker to set over the replication connection
> >> right?  I think that was the first solution where normal workers, as
> >> well as table sync workers, were setting over the replication
> >> connection, but Tom suggested that setting on the walsender is a
> >> better option as we can avoid the network round trip.
> >
> > Right.
> >
> > BTW I think we can set the parameters from the subscriber side without
> > additional network round trips by specifying the "options" parameter
> > in the connection string, no?
> >
>
> Yes, we can.  However, each client should be concerned the style for
> datestyle, IMO it is boring.
>
> >> If we want to set it over the replication connection then do it for
> >> both as Japin's first patch is doing, otherwise, I am not seeing any
> >> big issue in setting it early in the walsender also.  I think it is
> >> good to let walsender always send in the standard format which can be
> >> understood by other node, no?
> >
> > Yeah, probably the change on HEAD is fine but I'm a bit concerned
> > about possible issues on back branches like if the user expects to get
> > date data in the style of DateStyle setting on the server via
> > pg_recvlogical, this change could break it.
> >
>
> How it breaks?

I don't know the real case but for example, if an application gets
changes via pg_recvlogical with a decoding plugin (say wal2json) from
the database whose DateStyle setting is "SQL, MDY", it expects that
the date values in the streamed data are in the style of "ISO, MDY".
But with this change, it will get date values in the style of "ISO"
which could lead to a parse error in the application.

>  The user also can specify the "options" to get date data
> in the style which they are wanted. Right?

Right. But doesn't it mean breaking the compatibility?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: LogicalChanges* and LogicalSubxact* wait events are never reported

2021-10-21 Thread Masahiro Ikeda




On 2021/10/21 17:40, Amit Kapila wrote:
> On Wed, Oct 20, 2021 at 3:46 PM Masahiro Ikeda  
> wrote:
>>
>> On 2021/10/20 18:17, Amit Kapila wrote:
>>> On Wed, Oct 20, 2021 at 10:50 AM Michael Paquier  
>>> wrote:

 On Wed, Oct 20, 2021 at 02:12:20PM +0900, Masahiro Ikeda wrote:
> If my understanding is right, it's better to remove them since they make
> users confused. Please see the attached patch. I confirmed that to make
> check-world passes all tests.

 Yeah, I don't see the point in keeping these events around if they are
 not used.  Perhaps Amit has some plans for them, though.

>>>
>>> No, there is no plan for these. As far as I remember, during
>>> development, we have decided to use BufFile interface and forgot to
>>> remove these events which were required by the previous versions of
>>> the patch. I'll take care of this.
>>>
>>> Thanks for the report and patch!
>> Thanks for your replies and for handling it!
>>
> 
> Pushed!

Thanks!

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Drop replslot after pgstat_shutdown cause assert coredump

2021-10-21 Thread Kyotaro Horiguchi

At Fri, 22 Oct 2021 02:10:21 +0900, Fujii Masao  
wrote in 
> Even in prior to the commit, pgstat_shutdown_hook() can be called
> before ProcKill() at the backend exit, so ISTM that the problem can
> be reproduced.
> 
> Probably we need to make sure that pgstat_shutdown_hook() is called
> after ProcKill(), e.g., by registering pgstat_shutdown_hook() into

Considering the coming shared-memory based stats collector, pgstat
must be shutdown before shared memory shutdown.  Every operation that
requires stats collector also must be shut down before the pgstat
shutdown. A naive solution would be having before-pgstat-shutdown hook
but I'm not sure it's the right direction.

> on_proc_exit_list (I'm not sure if this change is safe, though).
> Or maybe pgstat logic for replication slot drop needs to be
> overhauled.

I think we don't want to lose the stats numbers of the to-be-dropped
slot. So the slot-drop must happen before pgstat shutdown.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Andres Freund

Hi,

On 2021-10-21 16:37:57 -0700, Andres Freund wrote:
> On 2021-10-21 18:27:25 -0400, Tom Lane wrote:
> > (a) the executable size increases by a few KB --- apparently, even
> > the minimum subset of simplehash.h's functionality is code-wasteful.
> 
> Hm. Surprised a bit by that. In an optimized build the difference is a
> smaller, at least.
> 
> optimized:
>text  data bss dec hex filename
>  448066  70481368  456482   6f722 src/bin/pg_dump/pg_dump
>  447530  70481496  456074   6f58a src/bin/pg_dump/pg_dump.orig
> 
> debug:
>text  data bss dec hex filename
>  516883  70241352  525259   803cb src/bin/pg_dump/pg_dump
>  509819  70241480  518323   7e8b3 src/bin/pg_dump/pg_dump.orig
> 
> The fact that optimization plays such a role makes me wonder if a good chunk
> of the difference is the slightly more complicated find{Type,Func,...}ByOid()
> functions.

It's not that.

In a debug build a good chunk of it is due to a bunch of Assert()s. Another
part is that trivial helper functions like SH_PREV() don't get inlined.

The increase for an optimized build seems to boil down to pg_log_error()
invocations. If I replace those with an exit(1), the resulting binaries are
within 100 byte.

If I prevent the compiler from inlining findObjectByCatalogId() in all the
find*ByOid() routines, your version is smaller than master even without other
changes.

Greetings,

Andres Freund

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Andres Freund

Hi,

On 2021-10-21 20:22:56 -0400, Tom Lane wrote:
> Andres Freund  writes:
> Yeah, that.  I tried doing a system-wide "perf" measurement, and soon
> realized that a big fraction of the time for a "pg_dump -s" run is
> being spent in the planner :-(.

A trick for seeing the proportions of this easily in perf is to start both
postgres and pg_dump pinned to a specific CPU, and profile that cpu. That gets
rid of most of the noise of other programs etc.

> I'm currently experimenting with
> PREPARE'ing pg_dump's repetitive queries, and it's looking very
> promising.  More later.

Good idea.

I wonder though if for some of them we should instead replace the per-object
queries with one query returning the information for all objects of a type. It
doesn't make all that much sense that we build and send one query for each
table and index.

Greetings,

Andres Freund

Re: Added schema level support for publication.

2021-10-21 Thread Greg Nancarrow

On Fri, Oct 22, 2021 at 12:19 AM vignesh C  wrote:
>
> I could reproduce the issue by using the following test:
> --- Setup
> create schema sch1;
> create schema sch2;
> create table sch1.tbl1 (a int) partition by range (a);
> create table sch2.tbl1_part1 partition of sch1.tbl1 for values from (1) to 
> (10);
> create table sch2.tbl1_part2 partition of sch1.tbl1 for values from
> (10) to (20);
> create schema sch3;
> create table sch3.t1(c1 int);
>
> --- Publication
> create publication pub1 for all tables in schema sch3, table
> sch1.tbl1, table sch2.tbl1_part1 with ( publish_via_partition_root
> =on);
> insert into sch1.tbl1 values(1);
> insert into sch1.tbl1 values(11);
> insert into sch3.t1 values(1);
>
>  Subscription
> CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost
> port=5432' PUBLICATION pub1;
>
> The patch posted at [1] has the fix for the same.
> [1] - 
> https://www.postgresql.org/message-id/CALDaNm1onqBEr0WE_e7%3DCNw3bURfrGRmbMjX31d-nx3FGLS10A%40mail.gmail.com
>

Thanks for addressing this.
I checked the updated code and did some more testing and the fix LGTM.

I was also previously concerned about what the behavior should be when
only including just the partitions of a partitioned table in a
publication using ALL TABLES IN SCHEMA and having
publish_via_partition_root=true. It seems to implicitly include the
partitioned table in the publication. But I did some testing and found
that this is the current behavior when only the partitions are
individually included in a publication using TABLE, so it seems to be
OK.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: should we allow users with a predefined role to access pg_backend_memory_contexts view and pg_log_backend_memory_contexts function?

2021-10-21 Thread Bharath Rupireddy

On Fri, Oct 22, 2021 at 3:15 AM Bossart, Nathan  wrote:
>
> On 10/20/21, 11:44 PM, "Bharath Rupireddy" 
>  wrote:
> > I would like to confine this thread to allowing non-superusers with a
> > predefined role (earlier suggestion was to use pg_read_all_stats) to
> > access views pg_backend_memory_contexts and pg_shmem_allocations and
> > functions pg_get_backend_memory_contexts and pg_get_shmem_allocations.
> > Attaching the previous v2 patch here for further review and thoughts.
>
> I took a look at the new patch.  The changes to system_views.sql look
> good to me.

Thanks for reviewing.

> Let's be sure to update doc/src/sgml/catalogs.sgml as
> well.

Added.

> -SELECT * FROM pg_log_backend_memory_contexts(pg_backend_pid());
> +SELECT pg_log_backend_memory_contexts(pg_backend_pid());
>
> nitpick: Do we need to remove the "* FROM" here?  This seems like an
> unrelated change.

Yes it's not mandatory, while we are on this I thought we could
combine them, I've also specified this in the commit message. IMO, we
can leave it to the committer.

> +-- test to check privileges of system views pg_shmem_allocations,
> +-- pg_backend_memory_contexts and function pg_log_backend_memory_contexts.
>
> I think the comment needs to be updated to remove the reference to
> pg_log_backend_memory_contexts.  It doesn't appear to be tested here.

Removed.

> +SELECT name, ident, parent, level, total_bytes >= free_bytes
> +  FROM pg_backend_memory_contexts WHERE level = 0; -- permission denied error
> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations; -- permission denied 
> error
>
> Since we're really just checking the basic permissions, could we just
> do the "count(*) >= 0" check for both views?

Done.

Here's v3 for further review.

Regards,
Bharath Rupireddy.


v3-0001-change-privileges-of-pg_backend_memory_contexts-a.patch
Description: Binary data

RE: Data is copied twice when specifying both child and parent table in publication

2021-10-21 Thread shiy.f...@fujitsu.com

On Tuesday, October 19, 2021 10:47 AM houzj.f...@fujitsu.com 
 wrote:
> 
> On Monday, October 18, 2021 5:03 PM Amit Langote
>  wrote:
> > I can imagine that the behavior seen here may look surprising, but not
> > sure if I would call it a bug as such.  I do remember thinking about
> > this case and the current behavior is how I may have coded it to be.
> >
> > Looking at this command in Hou-san's email:
> >
> >   create publication pub for table tbl1, tbl1_part1 with
> > (publish_via_partition_root=on);
> >
> > It's adding both the root partitioned table and the leaf partition
> > *explicitly*, and it's not clear to me if the latter's inclusion in
> > the publication should be assumed because the former is found to have
> > been added to the publication, that is, as far as the latter's
> > visibility to the subscriber is concerned.  It's not a stretch to
> > imagine that a user may write the command this way to account for a
> > subscriber node on which tbl1 and tbl1_part1 are unrelated tables.
> >
> > I don't think we assume anything on the publisher side regarding the
> > state/configuration of tables on the subscriber side, at least with
> > publication commands where tables are added to a publication
> > explicitly, so it is up to the user to make sure that the tables are
> > not added duplicatively.  One may however argue that the way we've
> > decided to handle FOR ALL TABLES does assume something about
> > partitions where it skips advertising them to subscribers when
> > publish_via_partition_root flag is set to true, but that is exactly to
> > avoid the duplication of data that goes to a subscriber.
> 
> Hi,
> 
> Thanks for the explanation.
> 
> I think one reason that I consider this behavior a bug is that: If we add
> both the root partitioned table and the leaf partition explicitly to the
> publication (and set publish_via_partition_root = on), the behavior of the
> apply worker is inconsistent with the behavior of table sync worker.
> 
> In this case, all changes in the leaf the partition will be applied using the
> identity and schema of the partitioned(root) table. But for the table sync, it
> will execute table sync for both the leaf and the root table which cause
> duplication of data.
> 
> Wouldn't it be better to make the behavior consistent here ?
> 

I agree with this point. 

About this case,

> >   create publication pub for table tbl1, tbl1_part1 with
> > (publish_via_partition_root=on);

As a user, although partitioned table includes the partition, publishing 
partitioned
table and its partition is allowed. So, I think we should take this case into
consideration. Initial data is copied once via the parent table seems 
reasonable.

Regards
Shi yu

MDAM techniques and Index Skip Scan patch

2021-10-21 Thread Peter Geoghegan

I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.

I don't necessarily expect to come away from this discussion with a
much better high level architecture for the patch, or any kind of
deeper insight, or even a frame of reference for further discussion. I
just think that we ought to *try* to impose some order on this stuff.

Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things. Many other MDAM techniques also seem highly appealing.
Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are
still related, clearly.

I'd like to also talk about another patch, that ISTM had that same
quality -- it was also held back by high level design uncertainty. Back in 2018,
Tom abandoned a patch that transformed a star-schema style query with
left outer joins on dimension tables with OR conditions, into an
equivalent query that UNIONs together 2 distinct queries [3][4].

Believe it or not, I am now reminded of that patch by the example of
"IN() Lists", from page 5 of the paper. We see this example SQL query:

SELECT date, item_class, store, sum(total_sales)
FROM sales
WHERE date between '06/01/95' and '06/30/95' and
item_class IN (20,35,50) and
store IN (200,250)
GROUP BY dept, date, item_class, store;

Granted, this SQL might not seem directly relevant to Tom's patch at
first -- there is no join for the optimizer to even try to eliminate,
which was the whole basis of Jim Nasby's original complaint, which is
what spurred Tom to write the patch in the first place. But hear me
out: there is still a fact table (the sales table) with some
dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the
table (and this SQL query) drives discussion of an optimization
involving transforming a predicate with many ORs (which is explicitly
said to be logically/semantically equivalent to the IN() lists from
the query). They transform the query into a bunch of disjunct clauses
that can easily be independently executed, and combined at the end
(see also "General OR Optimization" on page 6 of the paper).

Also...I'm not entirely sure that the intended underlying "physical
plan" is truly free of join-like scans. If you squint just right, you
might see something that you could think of as a "physical join" (at
least very informally). The whole point of this particular "IN()
Lists" example is that we get to the following, for each distinct
"dept" and "date" in the table:

dept=1, date='06/04/95', item_class=20, store=200
dept=1, date='06/04/95', item_class=20, store=250
dept=1, date='06/04/95', item_class=35, store=200
dept=1, date='06/04/95', item_class=35, store=250
dept=1, date='06/04/95', item_class=50, store=200
dept=1, date='06/04/95', item_class=50, store=250

There are 2400 such accesses in total after transformation -- imagine
additional lines like these, for every distinct combination of dept
and date (only for those dates that actually had sales, which they
enumerate up-front), for store 200 and 250, and item_class 20, 35, and
50. This adds up to 2400 lines in total. Even 2400 index probes will
be much faster than a full table scan, given that this is a large fact
table. The "sales" table is a clustered index whose keys are on the
columns "(dept, date, item_class, store)", per note at the top of page
4. The whole point is to avoid having any secondary indexes on this
fact table, without getting a full scan. We can just probe the primary
key 2400 times instead, following this transformation. No need for
secondary indexes.

The plan can be thought of as a DAG, at least informally. This is also
somewhat similar to what Tom was thinking about back in 2018. Tom had
to deduplicate rows during execution (IIRC using a UNION style ad-hoc
approach that sorted on TIDs), whereas I think that they can get away
with skipping that extra step. Page 7 says "MDAM removes duplicates
before reading the data, so it does not have to do any post read
operations to accomplish duplicate elimination (a common problem with
OR optimization)".

My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.

Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a c

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Tom Lane

Andres Freund  writes:
> I wonder though if for some of them we should instead replace the per-object
> queries with one query returning the information for all objects of a type. It
> doesn't make all that much sense that we build and send one query for each
> table and index.

The trick is the problem I alluded to in another thread: it's not safe to
do stuff like pg_get_expr() on tables we don't have lock on.

I've thought about doing something like

SELECT unsafe-functions FROM pg_class WHERE oid IN (someoid, someoid, ...)

but in cases with tens of thousands of tables, it seems unlikely that
that's going to behave all that nicely.

The *real* fix, I suppose, would be to fix all those catalog-inspection
functions so that they operate with respect to the query's snapshot.
But that's not a job I'm volunteering for.  Besides which, pg_dump
still has to cope with back-rev servers where it wouldn't be safe.

regards, tom lane

Re: pgstat_assert_is_up() can fail in walsender

2021-10-21 Thread Amit Langote

On Fri, Oct 22, 2021 at 2:14 AM Fujii Masao  wrote:
> On 2021/10/19 22:14, Amit Langote wrote:
> > Hi,
> >
> > I can (almost) consistently reproduce $subject by executing the
> > attached shell script, which I was using while constructing a test
> > case for another thread.
>
> This seems the same issue that was reported at the thread [1].
>
> [1]
> https://www.postgresql.org/message-id/os0pr01mb571621b206eeb17d8ab171f094...@os0pr01mb5716.jpnprd01.prod.outlook.com

Ah, indeed.  Thank you.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: [Bug] Logical Replication failing if the DateStyle is different in Publisher & Subscriber

2021-10-21 Thread Japin Li

On Fri, 22 Oct 2021 at 08:26, Masahiko Sawada  wrote:
> On Thu, Oct 21, 2021 at 11:18 PM Japin Li  wrote:
>>
>> How it breaks?
>
> I don't know the real case but for example, if an application gets
> changes via pg_recvlogical with a decoding plugin (say wal2json) from
> the database whose DateStyle setting is "SQL, MDY", it expects that
> the date values in the streamed data are in the style of "ISO, MDY".
> But with this change, it will get date values in the style of "ISO"
> which could lead to a parse error in the application.
>
>>  The user also can specify the "options" to get date data
>> in the style which they are wanted. Right?
>
> Right. But doesn't it mean breaking the compatibility?
>

Yeah, it might be break the compatibility.

In conclusion, this bug has two ways to fix.

In conclusion, this bug has two ways to fix.

1. Set the parameters on publisher, this might be break the compatibility.
2. Set the parameters on subscriber. In my first patch, I try to set the
   parameters after establish the connection, it will lead more network
   round trips. We can set the parameters when connecting the walsender
   using "options".

For the second way, should we set the parameters same as subscriber or
use the parameters (e.g. datestyle = "ISO") likes postgres_fdw
set_transmission_modes()?

Any thoughts?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Improve logging when using Huge Pages

2021-10-21 Thread Justin Pryzby

+   ereport(LOG, (errmsg("Anonymous shared memory was allocated %s 
huge pages.", with_hugepages ? "with" : "without")));

You shouldn't break a sentence into pieces like this, since it breaks
translation.  You don't want an untranslated "without" to appear in the middle
of the translated message.

There are cases where a component *shouldn't* be translated, like this one:
where "numeric" should not be translated.

src/backend/utils/adt/numeric.c: 
errmsg("invalid input syntax for type %s: \"%s\"",
src/backend/utils/adt/numeric.c-
"numeric", str)));

-- 
Justin

Re: Drop replslot after pgstat_shutdown cause assert coredump

2021-10-21 Thread Kyotaro Horiguchi

I said:
> Considering the coming shared-memory based stats collector, pgstat
> must be shutdown before shared memory shutdown.  Every operation that
> requires stats collector also must be shut down before the pgstat
> shutdown. A naive solution would be having before-pgstat-shutdown hook
> but I'm not sure it's the right direction.

For this particular issue, we can add an explicit initilization phase
of replication slot per backend, which simply registers before_shmem
callback.  It would work fine unless we carelessly place the
initialization before pgstat_initialize() (not pgstat_init()) call.

(Honestly, I haven't been able to reproduce the issue itself for
 myself yet..)

> > on_proc_exit_list (I'm not sure if this change is safe, though).
> > Or maybe pgstat logic for replication slot drop needs to be
> > overhauled.
> 
> I think we don't want to lose the stats numbers of the to-be-dropped
> slot. So the slot-drop must happen before pgstat shutdown.

I haven't sought other similar issues. I'm going to check it if they,
if any, can be fixe the same way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..13762f82af 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -306,6 +306,8 @@ static bool pgstat_is_initialized = false;
 static bool pgstat_is_shutdown = false;
 #endif
 
+/* per-backend variable for assertion */
+bool pgstat_initialized PG_USED_FOR_ASSERTS_ONLY = false;
 
 /* --
  * Local function forward declarations
@@ -3036,6 +3038,7 @@ pgstat_initialize(void)
 
/* Set up a process-exit hook to clean up */
before_shmem_exit(pgstat_shutdown_hook, 0);
+   pgstat_initialized = true;
 
 #ifdef USE_ASSERT_CHECKING
pgstat_is_initialized = true;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1c6c0c7ce2..e0430aefa9 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -160,6 +161,33 @@ ReplicationSlotsShmemInit(void)
}
 }
 
+/*
+ * Exit hook to cleanup replication slots.
+ */
+static void
+ReplicationSlotShutdown(int code, Datum arg)
+{
+   /* Make sure active replication slots are released */
+   if (MyReplicationSlot != NULL)
+   ReplicationSlotRelease();
+
+   /* Also cleanup all the temporary slots. */
+   ReplicationSlotCleanup();
+}
+
+/*
+ * Initialize of replication slot facility per backend.
+ */
+void
+ReplicationSlotInit(void)
+{
+   if (max_replication_slots > 0)
+   {
+   assert_pgstat_initialized();
+   before_shmem_exit(ReplicationSlotShutdown, (Datum) 0);
+   }
+}
+
 /*
  * Check whether the passed slot name is valid and report errors at elevel.
  *
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b7d9da0aa9..b593ec8964 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -41,7 +41,6 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
@@ -847,13 +846,6 @@ ProcKill(int code, Datum arg)
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
 
-   /* Make sure active replication slots are released */
-   if (MyReplicationSlot != NULL)
-   ReplicationSlotRelease();
-
-   /* Also cleanup all the temporary slots. */
-   ReplicationSlotCleanup();
-
/*
 * Detach from any lock group of which we are a member.  If the leader
 * exist before all other group members, its PGPROC will remain 
allocated
diff --git a/src/backend/utils/init/postinit.c 
b/src/backend/utils/init/postinit.c
index 78bc64671e..dd83864b54 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -40,6 +40,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "replication/slot.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -531,6 +532,12 @@ BaseInit(void)
 */
pgstat_initialize();
 
+   /*
+* Initialize replication slot. This must be after pgstat_initialize() 
so
+* that the cleanup happnes before the shutdown of pgstat facility.
+*/
+   ReplicationSlotInit();
+
/* Do local initialization of storage and buffer managers */
InitSync();
smgrinit();
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..f06810c115 100644
--- a/src/include/pgstat.h
+++ b/src

Re: XTS cipher mode for cluster file encryption

2021-10-21 Thread Sasasu

On 2021/10/22 01:28, Stephen Frost wrote:

None of these are actually working with or changing the data though,
they're just copying it.  I don't think we'd actually want these to
decrypt and reencrypt the data.

OK, but why ALTER TABLE SET TABLESPACE use smgrread() and smgrextend() 
instead of copy_file().

TDE needs to modify these code paths, and make the patch bigger.

On 2021/10/22 01:28, Stephen Frost wrote:
> No, the CTR approach isn't great because, as has been discussed quite a
> bit already, using the LSN as the IV means that different data might be
> re-encrypted with the same LSN and that's not an acceptable thing to
> have happen with CTR.
On 2021/10/22 01:28, Stephen Frost wrote:
> Thankfully, for WAL
> (unlike heap and index blocks) we don't really have that issue- we
> hopefully aren't going to write different WAL records at the same LSN
> and so using the LSN there should be alright.
On 2021/10/22 01:28, Stephen Frost wrote:
> We've discussed at length how using CTR for heap isn't a good idea even
> if we're using the LSN for the IV, while if we use XTS then we don't
> have the issues that CTR has with IV re-use and using the LSN (plus
> block number and perhaps other things).  Nothing in what has been
> discussed here has really changed anything there that I can see and so
> it's unclear to me why we continue to go round and round with it.

I am not re-discuss using CTR for heap table. I mean use some CTR-like 
algorithm *only* for WAL encryption. My idea is exactly the same when 
you are typing "we hopefully aren't going to write different WAL records 
at the same LSN and so using the LSN there should be alright."

The point of disagreement between you and me is only on the block-based API.

On 2021/10/22 01:28, Stephen Frost wrote:
> it's unclear to me why we continue to go round and round with it.

same to me. I am monitoring this thread about 9 months, watching yours 
discuss key management/CBC/CRT/GCM round and round.

OpenPGP_0x4E72AF09097DAE2E.asc
Description: OpenPGP public key

OpenPGP_signature
Description: OpenPGP digital signature

Re: row filtering for logical replication

2021-10-21 Thread Peter Smith

On Thu, Sep 23, 2021 at 10:33 PM Tomas Vondra
 wrote:
>
> 11) extra (unnecessary) parens in the deparsed expression
>
> test=# alter publication p add table t where ((b < 100) and (c < 100));
> ALTER PUBLICATION
> test=# \dRp+ p
>Publication p
>   Owner | All tables | Inserts | Updates | Deletes | Truncates | Via root
> ---++-+-+-+---+--
>   user  | f  | t   | t   | t   | t | f
> Tables:
>  "public.t" WHERE (((b < 100) AND (c < 100)))
>

I also reported the same as this some months back, but at that time it
was rejected citing some pg_dump patch. (Please see [1] #14).

--
[1] 
https://www.postgresql.org/message-id/532a18d8-ce90--8570-8a9fcf09f329%40www.fastmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Drop replslot after pgstat_shutdown cause assert coredump

2021-10-21 Thread Kyotaro Horiguchi

At Fri, 22 Oct 2021 11:43:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> (Honestly, I haven't been able to reproduce the issue itself for
>  myself yet..)

I managed to reproduce it for me.

psql "dbname=postgres replication=database"
postgres=# CREATE_REPLICATION_SLOT "ts1" TEMPORARY LOGICAL "pgoutput";
postgres=# C-d
(crash)

And confirmed that it doesn't happen with the fix.

> I haven't sought other similar issues. I'm going to check it if they,
> if any, can be fixed the same way.

FileClose calls pgstat_report_tempfile() via
BeforeShmemExit_Files. It is already registered after pgstat.
I added a call to assert_pgstat_initialized() to 

All other pgstat functions seem to be called outside shmem_exit.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..13762f82af 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -306,6 +306,8 @@ static bool pgstat_is_initialized = false;
 static bool pgstat_is_shutdown = false;
 #endif
 
+/* per-backend variable for assertion */
+bool pgstat_initialized PG_USED_FOR_ASSERTS_ONLY = false;
 
 /* --
  * Local function forward declarations
@@ -3036,6 +3038,7 @@ pgstat_initialize(void)
 
/* Set up a process-exit hook to clean up */
before_shmem_exit(pgstat_shutdown_hook, 0);
+   pgstat_initialized = true;
 
 #ifdef USE_ASSERT_CHECKING
pgstat_is_initialized = true;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1c6c0c7ce2..b2c719d31e 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -160,6 +161,33 @@ ReplicationSlotsShmemInit(void)
}
 }
 
+/*
+ * Exit hook to cleanup replication slots.
+ */
+static void
+ReplicationSlotShutdown(int code, Datum arg)
+{
+   /* Make sure active replication slots are released */
+   if (MyReplicationSlot != NULL)
+   ReplicationSlotRelease();
+
+   /* Also cleanup all the temporary slots. */
+   ReplicationSlotCleanup();
+}
+
+/*
+ * Initialize replication slot facility per backend.
+ */
+void
+ReplicationSlotInit(void)
+{
+   if (max_replication_slots < 1)
+   return;
+
+   assert_pgstat_initialized();/* the callback requires pgstat */
+   before_shmem_exit(ReplicationSlotShutdown, (Datum) 0);
+}
+
 /*
  * Check whether the passed slot name is valid and report errors at elevel.
  *
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f9cda6906d..8fbacdc86c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -917,6 +917,7 @@ InitTemporaryFileAccess(void)
 * Register before-shmem-exit hook to ensure temp files are dropped 
while
 * we can still report stats.
 */
+   assert_pgstat_initialized();/* the callback requires pgstat */
before_shmem_exit(BeforeShmemExit_Files, 0);
 
 #ifdef USE_ASSERT_CHECKING
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b7d9da0aa9..b593ec8964 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -41,7 +41,6 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
@@ -847,13 +846,6 @@ ProcKill(int code, Datum arg)
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
 
-   /* Make sure active replication slots are released */
-   if (MyReplicationSlot != NULL)
-   ReplicationSlotRelease();
-
-   /* Also cleanup all the temporary slots. */
-   ReplicationSlotCleanup();
-
/*
 * Detach from any lock group of which we are a member.  If the leader
 * exist before all other group members, its PGPROC will remain 
allocated
diff --git a/src/backend/utils/init/postinit.c 
b/src/backend/utils/init/postinit.c
index 78bc64671e..b7c1a400f5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -40,6 +40,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "replication/slot.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -531,6 +532,12 @@ BaseInit(void)
 */
pgstat_initialize();
 
+   /*
+* Initialize replication slot. This must be after pgstat_initialize() 
so
+* that the cleanup happens before the shutdown of pgstat facility.
+*/
+   ReplicationSlotInit();
+
/* Do local initialization of storage an

Re: Added schema level support for publication.

2021-10-21 Thread Amit Kapila

On Thu, Oct 21, 2021 at 6:47 PM vignesh C  wrote:
>
>
> Thanks for the comments, the attached v45 patch has the fix for the same.
>

The first patch is mostly looking good to me apart from the below
minor comments:

1.
+  
+   The catalog pg_publication_namespace contains the
+   mapping between schemas and publications in the database.  This is a
+   many-to-many mapping.

There are extra spaces after mapping at the end which are not required.

2.
+   CREATE privilege on the database.  Also, the new owner
+   of a FOR ALL TABLES publication must be a superuser.

I think we can modify the second line as: "Also, the new owner of a
FOR ALL TABLES or FOR ALL TABLES IN
SCHEMA publication must be a superuser.

3.
/table's schema as part of specified schema is not supported./table's
schema as part of the specified schema is not supported.

4.
+  
+   Create a publication that publishes all changes for tables
+   users, departments and
+   that publishes all changes for all the tables present in the schema
+   production:

I don't think '...that publishes...' is required twice in the above sentence.

5.
+static List *OpenReliIdList(List *relids);
 static List *OpenTableList(List *tables);
 static void CloseTableList(List *rels);
 static void PublicationAddTables(Oid pubid, List *rels, bool if_not_exists,
  AlterPublicationStmt *stmt);
 static void PublicationDropTables(Oid pubid, List *rels, bool missing_ok);
+static void LockSchemaList(List *schemalist);
+static void PublicationAddSchemas(Oid pubid, List *schemas, bool if_not_exists,
+   AlterPublicationStmt *stmt);
+static void PublicationDropSchemas(Oid pubid, List *schemas, bool missing_ok);

Keep the later definitions also in this order. I suggest move
LockSchemaList() just after CloseTableList() both in declaration and
definition.

6.
+/*
+ * Convert the PublicationObjSpecType list into schema oid list and rangevar
+ * list.
+ */

I think you need to say publication table instead of rangevar in the
above comment.

7.
+ /*
+ * It is quite possible that for the SET case user has not specified any
+ * schema in which case we need to remove all the existing schemas.
+ */

/schema/schemas

8.
+/*
+ * Open relations specified by a RangeVar list.

/RangeVar/PublicationTable

9.
+static bool
+_equalPublicationObject(const PublicationObjSpec *a,
+ const PublicationObjSpec *b)
+{
+ COMPARE_SCALAR_FIELD(pubobjtype);
+ COMPARE_STRING_FIELD(name);
+ COMPARE_NODE_FIELD(pubtable);
+ COMPARE_LOCATION_FIELD(location);
+
+ return true;
+}
+

Let's define this immediately before _equalPublicationTable as all
publication functions are defined there. Also, make the handling of
T_PublicationObjSpec before T_PublicationTable in equal() function as
that is the way nodes are defined.

-- 
With Regards,
Amit Kapila.

Re: Parallel vacuum workers prevent the oldest xmin from advancing

2021-10-21 Thread Masahiko Sawada

On Wed, Oct 20, 2021 at 9:27 AM Masahiko Sawada  wrote:
>
> On Wed, Oct 20, 2021 at 3:07 AM Alvaro Herrera  
> wrote:
> >
> > On 2021-Oct-19, Alvaro Herrera wrote:
> >
>
> Thank you for the comment.
>
> > > Hmm, I think this should happen before the transaction snapshot is
> > > established in the worker; perhaps immediately after calling
> > > StartParallelWorkerTransaction(), or anyway not after
> > > SetTransactionSnapshot.  In fact, since SetTransactionSnapshot receives
> > > a 'sourceproc' argument, why not do it exactly there? ISTM that
> > > ProcArrayInstallRestoredXmin() is where this should happen.
> >
> > ... and there is a question about the lock strength used for
> > ProcArrayLock.  The current routine uses LW_SHARED, but there's no
> > clarity that we can modify proc->statusFlags and ProcGlobal->statusFlags
> > without LW_EXCLUSIVE.
> >
> > Maybe we can change ProcArrayInstallRestoredXmin so that if it sees that
> > proc->statusFlags is not zero, then it grabs LW_EXCLUSIVE (and copies),
> > otherwise it keeps using LW_SHARED as it does now (and does not copy.)
>
> Initially, I've considered copying statusFlags in
> ProcArrayInstallRestoredXmin() but I hesitated to do that because
> statusFlags is not relevant with xmin and snapshot stuff. But I agree
> that copying statusFlags should happen before restoring the snapshot.
>
> If we copy statusFlags in ProcArrayInstallRestoredXmin() there is
> still little window that the restored snapshot holds back the oldest
> xmin?

That's wrong, I'd misunderstood.

I agree to copy statusFlags in ProcArrayInstallRestoredXmin(). I've
updated the patch accordingly.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


copy_status_flags_v3.patch
Description: Binary data

Re: Experimenting with hash tables inside pg_dump

2021-10-21 Thread Andres Freund

Hi,

On 2021-10-21 22:13:22 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I wonder though if for some of them we should instead replace the per-object
> > queries with one query returning the information for all objects of a type. 
> > It
> > doesn't make all that much sense that we build and send one query for each
> > table and index.
> 
> The trick is the problem I alluded to in another thread: it's not safe to
> do stuff like pg_get_expr() on tables we don't have lock on.

I was looking at getTableAttrs() - sending one query instead of #tables
queries yields a quite substantial speedup in a quick prototype. And I don't
think it changes anything around locking semantics.


> I've thought about doing something like
> 
> SELECT unsafe-functions FROM pg_class WHERE oid IN (someoid, someoid, ...)
> 
> but in cases with tens of thousands of tables, it seems unlikely that
> that's going to behave all that nicely.

That's kinda what I'm doing in the quick hack. But instead of using IN(...) I
made it unnest('{oid, oid, ...}'), that scales much better.

A pg_dump --schema-only of the regression database goes from

real0m0.675s
user0m0.039s
sys 0m0.029s

to

real0m0.477s
user0m0.037s
sys 0m0.020s

which isn't half-bad.

There's a few more cases like this I think. But most are harder because the
dumping happens one-by-one from dumpDumpableObject(). The relatively easy but
substantial cases I could find quickly were getIndexes(), getConstraints(),
getTriggers()


To see where it's worth putting in time it'd be useful if getSchemaData() in
verbose mode printed timing information...


> The *real* fix, I suppose, would be to fix all those catalog-inspection
> functions so that they operate with respect to the query's snapshot.
> But that's not a job I'm volunteering for.  Besides which, pg_dump
> still has to cope with back-rev servers where it wouldn't be safe.

Yea, that's not a small change :(. I suspect that we'd need a bunch of new
caching infrastructure to make that reasonably performant, since this
presumably couldn't use syscache etc.

Greetings,

Andres Freund
diff --git c/src/bin/pg_dump/pg_dump.c i/src/bin/pg_dump/pg_dump.c
index ed8ed2f266e..7ec84428e9c 100644
--- c/src/bin/pg_dump/pg_dump.c
+++ i/src/bin/pg_dump/pg_dump.c
@@ -43,6 +43,7 @@
 #include "access/transam.h"
 #include "catalog/pg_aggregate_d.h"
 #include "catalog/pg_am_d.h"
+#include "catalog/pg_attrdef.h"
 #include "catalog/pg_attribute_d.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_cast_d.h"
@@ -8397,6 +8398,14 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 {
 	DumpOptions *dopt = fout->dopt;
 	PQExpBuffer q = createPQExpBuffer();
+	PQExpBuffer tblidx = createPQExpBuffer();
+	PQExpBuffer tbloid = createPQExpBuffer();
+
+	PGresult   *res;
+	bool		first;
+	int			i_tbloid;
+	int			i_idx;
+	int			i_natts;
 	int			i_attnum;
 	int			i_attname;
 	int			i_atttypname;
@@ -8417,13 +8426,28 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_attfdwoptions;
 	int			i_attmissingval;
 	int			i_atthasdef;
+	int			i_attdef;
+	int			i_attdefoid;
 
+	int			ntups;
+	int			tblidx_prev = -1;
+	Oid			tbloid_prev = InvalidOid;
+	int			natts_prev = -1;
+	int			natt_prev = -1;
+	TableInfo  *tbinfo = NULL;
+	int			tbstart = -1;
+
+	resetPQExpBuffer(tblidx);
+	resetPQExpBuffer(tbloid);
+
+	appendPQExpBufferStr(tblidx, "'{");
+	appendPQExpBufferStr(tbloid, "'{");
+
+	/* find all the user attributes and their types */
+	first = true;
 	for (int i = 0; i < numTables; i++)
 	{
 		TableInfo  *tbinfo = &tblinfo[i];
-		PGresult   *res;
-		int			ntups;
-		bool		hasdefaults;
 
 		/* Don't bother to collect info for sequences */
 		if (tbinfo->relkind == RELKIND_SEQUENCE)
@@ -8433,301 +8457,376 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 		if (!tbinfo->interesting)
 			continue;
 
-		/* find all the user attributes and their types */
-
-		/*
-		 * we must read the attribute names in attribute number order! because
-		 * we will use the attnum to index into the attnames array later.
-		 */
-		pg_log_info("finding the columns and types of table \"%s.%s\"",
-	tbinfo->dobj.namespace->dobj.name,
-	tbinfo->dobj.name);
-
-		resetPQExpBuffer(q);
-
-		appendPQExpBufferStr(q,
-			 "SELECT\n"
-			 "a.attnum,\n"
-			 "a.attname,\n"
-			 "a.atttypmod,\n"
-			 "a.attstattarget,\n"
-			 "a.attstorage,\n"
-			 "t.typstorage,\n"
-			 "a.attnotnull,\n"
-			 "a.atthasdef,\n"
-			 "a.attisdropped,\n"
-			 "a.attlen,\n"
-			 "a.attalign,\n"
-			 "a.attislocal,\n"
-			 "pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,\n");
-
-		if (fout->remoteVersion >= 9)
-			appendPQExpBufferStr(q,
- "array_to_string(a.attoptions, ', ') AS attoptions,\n");
+		if (first)
+		{
+			appendPQExpBuffer(tblidx, "%u", i);
+			appendPQExpBuffer(tbloid, "%u", tbinfo->dobj.catId.oid);
+			first = false;
+		}
 		else
-			appendPQE

Re: Partial aggregates pushdown

2021-10-21 Thread Alexander Pyhalov


Zhihong Yu писал 2021-10-22 00:43:

Hi,
w.r.t. 0001-Partial-aggregates-push-down-v03.patch



Hi.


For partial_agg_ok(),

+   if (agg->aggdistinct || agg->aggvariadic || agg->aggkind !=
AGGKIND_NORMAL || agg->aggorder != NIL)
+   ok = false;

Since SearchSysCache1() is not called yet, you can return false
directly.


Fixed.



+   if (aggform->aggpartialpushdownsafe != true)

The above can be written as:

   if (!aggform->aggpartialpushdownsafe)


Fixed.



For build_conv_list():

+   Oid converter_oid = InvalidOid;
+
+   if (IsA(tlentry->expr, Aggref))

...
+   }
+   convlist = lappend_oid(convlist, converter_oid);

Do you intend to append InvalidOid to convlist (when tlentry->expr is
not Aggref) ?


Yes, for each tlist member (which matches fpinfo->grouped_tlist in case 
when foreignrel is UPPER_REL) we need to find corresponding converter.
If we don't append InvalidOid, we can't find convlist member, 
corresponding to tlist member. Added comments to build_conv_list.


Also fixed error in pg_dump.c (we selected '0' when 
aggpartialconverterfn was not defined in schema, but checked for '-').


--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom a18e2ff8de00592797e7c3ccb8d6cd536a2e4e72 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Thu, 14 Oct 2021 17:30:34 +0300
Subject: [PATCH] Partial aggregates push down

---
 contrib/postgres_fdw/deparse.c|  46 +++-
 .../postgres_fdw/expected/postgres_fdw.out| 185 +++-
 contrib/postgres_fdw/postgres_fdw.c   | 204 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  27 ++-
 src/backend/catalog/pg_aggregate.c|  28 ++-
 src/backend/commands/aggregatecmds.c  |  23 +-
 src/backend/utils/adt/numeric.c   |  96 +
 src/bin/pg_dump/pg_dump.c |  21 +-
 src/include/catalog/catversion.h  |   2 +-
 src/include/catalog/pg_aggregate.dat  | 106 -
 src/include/catalog/pg_aggregate.h|  10 +-
 src/include/catalog/pg_proc.dat   |   6 +
 src/test/regress/expected/oidjoins.out|   1 +
 13 files changed, 672 insertions(+), 83 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index d98bd666818..fa9f487d66d 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -196,6 +196,7 @@ static bool is_subquery_var(Var *node, RelOptInfo *foreignrel,
 static void get_relation_column_alias_ids(Var *node, RelOptInfo *foreignrel,
 		  int *relno, int *colno);
 
+static bool partial_agg_ok(Aggref *agg);
 
 /*
  * Examine each qual clause in input_conds, and classify them into two groups,
@@ -831,8 +832,10 @@ foreign_expr_walker(Node *node,
 if (!IS_UPPER_REL(glob_cxt->foreignrel))
 	return false;
 
-/* Only non-split aggregates are pushable. */
-if (agg->aggsplit != AGGSPLIT_SIMPLE)
+if ((agg->aggsplit != AGGSPLIT_SIMPLE) && (agg->aggsplit != AGGSPLIT_INITIAL_SERIAL))
+	return false;
+
+if (agg->aggsplit == AGGSPLIT_INITIAL_SERIAL && !partial_agg_ok(agg))
 	return false;
 
 /* As usual, it must be shippable. */
@@ -3249,7 +3252,7 @@ deparseAggref(Aggref *node, deparse_expr_cxt *context)
 	bool		use_variadic;
 
 	/* Only basic, non-split aggregation accepted. */
-	Assert(node->aggsplit == AGGSPLIT_SIMPLE);
+	Assert(node->aggsplit == AGGSPLIT_SIMPLE || node->aggsplit == AGGSPLIT_INITIAL_SERIAL);
 
 	/* Check if need to print VARIADIC (cf. ruleutils.c) */
 	use_variadic = node->aggvariadic;
@@ -3719,3 +3722,40 @@ get_relation_column_alias_ids(Var *node, RelOptInfo *foreignrel,
 	/* Shouldn't get here */
 	elog(ERROR, "unexpected expression in subquery output");
 }
+
+/*
+ * Check that partial aggregate agg is fine to push down
+ */
+static bool
+partial_agg_ok(Aggref *agg)
+{
+	HeapTuple	aggtup;
+	Form_pg_aggregate aggform;
+	bool		ok = true;
+
+	Assert(agg->aggsplit == AGGSPLIT_INITIAL_SERIAL);
+
+	/* We don't support complex partial aggregates */
+	if (agg->aggdistinct || agg->aggvariadic || agg->aggkind != AGGKIND_NORMAL || agg->aggorder != NIL)
+		return false;
+
+	aggtup = SearchSysCache1(AGGFNOID, ObjectIdGetDatum(agg->aggfnoid));
+	if (!HeapTupleIsValid(aggtup))
+		elog(ERROR, "cache lookup failed for function %u", agg->aggfnoid);
+	aggform = (Form_pg_aggregate) GETSTRUCT(aggtup);
+
+	/* Only aggregates, marked as pushdown safe, are allowed */
+	if (!aggform->aggpartialpushdownsafe)
+		ok = false;
+
+	/*
+	 * But if an aggregate requires serialization/deserialization, partial
+	 * converter should be defined
+	 */
+	if (ok && agg->aggtranstype == INTERNALOID)
+		ok = (aggform->aggpartialconverterfn != InvalidOid);
+
+	ReleaseSysCache(aggtup);
+
+	return ok;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 44c4367b8f9..3e1b997875e 100644
--- a/contrib/postgres_fdw/expected/postgres_

Re: Added schema level support for publication.

2021-10-21 Thread Greg Nancarrow

On Fri, Oct 22, 2021 at 12:41 PM Greg Nancarrow  wrote:
>
> I was also previously concerned about what the behavior should be when
> only including just the partitions of a partitioned table in a
> publication using ALL TABLES IN SCHEMA and having
> publish_via_partition_root=true. It seems to implicitly include the
> partitioned table in the publication. But I did some testing and found
> that this is the current behavior when only the partitions are
> individually included in a publication using TABLE, so it seems to be
> OK.
>

Thinking some more about this, it still may still be confusing to the
user if not explicitly stated in the ALL TABLES IN SCHEMA case.
How about adding some additional explanation to the end of the
following paragraph:

+ 
+  When a partitioned table is published via schema level publication, all
+  of its existing and future partitions irrespective of it being from the
+  publication schema or not are implicitly considered to be part of the
+  publication. So, even operations that are performed directly on a
+  partition are also published via publications that its ancestors are
+  part of.
+ 

Something like:

Similarly, if a partition is published via schema level publication
and publish_via_partition_root=true, the parent partitioned table is
implicitly considered to be part of the publication, irrespective of
it being from the publication schema or not.

Regards,
Greg Nancarrow
Fujitsu Australia

81 matches

Mail list logo