date:20200611

Re: Asynchronous Append on postgres_fdw nodes.

2020-06-11 Thread Andrey V. Lepikhov


On 6/10/20 8:05 AM, Kyotaro Horiguchi wrote:

Hello, Andrey.

At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov  
wrote in

On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?


Yes.  For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist.  If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.
I mean that you can possibly change computation of total cost of the 
Async append node. It may affect the planner choice between ForeignScan 
(followed by the execution of the JOIN locally) and partitionwise join 
strategies.


Have you also considered the possibility of dynamic choice between 
synchronous and async append (during optimization)? This may be useful 
for a query with the LIMIT clause.


--
Andrey Lepikhov
Postgres Professional

Re: Internal key management system

2020-06-11 Thread Fabien COELHO




Hello Bruce,

Sorry for the length (yet again) of this answer, I'm trying to make my 
point as clear as possible.



Obviously it requires some more thinking and design, but my point is that
postgres should not hold a KEK, ever, nor presume how DEK are to be managed
by a DMS, and that is not very difficult to achieve by putting it outside of
pg and defining how interactions take place. Providing a reference/example
implementation would be nice as well, and Masahiko-san code can be rewrapped
quite easily.


Well, the decrypted keys are already stored in backend memory,


The fact that if the pg process is compromised then the DEK and data 
encrypted are compromised is more or less unavoidable (maybe only the data 
could be compromised though, but not the DEK, depending on how the 
encryption/decryption operates).



so what risk does haveing the KEK in memory for a brief period avoid?


My understanding is that the KEK does not protect one key, but all keys, 
thus all data, possibly even past or future, so it loss impacts more than 
the here and now local process.


If the KEK is ever present in pg process, it presumes that the threat 
model being addressed allows its loss if the process is compromised, i.e. 
all (past, present, future) security properties are void once the process 
is compromised.


This may be okay for some use cases, but I can easily foresee that it 
would not be for all. I can think of use cases where the user/auditor says 
that the KEK should be elsewhere, and I would tend to agree.


So my point is that the implementation should allow it, i.e. define a 
simple interface, and possibly a reference implementation with good 
properties which might fit some/typical security requirements, and the 
patch mostly fits that need, but for the possible isolation of the KEK.


ISTM that a reasonable point of comparision is the design of kerberos, 
with a central authority/server which authenticate parties and allow them 
to authenticate one another and communicate securely.


The design means that any compromised host/service would compromise all 
its interaction with other parties, but not communications between third 
parties. The compromission stays local, with the exception is the kerberos 
server itself, which somehow holds all the keys.


For me the KEK is basically the kerberos server, you should provide means 
to allow the user to isolate that where they think it should be, and not 
enforce that it is within postgres process.


Another point is that what I suggest does not seem very hard from an 
implementation point of view, and allows extensibility, which is also a 
win.


Lastly, I still think that all this, whatever the design, should be 
packaged as an extension, unless it is really impossible to do so. I would 
also try to separate the extension for KMS interaction with the extension 
for actually using the keys, so that the user could change the underlying 
cryptographic primitives as they see fit.


--
Fabien.

Re: Add support for INDEX_CLEANUP and TRUNCATE to vacuumdb

2020-06-11 Thread Michael Paquier

On Thu, Jun 11, 2020 at 12:41:17AM +, Bossart, Nathan wrote:
> I quickly put together a patch to add INDEX_CLEANUP and TRUNCATE to
> vacuumdb before noticing a previous thread for it [0].  My  take on it
> was to just name the options --skip-index-cleanup and --skip-truncate.
> While that does not give you a direct mapping to the corresponding
> VACUUM options, it simplifies the patch by avoiding the boolean
> parameter parsing stuff altogether.

Cannot blame you for that.  There is little sense to have a pure
mapping with the options here with some custom boolean parsing.  What
about naming them --no-index-cleanup and --no-truncate instead?  I
would suggest to track the option values with variables named like
do_truncate and do_index_cleanup.  That would be similar with what we
do with --no-sync for example.
--
Michael


signature.asc
Description: PGP signature

Re: Add tap test for --extra-float-digits option

2020-06-11 Thread Michael Paquier

On Thu, Jun 11, 2020 at 02:25:37PM +0900, Dong Wook Lee wrote:
> Hi hackers! I added tap test code for pg_dump --extra-float-digits option
> because it hadn't tested it. There was no problem when writing test code
> and running TAP tests.

If we go down to that (there is a test for --compression), what about
--rows-per-insert?
--
Michael


signature.asc
Description: PGP signature

Re: proposal: possibility to read dumped table's name from file

2020-06-11 Thread Pavel Stehule

st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby 
napsal:

> On Tue, Jun 09, 2020 at 11:46:24AM +0200, Pavel Stehule wrote:
> > po 8. 6. 2020 v 23:30 odesílatel Justin Pryzby 
> napsal:
>
> > I still wonder if a better syntax would use a unified --filter option,
> whose
> > > argument would allow including/excluding any type of object:
> > I tried to implement simple format "[+-][tndf] objectname"
>
> Thanks.
>
> > + /* ignore empty rows */
> > + if (*line != '\0')
>
> Maybe: if line=='\0': continue
>

ok


> We should also support comments.
>
> > + bool
> include_filter = false;
> > + bool
> exclude_filter = false;
>
> I think we only need one bool.
> You could call it: bool is_exclude = false
>
>
ok

> +
> > + if (chars < 4)
> > +
>  invalid_filter_format(optarg, line, lineno);
>
> I think that check is too lax.
> I think it's ok if we require the first char to be [-+] and the 2nd char
> to be
> [dntf]
>

> > + objecttype =
> line[1];
>
> ... but I think this is inadequately "liberal in what it accepts"; I think
> it
> should skip spaces.  In my proposed scheme, someone might reasonably write:
>
> > +
> > + objectname =
> &line[3];
> > +
> > + /* skip initial
> spaces */
> > + while (*objectname
> == ' ')
> > +
>  objectname++;
>
> I suggest to use isspace()
>

ok


> I think we should check that *objectname != '\0', rather than chars>=4,
> above.
>

done


> > + if
> (include_filter)
> > + {
> > +
>  simple_string_list_append(&table_include_patterns, objectname);
> > +
>  dopt.include_everything = false;
> > + }
> > + else if
> (exclude_filter)
> > +
>  simple_string_list_append(&table_exclude_patterns, objectname);
>
> If you use bool is_exclude, then this becomes "else" and you don't need to
> think about checking if (!include && !exclude).
>
> > + else if
> (objecttype == 'f')
> > + {
> > + if
> (include_filter)
> > +
>  simple_string_list_append(&foreign_servers_include_patterns, objectname);
> > + else if
> (exclude_filter)
> > +
>  invalid_filter_format(optarg, line, lineno);
> > + }
>
> I would handle invalid object types as "else: invalid_filter_format()"
> here,
> rather than duplicating above as: !=ALL('d','n','t','f')
>

good idea


> > +
> > + if (ferror(f))
> > + fatal("could not read from
> file \"%s\": %s",
> > +   f == stdin ?
> "stdin" : optarg,
> > +   strerror(errno));
>
> I think we're allowed to use %m here ?
>

changed


> > + printf(_("  --filter=FILENAMEread object names from
> file\n"));
>
> Object name filter expression, or something..
>

yes, it is not object names now


> > + * getline is originaly GNU function, and should not be everywhere
> still.
> originally
>
> > + * Use own reduced implementation.
>
> Did you "reduce" this from another implementation?  Where?
> What is its license ?
>
> Maybe a line-reader already exists in the frontend (?) .. or maybe it
> should.
>

everywhere else is used a function fgets. Currently pg_getline is used just
on only one place, so I don't think so moving it to some common part is
maybe premature.

Maybe it can be used as replacement of some fgets calls, but then it is
different topic, I think.

Thank you for comments, attached updated patch

Regards

Pavel


> --
> Justin
>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 2f0807e912..332f0c312f 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -813,6 +813,15 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Read filters from file. Format "(+|-)(tnfd) objectname:
+   
+  
+ 
+
  
   --load-via-partition-root
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dfe43968b8..9e76189635 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -290,6 +290,8 @@ static voi

Re: Index Skip Scan (new UniqueKeys)

2020-06-11 Thread Andy Fan

On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Hi,
>
> Here is a new version of index skip scan patch, based on v8 patch for
> UniqueKeys implementation from [1]. I want to start a new thread to
> simplify navigation, hopefully I didn't forget anyone who actively
> participated in the discussion.
>
> To simplify reviewing I've split it into several parts:
>
> * First two are taken from [1] just for the reference and to make cfbot
> happy.
>
> * Extend-UniqueKeys consists changes that needs to be done for
>   UniqueKeys to be used in skip scan. Essentially this is a reduced
>   version of previous implementation from Jesper & David, based on the
>   new UniqueKeys infrastructure, as it does the very same thing.
>
> * Index-Skip-Scan contains not am specific code and the overall
>   infrastructure, including configuration elements and so on.
>
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.
>
> * The last one contains just documentation bits.
>
> Interesting enough with a new UniqueKey implementation skipping is
> applied in some tests unexpectedly. For those I've disabled
> indexskipscan to avoid confusion.
>
> [1]:
> https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
>

Thanks for the patch.

I just get the rough idea of patch, looks we have to narrow down the user
cases
where we can use this method.  Consider the below example:

create table j1(i int, im5 int,  im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1,
1000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;

demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-+---+
 1 |   1 | 1 |  1
(1 row)

demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-+---+
 1 |   1 | 0 |  0
 1 |   1 | 1 |  1
(2 rows)

drop index j1_im5_i_idx;

create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-+---+---
   1 | 1 | 0
   1 | 1 | 1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-+---+---
   1 | 1 | 0
(1 row)


-- 
Best Regards
Andy Fan

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 12:30, Amit Kapila  wrote:
>
> On Thu, Jun 11, 2020 at 12:35 AM Magnus Hagander  wrote:
> >
> > On Wed, Jun 10, 2020 at 9:01 AM Masahiko Sawada 
> >  wrote:
> >>
> >
> >> If we move these values to replication slots, I think we can change
> >> the code so that these statistics are managed by replication slots
> >> (e.g. ReplicationSlot struct). Once ReplicationSlot has these values,
> >> we can keep them beyond reconnections or multiple calls of SQL
> >> interface functions. Of course, these values don’t need to be
> >> persisted.
> >
> >
> > Eh, why should they not be persisted?
> >
>
> Because these stats are corresponding to a ReoderBuffer (logical
> decoding) which will be allocated fresh after restart and have no
> relation with what has happened before restart.

I thought the same. But I now think there is no difference between
reconnecting replication and server restart in terms of the logical
decoding context. Even if we persist these values, we might want to
reset these values after crash recovery like stats collector.

>
> Now, thinking about this again, I am not sure if these stats are
> directly related to slots. These are stats for logical decoding which
> can be performed either via WALSender or decoding plugin (via APIs).
> So, why not have them displayed in a new view like pg_stat_logical (or
> pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
> will need to add similar stats for streaming of in-progress
> transactions as well (see patch 0007-Track-statistics-for-streaming at
> [1]), so having a separate view for these doesn't sound illogical.
>

I think we need to decide how long we want to remain these statistics
values. That is, if we were to have such pg_stat_logical view, these
values would remain until logical decoding finished since I think the
view would display only running logical decoding. OTOH, if we were to
correspond these stats to slots, these values would remain beyond
multiple logical decoding SQL API calls.

I think one of the main use-case of these statistics is the tuning of
logical_decoding_work_mem. This seems similar to a combination of
pg_stat_database.temp_files/temp_bytes and work_mem. From this
perspective, I guess it’s useful for users if these values remain
until or the slots are removed or server crashes. Given that the kinds
of logical decoding statistics might grow, having a separate view
dedicated to replication slots makes sense to me.

For updating these statistics, if we correspond these statistics to
logical decoding or replication slot, we can change the strategy of
updating statistics so that it doesn’t depend on logical decoding
plugin implementation. If updating statistics doesn’t affect
performance much, it’s better to track the statistics regardless of
plugins.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Internal key management system

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 16:07, Fabien COELHO  wrote:
>
>
> Hello Bruce,
>
> Sorry for the length (yet again) of this answer, I'm trying to make my
> point as clear as possible.

Thank you for your explanation!

>
> >> Obviously it requires some more thinking and design, but my point is that
> >> postgres should not hold a KEK, ever, nor presume how DEK are to be managed
> >> by a DMS, and that is not very difficult to achieve by putting it outside 
> >> of
> >> pg and defining how interactions take place. Providing a reference/example
> >> implementation would be nice as well, and Masahiko-san code can be 
> >> rewrapped
> >> quite easily.
> >
> > Well, the decrypted keys are already stored in backend memory,
>
> The fact that if the pg process is compromised then the DEK and data
> encrypted are compromised is more or less unavoidable (maybe only the data
> could be compromised though, but not the DEK, depending on how the
> encryption/decryption operates).
>
> > so what risk does haveing the KEK in memory for a brief period avoid?
>
> My understanding is that the KEK does not protect one key, but all keys,
> thus all data, possibly even past or future, so it loss impacts more than
> the here and now local process.
>
> If the KEK is ever present in pg process, it presumes that the threat
> model being addressed allows its loss if the process is compromised, i.e.
> all (past, present, future) security properties are void once the process
> is compromised.

Why we should not put KEK in pg process but it's okay for other
processes? I guess you're talking about a threat when a malicious user
logged in OS (or at least accessible) but I thought there is no
difference between pg process and other processes in terms of the
process being compromised. So the solution, in that case, would be to
outsource encryption/decryption to external servers as Bruce
mentioned.

Regards,


--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add support for INDEX_CLEANUP and TRUNCATE to vacuumdb

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 09:41, Bossart, Nathan  wrote:
>
> Hi hackers,
>
> I quickly put together a patch to add INDEX_CLEANUP and TRUNCATE to
> vacuumdb before noticing a previous thread for it [0].  My  take on it
> was to just name the options --skip-index-cleanup and --skip-truncate.
> While that does not give you a direct mapping to the corresponding
> VACUUM options, it simplifies the patch by avoiding the boolean
> parameter parsing stuff altogether.
>

Thank you for updating the patch!

I looked at this patch.

@@ -412,6 +434,13 @@ vacuum_one_database(const char *dbname,
vacuumingOptions *vacopts,
exit(1);
}

+   if (vacopts->skip_index_cleanup && PQserverVersion(conn) < 12)
+   {
+   PQfinish(conn);
+   pg_log_error("cannot use the \"%s\" option on server versions
older than PostgreSQL %s",
+"skip-index-cleanup", "12");
+   }
+
if (vacopts->skip_locked && PQserverVersion(conn) < 12)
{
PQfinish(conn);

exit(1) is missing after pg_log_error().

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

problem with RETURNING and update row movement

2020-06-11 Thread Amit Langote

Hi,

While working on [1], I came across a bug.

Reproduction steps:

create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *;
ERROR:  attribute 5 of type record has wrong type
DETAIL:  Table has type record, but query expects integer.

The error occurs when projecting the RETURNING list.  The problem is
that the projection being used when the error occurs belongs to result
relation foo2 which is the destination partition of a row movement
operation, but it's trying to access a column in the tuple produced by
the plan belonging to foo1, the source partition of the row movement.
 foo2's RETURNING projection can only work correctly when it's being
fed tuples from the plan belonging to foo2.

Note that the targetlists of the plans belonging to different result
relations can be different depending on the respective relation's
tuple descriptors, so are not interchangeable.  Also, each result
relation's RETURNING list is made to refer to its own plan's output.
Without row movement, there is only one result relation to consider,
so there's no confusion regarding which RETURNING list to compute.
With row movement however, while there is only one plan tuple, there
are two result relations to consider each with its own RETURNING list.
I think we should be computing the *source* relation's RETURNING list,
because only that one of the two can consume the plan tuple correctly.
Attached is a patch that fixes things to be that way.

By the way, the problem exists since PG 11 when UPDATE row movement
feature went in.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] 
https://www.postgresql.org/message-id/CA%2BHiwqH-2sq-3Zq-CtuWjfRSyrGPXJBf1nCKKvTHuGVyfQ1OYA%40mail.gmail.com


v1-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patch
Description: Binary data

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Amit Kapila

On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
 wrote:
>
> On Thu, 11 Jun 2020 at 12:30, Amit Kapila  wrote:
> >
> >
> > Now, thinking about this again, I am not sure if these stats are
> > directly related to slots. These are stats for logical decoding which
> > can be performed either via WALSender or decoding plugin (via APIs).
> > So, why not have them displayed in a new view like pg_stat_logical (or
> > pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
> > will need to add similar stats for streaming of in-progress
> > transactions as well (see patch 0007-Track-statistics-for-streaming at
> > [1]), so having a separate view for these doesn't sound illogical.
> >
>
> I think we need to decide how long we want to remain these statistics
> values. That is, if we were to have such pg_stat_logical view, these
> values would remain until logical decoding finished since I think the
> view would display only running logical decoding. OTOH, if we were to
> correspond these stats to slots, these values would remain beyond
> multiple logical decoding SQL API calls.
>

I thought of having these till the process that performs these
operations exist.  So for WALSender, the stats will be valid till it
is not restarted due to some reason or when performed via backend, the
stats will be valid till the corresponding backend exits.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 18:11, Amit Kapila  wrote:
>
> On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, 11 Jun 2020 at 12:30, Amit Kapila  wrote:
> > >
> > >
> > > Now, thinking about this again, I am not sure if these stats are
> > > directly related to slots. These are stats for logical decoding which
> > > can be performed either via WALSender or decoding plugin (via APIs).
> > > So, why not have them displayed in a new view like pg_stat_logical (or
> > > pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
> > > will need to add similar stats for streaming of in-progress
> > > transactions as well (see patch 0007-Track-statistics-for-streaming at
> > > [1]), so having a separate view for these doesn't sound illogical.
> > >
> >
> > I think we need to decide how long we want to remain these statistics
> > values. That is, if we were to have such pg_stat_logical view, these
> > values would remain until logical decoding finished since I think the
> > view would display only running logical decoding. OTOH, if we were to
> > correspond these stats to slots, these values would remain beyond
> > multiple logical decoding SQL API calls.
> >
>
> I thought of having these till the process that performs these
> operations exist.  So for WALSender, the stats will be valid till it
> is not restarted due to some reason or when performed via backend, the
> stats will be valid till the corresponding backend exits.
>

The number of rows of that view could be up to (max_backends +
max_wal_senders). Is that right? What if different backends used the
same replication slot one after the other?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: typos in comments referring to macros

2020-06-11 Thread Amit Kapila

On Wed, Jun 10, 2020 at 3:19 PM Amit Kapila  wrote:
>
> On Wed, Jun 10, 2020 at 2:47 PM John Naylor  
> wrote:
> >
> > It should be BLCKSZ and LOBLKSIZE, as in the attached.
> >
>
> LGTM on first look. I'll push either later today or tomorrow.
>

Pushed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Amit Kapila

On Thu, Jun 11, 2020 at 3:07 PM Masahiko Sawada
 wrote:
>
> On Thu, 11 Jun 2020 at 18:11, Amit Kapila  wrote:
> >
> > On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Thu, 11 Jun 2020 at 12:30, Amit Kapila  wrote:
> > > >
> > > >
> > > > Now, thinking about this again, I am not sure if these stats are
> > > > directly related to slots. These are stats for logical decoding which
> > > > can be performed either via WALSender or decoding plugin (via APIs).
> > > > So, why not have them displayed in a new view like pg_stat_logical (or
> > > > pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
> > > > will need to add similar stats for streaming of in-progress
> > > > transactions as well (see patch 0007-Track-statistics-for-streaming at
> > > > [1]), so having a separate view for these doesn't sound illogical.
> > > >
> > >
> > > I think we need to decide how long we want to remain these statistics
> > > values. That is, if we were to have such pg_stat_logical view, these
> > > values would remain until logical decoding finished since I think the
> > > view would display only running logical decoding. OTOH, if we were to
> > > correspond these stats to slots, these values would remain beyond
> > > multiple logical decoding SQL API calls.
> > >
> >
> > I thought of having these till the process that performs these
> > operations exist.  So for WALSender, the stats will be valid till it
> > is not restarted due to some reason or when performed via backend, the
> > stats will be valid till the corresponding backend exits.
> >
>
> The number of rows of that view could be up to (max_backends +
> max_wal_senders). Is that right? What if different backends used the
> same replication slot one after the other?
>

Yeah, it would be tricky if multiple slots are used by the same
backend.  We could probably track the number of times decoding has
happened by the session that will probably help us in averaging the
spill amount.  If we think that the aim is to help users to tune
logical_decoding_work_mem to avoid frequent spilling or streaming then
how would maintaining at slot level will help?  As you said previously
we could track it only for running logical decoding context but if we
do that then data will be temporary and the user needs to constantly
monitor the same to make sense of it but maybe that is fine.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Internal key management system

2020-06-11 Thread Fabien COELHO




Hello Masahiko-san,


If the KEK is ever present in pg process, it presumes that the threat
model being addressed allows its loss if the process is compromised, i.e.
all (past, present, future) security properties are void once the process
is compromised.


Why we should not put KEK in pg process but it's okay for other
processes?


My point is "elsewhere".

Indeed, it could be on another process on the same host, in which case I'd 
rather have the process run under a different uid, which means another 
compromission would be required if pg is compromissed locally ; it could 
also be in a process on another host ; it could be on some special 
hardware. Your imagination is the limit.


I guess you're talking about a threat when a malicious user logged in OS 
(or at least accessible) but I thought there is no difference between pg 
process and other processes in terms of the process being compromised.


Processes are isolated based on uid, unless root is compromised. Once a id 
is compromised (eg "postgres"), the hacker has basically access to all 
files and processes accessible to that id.


So the solution, in that case, would be to outsource 
encryption/decryption to external servers as Bruce mentioned.


Hosting stuff (keys, encryption) on another server is indeed an option if 
"elsewhere" is implemented.



From a design point of view:


 0. KEK, DEK & crypto are managed by pg

 1. DEK & crypto are managed by pg,
but KEK is outside pg.

 2. eveything is managed out of pg.

I think that both 1 & 2 are valid options, which do not require the same 
interface. If you have 1, you can do 0 by giving KEK to a pg process.


How DEK are identified and created with the KEK should also be something 
open, left to the implementation, the interface should not need to know.


--
Fabien.

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Joe Conway

On 6/10/20 6:00 PM, Andres Freund wrote:
> On June 10, 2020 2:13:51 PM PDT, David Rowley  wrote:
>>On Thu, 11 Jun 2020 at 02:13, Tom Lane  wrote:
>>> I have in the past scraped the latter results and tried to make sense
>>of
>>> them.  They are *mighty* noisy, even when considering just one animal
>>> that I know to be running on a machine with little else to do.
>>
>>Do you recall if you looked at the parallel results or the serially
>>executed ones?
>>
>>I imagine that the parallel ones will have much more noise since we
>>run the tests up to 20 at a time. I imagine probably none, or at most
>>not many of the animals have enough CPU cores not to be context
>>switching a lot during those the parallel runs.  I thought the serial
>>ones would be better but didn't have an idea of they'd be good enough
>>to be useful.
> 
> I'd assume that a rolling average (maybe 10 runs or so) would hide noise 
> enough to see at least some trends even for parallel runs.
> 
> We should be able to prototype this with a few queries over the bf database, 
> right?


This seems to me like a perfect use case for control charts:

  https://en.wikipedia.org/wiki/Control_chart

They are designed specifically to detect systematic changes in an environment
with random noise.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Parallel Seq Scan vs kernel read ahead

2020-06-11 Thread Amit Kapila

On Thu, Jun 11, 2020 at 10:13 AM David Rowley  wrote:
>
> On Thu, 11 Jun 2020 at 16:03, Amit Kapila  wrote:
> > I think something on these lines would be a good idea especially
> > keeping step-size proportional to relation size.  However, I am not
> > completely sure if doubling the step-size with equal increase in
> > relation size (ex. what is happening between 16MB~8192MB) is the best
> > idea.  Why not double the step-size when relation size increases by
> > four times?  Will some more tests help us to identify this?  I also
> > don't know what is the right answer here so just trying to brainstorm.
>
> Brainstorming sounds good. I'm by no means under any illusion that the
> formula is correct.
>
> But, why four times?
>

Just trying to see if we can optimize such that we use bigger
step-size for bigger relations and smaller step-size for smaller
relations.

>  The way I did it tries to keep the number of
> chunks roughly the same each time. I think the key is the number of
> chunks more than the size of the chunks. Having fewer chunks increases
> the chances of an imbalance of work between workers, and with what you
> mention, the number of chunks will vary more than what I have proposed
>

But, I think it will lead to more number of chunks for smaller relations.

> The code I showed above will produce something between 512-1024 chunks
> for all cases until we 2^20 pages, then we start capping the chunk
> size to 1024.  I could probably get onboard with making it depend on
> the number of parallel workers, but perhaps it would be better just to
> divide by, say, 16384 rather than 1024, as I proposed above. That way
> we'll be more fine-grained, but we'll still read in larger than 1024
> chunk sizes when the relation gets beyond 128GB.
>

I think increasing step-size might be okay for very large relations.

Another point I am thinking is that whatever formula we come up here
might not be a good fit for every case.  For ex. as you mentioned
above that larger step-size can impact the performance based on
qualification, similarly there could be other things like having a
target list or qual having some function which takes more time for
certain tuples and lesser for others especially if function evaluation
is based on some column values.  So, can we think of providing a
rel_option for step-size?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

jsonpath versus NaN

2020-06-11 Thread Tom Lane

Commit 72b646033 inserted this into convertJsonbScalar:

break;
 
case jbvNumeric:
+   /* replace numeric NaN with string "NaN" */
+   if (numeric_is_nan(scalarVal->val.numeric))
+   {
+   appendToBuffer(buffer, "NaN", 3);
+   *jentry = 3;
+   break;
+   }
+
numlen = VARSIZE_ANY(scalarVal->val.numeric);
padlen = padBufferToInt(buffer);

To characterize this as hack, slash, and burn programming would be
charitable.  It is entirely clear from the code, the documentation,
and the relevant RFCs that JSONB does not allow NaNs as numeric
values.  So it should be impossible for this code to do anything.

I tried taking it out, and found that this test case from the same
commit fails:

+select jsonb_path_query('"nan"', '$.double()');
+ jsonb_path_query 
+--
+ "NaN"
+(1 row)

However, seeing that the JSON RFC disallows NaNs, I do not understand
why it's important to accept this.  The adjacent test case showing
that 'inf' isn't accepted:

+select jsonb_path_query('"inf"', '$.double()');
+ERROR:  non-numeric SQL/JSON item
+DETAIL:  jsonpath item method .double() can only be applied to a numeric value

seems like a saner approach.

In short, I think we should rip out the above code snippet and adjust
executeItemOptUnwrapTarget, at about line jsonpath_exec.c:1076 as of HEAD,
to reject NaNs the same way it already rejects infinities.  Can you
explain why it was done like this?

(The reason I came across this was that I'm working on extending
type numeric to allow infinities, and it was not clear what to
do here.  But allowing a jsonb to contain a numeric NaN, even
transiently, seems like a completely horrid idea.)

regards, tom lane

[PATCH] fix two shadow vars (src/backend/commands/sequence.c)

2020-06-11 Thread Ranier Vilela

Hi,
src/backend/commands/sequence.c
Has two shadows (buf var), with two unnecessary variables declared.

For readability reasons, the declaration of variable names in the
prototypes was also corrected.

regards,
Ranier Vilela


fix_shadows_buf_var.patch
Description: Binary data

Windows regress fails (latest HEAD)

2020-06-11 Thread Ranier Vilela

Hi,
Latest HEAD, fails with windows regress tests.

 float8   ... FAILED  517 ms
 partition_prune  ... FAILED 3085 ms

regards,
Ranier VIlela


regression.diffs
Description: Binary data


regression.out
Description: Binary data

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Andrew Dunstan



On 6/11/20 8:52 AM, Ranier Vilela wrote:
> Hi,
> Latest HEAD, fails with windows regress tests.
>
>  float8                       ... FAILED      517 ms
>  partition_prune              ... FAILED     3085 ms
>
>

Ranier,


The first thing you should do when you find this is to see if there is a
buildfarm report of the failure. If there isn't then try to work out why.

Also, when making a report like this, it is essential to let us know
things like:

  * which commit is causing the failure (git bisect is good for finding
this)
  * what Windows version you're testing on
  * which compiler you're using
  * which configuration settings you're using
  * what are the regression differences you get in the failing tests

Without that we can't do anything with your report because it just
doesn't have enough information


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-11 Thread Amit Kapila

On Fri, Jun 5, 2020 at 3:16 PM Masahiko Sawada
 wrote:
>
> On Thu, 4 Jun 2020 at 12:46, Amit Kapila  wrote:
> >
> >
> > +
> > + This parameter can be changed at any time; the behavior for any 
> > one
> > + transaction is determined by the setting in effect when it 
> > commits.
> > +
> >
> > This is written w.r.t foreign_twophase_commit.  If one changes this
> > between prepare and commit, will it have any impact?
>
> Since the distributed transaction commit automatically uses 2pc when
> executing COMMIT, it's not possible to change foreign_twophase_commit
> between prepare and commit. So I'd like to explain the case where a
> user executes PREPARE and then COMMIT PREPARED while changing
> foreign_twophase_commit.
>
> PREPARE can run only when foreign_twophase_commit is 'required' (or
> 'prefer') and all foreign servers involved with the transaction
> support 2pc. We prepare all foreign transactions no matter what the
> number of servers and modified or not. If either
> foreign_twophase_commit is 'disabled' or the transaction modifies data
> on a foreign server that doesn't support 2pc, it raises an error. At
> COMMIT (or ROLLBACK) PREPARED, similarly foreign_twophase_commit needs
> to be set to 'required'. It raises an error if the distributed
> transaction has a foreign transaction and foreign_twophase_commit is
> 'disabled'.
>

So, IIUC, it will raise an error if foreign_twophase_commit is
'disabled' (or one of the foreign server involved doesn't support 2PC)
and the error can be raised both when user issues PREPARE or COMMIT
(or ROLLBACK) PREPARED.  If so, isn't it strange that we raise such an
error after PREPARE?  What kind of use-case required this?

>
> >
> > 4.
> > +  in_doubt
> > +  boolean
> > +  
> > +  
> > +   If true this foreign transaction is
> > in-doubt status and
> > +   needs to be resolved by calling 
> > pg_resolve_fdwxact
> > +   function.
> > +  
> >
> > It would be better if you can add an additional sentence to say when
> > and or how can foreign transactions reach in-doubt state.
> >

+   If true this foreign transaction is in-doubt status.
+   A foreign transaction becomes in-doubt status when user canceled the
+   query during transaction commit or the server crashed during transaction
+   commit.

Can we reword the second sentence as: "A foreign transaction can have
this status when the user has cancelled the statement or the server
crashes during transaction commit."?   I have another question about
this field, why can't it be one of the status ('preparing',
'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
separate field?  Also, isn't it more suitable to name 'status' field
as 'state' because these appear to be more like different states of
transaction?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Ranier Vilela

Em qui., 11 de jun. de 2020 às 10:01, Andrew Dunstan <
andrew.duns...@2ndquadrant.com> escreveu:

>
> On 6/11/20 8:52 AM, Ranier Vilela wrote:
> > Hi,
> > Latest HEAD, fails with windows regress tests.
> >
> >  float8   ... FAILED  517 ms
> >  partition_prune  ... FAILED 3085 ms
> >
> >
>
> The first thing you should do when you find this is to see if there is a
> buildfarm report of the failure. If there isn't then try to work out why.
>
Sorry, I will have to research the buildfarm, I have no reference to it.


>
> Also, when making a report like this, it is essential to let us know
> things like:
>
>   * which commit is causing the failure (git bisect is good for finding
> this)
>
Thanks for hit (git bisect).


>   * what Windows version you're testing on
>
Windows 10 (2004)

  * which compiler you're using
>
msvc 2019 (64 bits)

  * which configuration settings you're using
>
None. Only build with default configuration (release is default).

  * what are the regression differences you get in the failing tests
>
It was sent as an attachment.

regards,
Ranier Vilela

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Andrew Dunstan



On 6/11/20 9:28 AM, Ranier Vilela wrote:
> Em qui., 11 de jun. de 2020 às 10:01, Andrew Dunstan
>  > escreveu:
>
>
> On 6/11/20 8:52 AM, Ranier Vilela wrote:
> > Hi,
> > Latest HEAD, fails with windows regress tests.
> >
> >  float8                       ... FAILED      517 ms
> >  partition_prune              ... FAILED     3085 ms
> >
> >
>
> The first thing you should do when you find this is to see if
> there is a
> buildfarm report of the failure. If there isn't then try to work
> out why.
>
> Sorry, I will have to research the buildfarm, I have no reference to it.
>  


See https://buildfarm.postgresql.org


>
>   * which configuration settings you're using
>
> None. Only build with default configuration (release is default).


We would also need to see the contents of your config.pl


>
>   * what are the regression differences you get in the failing tests
>
> It was sent as an attachment.
>
>

If you send attachments please refer to them in the body of your email,
otherwise they are easily missed, as I did.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Ranier Vilela

Em qui., 11 de jun. de 2020 às 10:36, Andrew Dunstan <
andrew.duns...@2ndquadrant.com> escreveu:

>
> On 6/11/20 9:28 AM, Ranier Vilela wrote:
> > Em qui., 11 de jun. de 2020 às 10:01, Andrew Dunstan
> >  > > escreveu:
> >
> >
> > On 6/11/20 8:52 AM, Ranier Vilela wrote:
> > > Hi,
> > > Latest HEAD, fails with windows regress tests.
> > >
> > >  float8   ... FAILED  517 ms
> > >  partition_prune  ... FAILED 3085 ms
> > >
> > >
> >
> > The first thing you should do when you find this is to see if
> > there is a
> > buildfarm report of the failure. If there isn't then try to work
> > out why.
> >
> > Sorry, I will have to research the buildfarm, I have no reference to it.
> >
>
>
> See https://buildfarm.postgresql.org

Ok.


>
>
> >
> >   * which configuration settings you're using
> >
> > None. Only build with default configuration (release is default).
>
>
> We would also need to see the contents of your config.pl

It seems that I am missing something, there is no config.pl, anywhere in
the postgres directory.


>
>
>
> >
> >   * what are the regression differences you get in the failing tests
> >
> > It was sent as an attachment.
> >
> >
>
> If you send attachments please refer to them in the body of your email,
> otherwise they are easily missed, as I did.
>
Ah yes, ok, I'll remember that.

regards,
Ranier Vilela

Re: remove some STATUS_* symbols

2020-06-11 Thread Peter Eisentraut


On 2020-01-16 13:56, Robert Haas wrote:

On Thu, Jan 16, 2020 at 12:50 AM Michael Paquier  wrote:

Thanks, that looks fine.  I am still not sure whether the second patch
adding an enum via ProcWaitStatus improves the code readability
though, so my take would be to discard it for now.  Perhaps others
think differently, I don't know.


IMHO, custom enums for each particular case would be a big improvement
over supposedly-generic STATUS codes. It makes it clearer which values
are possible in each code path, and it comes out nicer in the
debugger, too.


Given this feedback, I would like to re-propose the original patch, 
attached again here.


After this, the use of the remaining STATUS_* symbols will be contained 
to the frontend and backend libpq code, so it'll be more coherent.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 477515ea4b112e860587640f255cbfcdfee1755c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 29 Dec 2019 09:09:20 +0100
Subject: [PATCH v2 1/4] Remove STATUS_WAITING

Add a separate enum for use in the locking APIs, which were the only
user.

Discussion: 
https://www.postgresql.org/message-id/flat/a6f91ead-0ce4-2a34-062b-7ab9813ea308%402ndquadrant.com
---
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lock.c   |  8 ++---
 src/backend/storage/lmgr/proc.c   | 46 +--
 src/include/c.h   |  1 -
 src/include/storage/proc.h| 13 ++--
 5 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/twophase.c 
b/src/backend/access/transam/twophase.c
index e1904877fa..54fb6cc047 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -460,7 +460,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId 
xid, const char *gid,
MemSet(proc, 0, sizeof(PGPROC));
proc->pgprocno = gxact->pgprocno;
SHMQueueElemInit(&(proc->links));
-   proc->waitStatus = STATUS_OK;
+   proc->waitStatus = PROC_WAIT_STATUS_OK;
/* We set up the gxact's VXID as InvalidBackendId/XID */
proc->lxid = (LocalTransactionId) xid;
pgxact->xid = xid;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 7fecb38162..95989ce79b 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1856,7 +1856,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 */
PG_TRY();
{
-   if (ProcSleep(locallock, lockMethodTable) != STATUS_OK)
+   if (ProcSleep(locallock, lockMethodTable) != 
PROC_WAIT_STATUS_OK)
{
/*
 * We failed as a result of a deadlock, see 
CheckDeadLock(). Quit
@@ -1907,7 +1907,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 /*
  * Remove a proc from the wait-queue it is on (caller must know it is on one).
  * This is only used when the proc has failed to get the lock, so we set its
- * waitStatus to STATUS_ERROR.
+ * waitStatus to PROC_WAIT_STATUS_ERROR.
  *
  * Appropriate partition lock must be held by caller.  Also, caller is
  * responsible for signaling the proc if needed.
@@ -1923,7 +1923,7 @@ RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
 
/* Make sure proc is waiting */
-   Assert(proc->waitStatus == STATUS_WAITING);
+   Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
Assert(proc->links.next != NULL);
Assert(waitLock);
Assert(waitLock->waitProcs.size > 0);
@@ -1946,7 +1946,7 @@ RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
/* Clean up the proc's own state, and pass it the ok/fail signal */
proc->waitLock = NULL;
proc->waitProcLock = NULL;
-   proc->waitStatus = STATUS_ERROR;
+   proc->waitStatus = PROC_WAIT_STATUS_ERROR;
 
/*
 * Delete the proclock immediately if it represents no already-held 
locks.
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f5eef6fa4e..e57fcd2538 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -383,7 +383,7 @@ InitProcess(void)
 * initialized by InitProcGlobal.
 */
SHMQueueElemInit(&(MyProc->links));
-   MyProc->waitStatus = STATUS_OK;
+   MyProc->waitStatus = PROC_WAIT_STATUS_OK;
MyProc->lxid = InvalidLocalTransactionId;
MyProc->fpVXIDLock = false;
MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
@@ -567,7 +567,7 @@ InitAuxiliaryProcess(void)
 * initialized by InitProcGlobal.
 */
SHMQueueElemInit(&(MyProc->links));
-   MyProc->waitStatus = STATUS_OK;
+   MyProc->waitStatus = PROC_WAIT_STATUS_OK;
MyProc->lxid = InvalidLocalTransactionId;
MyProc-

Re: Terminate the idle sessions

2020-06-11 Thread Li Japin



On Jun 10, 2020, at 10:27 PM, Adam Brusselback 
mailto:adambrusselb...@gmail.com>> wrote:

My use case is, I have a primary application that connects to the DB, most 
users work through that (setting is useless for this scenario, app manages it's 
connections well enough). I also have a number of internal users who deal with 
data ingestion and connect to the DB directly to work, and those users 
sometimes leave query windows open for days accidentally. Generally not an 
issue, but would be nice to be able to time those connections out.

If there is no big impact, I think we might add it builtin.

Japin Li

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Andrew Dunstan



On 6/11/20 9:40 AM, Ranier Vilela wrote:
> Em qui., 11 de jun. de 2020 às 10:36, Andrew Dunstan
>  > escreveu:
>
>
>
>
>
> >
> >       * which configuration settings you're using
> >
> > None. Only build with default configuration (release is default).
>
>
> We would also need to see the contents of your config.pl
> 
>
> It seems that I am missing something, there is no config.pl
> , anywhere in the postgres directory.



If there isn't one it uses config_default.pl.


See src/tools/msvc/README which includes this snippet:


The tools for building PostgreSQL using Microsoft Visual Studio
currently
consist of the following files:

- Configuration files -
config_default.pl  default configuration arguments

A typical build environment has two more files, buildenv.pl and
config.pl
that contain the user's build environment settings and configuration
arguments.

cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Ranier Vilela

Em qui., 11 de jun. de 2020 às 11:00, Andrew Dunstan <
andrew.duns...@2ndquadrant.com> escreveu:

>
> On 6/11/20 9:40 AM, Ranier Vilela wrote:
> > Em qui., 11 de jun. de 2020 às 10:36, Andrew Dunstan
> >  > > escreveu:
> >
> >
> >
> >
> >
> > >
> > >   * which configuration settings you're using
> > >
> > > None. Only build with default configuration (release is default).
> >
> >
> > We would also need to see the contents of your config.pl
> > 
> >
> > It seems that I am missing something, there is no config.pl
> > , anywhere in the postgres directory.
>
>
>
> If there isn't one it uses config_default.pl.
>
I see.
As I did a clean build, from github (git clone), there is no config.pl, so,
was using the same file.
(
https://github.com/postgres/postgres/blob/master/src/tools/msvc/config_default.pl
)

regards,
Ranier VIlela

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 20:02, Amit Kapila  wrote:
>
> On Thu, Jun 11, 2020 at 3:07 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, 11 Jun 2020 at 18:11, Amit Kapila  wrote:
> > >
> > > On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Thu, 11 Jun 2020 at 12:30, Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Now, thinking about this again, I am not sure if these stats are
> > > > > directly related to slots. These are stats for logical decoding which
> > > > > can be performed either via WALSender or decoding plugin (via APIs).
> > > > > So, why not have them displayed in a new view like pg_stat_logical (or
> > > > > pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
> > > > > will need to add similar stats for streaming of in-progress
> > > > > transactions as well (see patch 0007-Track-statistics-for-streaming at
> > > > > [1]), so having a separate view for these doesn't sound illogical.
> > > > >
> > > >
> > > > I think we need to decide how long we want to remain these statistics
> > > > values. That is, if we were to have such pg_stat_logical view, these
> > > > values would remain until logical decoding finished since I think the
> > > > view would display only running logical decoding. OTOH, if we were to
> > > > correspond these stats to slots, these values would remain beyond
> > > > multiple logical decoding SQL API calls.
> > > >
> > >
> > > I thought of having these till the process that performs these
> > > operations exist.  So for WALSender, the stats will be valid till it
> > > is not restarted due to some reason or when performed via backend, the
> > > stats will be valid till the corresponding backend exits.
> > >
> >
> > The number of rows of that view could be up to (max_backends +
> > max_wal_senders). Is that right? What if different backends used the
> > same replication slot one after the other?
> >
>
> Yeah, it would be tricky if multiple slots are used by the same
> backend.  We could probably track the number of times decoding has
> happened by the session that will probably help us in averaging the
> spill amount.  If we think that the aim is to help users to tune
> logical_decoding_work_mem to avoid frequent spilling or streaming then
> how would maintaining at slot level will help?

Since the logical decoding intermediate files are written at per slots
directory, I thought that corresponding these statistics to
replication slots is also understandable for users. I was thinking
something like pg_stat_logical_replication_slot view which shows
slot_name and statistics of only logical replication slots. The view
always shows rows as many as existing replication slots regardless of
logical decoding being running. I think there is no big difference in
how users use these statistics values between maintaining at slot
level and at logical decoding level.

In logical replication case, since we generally don’t support setting
different logical_decoding_work_mem per wal senders, every wal sender
will decode the same WAL stream with the same setting, meaning they
will similarly spill intermediate files. Maybe the same is true
statistics of streaming. So having these statistics per logical
replication might not help as of now.

> As you said previously
> we could track it only for running logical decoding context but if we
> do that then data will be temporary and the user needs to constantly
> monitor the same to make sense of it but maybe that is fine.

Agreed, in general, it's better not to frequently reset cumulative
values. I personally would like these statistics to be valid even
after the process executing logical decoding exits (or even after
server restart), and to do reset them as needed.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-06-11 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Wed, Jun 10, 2020 at 01:23:37PM +0900, Michael Paquier wrote:
> > Okay.  After sleeping on it, it looks like would be better to move
> > this new fe_archive.c to src/fe_utils/.  I'll try to do that tomorrow,
> > and added an open item so as we don't forget about it.
> 
> Applied this part now as of a3b2bf1.

Uh, doesn't this pull these functions out of libpgcommon, where they
might have been being used by utilities outside of our client tools?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Stephen Frost

Greetings,

* David Rowley (dgrowle...@gmail.com) wrote:
> On Thu, 11 Jun 2020 at 10:02, Tom Lane  wrote:
> > Thomas Munro  writes:
> > > I've been doing that in a little database that pulls down the results
> > > and analyses them with primitive regexes.  First I wanted to know the
> > > pass/fail history for each individual regression, isolation and TAP
> > > script, then I wanted to build something that could identify tests
> > > that are 'flapping', and work out when the started and stopped
> > > flapping etc.  I soon realised it was all too noisy, but then I
> > > figured that I could fix that by detecting crashes.  So I classify
> > > every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> > > top level run was CRASH, than I can disregard the individual per
> > > script results, because they're all BS.
> >
> > If you can pin the crash on a particular test script, it'd be useful
> > to track that as a kind of failure.  In general, though, both crashes
> > and non-crash failures tend to cause collateral damage to later test
> > scripts --- if you can't filter that out then the later scripts will
> > have high false-positive rates.
> 
> I guess the fact that you've both needed to do analysis on individual
> tests shows that there might be a call for this beyond just recording
> the test's runtime.
> 
> If we had a table that stored the individual test details, pass/fail
> and just stored the timing information along with that, then, even if
> the timing was unstable, it could still be useful for some analysis.
> I'd be happy enough even if that was only available as a csv file
> download.  I imagine the buildfarm does not need to provide us with
> any tools for doing analysis on this. Ideally, there would be some
> run_id that we could link it back to the test run which would give us
> the commit SHA, and the animal that it ran on. Joining to details
> about the animal could be useful too, e.g perhaps a certain test
> always fails on 32-bit machines.
> 
> I suppose that maybe we could modify pg_regress to add a command line
> option to have it write out a machine-readable file, e.g:
> testname,result,runtime\n, then just have the buildfarm client ship
> that off to the buildfarm server to record in the database.

That seems like it'd be the best approach to me, though I'd defer to
Andrew on it.

By the way, if you'd like access to the buildfarm archive server where
all this stuff is stored, that can certainly be arranged, just let me
know.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Andrew Dunstan



On 6/11/20 10:21 AM, Stephen Frost wrote:
> Greetings,
>
> * David Rowley (dgrowle...@gmail.com) wrote:
>> On Thu, 11 Jun 2020 at 10:02, Tom Lane  wrote:
>>> Thomas Munro  writes:
 I've been doing that in a little database that pulls down the results
 and analyses them with primitive regexes.  First I wanted to know the
 pass/fail history for each individual regression, isolation and TAP
 script, then I wanted to build something that could identify tests
 that are 'flapping', and work out when the started and stopped
 flapping etc.  I soon realised it was all too noisy, but then I
 figured that I could fix that by detecting crashes.  So I classify
 every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
 top level run was CRASH, than I can disregard the individual per
 script results, because they're all BS.
>>> If you can pin the crash on a particular test script, it'd be useful
>>> to track that as a kind of failure.  In general, though, both crashes
>>> and non-crash failures tend to cause collateral damage to later test
>>> scripts --- if you can't filter that out then the later scripts will
>>> have high false-positive rates.
>> I guess the fact that you've both needed to do analysis on individual
>> tests shows that there might be a call for this beyond just recording
>> the test's runtime.
>>
>> If we had a table that stored the individual test details, pass/fail
>> and just stored the timing information along with that, then, even if
>> the timing was unstable, it could still be useful for some analysis.
>> I'd be happy enough even if that was only available as a csv file
>> download.  I imagine the buildfarm does not need to provide us with
>> any tools for doing analysis on this. Ideally, there would be some
>> run_id that we could link it back to the test run which would give us
>> the commit SHA, and the animal that it ran on. Joining to details
>> about the animal could be useful too, e.g perhaps a certain test
>> always fails on 32-bit machines.
>>
>> I suppose that maybe we could modify pg_regress to add a command line
>> option to have it write out a machine-readable file, e.g:
>> testname,result,runtime\n, then just have the buildfarm client ship
>> that off to the buildfarm server to record in the database.
> That seems like it'd be the best approach to me, though I'd defer to
> Andrew on it.
>
> By the way, if you'd like access to the buildfarm archive server where
> all this stuff is stored, that can certainly be arranged, just let me
> know.
>


Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.


On the server side, we could add a table with a key of  but we'd need to make sure those test
names were unique. Maybe we need a way of telling pg_regress to prepend
a module name (e.g. btree_gist ot plperl) to the test name.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Windows regress fails (latest HEAD)

2020-06-11 Thread Andrew Gierth

> "Ranier" == Ranier Vilela  writes:

 Ranier> Hi,
 Ranier> Latest HEAD, fails with windows regress tests.

 Ranier>   three |  f1  |sqrt_f1
 Ranier>  ---+--+---
 Ranier> |   1004.3 |  31.6906926399535
 Ranier> -   | 1.2345678901234e+200 | 1.1110611109e+100
 Ranier> +   | 1.2345678901234e+200 | 1.1110611108e+100
 Ranier> | 1.2345678901234e-200 | 1.1110611109e-100
 Ranier>  (3 rows)

This error is a surprisingly large one. Normally one expects sqrt to be
accurate to within half an ulp, i.e. accurate to the limits of the
format, though the regression test avoids actually making this
assumption. But in this case the true output we expect is:

1.11106111085536...e+100

for which the closest representable float8 is

1.11106111085583...e+100  (= 0x1.451DCD2E3ACAFp+332)

which should round (since we're doing this test with
extra_float_digits=0) to

1.1110611109e+100

The nearest value that would round to 1.1110611108e+100 would be
1.111061110848e+100 (= 0x1.451DCD2E3ACABp+332), which is a
difference of not less than 4 ulps from the expected value.

-- 
Andrew (irc:RhodiumToad)

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Magnus Hagander

On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 6/11/20 10:21 AM, Stephen Frost wrote:
> > Greetings,
> >
> > * David Rowley (dgrowle...@gmail.com) wrote:
> >> On Thu, 11 Jun 2020 at 10:02, Tom Lane  wrote:
> >>> Thomas Munro  writes:
>  I've been doing that in a little database that pulls down the results
>  and analyses them with primitive regexes.  First I wanted to know the
>  pass/fail history for each individual regression, isolation and TAP
>  script, then I wanted to build something that could identify tests
>  that are 'flapping', and work out when the started and stopped
>  flapping etc.  I soon realised it was all too noisy, but then I
>  figured that I could fix that by detecting crashes.  So I classify
>  every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
>  top level run was CRASH, than I can disregard the individual per
>  script results, because they're all BS.
> >>> If you can pin the crash on a particular test script, it'd be useful
> >>> to track that as a kind of failure.  In general, though, both crashes
> >>> and non-crash failures tend to cause collateral damage to later test
> >>> scripts --- if you can't filter that out then the later scripts will
> >>> have high false-positive rates.
> >> I guess the fact that you've both needed to do analysis on individual
> >> tests shows that there might be a call for this beyond just recording
> >> the test's runtime.
> >>
> >> If we had a table that stored the individual test details, pass/fail
> >> and just stored the timing information along with that, then, even if
> >> the timing was unstable, it could still be useful for some analysis.
> >> I'd be happy enough even if that was only available as a csv file
> >> download.  I imagine the buildfarm does not need to provide us with
> >> any tools for doing analysis on this. Ideally, there would be some
> >> run_id that we could link it back to the test run which would give us
> >> the commit SHA, and the animal that it ran on. Joining to details
> >> about the animal could be useful too, e.g perhaps a certain test
> >> always fails on 32-bit machines.
> >>
> >> I suppose that maybe we could modify pg_regress to add a command line
> >> option to have it write out a machine-readable file, e.g:
> >> testname,result,runtime\n, then just have the buildfarm client ship
> >> that off to the buildfarm server to record in the database.
> > That seems like it'd be the best approach to me, though I'd defer to
> > Andrew on it.
> >
> > By the way, if you'd like access to the buildfarm archive server where
> > all this stuff is stored, that can certainly be arranged, just let me
> > know.
> >
>
>
> Yeah, we'll need to work out where to stash the file. The client will
> pick up anything in src/regress/log for "make check", but would need
> adjusting for other steps that invoke pg_regress. I'm getting close to
> cutting a new client release, but I can delay it till we settle this.
>
>
> On the server side, we could add a table with a key of  snapshot, branch, step, testname> but we'd need to make sure those test
> names were unique. Maybe we need a way of telling pg_regress to prepend
> a module name (e.g. btree_gist ot plperl) to the test name.
>

It seems pretty trivial to for example get all the steps out of check.log
and their timing with a regexp. I just used '^(?:test)?\s+(\S+)\s+\.\.\.
ok\s+(\d+) ms$' as the regexp. Running that against a few hundred build
runs in the db generally looks fine, though I didn't look into it in
detail.

Of course, that only looked at check.log, and more logic would be needed if
we want to look into the other areas as well, but as long as it's
pg_regress output I think it should be easy?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: how to create index concurrently on partitioned table

2020-06-11 Thread Justin Pryzby

On Sun, Jun 07, 2020 at 01:04:48PM -0500, Justin Pryzby wrote:
> On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
> > > Partitioning is necessary for very large tables.  However, I found that
> > > postgresql does not support create index concurrently on partitioned
> > > tables.  The document show that we need to create an index on each
> > > partition individually and then finally create the partitioned index
> > > non-concurrently.  This is undoubtedly a complex operation for DBA,
> > > especially when there are many partitions. 

I added functionality for C-I-C, REINDEX-CONCURRENTLY, and CLUSTER of
partitioned tables.  We already recursively handle VACUUM and ANALYZE since
v10.

And added here:
https://commitfest.postgresql.org/28/2584/

Adger, if you're familiar with compilation and patching, do you want to try the
patch ?

Note, you could do this now using psql like:
SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM 
pg_partition_ancestors() AS a;
\gexec

-- 
Justin
>From b889658992c20011c10ecb05fd43e59ac5475d9e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH v2 1/3] Allow CREATE INDEX CONCURRENTLY on partitioned table

---
 doc/src/sgml/ref/create_index.sgml |  9 
 src/backend/commands/indexcmds.c   | 68 ++
 src/test/regress/expected/indexing.out | 14 +-
 src/test/regress/sql/indexing.sql  |  3 +-
 4 files changed, 52 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ff87b2d28f..e1298b8523 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -657,15 +657,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f..344cabaf52 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -652,17 +652,6 @@ DefineIndex(Oid relationId,
 	partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	if (partitioned)
 	{
-		/*
-		 * Note: we check 'stmt->concurrent' rather than 'concurrent', so that
-		 * the error is thrown also for temporary tables.  Seems better to be
-		 * consistent, even though we could do it on temporary table because
-		 * we're not actually doing it concurrently.
-		 */
-		if (stmt->concurrent)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot create index on partitioned table \"%s\" concurrently",
-			RelationGetRelationName(rel;
 		if (stmt->excludeOpNames)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -1139,8 +1128,26 @@ DefineIndex(Oid relationId,
 		CreateComments(indexRelationId, RelationRelationId, 0,
 	   stmt->idxcomment);
 
+	/* save lockrelid and locktag for below */
+	heaprelid = rel->rd_lockInfo.lockRelId;
+	SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);
+
 	if (partitioned)
 	{
+		/*
+		 * Need to close the relation before recursing into children, so copy
+		 * needed data.
+		 */
+		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		int			nparts = partdesc->nparts;
+		TupleDesc	parentDesc = CreateTupleDescCopy(RelationGetDescr(rel));
+		char		*relname = pstrdup(RelationGetRelationName(rel));
+		Oid			*part_oids;
+
+		part_oids = palloc(sizeof(Oid) * nparts);
+		memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
+		table_close(rel, NoLock);
+
 		/*
 		 * Unless caller specified to skip this step (via ONLY), process each
 		 * partition to make sure they all contain a corresponding index.
@@ -1149,19 +1156,11 @@ DefineIndex(Oid relationId,
 		 */
 		if (!stmt->relation || stmt->relation->inh)
 		{
-			PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-			int			nparts = partdesc->nparts;
-			Oid		   *part_oids = palloc(sizeof(Oid) * nparts);
 			bool		invalidate_parent = false;
-			TupleDesc	parentDesc;
 			Oid		   *opfamOids;
 
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
 		 nparts);
-
-			memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
-
-			parentDesc = RelationGetDescr(rel);
 			opfamOids = palloc(sizeof(Oid) * numberOfKeyAttributes);
 			for (i = 0; i < numberOfKeyAttributes; i++)
 opfamOids[i] = get_opclass_family(classObjectId[i]);
@@ -1196,9 +1195,9 @@ DefineIndex(Oid relationId,
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("cannot cre

pg_dump, gzwrite, and errno

2020-06-11 Thread Justin Pryzby

While testing Pavel's patch for pg_dump --filter, I got:

pg_dump: error: could not write to output file: Success
[pryzbyj@database postgresql]$ echo $?
1

I see we tried to fix it few years ago:
https://www.postgresql.org/message-id/flat/1498120508308.9826%40infotecs.ru
https://www.postgresql.org/message-id/flat/20160125143008.2539.2878%40wrigleys.postgresql.org
https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyot...@lab.ntt.co.jp
https://www.postgresql.org/message-id/20150608174336.gm133...@postgresql.org

Commits:
4d57e83816778c6f61ea35c697f937a6f9c3c3de
9a3b5d3ad0f1c19c47e2ee65b372344cb0616c9a

This patch fixes it for me
pg_dump: error: could not write to output file: No space left on device

--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -347,8 +347,12 @@ _WriteData(ArchiveHandle *AH, const void *data, size_t 
dLen)
lclContext *ctx = (lclContext *) AH->formatData;
 
if (dLen > 0 && cfwrite(data, dLen, ctx->dataFH) != dLen)
+   {
+   if (errno == 0)
+   errno = ENOSPC;
fatal("could not write to output file: %s",
  get_cfp_error(ctx->dataFH));
+   }
 }


PS. Due to $UserError, I originally sent this message with inaccurate RFC822
headers..

Re: new heapcheck contrib module

2020-06-11 Thread Dilip Kumar

On Mon, May 11, 2020 at 10:51 PM Mark Dilger
 wrote:
>
> Here is v5 of the patch.  Major changes in this version include:
>
> 1) A new module, pg_amcheck, which includes a command line client for 
> checking a database or subset of a database.  Internally it functions by 
> querying the database for a list of tables which are appropriate given the 
> command line switches, and then calls amcheck's functions to validate each 
> table and/or index.  The options for selecting/excluding tables and schemas 
> is patterned on pg_dump, on the assumption that interface is already familiar 
> to users.
>
> 2) amcheck's btree checking functions have been refactored to be able to 
> operate in two modes; the original mode in which all errors are reported via 
> ereport, and a new mode for returning errors as rows from a set returning 
> function.  The new mode is used by a new function verify_btreeam(), analogous 
> to verify_heapam(), both of which are used by the pg_amcheck command line 
> tool.
>
> 3) The regression test which generates corruption within a table uses the 
> pageinspect module to determine the location of each tuple on disk for 
> corrupting.  This was suggested upthread.
>
> Testing on the command line shows that the pre-existing btree checking code 
> could use some hardening, as it currently crashes the backend on certain 
> corruptions.  When I corrupt relation files for tables and indexes in the 
> backend and then use pg_amcheck to check all objects in the database, I keep 
> getting assertions from the btree checking code.  I think I need to harden 
> this code, but wanted to post an updated patch and solicit opinions before 
> doing so.  Here are some example problems I'm seeing.  Note the stack trace 
> when calling from the command line tool includes the new verify_btreeam 
> function, but you can get the same crashes using the old interface via psql:
>
> From psql, first error:
>
> test=# select bt_index_parent_check('corrupted_idx', true, true);
> TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, 
> offnum)", File: "nbtsearch.c", Line: 663)
> 0   postgres0x000106872977 
> ExceptionalCondition + 103
> 1   postgres0x0001063a33e2 _bt_compare + 1090
> 2   amcheck.so  0x000106d62921 
> bt_target_page_check + 6033
> 3   amcheck.so  0x000106d5fd2f 
> bt_index_check_internal + 2847
> 4   amcheck.so  0x000106d60433 
> bt_index_parent_check + 67
> 5   postgres0x0001064d6762 ExecInterpExpr + 
> 1634
> 6   postgres0x00010650d071 ExecResult + 321
> 7   postgres0x0001064ddc3d 
> standard_ExecutorRun + 301
> 8   postgres0x0001066600c5 PortalRunSelect + 
> 389
> 9   postgres0x00010665fc7f PortalRun + 527
> 10  postgres0x00010665ed59 exec_simple_query 
> + 1641
> 11  postgres0x00010665c99d PostgresMain + 3661
> 12  postgres0x0001065d6a8a BackendRun + 410
> 13  postgres0x0001065d61c4 ServerLoop + 3044
> 14  postgres0x0001065d2fe9 PostmasterMain + 
> 3769
> 15  postgres0x00010652e3b0 help + 0
> 16  libdyld.dylib   0x7fff6725fcc9 start + 1
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: 2020-05-11 
> 10:11:47.394 PDT [41091] LOG:  server process (PID 41309) was terminated by 
> signal 6: Abort trap: 6
>
>
>
> From commandline, second error:
>
> pgtest % pg_amcheck -i test
> (relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 3289393 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 12593 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
>
> 
>
> (relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 306 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 305 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff > lp_len (54 > 34)
> (relname=corrupted

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Tom Lane

Magnus Hagander  writes:
> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> andrew.duns...@2ndquadrant.com> wrote:
>> Yeah, we'll need to work out where to stash the file. The client will
>> pick up anything in src/regress/log for "make check", but would need
>> adjusting for other steps that invoke pg_regress. I'm getting close to
>> cutting a new client release, but I can delay it till we settle this.

> It seems pretty trivial to for example get all the steps out of check.log
> and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests.  New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

regards, tom lane

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Magnus Hagander

On Thu, Jun 11, 2020 at 6:32 PM Tom Lane  wrote:

> Magnus Hagander  writes:
> > On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> > andrew.duns...@2ndquadrant.com> wrote:
> >> Yeah, we'll need to work out where to stash the file. The client will
> >> pick up anything in src/regress/log for "make check", but would need
> >> adjusting for other steps that invoke pg_regress. I'm getting close to
> >> cutting a new client release, but I can delay it till we settle this.
>
> > It seems pretty trivial to for example get all the steps out of check.log
> > and their timing with a regexp.
>
> Yeah, I don't see why we can't scrape this data from the existing
> buildfarm output, at least for the core regression tests.  New
> infrastructure could make it easier/cheaper, but I don't think we
> should invest in that until we've got a provably useful tool.
>

So spending a few minutes to look at my data, it is not quite that easy if
we want to look at contrib checks for example. Both btree_gin and
btree_gist have checks called "int4" for example, and the aforementioned
regexp won't pick that up. But that's surely a fixable problem.

And perhaps we should at least start off data for just "make check" to see
if it's useful, before spending too much work?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: pg_upgrade fails with non-standard ACL

2020-06-11 Thread Anastasia Lubennikova


On 08.06.2020 19:31, Alvaro Herrera wrote:

On 2020-Jun-08, Anastasia Lubennikova wrote:


In this version I rebased test patches,  added some more comments, fixed
memory allocation issue and also removed code that handles ACLs on
languages. They require a lot of specific handling, while I doubt that their
signatures, which consist of language name only, are subject to change in
any future versions.

I'm thinking what's a good way to have a test that's committable.  Maybe
it would work to add a TAP test to pg_upgrade that runs initdb, does a
few GRANTS as per your attachment, then runs pg_upgrade?  Currently the
pg_upgrade tests are not TAP ... we'd have to revive
https://postgr.es/m/20180126080026.gi17...@paquier.xyz
(Some progress was already made on the buildfarm front to permit this)


I would be glad to add some test, but it seems to me that the infrastructure
changes for cross-version pg_upgrade test is much more complicated task than
this modest bugfix.  Besides, I've read the discussion and it seems that 
Michael

is not going to continue this work.

Attached v10 patch contains more fix for uninitialized variable.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



pg_upgrade_ACL_test.sh
Description: application/shellscript
commit f98c26fca3d80134a7a3cf8424d8c23891c5f2b9
Author: anastasia 
Date:   Mon Jun 8 18:34:14 2020 +0300

pg_upgrade_ACL_check_v10.patch

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 00aef855dc..94e8528a3a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -16,6 +16,7 @@
 
 static void check_new_cluster_is_empty(void);
 static void check_databases_are_compatible(void);
+static void check_for_changed_signatures(void);
 static void check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb);
 static bool equivalent_locale(int category, const char *loca, const char *locb);
 static void check_is_install_user(ClusterInfo *cluster);
@@ -142,6 +143,8 @@ check_and_dump_old_cluster(bool live_check)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804)
 		new_9_0_populate_pg_largeobject_metadata(&old_cluster, true);
 
+	get_non_default_acl_infos(&old_cluster);
+
 	/*
 	 * While not a check option, we do this now because this is the only time
 	 * the old server is running.
@@ -161,6 +164,7 @@ check_new_cluster(void)
 
 	check_new_cluster_is_empty();
 	check_databases_are_compatible();
+	check_for_changed_signatures();
 
 	check_loadable_libraries();
 
@@ -443,6 +447,223 @@ check_databases_are_compatible(void)
 	}
 }
 
+/*
+ * Find the location of the last dot, return NULL if not found.
+ */
+static char *
+last_dot_location(const char *identity)
+{
+	const char *p,
+			   *ret = NULL;
+
+	for (p = identity; *p; p++)
+		if (*p == '.')
+			ret = p;
+	return unconstify(char *, ret);
+}
+
+/*
+ * check_for_changed_signatures()
+ *
+ * Check that the old cluster doesn't have non-default ACL's for system objects
+ * (relations, attributes, functions and procedures) which have different
+ * signatures in the new cluster. Otherwise generate revoke_objects.sql.
+ */
+static void
+check_for_changed_signatures(void)
+{
+	PGconn	   *conn;
+	char		subquery[QUERY_ALLOC];
+	PGresult   *res;
+	int			ntups;
+	int			i_obj_ident;
+	int			dbnum;
+	bool		need_check = false;
+	FILE	   *script = NULL;
+	bool		found_changed = false;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for system objects with non-default ACL");
+
+	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+		if (old_cluster.dbarr.dbs[dbnum].non_def_acl_arr.nacls > 0)
+		{
+			need_check = true;
+			break;
+		}
+	/*
+	 * The old cluster doesn't have system objects with non-default ACL so
+	 * quickly exit.
+	 */
+	if (!need_check)
+	{
+		check_ok();
+		return;
+	}
+
+	snprintf(output_path, sizeof(output_path), "revoke_objects.sql");
+
+	snprintf(subquery, sizeof(subquery),
+		/* Get system relations which created in pg_catalog */
+		"SELECT 'pg_class'::regclass classid, oid objid, 0 objsubid "
+		"FROM pg_catalog.pg_class "
+		"WHERE relnamespace = 'pg_catalog'::regnamespace "
+		"UNION ALL "
+		/* Get system relations attributes which created in pg_catalog */
+		"SELECT 'pg_class'::regclass, att.attrelid, att.attnum "
+		"FROM pg_catalog.pg_class rel "
+		"INNER JOIN pg_catalog.pg_attribute att ON  rel.oid = att.attrelid "
+		"WHERE rel.relnamespace = 'pg_catalog'::regnamespace "
+		"UNION ALL "
+		/* Get system functions and procedure which created in pg_catalog */
+		"SELECT 'pg_proc'::regclass, oid, 0 "
+		"FROM pg_catalog.pg_proc "
+		"WHERE pronamespace = 'pg_catalog'::regnamespace ");
+
+	conn = connectToServer(&new_cluster, "template1");
+	res = executeQueryOrDie(conn,
+		"SELECT ident.type, ident.identity "
+		"FROM (%s) obj, "
+		"LATERAL pg_catalog.pg_identify_object("
+		"obj.classid, obj.objid, obj.objsubid) ident "
+		/*
+		 * Don't rely on database collation, since we use strcmp
+		 * comparis

Re: Atomic operations within spinlocks

2020-06-11 Thread Andres Freund

Hi,

On 2020-06-10 07:26:32 -0400, Robert Haas wrote:
> On Fri, Jun 5, 2020 at 8:19 PM Andres Freund  wrote:
> > Randomly noticed while looking at the code:
> > uint64  flagbit = UINT64CONST(1) << (uint64) type;
> >
> > that shouldn't be 64bit, right?
> 
> I'm going to admit ignorance here. What's the proper coding rule?

Well, pss_barrierCheckMask member is just 32bit, so it seems odd to
declare the local variable 64bit?

uint64  flagbit = UINT64CONST(1) << (uint64) type;
...
pg_atomic_fetch_or_u32(&slot->pss_barrierCheckMask, flagbit);


Greetings,

Andres Freund

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-11 Thread Andres Freund

Hi,

On 2020-06-10 13:37:59 -0400, Robert Haas wrote:
> On Tue, Jun 9, 2020 at 6:54 PM Andres Freund  wrote:
> > What do you think about my idea of having a BEGIN/END_SIGNAL_HANDLER?
> > That'd make it much easier to write assertions forbidding palloc, 64bit
> > atomics, ...
> 
> I must have missed the previous place where you suggested this, but I
> think it's a good idea.

https://www.postgresql.org/message-id/20200606023103.avzrctgv7476xj7i%40alap3.anarazel.de

It'd be neat if we could do that entirely within pqsignal(). But that'd
require some additional state (I think an array of handlers, indexed by
signum).

Greetings,

Andres Freund

Re: WIP/PoC for parallel backup

2020-06-11 Thread Hamid Akhtar

As far I understand, parallel backup is not a mandatory performance
feature, rather, one at user's discretion. This IMHO indicates that it will
benefit some users and it may not others.

Taking a backup is an I/O intensive workload. So by parallelizing it
through multiple worker threads/processes, creates an overhead of its own.
So what precisely are we optimizing here. Looking at a running database
system in any environment, I see the following potential scenarios playing
out. These are probably clear to everyone here, but I'm listing these for
completeness and clarity.

Locally Running Backup:
(1) Server has no clients connected other than base backup.
(2) Server has other clients connected which are actively performing
operations causing disk I/O.

Remotely Running Backup:
(3) Server has no clients connected other than remote base backup.
(4) Server has other clients connected which are actively performing
operations causing disk I/O.

Others:
(5) Server or the system running base backup has other processes competing
for disk or network bandwidth.

Generally speaking, I see that parallelization could potentially benefit in
scenarios (2), (4) and (5) with the reason being that having more than one
thread increases the likelihood that backup will now get a bigger time
slice for disk I/O and network bandwidth. With (1) and (3), since there are
no competing processes, addition of multiple threads or processes will only
increase CPU overhead whilst still getting the same network and disk time
slice. In this particular case, the performance will degrade.

IMHO, that’s why by adding other load on the server, perhaps by running
pgbench simultaneously may show improved performance for parallel backup.
Also, running parallel backup on a local laptop more often than yields
improved performance.

There are obviously other factors that may impact the performance like the
type of I/O scheduler being used whether CFQ or some other.

IMHO, parallel backup has obvious performance benefits, but we need to
ensure that users understand that there is potential for slower backup if
there is no competition for resources.

On Fri, May 22, 2020 at 11:03 AM Suraj Kharage <
suraj.khar...@enterprisedb.com> wrote:

>
> On Thu, May 21, 2020 at 7:12 PM Robert Haas  wrote:
>
>>
>> So, basically, when we go from 1 process to 4, the additional
>> processes spend all of their time waiting rather than doing any useful
>> work, and that's why there is no performance benefit. Presumably, the
>> reason they spend all their time waiting for ClientRead/ClientWrite is
>> because the network between the two machines is saturated, so adding
>> more processes that are trying to use it at maximum speed just leads
>> to spending more time waiting for it to be available.
>>
>> Do we have the same results for the local backup case, where the patch
>> helped?
>>
>
> Here is the result for local backup case (100GB data). Attaching the
> captured logs.
>
> The total number of events (pg_stat_activity) captured during local runs:
> - 82 events for normal backups
> - 31 events for parallel backups (-j 4)
>
> BaseBackupRead wait event numbers: (newly added)
> 24 - in normal backups
> 14 - in parallel backup (-j 4)
>
> ClientWrite wait event numbers:
> 8 - in normal backup
> 43 - in parallel backups
>
> ClientRead wait event numbers:
> 0 - ClientRead in normal backup
> 32 - ClientRead in parallel backups for diff processes.
>
>
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
> EnterpriseDB Corporation,
> The Postgres Database Company.
>

-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: hashagg slowdown due to spill changes

2020-06-11 Thread Andres Freund

Hi, 

On June 10, 2020 6:15:39 PM PDT, Jeff Davis  wrote:
>On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
>> Hi,
>> 
>> While preparing my pgcon talk I noticed that our hash-agg performance
>> degraded noticeably. Looks to me like it's due to the spilling-
>> hashagg
>> changes.
>
>Attached a proposed fix. (Might require some minor cleanup).
>
>The only awkward part is that LookupTupleHashEntry() needs a new out
>parameter to pass the hash value back to the caller. Ordinarily, the
>caller can get that from the returned entry, but if isnew==NULL, then
>the function might return NULL (and the caller wouldn't have an entry
>from which to read the hash value).

Great!

Did you run any performance tests?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Definitional issue: stddev_pop (and related) for 1 input

2020-06-11 Thread Tom Lane

Before v12, stddev_pop() had the following behavior with just a
single input value:

regression=# SELECT stddev_pop('42'::float8);
 stddev_pop 

  0
(1 row)

regression=# SELECT stddev_pop('inf'::float8);
 stddev_pop 

NaN
(1 row)

regression=# SELECT stddev_pop('nan'::float8);
 stddev_pop 

NaN
(1 row)

As of v12, though, all three cases produce 0.  I am not sure what
to think about that with respect to an infinity input, but I'm
quite sure I don't like it for NaN input.

It looks like the culprit is the introduction of the "Youngs-Cramer"
algorithm in float8_accum: nothing is done to Sxx at the first iteration,
even if the input is inf or NaN.  I'd be inclined to force Sxx to NaN
when the first input is NaN, and perhaps also when it's Inf.
Alternatively we could clean up in the finalization routine by noting
that Sx is Inf/NaN, but that seems messier.  Thoughts?

(I came across this by noting that the results don't agree with
numeric accumulation, which isn't using Youngs-Cramer.)

regards, tom lane

Re: hashagg slowdown due to spill changes

2020-06-11 Thread Jeff Davis

On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
> Did you run any performance tests?

Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.

Regards,
Jeff Davis

Re: jsonpath versus NaN

2020-06-11 Thread Alexander Korotkov

Hi Tom,

Thank you for raising this issue.

On Thu, Jun 11, 2020 at 3:45 PM Tom Lane  wrote:
> Commit 72b646033 inserted this into convertJsonbScalar:
>
> break;
>
> case jbvNumeric:
> +   /* replace numeric NaN with string "NaN" */
> +   if (numeric_is_nan(scalarVal->val.numeric))
> +   {
> +   appendToBuffer(buffer, "NaN", 3);
> +   *jentry = 3;
> +   break;
> +   }
> +
> numlen = VARSIZE_ANY(scalarVal->val.numeric);
> padlen = padBufferToInt(buffer);
>
> To characterize this as hack, slash, and burn programming would be
> charitable.  It is entirely clear from the code, the documentation,
> and the relevant RFCs that JSONB does not allow NaNs as numeric
> values.

The JSONB itself doesn't store number NaNs.  It stores the string "NaN".

I found the relevant part of the standard.  Unfortunately, I can't
post the full standard here due to its license, but I think I can cite
the relevant part.

1) If JM specifies double, then For all j, 1 (one) ≤ j ≤ n,
Case:
a) If Ij is not a number or character string, then let ST be data
exception — non-numeric SQL/JSON item.
b) Otherwise, let X be an SQL variable whose value is Ij. Let Vj be
the result of
CAST (X AS DOUBLE PRECISION)
If this conversion results in an exception condition, then let ST be
that exception condition.

So, when we apply the .double() method to string, then the result
should be the same as if we cast string to double in SQL.  In SQL we
convert string 'NaN' to numeric NaN.  So, standard requires us to do
the same in SQL/JSON.

I didn't find yet what the standard says about serializing NaNs back
to JSON.  I'll keep you posted.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: jsonpath versus NaN

2020-06-11 Thread Tom Lane

Alexander Korotkov  writes:
> On Thu, Jun 11, 2020 at 3:45 PM Tom Lane  wrote:
>> It is entirely clear from the code, the documentation,
>> and the relevant RFCs that JSONB does not allow NaNs as numeric
>> values.

> The JSONB itself doesn't store number NaNs.  It stores the string "NaN".

Yeah, but you have a numeric NaN within the JsonbValue tree between
executeItemOptUnwrapTarget and convertJsonbScalar.  Who's to say that
that illegal-per-the-data-type structure won't escape to somewhere else?
Or perhaps more likely, that we'll need additional warts in other random
places in the JSON code to keep from spitting up on the transiently
invalid structure.

> I found the relevant part of the standard.  Unfortunately, I can't
> post the full standard here due to its license, but I think I can cite
> the relevant part.

I don't think this is very relevant.  The SQL standard has not got the
concepts of Inf or NaN either (see 4.4.2 Characteristics of numbers),
therefore their definition is only envisioning that a string representing
a normal finite number should be castable to DOUBLE PRECISION.  Thus,
both of the relevant standards think that "numbers" are just finite
numbers.

So when neither JSON nor SQL consider that "NaN" is an allowed sort
of number, why are you doing violence to the code to allow it in a
jsonpath?  And if you insist on doing such violence, why didn't you
do some more and kluge it to the point where "Inf" would work too?
(It would require slightly less klugery in the wake of the infinities-
in-numeric patch that I'm going to post soon ... but that doesn't make
it a good idea.)

regards, tom lane

Re: new heapcheck contrib module

2020-06-11 Thread Mark Dilger

> On Jun 11, 2020, at 9:14 AM, Dilip Kumar  wrote:
> 
> I have just browsed through the patch and the idea is quite
> interesting.  I think we can expand it to check that whether the flags
> set in the infomask are sane or not w.r.t other flags and xid status.
> Some examples are
> 
> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> should not be set in new_infomask2.
> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> actually cross verify the transaction status from the CLOG and check
> whether is matching the hint bit or not.
> 
> While browsing through the code I could not find that we are doing
> this kind of check,  ignore if we are already checking this.

Thanks for taking a look!

Having both of those bits set simultaneously appears to fall into a different 
category than what I wrote verify_heapam.c to detect.  It doesn't violate any 
assertion in the backend, nor does it cause the code to crash.  (At least, I 
don't immediately see how it does either of those things.)  At first glance it 
appears invalid to have those bits both set simultaneously, but I'm hesitant to 
enforce that without good reason.  If it is a good thing to enforce, should we 
also change the backend code to Assert?

I integrated your idea into one of the regression tests.  It now sets these two 
bits in the header of one of the rows in a table.  The verify_heapam check 
output (which includes all detected corruptions) does not change, which 
verifies your observation that verify_heapam is not checking for this.  I've 
attached that as a patch to this email.  Note that this patch should be applied 
atop the v6 patch recently posted in another email.

WIP_dilip_kumar_idea.patch
Description: Binary data

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Atomic operations within spinlocks

2020-06-11 Thread Robert Haas

On Thu, Jun 11, 2020 at 1:26 PM Andres Freund  wrote:
> Well, pss_barrierCheckMask member is just 32bit, so it seems odd to
> declare the local variable 64bit?
>
> uint64  flagbit = UINT64CONST(1) << (uint64) type;
> ...
> pg_atomic_fetch_or_u32(&slot->pss_barrierCheckMask, flagbit);

Oops.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] fix two shadow vars (src/backend/commands/sequence.c)

2020-06-11 Thread Alvaro Herrera

On 2020-Jun-11, Ranier Vilela wrote:

> Hi,
> src/backend/commands/sequence.c
> Has two shadows (buf var), with two unnecessary variables declared.

These are not unnecessary -- removing them breaks translatability of
those messages.  If these were ssize_t you could use '%zd' (see commit
ac4ef637ad2f) but I don't think you can in this case.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-06-11 Thread Jeff Davis

On Wed, 2020-06-10 at 11:39 -0700, Jeff Davis wrote:
> 1. Testing the spilling of hashed grouping sets: I'm inclined to just
> get rid of enable_groupingsets_hash_disk and use Melanie's stats-
> hacking approach instead.

Fixed in 92c58fd9.

> think the names you suggested quite fit, but the idea to use a more
> interesting GUC value might help express the behavior. Perhaps making
> enable_hashagg a ternary "enable_hashagg=on|off|avoid_disk"? The word
> "reject" is too definite for the planner, which is working with
> imperfect information.

I renamed enable_hashagg_disk to hashagg_avoid_disk_plan, which I think
satisfies the concerns raised here. Also in 92c58fd9.

There is still the original topic of this thread, which is whether we
need to change the default value of this GUC, or penalize disk-based
HashAgg in some way, to be more conservative about plan changes in v13.
I think we can wait a little longer to make a decision there.

Regards,
Jeff Davis

extensible options syntax for replication parser?

2020-06-11 Thread Robert Haas

Hi,

It seems to me that we're making the same mistake with the replication
parser that we've made in various placesin the regular parser: using a
syntax for options that requires that every potential option be a
keyword, and every potential option requires modification of the
grammar. In particular, the syntax for the BASE_BACKUP command has
accreted a whole lot of cruft already and I think that trend is likely
to continue. I don't think that trying to keep people from adding
options is a good strategy, so instead I'd like to have a better way
to do it. Attached is v1 of a patch to refactor things so that parts
of the BASE_BACKUP and CREATE_REPLICATION_SLOT are replaced with a
flexible options syntax. There are some debatable decisions here, so
I'd be happy to get some feedback on whether to go further with this,
or less far, or maybe even just abandon the idea altogether. I doubt
the last one is the right course, though: ISTM something's got to be
done about the BASE_BACKUP case, at least.

This version of the patch does not include documentation, but
hopefully it's fairly clear from reading the code what it's all about.
If people agree with the basic approach, I'll write docs. The
intention is that we'd continue to support the existing syntax for the
existing options, but the client tools would be adjusted to use the
new syntax if the server's new enough, and any new options would be
supported only through the new syntax. At some point in the distant
future we could retire the old syntax, when we've stopped caring about
compatibility with pre-14 versions.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v1-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
Description: Binary data

Re: Parallel Seq Scan vs kernel read ahead

2020-06-11 Thread David Rowley

On Thu, 11 Jun 2020 at 23:35, Amit Kapila  wrote:
> Another point I am thinking is that whatever formula we come up here
> might not be a good fit for every case.  For ex. as you mentioned
> above that larger step-size can impact the performance based on
> qualification, similarly there could be other things like having a
> target list or qual having some function which takes more time for
> certain tuples and lesser for others especially if function evaluation
> is based on some column values.  So, can we think of providing a
> rel_option for step-size?

I think someone at some point is not going to like the automatic
choice. So perhaps a reloption to allow users to overwrite it is a
good idea. -1 should most likely mean use the automatic choice based
on relation size.  I think for parallel seq scans that filter a large
portion of the records most likely need some sort of index, but there
are perhaps some genuine cases for not having one. e.g perhaps the
query is just not run often enough for an index to be worthwhile. In
that case, the performance is likely less critical, but at least the
reloption would allow users to get the old behaviour.

David

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Andrew Dunstan



On 6/11/20 12:32 PM, Tom Lane wrote:
> Magnus Hagander  writes:
>> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
>> andrew.duns...@2ndquadrant.com> wrote:
>>> Yeah, we'll need to work out where to stash the file. The client will
>>> pick up anything in src/regress/log for "make check", but would need
>>> adjusting for other steps that invoke pg_regress. I'm getting close to
>>> cutting a new client release, but I can delay it till we settle this.
>> It seems pretty trivial to for example get all the steps out of check.log
>> and their timing with a regexp.
> Yeah, I don't see why we can't scrape this data from the existing
> buildfarm output, at least for the core regression tests.  New
> infrastructure could make it easier/cheaper, but I don't think we
> should invest in that until we've got a provably useful tool.


OK, that makes my life fairly easy. Are you wanting this to be
automated, or just left as an exercise for researchers for now?


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Find query characters in respect of optimizer for develop purpose

2020-06-11 Thread Melanie Plageman

On Mon, May 18, 2020 at 1:30 AM Andy Fan  wrote:

> Hello:
>
> Before I want to pay attention to some optimizer features, I want to
> estimate how much benefits it can create for customers, at least for our
> current
> running customer. So I want to have some basic idea what kind of the query
> is
> running now in respect of optimizer.
>
>
You are imagining this to be collected during planning on a live
customer system as a form of telemetry?
I was inspired to search the hackers mailing list archive for the word
"telemetry" and didn't get many hits, which surprised me.

>
> My basic is we can track it with the below struct(every backend has one
> global
> variable to record it).
>
> +typedef struct
> +{
> +   int subplan_count;
> +   int subquery_count;
> +   int join_count;
> +   bool hasagg;
> +   bool hasgroup;
> +} QueryCharacters;
>
> it will be reset at the beginning of standard_planner, and the values are
> increased at  make_subplan, set_subquery_pathlist, make_one_rel,
> create_grouping_paths. later it can be tracked and viewed in
> pg_stat_statements.
>
>
I think the natural reaction to this idea is: isn't there a 3rd party
tool that does this? Or can't you use one of the hooks and write an
extension, to, for example, examine the parse,query,and plan trees?

However, it does seem like keeping track of this information would be
much easier during planning since planner will be examining the query
tree and making the plan anyway.

On the other hand, I think that depends a lot on what specific
information you want to collect. Out of the fields you listed, it is
unclear what some of them would mean.
Does join_count count the number of explicit joins in the original query
or does it count the number of joins in the final plan? Does
subquery_count count all sub-selects in the original query or does it
only count subqueries that become SubqueryScans or SubPlans? What about
subqueries that become InitPlans?

One concern I have is that it seems like this struct would have to be
updated throughout planning and that it would be easy to break it with
the addition of new code. Couldn't every new optimization added to
planner potentially affect the accuracy of the information in the
struct?

-- 
Melanie Plageman

Re: Recording test runtimes with the buildfarm

2020-06-11 Thread Tom Lane

Andrew Dunstan  writes:
> On 6/11/20 12:32 PM, Tom Lane wrote:
>> Yeah, I don't see why we can't scrape this data from the existing
>> buildfarm output, at least for the core regression tests.  New
>> infrastructure could make it easier/cheaper, but I don't think we
>> should invest in that until we've got a provably useful tool.

> OK, that makes my life fairly easy. Are you wanting this to be
> automated, or just left as an exercise for researchers for now?

I'm envisioning the latter --- I think somebody should prove that
useful results can be obtained before we spend any effort on making it
easier to gather the input numbers.  Mind you, I'm not saying that
I don't believe it's possible to get good results.  But building
infrastructure in advance of a solid use-case is a recipe for
building the wrong thing.

regards, tom lane

Re: [PATCH] fix two shadow vars (src/backend/commands/sequence.c)

2020-06-11 Thread Ranier Vilela

Em qui., 11 de jun. de 2020 às 17:19, Alvaro Herrera <
alvhe...@2ndquadrant.com> escreveu:

> On 2020-Jun-11, Ranier Vilela wrote:
>
> > Hi,
> > src/backend/commands/sequence.c
> > Has two shadows (buf var), with two unnecessary variables declared.
>
> These are not unnecessary -- removing them breaks translatability of
> those messages.  If these were ssize_t you could use '%zd' (see commit
> ac4ef637ad2f) but I don't think you can in this case.
>
Hi Alvaro, thanks for reply.

File backend\utils\sort\tuplesort.c:
elog(LOG, "worker %d using " INT64_FORMAT " KB of memory for read buffers
among %d input tapes",
File backend\storage\ipc\shm_toc.c:
elog(ERROR, "could not find key " UINT64_FORMAT " in shm TOC at %p",
File backend\storage\large_object\inv_api.c:
* use errmsg_internal here because we don't want to expose INT64_FORMAT
errmsg_internal("invalid large object seek target: " INT64_FORMAT,

elog and errmsg_internal, permits use as proposed by the patch,
does it mean that errmsg, does not allow and does not do the same job as
snprintf?

regards,
Ranier Vilela

Re: [PATCH] Leading minus for negative time interval in ISO 8601

2020-06-11 Thread Bruce Momjian

On Tue, Jun  9, 2020 at 11:18:20PM -0500, Mikhail Titov wrote:
> On Wed, Jun  3, 2020 at 11:25 PM, Tom Lane  wrote:
> > ...
> > Maybe we should just take the position that negative intervals aren't
> > standardized, and if you want to transport them using ISO format then
> > you first need to lobby ISO to fix that.
> 
> Apparently ISO did "fix" this. I managed to get a copy of ISO
> 8601-2:2019(E) and I insist on reconsidering the patch. Here is an
> excerpt from page 12 of the standard:

This shows the problem of trying to honor a standard which is not
publicly available.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: [PATCH] fix two shadow vars (src/backend/commands/sequence.c)

2020-06-11 Thread Tom Lane

Ranier Vilela  writes:
> elog and errmsg_internal, permits use as proposed by the patch,
> does it mean that errmsg, does not allow and does not do the same job as
> snprintf?

Yes.  errmsg() strings are captured for translation.  If they contain
platform-dependent substrings, that's a problem, because only one variant
will get captured.  And INT64_FORMAT is platform-dependent.

We have of late decided that it's safe to use %lld (or %llu) to format
int64s everywhere, but you then have to cast the printf argument to
match that explicitly.  See commit 6a1cd8b92 for precedent.

regards, tom lane

exp() versus the POSIX standard

2020-06-11 Thread Tom Lane

The POSIX standard says this about the exp(3) function:

If x is -Inf, +0 shall be returned.

At least on my Linux box, our version does no such thing:

regression=# select exp('-inf'::float8);
ERROR:  value out of range: underflow

Does anyone disagree that that's a bug?  Should we back-patch
a fix, or just change it in HEAD?  Given the lack of user
complaints, I lean a bit towards the latter, but am not sure.

regards, tom lane

Re: exp() versus the POSIX standard

2020-06-11 Thread Tom Lane

I wrote:
> The POSIX standard says this about the exp(3) function:
>   If x is -Inf, +0 shall be returned.
> At least on my Linux box, our version does no such thing:
> regression=# select exp('-inf'::float8);
> ERROR:  value out of range: underflow

Now that I look, power() has similar issues:

regression=# select power('1.1'::float8, '-inf');
ERROR:  value out of range: underflow
regression=# select power('0.1'::float8, 'inf');
ERROR:  value out of range: underflow
regression=# select power('-inf'::float8, '-3');
ERROR:  value out of range: underflow
regression=# select power('-inf'::float8, '-4');
ERROR:  value out of range: underflow

contradicting POSIX which says

For |x| > 1, if y is -Inf, +0 shall be returned.

For |x| < 1, if y is +Inf, +0 shall be returned.

For y an odd integer < 0, if x is -Inf, -0 shall be returned.

For y < 0 and not an odd integer, if x is -Inf, +0 shall be returned.

regards, tom lane

Re: exp() versus the POSIX standard

2020-06-11 Thread Komяpa

пт, 12 чэр 2020, 02:57 карыстальнік Tom Lane  напісаў:

> I wrote:
> > The POSIX standard says this about the exp(3) function:
> >   If x is -Inf, +0 shall be returned.
> > At least on my Linux box, our version does no such thing:
> > regression=# select exp('-inf'::float8);
> > ERROR:  value out of range: underflow
>
> Now that I look, power() has similar issues:
>
> regression=# select power('1.1'::float8, '-inf');
> ERROR:  value out of range: underflow
> regression=# select power('0.1'::float8, 'inf');
> ERROR:  value out of range: underflow
> regression=# select power('-inf'::float8, '-3');
> ERROR:  value out of range: underflow
> regression=# select power('-inf'::float8, '-4');
> ERROR:  value out of range: underflow
>
> contradicting POSIX which says
>
> For |x| > 1, if y is -Inf, +0 shall be returned.
>
> For |x| < 1, if y is +Inf, +0 shall be returned.
>
> For y an odd integer < 0, if x is -Inf, -0 shall be returned.
>
> For y < 0 and not an odd integer, if x is -Inf, +0 shall be returned.
>

I've had the same issue with multiplying two tiny numbers. Select
2e-300::float * 2e-300::float gives an underflow, and it is not a wanted
thing. This looks like handmade implementation of IEEE754's underflow
exception that should be an optional return flag in addition to well
defined number, but became a stop-the-world exception instead. Had to build
custom Postgres with that logic ripped off in the past to be able to
multiply numbers. Will be happy if that "underflow" (and overflow) thing is
removed.

If in doubt whether this exception should be removed, to follow the spec
fully in this way you have to also raise exception on any inexact result of
operations on floats.

> regards, tom lane
>
>
>

Infinities in type numeric

2020-06-11 Thread Tom Lane

We had a discussion recently about how it'd be a good idea to support
infinity values in type numeric [1].  Here's a draft patch enabling
that, using the idea suggested in that thread of commandeering some
unused bits in the representation of numeric NaNs.  AFAICT we've been
careful to ensure those bits are always zero, so that this will work
without creating any pg_upgrade problems.

This is just WIP, partly because I haven't touched the SGML docs
yet, but also because there are some loose ends to be resolved:

* I believe I made all the functions that correspond to POSIX-standard
functions do what POSIX says for infinite inputs.  However, this does
not always match what our existing float8 functions do [2].  I'm
assuming that we'll change those functions to match POSIX; but if we
don't, this might need another look.

* I had to invent some semantics for non-standardized functions,
particularly numeric_mod, numeric_gcd, numeric_lcm.  This area
could use review to be sure that I chose desirable behaviors.

* I'm only about 50% sure that I understand what the sort abbreviation
code is doing.  A quick look from Peter or some other expert would be
helpful.

* It seems to me that the existing behavior of numeric_stddev_internal
is not quite right for the case of a single input value that is a NaN,
when in "sample" mode.  Per the comment "Sample stddev and variance are
undefined when N <= 1", ISTM that we ought to return NULL in this case,
but actually you get a NaN because the check for "NaNcount > 0" is made
before considering that.  I think that's the wrong way round --- in some
sense NULL is "less defined" than NaN, so that's what we ought to use.
Moreover, the float8 stddev code agrees: in HEAD you get

regression=# SELECT stddev_samp('nan'::float8);
 stddev_samp 
-

(1 row)

regression=# SELECT stddev_samp('nan'::numeric);
 stddev_samp 
-
 NaN
(1 row)

So I think we ought to make the numeric code match the former, and have
done that here.  However, the float8 code has its own issues for the
population case [3], and depending on what we do about that, this might
need further changes to agree.  (There's also the question of whether to
back-patch any such bug fixes.)

* The jsonpath code is inconsistent about how it handles NaN vs Inf [4].
I'm assuming here that we'll fix that by rejecting NaNs in that code,
but if we conclude that we do need to allow non-finite double values
there, probably we need to allow Infs too.

* It seems like there might be a use-case for isfinite() and maybe
isnan() SQL functions.  On the other hand, we don't have those for
float4/float8 either.  These could be a follow-on addition, anyway.

I'll stick this in the queue for review.

regards, tom lane

[1] https://www.postgresql.org/message-id/27490.1590414212%40sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/flat/582552.1591917752%40sss.pgh.pa.us
[3] https://www.postgresql.org/message-id/353062.1591898766%40sss.pgh.pa.us
[4] https://www.postgresql.org/message-id/flat/203949.1591879542%40sss.pgh.pa.us

diff --git a/contrib/jsonb_plperl/jsonb_plperl.c b/contrib/jsonb_plperl/jsonb_plperl.c
index ed361efbe2..b81ba54b80 100644
--- a/contrib/jsonb_plperl/jsonb_plperl.c
+++ b/contrib/jsonb_plperl/jsonb_plperl.c
@@ -227,10 +227,8 @@ SV_to_JsonbValue(SV *in, JsonbParseState **jsonb_state, bool is_elem)
 /*
  * jsonb doesn't allow infinity or NaN (per JSON
  * specification), but the numeric type that is used for the
- * storage accepts NaN, so we have to prevent it here
- * explicitly.  We don't really have to check for isinf()
- * here, as numeric doesn't allow it and it would be caught
- * later, but it makes for a nicer error message.
+ * storage accepts those, so we have to reject them here
+ * explicitly.
  */
 if (isinf(nval))
 	ereport(ERROR,
diff --git a/contrib/jsonb_plpython/jsonb_plpython.c b/contrib/jsonb_plpython/jsonb_plpython.c
index e09308daf0..836c178770 100644
--- a/contrib/jsonb_plpython/jsonb_plpython.c
+++ b/contrib/jsonb_plpython/jsonb_plpython.c
@@ -387,14 +387,17 @@ PLyNumber_ToJsonbValue(PyObject *obj, JsonbValue *jbvNum)
 	pfree(str);
 
 	/*
-	 * jsonb doesn't allow NaN (per JSON specification), so we have to prevent
-	 * it here explicitly.  (Infinity is also not allowed in jsonb, but
-	 * numeric_in above already catches that.)
+	 * jsonb doesn't allow NaN or infinity (per JSON specification), so we
+	 * have to reject those here explicitly.
 	 */
 	if (numeric_is_nan(num))
 		ereport(ERROR,
 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
  errmsg("cannot convert NaN to jsonb")));
+	if (numeric_is_inf(num))
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("cannot convert infinity to jsonb")));
 
 	jbvNum->type = jbvNumeric;
 	jbvNum->val.numeric = num;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 16768b28c3

Re: [PATCH] fix two shadow vars (src/backend/commands/sequence.c)

2020-06-11 Thread Ranier Vilela

Em qui., 11 de jun. de 2020 às 19:54, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > elog and errmsg_internal, permits use as proposed by the patch,
> > does it mean that errmsg, does not allow and does not do the same job as
> > snprintf?
>
> Yes.  errmsg() strings are captured for translation.  If they contain
> platform-dependent substrings, that's a problem, because only one variant
> will get captured.  And INT64_FORMAT is platform-dependent.
>
> We have of late decided that it's safe to use %lld (or %llu) to format
> int64s everywhere, but you then have to cast the printf argument to
> match that explicitly.  See commit 6a1cd8b92 for precedent.
>
Hi Tom, thank you for the detailed explanation.

I see commit 6a1cd8b92, and I think which is the same case with
basebackup.c (total_checksum_failures),
maxv and minv, are int64 (INT64_FORMAT).

%lld -> (long long int) maxv
%lld -> (long long int) minv

Attached new patch, with fixes from commit 6a1cd8b92.

regards,
Ranier Vilela


>
> regards, tom lane
>


fix_shadows_buf_var_v2.patch
Description: Binary data

Re: exp() versus the POSIX standard

2020-06-11 Thread Tom Lane

=?UTF-8?Q?Darafei_=22Kom=D1=8Fpa=22_Praliaskouski?=  writes:
> I've had the same issue with multiplying two tiny numbers. Select
> 2e-300::float * 2e-300::float gives an underflow, and it is not a wanted
> thing. This looks like handmade implementation of IEEE754's underflow
> exception that should be an optional return flag in addition to well
> defined number, but became a stop-the-world exception instead.

Solving that problem is very far outside the scope of what I'm interested
in here.  I think that we'd probably regret it if we try to support IEEE
subnormals, for example --- I know that all modern hardware is probably
good with those, but I'd bet against different platforms' libc functions
all behaving the same.  I don't see a sane way to offer user control over
whether we throw underflow errors or not, either.  (Do you really want "+"
to stop being immutable?)  The darker corners of IEEE754, like inexactness
exceptions, are even less likely to be implemented consistently
everywhere.

regards, tom lane

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 22:21, Amit Kapila  wrote:
>
> On Fri, Jun 5, 2020 at 3:16 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, 4 Jun 2020 at 12:46, Amit Kapila  wrote:
> > >
> > >
> > > +
> > > + This parameter can be changed at any time; the behavior for any 
> > > one
> > > + transaction is determined by the setting in effect when it 
> > > commits.
> > > +
> > >
> > > This is written w.r.t foreign_twophase_commit.  If one changes this
> > > between prepare and commit, will it have any impact?
> >
> > Since the distributed transaction commit automatically uses 2pc when
> > executing COMMIT, it's not possible to change foreign_twophase_commit
> > between prepare and commit. So I'd like to explain the case where a
> > user executes PREPARE and then COMMIT PREPARED while changing
> > foreign_twophase_commit.
> >
> > PREPARE can run only when foreign_twophase_commit is 'required' (or
> > 'prefer') and all foreign servers involved with the transaction
> > support 2pc. We prepare all foreign transactions no matter what the
> > number of servers and modified or not. If either
> > foreign_twophase_commit is 'disabled' or the transaction modifies data
> > on a foreign server that doesn't support 2pc, it raises an error. At
> > COMMIT (or ROLLBACK) PREPARED, similarly foreign_twophase_commit needs
> > to be set to 'required'. It raises an error if the distributed
> > transaction has a foreign transaction and foreign_twophase_commit is
> > 'disabled'.
> >
>
> So, IIUC, it will raise an error if foreign_twophase_commit is
> 'disabled' (or one of the foreign server involved doesn't support 2PC)
> and the error can be raised both when user issues PREPARE or COMMIT
> (or ROLLBACK) PREPARED.  If so, isn't it strange that we raise such an
> error after PREPARE?  What kind of use-case required this?
>

I don’t concrete use-case but the reason why it raises an error when a
user setting foreign_twophase_commit to 'disabled' executes COMMIT (or
ROLLBACK) PREPARED within the transaction involving at least one
foreign server is that I wanted to make it behaves in a similar way of
COMMIT case. I mean, if a user executes just COMMIT, the distributed
transaction is committed in two phases but the value of
foreign_twophase_commit is not changed during these two phases. So I
wanted to require user to set foreign_twophase_commit to ‘required’
both when executing PREPARE and executing COMMIT (or ROLLBACK)
PREPARED. Implementation also can become simple because we can assume
that foreign_twophase_commit is always enabled when a transaction
requires foreign transaction preparation and resolution.

> >
> > >
> > > 4.
> > > +  in_doubt
> > > +  boolean
> > > +  
> > > +  
> > > +   If true this foreign transaction is
> > > in-doubt status and
> > > +   needs to be resolved by calling 
> > > pg_resolve_fdwxact
> > > +   function.
> > > +  
> > >
> > > It would be better if you can add an additional sentence to say when
> > > and or how can foreign transactions reach in-doubt state.
> > >
>
> +   If true this foreign transaction is in-doubt 
> status.
> +   A foreign transaction becomes in-doubt status when user canceled the
> +   query during transaction commit or the server crashed during 
> transaction
> +   commit.
>
> Can we reword the second sentence as: "A foreign transaction can have
> this status when the user has cancelled the statement or the server
> crashes during transaction commit."?

Agreed. Updated in my local branch.

>  I have another question about
> this field, why can't it be one of the status ('preparing',
> 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> separate field?

Because I'm using in-doubt field also for checking if the foreign
transaction entry can also be resolved manually, i.g.
pg_resolve_foreign_xact(). For instance, a foreign transaction which
status = 'prepared' and in-doubt = 'true' can be resolved either
foreign transaction resolver or pg_resolve_foreign_xact(). When a user
execute pg_resolve_foreign_xact() against the foreign transaction, it
sets status = 'committing' (or 'rollbacking') by checking transaction
status in clog. The user might cancel pg_resolve_foreign_xact() during
resolution. In this case, the foreign transaction is still status =
'committing' and in-doubt = 'true'. Then if a foreign transaction
resolver process processes the foreign transaction, it can commit it
without clog looking.

> Also, isn't it more suitable to name 'status' field
> as 'state' because these appear to be more like different states of
> transaction?

Agreed.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: doc review for v13

2020-06-11 Thread Justin Pryzby

Some new bits,
And some old ones.

-- 
Justin
>From b74bef6f9d36a2fac71045ab707ed4be65730ba7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 29 Apr 2020 13:13:29 -0500
Subject: [PATCH v5 1/5] Fix comments for WITH OIDs, removed at 578b22971

Previously mentioned here:
https://www.postgresql.org/message-id/20200429194622.ga25...@telsasoft.com
---
 src/backend/access/common/tupdesc.c | 4 +---
 src/backend/access/transam/varsup.c | 3 +--
 src/backend/commands/tablecmds.c| 5 ++---
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 1e743d7d86..ce84e22cbd 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -786,9 +786,7 @@ TupleDescInitEntryCollation(TupleDesc desc,
  *
  * Given a relation schema (list of ColumnDef nodes), build a TupleDesc.
  *
- * Note: the default assumption is no OIDs; caller may modify the returned
- * TupleDesc if it wants OIDs.  Also, tdtypeid will need to be filled in
- * later on.
+ * tdtypeid will need to be filled in later on.
  */
 TupleDesc
 BuildDescForRelation(List *schema)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e..8328b8050a 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -527,8 +527,7 @@ GetNewObjectId(void)
 	 * The first time through this routine after normal postmaster start, the
 	 * counter will be forced up to FirstNormalObjectId.  This mechanism
 	 * leaves the OIDs between FirstBootstrapObjectId and FirstNormalObjectId
-	 * available for automatic assignment during initdb, while ensuring they
-	 * will never conflict with user-assigned OIDs.
+	 * available for automatic assignment during initdb.
 	 */
 	if (ShmemVariableCache->nextOid < ((Oid) FirstNormalObjectId))
 	{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2ab02e01a0..35945bce73 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4829,7 +4829,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
 			continue;
 
 		/*
-		 * If we change column data types or add/remove OIDs, the operation
+		 * If we change column data types, the operation
 		 * has to be propagated to tables that use this table's rowtype as a
 		 * column type.  tab->newvals will also be non-NULL in the case where
 		 * we're adding a column with a default.  We choose to forbid that
@@ -4853,8 +4853,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
 
 		/*
 		 * We only need to rewrite the table if at least one column needs to
-		 * be recomputed, we are adding/removing the OID column, or we are
-		 * changing its persistence.
+		 * be recomputed, or we are changing its persistence.
 		 *
 		 * There are two reasons for requiring a rewrite when changing
 		 * persistence: on one hand, we need to ensure that the buffers
-- 
2.17.0

>From eb85fabed31d7c3a8fa1396ddfd2ea733c5c8baf Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 10 May 2020 17:30:14 -0500
Subject: [PATCH v5 2/5] Remove mention of double timestamps..

..which were removed in v10.

This was previously discussed here:
https://www.postgresql.org/message-id/20191230074721.GM12890%40telsasoft.com
---
 src/backend/utils/adt/selfuncs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0f02f32ffd..be08eb4814 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -4018,8 +4018,8 @@ estimate_multivariate_ndistinct(PlannerInfo *root, RelOptInfo *rel,
  * to be treated separately.
  *
  * The several datatypes representing absolute times are all converted
- * to Timestamp, which is actually a double, and then we just use that
- * double value.  Note this will give correct results even for the "special"
+ * to Timestamp, which is actually an int64, and then we promote that to
+ * a double.  Note this will give correct results even for the "special"
  * values of Timestamp, since those are chosen to compare correctly;
  * see timestamp_cmp.
  *
-- 
2.17.0

>From c0501535b1ec3709abb8334ddd29f2603bece5ca Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 3 May 2020 19:57:32 -0500
Subject: [PATCH v5 3/5] possessive: its (and more)

---
 src/backend/replication/logical/origin.c | 2 +-
 src/backend/utils/mmgr/README| 2 +-
 src/bin/pg_dump/pg_backup_archiver.c | 6 +++---
 src/bin/pg_dump/pg_backup_custom.c   | 8 +++-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 1ca4479605..dec9e95119 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -427,7 +427,7 @@ restart:
 
 
 /*
- * Lookup replication ori

Re: Default setting for enable_hashagg_disk

2020-06-11 Thread Justin Pryzby

On Thu, Jun 11, 2020 at 01:22:57PM -0700, Jeff Davis wrote:
> On Wed, 2020-06-10 at 11:39 -0700, Jeff Davis wrote:
> > 1. Testing the spilling of hashed grouping sets: I'm inclined to just
> > get rid of enable_groupingsets_hash_disk and use Melanie's stats-
> > hacking approach instead.
> 
> Fixed in 92c58fd9.
> 
> > think the names you suggested quite fit, but the idea to use a more
> > interesting GUC value might help express the behavior. Perhaps making
> > enable_hashagg a ternary "enable_hashagg=on|off|avoid_disk"? The word
> > "reject" is too definite for the planner, which is working with
> > imperfect information.
> 
> I renamed enable_hashagg_disk to hashagg_avoid_disk_plan, which I think
> satisfies the concerns raised here. Also in 92c58fd9.

Thanks for considering :)

I saw you updated the Open Items page, but put the items into "Older Bugs /
Fixed".

I moved them underneath "Resolved" since they're all new in v13.
https://wiki.postgresql.org/index.php?title=PostgreSQL_13_Open_Items&diff=34995&oldid=34994

-- 
Justin

Re: exp() versus the POSIX standard

2020-06-11 Thread Komяpa

Hi,

On Fri, Jun 12, 2020 at 4:25 AM Tom Lane  wrote:

> =?UTF-8?Q?Darafei_=22Kom=D1=8Fpa=22_Praliaskouski?= 
> writes:
> > I've had the same issue with multiplying two tiny numbers. Select
> > 2e-300::float * 2e-300::float gives an underflow, and it is not a wanted
> > thing. This looks like handmade implementation of IEEE754's underflow
> > exception that should be an optional return flag in addition to well
> > defined number, but became a stop-the-world exception instead.
>
> Solving that problem is very far outside the scope of what I'm interested
> in here.

They're essentially the same issue.

Generally, it exists from the very beginning of git and seems to be a
series of misunderstandings.

Initially, somewhere around 1996, someone thought that a double goes only
from DBL_MIN to DBL_MAX, just like INT_MIN and INT_MAX, while they aren't
exactly that:
https://github.com/postgres/postgres/blame/8fecd4febf8357f3cc20383ed29ced484877d5ac/src/backend/utils/adt/float.c#L525

That logic seems to be sane in float4 case (where computation is done in
64bit and then checked to fit into 32bit without an overflow).
It feels like the float8 case got there just by copy-paste, but maybe it
was also used to not handle NaNs - it's not there in cmp's yett.

Later in 2007 Bruce Momjian removed the limitation on Infinities, but kept
the general structure - now subnormals are accepted, as DBL_MIN is no
longer used, but there is still a check that underflow occurred.
https://github.com/postgres/postgres/commit/f9ac414c35ea084ff70c564ab2c32adb06d5296f#diff-7068290137a01263be83308699042f1fR58

> I think that we'd probably regret it if we try to support IEEE
> subnormals, for example --- I know that all modern hardware is probably
> good with those, but I'd bet against different platforms' libc functions
> all behaving the same.

You don't need to support them. You just have them already.

> I don't see a sane way to offer user control over
> whether we throw underflow errors or not, either.

IEEE754 talks about CPU design. "Exception" there is not a postgres
exception, that's an exceptional case in computation that may raise a flag.
For all those exceptional cases there is a well defined description of what
value should be returned.
https://en.wikipedia.org/wiki/IEEE_754#Exception_handling

Current code looks like a misreading of what IEEE754 exception is, but upon
closer look it looks like a mutation of misunderstanding of what FLT_MIN is
for (FLT_TRUE_MIN that would fit there appeared only in C11 unfortunately).

> (Do you really want "+" to stop being immutable?)

No, no kind of GUC switch is needed. Just drop underflow/overflow checks.
You'll get 0 or Infinity in expected places, and infinities are okay since
2007.

The darker corners of IEEE754, like inexactness
> exceptions, are even less likely to be implemented consistently
> everywhere.
>
> regards, tom lane
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Amit Kapila

On Thu, Jun 11, 2020 at 7:39 PM Masahiko Sawada
 wrote:
>
> On Thu, 11 Jun 2020 at 20:02, Amit Kapila  wrote:
> >
> > On Thu, Jun 11, 2020 at 3:07 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Thu, 11 Jun 2020 at 18:11, Amit Kapila  wrote:
> > > >
> > > > On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
> > > >  wrote:
> > > > >
> > > > > On Thu, 11 Jun 2020 at 12:30, Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Now, thinking about this again, I am not sure if these stats are
> > > > > > directly related to slots. These are stats for logical decoding 
> > > > > > which
> > > > > > can be performed either via WALSender or decoding plugin (via APIs).
> > > > > > So, why not have them displayed in a new view like pg_stat_logical 
> > > > > > (or
> > > > > > pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, 
> > > > > > we
> > > > > > will need to add similar stats for streaming of in-progress
> > > > > > transactions as well (see patch 0007-Track-statistics-for-streaming 
> > > > > > at
> > > > > > [1]), so having a separate view for these doesn't sound illogical.
> > > > > >
> > > > >
> > > > > I think we need to decide how long we want to remain these statistics
> > > > > values. That is, if we were to have such pg_stat_logical view, these
> > > > > values would remain until logical decoding finished since I think the
> > > > > view would display only running logical decoding. OTOH, if we were to
> > > > > correspond these stats to slots, these values would remain beyond
> > > > > multiple logical decoding SQL API calls.
> > > > >
> > > >
> > > > I thought of having these till the process that performs these
> > > > operations exist.  So for WALSender, the stats will be valid till it
> > > > is not restarted due to some reason or when performed via backend, the
> > > > stats will be valid till the corresponding backend exits.
> > > >
> > >
> > > The number of rows of that view could be up to (max_backends +
> > > max_wal_senders). Is that right? What if different backends used the
> > > same replication slot one after the other?
> > >
> >
> > Yeah, it would be tricky if multiple slots are used by the same
> > backend.  We could probably track the number of times decoding has
> > happened by the session that will probably help us in averaging the
> > spill amount.  If we think that the aim is to help users to tune
> > logical_decoding_work_mem to avoid frequent spilling or streaming then
> > how would maintaining at slot level will help?
>
> Since the logical decoding intermediate files are written at per slots
> directory, I thought that corresponding these statistics to
> replication slots is also understandable for users.
>

What I wanted to know is how will it help users to tune
logical_decoding_work_mem?  Different backends can process from the
same slot, so it is not clear how user will be able to make any
meaning out of those stats.  OTOH, it is easier to see how to make
meaning of these stats if we display them w.r.t process.  Basically,
we have spill_count and spill_size which can be used to tune
logical_decoding_work_mem and also the activity of spilling happens at
process level, so it sounds like one-to-one mapping.  I am not telling
to rule out maintaining a slot level but trying to see if we can come
up with a clear definition.

> I was thinking
> something like pg_stat_logical_replication_slot view which shows
> slot_name and statistics of only logical replication slots. The view
> always shows rows as many as existing replication slots regardless of
> logical decoding being running. I think there is no big difference in
> how users use these statistics values between maintaining at slot
> level and at logical decoding level.
>
> In logical replication case, since we generally don’t support setting
> different logical_decoding_work_mem per wal senders, every wal sender
> will decode the same WAL stream with the same setting, meaning they
> will similarly spill intermediate files.
>

I am not sure this will be true in every case.  We do have a
slot_advance functionality, so some plugin might use that and that
will lead to different files getting spilled.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

2020-06-11 Thread Amit Kapila

On Fri, Jun 12, 2020 at 2:24 AM David Rowley  wrote:
>
> On Thu, 11 Jun 2020 at 23:35, Amit Kapila  wrote:
> > Another point I am thinking is that whatever formula we come up here
> > might not be a good fit for every case.  For ex. as you mentioned
> > above that larger step-size can impact the performance based on
> > qualification, similarly there could be other things like having a
> > target list or qual having some function which takes more time for
> > certain tuples and lesser for others especially if function evaluation
> > is based on some column values.  So, can we think of providing a
> > rel_option for step-size?
>
> I think someone at some point is not going to like the automatic
> choice. So perhaps a reloption to allow users to overwrite it is a
> good idea. -1 should most likely mean use the automatic choice based
> on relation size.  I think for parallel seq scans that filter a large
> portion of the records most likely need some sort of index, but there
> are perhaps some genuine cases for not having one. e.g perhaps the
> query is just not run often enough for an index to be worthwhile. In
> that case, the performance is likely less critical, but at least the
> reloption would allow users to get the old behaviour.
>

makes sense to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Default setting for enable_hashagg_disk

2020-06-11 Thread Greg Stark

On Thu, 9 Apr 2020 at 13:24, Justin Pryzby  wrote:

> On Thu, Apr 09, 2020 at 01:48:55PM +0200, Tomas Vondra wrote:
>
> > It it really any different from our enable_* GUCs? Even if you do e.g.
> > enable_sort=off, we may still do a sort. Same for enable_groupagg etc.
>
> Those show that the GUC was disabled by showing disable_cost.  That's
> what's
> different about this one.
>

Fwiw in the past this was seen not so much as a positive thing but a bug to
be fixed. We've talked about carrying a boolean "disabled plan" flag which
would be treated as a large cost penalty but not actually be added to the
cost in the plan.

The problems with the disable_cost in the cost are (at least):

1) It causes the resulting costs to be useless for comparing the plan costs
with other plans.

2) It can cause other planning decisions to be distorted in strange
non-linear ways.

-- 
greg

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-11 Thread Amit Kapila

On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
 wrote:
>
> On Thu, 11 Jun 2020 at 22:21, Amit Kapila  wrote:
> >
>
> >  I have another question about
> > this field, why can't it be one of the status ('preparing',
> > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > separate field?
>
> Because I'm using in-doubt field also for checking if the foreign
> transaction entry can also be resolved manually, i.g.
> pg_resolve_foreign_xact(). For instance, a foreign transaction which
> status = 'prepared' and in-doubt = 'true' can be resolved either
> foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> execute pg_resolve_foreign_xact() against the foreign transaction, it
> sets status = 'committing' (or 'rollbacking') by checking transaction
> status in clog. The user might cancel pg_resolve_foreign_xact() during
> resolution. In this case, the foreign transaction is still status =
> 'committing' and in-doubt = 'true'. Then if a foreign transaction
> resolver process processes the foreign transaction, it can commit it
> without clog looking.
>

I think this is a corner case and it is better to simplify the state
recording of foreign transactions then to save a CLOG lookup.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Fujii Masao





On 2020/06/12 12:21, Amit Kapila wrote:

On Thu, Jun 11, 2020 at 7:39 PM Masahiko Sawada
 wrote:


On Thu, 11 Jun 2020 at 20:02, Amit Kapila  wrote:


On Thu, Jun 11, 2020 at 3:07 PM Masahiko Sawada
 wrote:


On Thu, 11 Jun 2020 at 18:11, Amit Kapila  wrote:


On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
 wrote:


On Thu, 11 Jun 2020 at 12:30, Amit Kapila  wrote:



Now, thinking about this again, I am not sure if these stats are
directly related to slots. These are stats for logical decoding which
can be performed either via WALSender or decoding plugin (via APIs).
So, why not have them displayed in a new view like pg_stat_logical (or
pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
will need to add similar stats for streaming of in-progress
transactions as well (see patch 0007-Track-statistics-for-streaming at
[1]), so having a separate view for these doesn't sound illogical.



I think we need to decide how long we want to remain these statistics
values. That is, if we were to have such pg_stat_logical view, these
values would remain until logical decoding finished since I think the
view would display only running logical decoding. OTOH, if we were to
correspond these stats to slots, these values would remain beyond
multiple logical decoding SQL API calls.



I thought of having these till the process that performs these
operations exist.  So for WALSender, the stats will be valid till it
is not restarted due to some reason or when performed via backend, the
stats will be valid till the corresponding backend exits.



The number of rows of that view could be up to (max_backends +
max_wal_senders). Is that right? What if different backends used the
same replication slot one after the other?



Yeah, it would be tricky if multiple slots are used by the same
backend.  We could probably track the number of times decoding has
happened by the session that will probably help us in averaging the
spill amount.  If we think that the aim is to help users to tune
logical_decoding_work_mem to avoid frequent spilling or streaming then
how would maintaining at slot level will help?


Since the logical decoding intermediate files are written at per slots
directory, I thought that corresponding these statistics to
replication slots is also understandable for users.



What I wanted to know is how will it help users to tune
logical_decoding_work_mem?  Different backends can process from the
same slot, so it is not clear how user will be able to make any
meaning out of those stats.  OTOH, it is easier to see how to make
meaning of these stats if we display them w.r.t process.  Basically,
we have spill_count and spill_size which can be used to tune
logical_decoding_work_mem and also the activity of spilling happens at
process level, so it sounds like one-to-one mapping.  I am not telling
to rule out maintaining a slot level but trying to see if we can come
up with a clear definition.


I was thinking
something like pg_stat_logical_replication_slot view which shows
slot_name and statistics of only logical replication slots. The view
always shows rows as many as existing replication slots regardless of
logical decoding being running. I think there is no big difference in
how users use these statistics values between maintaining at slot
level and at logical decoding level.

In logical replication case, since we generally don’t support setting
different logical_decoding_work_mem per wal senders, every wal sender
will decode the same WAL stream with the same setting, meaning they
will similarly spill intermediate files.


I was thinking we support that. We can create multiple replication users
with different logical_decoding_work_mem settings. Also each walsender
can use logical_decoding_work_mem configured in its user. No?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-06-11 Thread Andy Fan

On Wed, Jun 3, 2020 at 10:36 AM Andy Fan  wrote:

>
>> Thanks for running those tests.  I had a quick look at the results and
>> I think to say that all 4 are better is not quite right. One is
>> actually a tiny bit slower and one is only faster due to a plan
>> change.
>>
>>
> Yes..  Thanks for pointing it out.
>
>
>> Q18 uses a result cache for 2 x nested loop joins and has a 0% hit
>> ratio.  The execution time is reduced to 91% of the original time only
>> because the planner uses a different plan, which just happens to be
>> faster by chance.
>>
>
This case should be caused by wrong rows estimations on condition
o_orderkey in (select l_orderkey from lineitem group by l_orderkey having
sum(l_quantity) > 312).  The estimation is 123766 rows, but the fact is 10
rows.
This estimation is hard and I don't think we should address this issue on
this
patch.

Q20 uses a result cache for the subplan and has a 0% hit ratio.  The
>> execution time is 100.27% of the original time. There are 8620 cache
>> misses.
>>
>
>
This is by design for current implementation.

> For subplans, since we plan subplans before we're done planning the
> outer plan, there's very little information to go on about the number
> of times that the cache will be looked up. For now, I've coded things
> so the cache is always used for EXPR_SUBLINK type subplans. "

I first tried to see if we can have a row estimation before the subplan
is created and it looks very complex.  The subplan was created during
preprocess_qual_conditions, at that time, we even didn't create the base
RelOptInfo , to say nothing of join_rel which the rows estimation happens
much later.

Then I see if we can delay the cache decision until we have the rows
estimation,
ExecInitSubPlan may be a candidate.  At this time  we can't add a new
ResutCache node, but we can add a cache function to SubPlan node with costed
based.  However the num_of_distinct values for parameterized variable can't
be
calculated which I still leave it as an open issue.

-- 
Best Regards
Andy Fan

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-11 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 12:40, Amit Kapila  wrote:
>
> On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
>  wrote:
> >
> > On Thu, 11 Jun 2020 at 22:21, Amit Kapila  wrote:
> > >
> >
> > >  I have another question about
> > > this field, why can't it be one of the status ('preparing',
> > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > separate field?
> >
> > Because I'm using in-doubt field also for checking if the foreign
> > transaction entry can also be resolved manually, i.g.
> > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > status = 'prepared' and in-doubt = 'true' can be resolved either
> > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > sets status = 'committing' (or 'rollbacking') by checking transaction
> > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > resolution. In this case, the foreign transaction is still status =
> > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > resolver process processes the foreign transaction, it can commit it
> > without clog looking.
> >
>
> I think this is a corner case and it is better to simplify the state
> recording of foreign transactions then to save a CLOG lookup.
>

The main usage of in-doubt flag is to distinguish between in-doubt
transactions and other transactions that have their waiter (I call
on-line transactions).  If one foreign server downs for a long time
after the server crash during distributed transaction commit, foreign
transaction resolver tries to resolve the foreign transaction but
fails because the foreign server doesn’t respond. We’d like to avoid
the situation where a resolver process always picks up that foreign
transaction and other on-online transactions waiting to be resolved
cannot move forward. Therefore, a resolver process prioritizes online
transactions. Once the shmem queue having on-line transactions becomes
empty, a resolver process looks at the array of foreign transaction
state to get in-doubt transactions to resolve. I think we should not
process both in-doubt transactions and on-line transactions in the
same way.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 12:56, Fujii Masao  wrote:
>
>
>
> On 2020/06/12 12:21, Amit Kapila wrote:
> > On Thu, Jun 11, 2020 at 7:39 PM Masahiko Sawada
> >  wrote:
> >>
> >> On Thu, 11 Jun 2020 at 20:02, Amit Kapila  wrote:
> >>>
> >>> On Thu, Jun 11, 2020 at 3:07 PM Masahiko Sawada
> >>>  wrote:
> 
>  On Thu, 11 Jun 2020 at 18:11, Amit Kapila  
>  wrote:
> >
> > On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
> >  wrote:
> >>
> >> On Thu, 11 Jun 2020 at 12:30, Amit Kapila  
> >> wrote:
> >>>
> >>>
> >>> Now, thinking about this again, I am not sure if these stats are
> >>> directly related to slots. These are stats for logical decoding which
> >>> can be performed either via WALSender or decoding plugin (via APIs).
> >>> So, why not have them displayed in a new view like pg_stat_logical (or
> >>> pg_stat_logical_decoding/pg_stat_logical_replication)?   In future, we
> >>> will need to add similar stats for streaming of in-progress
> >>> transactions as well (see patch 0007-Track-statistics-for-streaming at
> >>> [1]), so having a separate view for these doesn't sound illogical.
> >>>
> >>
> >> I think we need to decide how long we want to remain these statistics
> >> values. That is, if we were to have such pg_stat_logical view, these
> >> values would remain until logical decoding finished since I think the
> >> view would display only running logical decoding. OTOH, if we were to
> >> correspond these stats to slots, these values would remain beyond
> >> multiple logical decoding SQL API calls.
> >>
> >
> > I thought of having these till the process that performs these
> > operations exist.  So for WALSender, the stats will be valid till it
> > is not restarted due to some reason or when performed via backend, the
> > stats will be valid till the corresponding backend exits.
> >
> 
>  The number of rows of that view could be up to (max_backends +
>  max_wal_senders). Is that right? What if different backends used the
>  same replication slot one after the other?
> 
> >>>
> >>> Yeah, it would be tricky if multiple slots are used by the same
> >>> backend.  We could probably track the number of times decoding has
> >>> happened by the session that will probably help us in averaging the
> >>> spill amount.  If we think that the aim is to help users to tune
> >>> logical_decoding_work_mem to avoid frequent spilling or streaming then
> >>> how would maintaining at slot level will help?
> >>
> >> Since the logical decoding intermediate files are written at per slots
> >> directory, I thought that corresponding these statistics to
> >> replication slots is also understandable for users.
> >>
> >
> > What I wanted to know is how will it help users to tune
> > logical_decoding_work_mem?  Different backends can process from the
> > same slot, so it is not clear how user will be able to make any
> > meaning out of those stats.  OTOH, it is easier to see how to make
> > meaning of these stats if we display them w.r.t process.  Basically,
> > we have spill_count and spill_size which can be used to tune
> > logical_decoding_work_mem and also the activity of spilling happens at
> > process level, so it sounds like one-to-one mapping.  I am not telling
> > to rule out maintaining a slot level but trying to see if we can come
> > up with a clear definition.
> >
> >> I was thinking
> >> something like pg_stat_logical_replication_slot view which shows
> >> slot_name and statistics of only logical replication slots. The view
> >> always shows rows as many as existing replication slots regardless of
> >> logical decoding being running. I think there is no big difference in
> >> how users use these statistics values between maintaining at slot
> >> level and at logical decoding level.
> >>
> >> In logical replication case, since we generally don’t support setting
> >> different logical_decoding_work_mem per wal senders, every wal sender
> >> will decode the same WAL stream with the same setting, meaning they
> >> will similarly spill intermediate files.
>
> I was thinking we support that. We can create multiple replication users
> with different logical_decoding_work_mem settings. Also each walsender
> can use logical_decoding_work_mem configured in its user. No?
>

Yes, you're right. I had missed that way.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel copy

2020-06-11 Thread vignesh C

On Thu, Jun 4, 2020 at 12:44 AM Andres Freund  wrote
>
>
> Hm. you don't explicitly mention that in your design, but given how
> small the benefits going from 0-1 workers is, I assume the leader
> doesn't do any "chunk processing" on its own?
>

Yes you are right, the leader does not do any processing, Leader's
work is mainly to populate the shared memory with the offset
information for each record.

>
>
> > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> > submitted acts as leader with the responsibility of reading data from the
> > file/stdin, launching at most n number of workers as specified with
> > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > common data required for the workers execution in the DSM and shares it
> > with the workers. The leader then executes before statement triggers if
> > there exists any. Leader populates DSM chunks which includes the start
> > offset and chunk size, while populating the chunks it reads as many blocks
> > as required into the DSM data blocks from the file. Each block is of 64K
> > size. The leader parses the data to identify a chunk, the existing logic
> > from CopyReadLineText which identifies the chunks with some changes was
> > used for this. Leader checks if a free chunk is available to copy the
> > information, if there is no free chunk it waits till the required chunk is
> > freed up by the worker and then copies the identified chunks information
> > (offset & chunk size) into the DSM chunks. This process is repeated till
> > the complete file is processed. Simultaneously, the workers cache the
> > chunks(50) locally into the local memory and release the chunks to the
> > leader for further populating. Each worker processes the chunk that it
> > cached and inserts it into the table. The leader waits till all the chunks
> > populated are processed by the workers and exits.
>
> Why do we need the local copy of 50 chunks? Copying memory around is far
> from free. I don't see why it'd be better to add per-process caching,
> rather than making the DSM bigger? I can see some benefit in marking
> multiple chunks as being processed with one lock acquisition, but I
> don't think adding a memory copy is a good idea.

We had run performance with  csv data file, 5.1GB, 10million tuples, 2
indexes on integer columns, results for the same are given below. We
noticed in some cases the performance is better if we copy the 50
records locally and release the shared memory. We will get better
benefits as the workers increase. Thoughts?

Workers  | Exec time (With local copying | Exec time (Without copying,
   | 50 records & release the | processing record by record)
   | shared memory)  |

0 |   1162.772(1X)|   1152.684(1X)
2 |   635.249(1.83X) |   647.894(1.78X)
4 |   336.835(3.45X) |   335.534(3.43X)
8 |   188.577(6.17 X)|   189.461(6.08X)
16   |   126.819(9.17X) |   142.730(8.07X)
20   |   117.845(9.87X) |   146.533(7.87X)
30   |   127.554(9.11X) |   160.307(7.19X)

> This patch *desperately* needs to be split up. It imo is close to
> unreviewable, due to a large amount of changes that just move code
> around without other functional changes being mixed in with the actual
> new stuff.

I have split the patch, the new split patches are attached.

>
>
>
> >  /*
> > + * State of the chunk.
> > + */
> > +typedef enum ChunkState
> > +{
> > + CHUNK_INIT, /* initial state of 
> > chunk */
> > + CHUNK_LEADER_POPULATING,/* leader processing chunk */
> > + CHUNK_LEADER_POPULATED, /* leader completed populating chunk 
> > */
> > + CHUNK_WORKER_PROCESSING,/* worker processing chunk */
> > + CHUNK_WORKER_PROCESSED  /* worker completed processing chunk 
> > */
> > +}ChunkState;
> > +
> > +#define RAW_BUF_SIZE 65536   /* we palloc RAW_BUF_SIZE+1 bytes */
> > +
> > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> > +#define RINGSIZE (10 * 1000)
> > +#define MAX_BLOCKS_COUNT 1000
> > +#define WORKER_CHUNK_COUNT 50/* should be mod of RINGSIZE */
> > +
> > +#define  IsParallelCopy()(cstate->is_parallel)
> > +#define IsLeader()   (cstate->pcdata->is_leader)
> > +#define IsHeaderLine()   (cstate->header_line && 
> > cstate->cur_lineno == 1)
> > +
> > +/*
> > + * Copy data block information.
> > + */
> > +typedef struct CopyDataBlock
> > +{
> > + /* The number of unprocessed chunks in the current

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-11 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 12:21, Amit Kapila  wrote:
>
> On Thu, Jun 11, 2020 at 7:39 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, 11 Jun 2020 at 20:02, Amit Kapila  wrote:
> > >
> > > On Thu, Jun 11, 2020 at 3:07 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Thu, 11 Jun 2020 at 18:11, Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Thu, Jun 11, 2020 at 1:46 PM Masahiko Sawada
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, 11 Jun 2020 at 12:30, Amit Kapila  
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Now, thinking about this again, I am not sure if these stats are
> > > > > > > directly related to slots. These are stats for logical decoding 
> > > > > > > which
> > > > > > > can be performed either via WALSender or decoding plugin (via 
> > > > > > > APIs).
> > > > > > > So, why not have them displayed in a new view like 
> > > > > > > pg_stat_logical (or
> > > > > > > pg_stat_logical_decoding/pg_stat_logical_replication)?   In 
> > > > > > > future, we
> > > > > > > will need to add similar stats for streaming of in-progress
> > > > > > > transactions as well (see patch 
> > > > > > > 0007-Track-statistics-for-streaming at
> > > > > > > [1]), so having a separate view for these doesn't sound illogical.
> > > > > > >
> > > > > >
> > > > > > I think we need to decide how long we want to remain these 
> > > > > > statistics
> > > > > > values. That is, if we were to have such pg_stat_logical view, these
> > > > > > values would remain until logical decoding finished since I think 
> > > > > > the
> > > > > > view would display only running logical decoding. OTOH, if we were 
> > > > > > to
> > > > > > correspond these stats to slots, these values would remain beyond
> > > > > > multiple logical decoding SQL API calls.
> > > > > >
> > > > >
> > > > > I thought of having these till the process that performs these
> > > > > operations exist.  So for WALSender, the stats will be valid till it
> > > > > is not restarted due to some reason or when performed via backend, the
> > > > > stats will be valid till the corresponding backend exits.
> > > > >
> > > >
> > > > The number of rows of that view could be up to (max_backends +
> > > > max_wal_senders). Is that right? What if different backends used the
> > > > same replication slot one after the other?
> > > >
> > >
> > > Yeah, it would be tricky if multiple slots are used by the same
> > > backend.  We could probably track the number of times decoding has
> > > happened by the session that will probably help us in averaging the
> > > spill amount.  If we think that the aim is to help users to tune
> > > logical_decoding_work_mem to avoid frequent spilling or streaming then
> > > how would maintaining at slot level will help?
> >
> > Since the logical decoding intermediate files are written at per slots
> > directory, I thought that corresponding these statistics to
> > replication slots is also understandable for users.
> >
>
> What I wanted to know is how will it help users to tune
> logical_decoding_work_mem?  Different backends can process from the
> same slot, so it is not clear how user will be able to make any
> meaning out of those stats.

I thought that the user needs to constantly monitor them during one
process is executing logical decoding and to see the increments. I
might not fully understand but I guess the same is true for displaying
them w.r.t. process. Since a process can do logical decoding several
times using the same slot with a different setting, the user will need
to monitor them several times.

> OTOH, it is easier to see how to make
> meaning of these stats if we display them w.r.t process.  Basically,
> we have spill_count and spill_size which can be used to tune
> logical_decoding_work_mem and also the activity of spilling happens at
> process level, so it sounds like one-to-one mapping.

Displaying them w.r.t process also seems a good idea but I'm still
unclear what to display and how long these values are valid. The view
will have the following columns for example?

* pid
* slot_name
* spill_txns
* spill_count
* spill_bytes
* exec_count

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Internal key management system

2020-06-11 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 20:03, Fabien COELHO  wrote:
>
>
> Hello Masahiko-san,
>
> >> If the KEK is ever present in pg process, it presumes that the threat
> >> model being addressed allows its loss if the process is compromised, i.e.
> >> all (past, present, future) security properties are void once the process
> >> is compromised.
> >
> > Why we should not put KEK in pg process but it's okay for other
> > processes?
>
> My point is "elsewhere".
>
> Indeed, it could be on another process on the same host, in which case I'd
> rather have the process run under a different uid, which means another
> compromission would be required if pg is compromissed locally ; it could
> also be in a process on another host ; it could be on some special
> hardware. Your imagination is the limit.
>
> > I guess you're talking about a threat when a malicious user logged in OS
> > (or at least accessible) but I thought there is no difference between pg
> > process and other processes in terms of the process being compromised.
>
> Processes are isolated based on uid, unless root is compromised. Once a id
> is compromised (eg "postgres"), the hacker has basically access to all
> files and processes accessible to that id.
>
> > So the solution, in that case, would be to outsource
> > encryption/decryption to external servers as Bruce mentioned.
>
> Hosting stuff (keys, encryption) on another server is indeed an option if
> "elsewhere" is implemented.

If I understand your idea correctly we put both DEK and KEK
"elsewhere", and a postgres process gets only DEK from it. It seems to
me this idea assumes that the place storing encryption keys employees
2-tire key hierarchy or similar thing. What if the user wants to or
has to manage a single encryption key? For example, storing an
encryption key for PostgreSQL TDE into a file in a safe server instead
of KMS using DEK and KEK because of budgets or requirements whatever.
In this case, if the user does key rotation, that encryption key would
need to be rotated, resulting in the user would need to re-encrypt all
database data encrypted with old key. It should work but what do you
think about how postgres does key rotation and re-encryption?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: BUG #16481: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2020-06-11 Thread Vianello Fabio

Just to help those who find themselves in the same situation, there is a simple 
application-level workaround which consists in listening to the notifications 
in the replica when they are issued by the master and vice versa.

We are programming in .NET and we use a code like this:

Code in the replica side:

   var csb = new NpgsqlConnectionStringBuilder
{
Host = "master",
Database = "MasterDB",
Port = 5432,
Username = "postgres",
Password = "XXX",

};
var connection = new NpgsqlConnection(csb.ConnectionString);
connection.Open();
using (var command = new NpgsqlCommand("listen \"Table\"", 
connection))
{
command.ExecuteNonQuery();
}
connection.Notification += PostgresNotification;

So you can listen from the replica every changed raised by a trigger on the 
master from the replica side on the table "Table".

CREATE TRIGGER master_trigger
AFTER INSERT OR DELETE OR UPDATE
ON public."TABLE"
FOR EACH ROW
EXECUTE PROCEDURE public.master_notice();

ALTER TABLE public."Tabele"
ENABLE ALWAYS TRIGGER master_trigger;

CREATE FUNCTION public. master_notice ()
RETURNS trigger
LANGUAGE 'plpgsql'
COST 100
VOLATILE NOT LEAKPROOF
AS $BODY$
  BEGIN
  PERFORM  pg_notify('Table', Cast(NEW."ID"  as text));
  RETURN NEW;
  END;
  $BODY$;

ALTER FUNCTION public.incupdate1_notice()
OWNER TO postgres;

I hope that help someone, because the bug last from "years". I tried in version 
10 11 and 12, so it is present since 2017-10-05 and I can't see any solution on 
13 beta.

Best Regards.
Fabio.


-Original Message-
From: Vianello Fabio
Sent: lunedì 8 giugno 2020 11:14
To: Kyotaro Horiguchi 
Cc: pgsql-b...@lists.postgresql.org; pgsql-hackers@lists.postgresql.org
Subject: RE: BUG #16481: Stored Procedure Triggered by Logical Replication is 
Unable to use Notification Events

Hi Kyotaro Horiguchi, thanks for you helps.
We have a question about the bug. Why there isn't any solution in the HEAD?

This bug last since 10.4 version and I can't understand why it is not fixed in 
the HEAD  yet.

BR.
Fabio Vianello.


-Original Message-
From: Kyotaro Horiguchi [mailto:horikyota@gmail.com]
Sent: lunedì 8 giugno 2020 10:28
To: Vianello Fabio ; 
pgsql-b...@lists.postgresql.org; pgsql-hackers@lists.postgresql.org
Subject: Re: BUG #16481: Stored Procedure Triggered by Logical Replication is 
Unable to use Notification Events

Hello.

It seems to me a bug.

At Fri, 05 Jun 2020 11:05:14 +, PG Bug reporting form 
 wrote in
> The following bug has been logged on the website:
>
> Bug reference:  16481
> Logged by:  Fabio Vianello
> Email address:  fabio.viane...@salvagninigroup.com
> PostgreSQL version: 12.3
> Operating system:   Windows 10
> Description:
>
> About the bug BUG #15293, on PostgreSQL version 10.4 and 11.2 as
> describe below, we found the same issue on the PostgreSQL version 12.3.

The HEAD behaves the same way.

> Is it a feature?
> Becasue in the documentation we didn't found any constraint that says
> that we can not use NOTIFY/LISTEN on logical replication tables.
>
> "When using logical replication a stored procedure executed on the
> replica is unable to use NOTIFY to send messages to other listeners.
> The stored procedure does execute as expected but the pg_notify()
> doesn't appear to have any effect. If an insert is run on the replica
> side the trigger executes the stored procedure as expected and the
> NOTIFY correctly notifies listeners.

The message is actually queued, but logical replication worker doesn't signal 
that to listener backends. If any ordinary session sent a message to the same 
listener after that, both messages would be shown at once.

That can be fixed by calling ProcessCompletedNotifies() in apply_handle_commit. 
The function has a code to write out notifications to connected clients but it 
doesn't nothing on logical replication workers.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
SALVAGNINI ITALIA S.p.A.
Via Guido Salvagnini, 51 - IT - 36040 Sarego (VI)
T. +39 0444 725111 | F. +39 0444 43 6404
Società a socio unico - Attività direz. e coord.: Salvagnini Holding S.p.A.
Clicca qui per le 
informazioni societarie
salvagninigroup.com | 
salvagnini.it


Le informazioni trasmesse sono destinate esclusivamente alla persona o alla 
società in indirizzo e sono da intendersi confidenziali e riservate. Ogni 
trasmissione, inoltro, diffusione o altro uso di queste informazioni a persone 
o società differenti dal destinatario è proibita. Se avete ricevuto questa 
comunicazione per errore, per favore contattate il mittente e cancellate le 
informazioni da ogni computer. Questa casella di posta elettronica è ri

Re: new heapcheck contrib module

2020-06-11 Thread Dilip Kumar

On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger
 wrote:
>
>
>
> > On Jun 11, 2020, at 9:14 AM, Dilip Kumar  wrote:
> >
> > I have just browsed through the patch and the idea is quite
> > interesting.  I think we can expand it to check that whether the flags
> > set in the infomask are sane or not w.r.t other flags and xid status.
> > Some examples are
> >
> > - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> > should not be set in new_infomask2.
> > - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> > actually cross verify the transaction status from the CLOG and check
> > whether is matching the hint bit or not.
> >
> > While browsing through the code I could not find that we are doing
> > this kind of check,  ignore if we are already checking this.
>
> Thanks for taking a look!
>
> Having both of those bits set simultaneously appears to fall into a different 
> category than what I wrote verify_heapam.c to detect.

Ok

  It doesn't violate any assertion in the backend, nor does it cause
the code to crash.  (At least, I don't immediately see how it does
either of those things.)  At first glance it appears invalid to have
those bits both set simultaneously, but I'm hesitant to enforce that
without good reason.  If it is a good thing to enforce, should we also
change the backend code to Assert?

Yeah, it may not hit assert or crash but it could lead to a wrong
result.  But I agree that it could be an assertion in the backend
code.  What about the other check, like hint bit is saying the
transaction is committed but actually as per the clog the status is
something else.  I think in general processing it is hard to check
such things in backend no? because if the hint bit is set saying that
the transaction is committed then we will directly check its
visibility with the snapshot.  I think a corruption checker may be a
good tool for catching such anomalies.

> I integrated your idea into one of the regression tests.  It now sets these 
> two bits in the header of one of the rows in a table.  The verify_heapam 
> check output (which includes all detected corruptions) does not change, which 
> verifies your observation that verifies _heapam is not checking for this.  
> I've attached that as a patch to this email.  Note that this patch should be 
> applied atop the v6 patch recently posted in another email.

Ok.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-11 Thread Amit Kapila

On Fri, Jun 12, 2020 at 9:54 AM Masahiko Sawada
 wrote:
>
> On Fri, 12 Jun 2020 at 12:40, Amit Kapila  wrote:
> >
> > On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Thu, 11 Jun 2020 at 22:21, Amit Kapila  wrote:
> > > >
> > >
> > > >  I have another question about
> > > > this field, why can't it be one of the status ('preparing',
> > > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > > separate field?
> > >
> > > Because I'm using in-doubt field also for checking if the foreign
> > > transaction entry can also be resolved manually, i.g.
> > > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > > status = 'prepared' and in-doubt = 'true' can be resolved either
> > > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > > sets status = 'committing' (or 'rollbacking') by checking transaction
> > > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > > resolution. In this case, the foreign transaction is still status =
> > > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > > resolver process processes the foreign transaction, it can commit it
> > > without clog looking.
> > >
> >
> > I think this is a corner case and it is better to simplify the state
> > recording of foreign transactions then to save a CLOG lookup.
> >
>
> The main usage of in-doubt flag is to distinguish between in-doubt
> transactions and other transactions that have their waiter (I call
> on-line transactions).
>

Which are these other online transactions?  I had assumed that foreign
transaction resolver process is to resolve in-doubt transactions but
it seems it is also used for some other purpose which anyway was the
next question I had while reviewing other sections of docs but let's
clarify as it came up now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

85 matches

Mail list logo