Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

2020-06-24 Thread Maciek Sakrejda
On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila  wrote:
> > I don't see any other reason for
> > looping over the NL node itself in this plan. The Gather itself
> > doesn't do any real looping, right?
>
> It is right that Gather doesn't do looping but Parallel Seq Scan node does so.

Sorry, I still don't follow. How does a Parallel Seq Scan do looping?
I looked at the parallel plan docs but I don't see looping mentioned
anywhere[1]. Also, is looping not normally indicated on children,
rather than on the node doing the looping? E.g., with a standard
Nested Loop, the outer child will have loops=1 and the inner child
will have loops equal to the row count produced by the outer child
(and the Nested Loop itself will have loops=1 unless it also is being
looped over by a parent node), right?

But even aside from that, why do I need to divide by the number of
loops here, when normally Postgres presents a per-loop average?

[1]: https://www.postgresql.org/docs/current/parallel-plans.html




RE: [PoC] Non-volatile WAL buffer

2020-06-24 Thread Takashi Menjo
Dear hackers,

I update my non-volatile WAL buffer's patchset to v3.  Now we can use it in 
streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if 
applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if 
you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.  The path will be 
written to postgresql.auto.conf or recovery.conf.  The size of the new NVWAL is 
same as the master's one.


Best regards,
Takashi

-- 
Takashi Menjo 
NTT Software Innovation Center

> -Original Message-
> From: Takashi Menjo 
> Sent: Wednesday, March 18, 2020 5:59 PM
> To: 'PostgreSQL-development' 
> Cc: 'Robert Haas' ; 'Heikki Linnakangas' 
> ; 'Amit Langote'
> 
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Dear hackers,
> 
> I rebased my non-volatile WAL buffer's patchset onto master.  A new v2 
> patchset is attached to this mail.
> 
> I also measured performance before and after patchset, varying -c/--client 
> and -j/--jobs options of pgbench, for
> each scaling factor s = 50 or 1000.  The results are presented in the 
> following tables and the attached charts.
> Conditions, steps, and other details will be shown later.
> 
> 
> Results (s=50)
> ==
>  Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after  before  after
> ---  -  -
> ( 8, 8)  35.737.1 (+3.9%)   0.224   0.216 (-3.6%)
> (18,18)  70.974.7 (+5.3%)   0.254   0.241 (-5.1%)
> (36,18)  76.080.8 (+6.3%)   0.473   0.446 (-5.7%)
> (54,18)  75.581.8 (+8.3%)   0.715   0.660 (-7.7%)
> 
> 
> Results (s=1000)
> 
>  Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after  before  after
> ---  -  -
> ( 8, 8)  37.440.1 (+7.3%)   0.214   0.199 (-7.0%)
> (18,18)  79.386.7 (+9.3%)   0.227   0.208 (-8.4%)
> (36,18)  87.295.5 (+9.5%)   0.413   0.377 (-8.7%)
> (54,18)  86.894.8 (+9.3%)   0.622   0.569 (-8.5%)
> 
> 
> Both throughput and average latency are improved for each scaling factor.  
> Throughput seemed to almost reach
> the upper limit when (c,j)=(36,18).
> 
> The percentage in s=1000 case looks larger than in s=50 case.  I think larger 
> scaling factor leads to less
> contentions on the same tables and/or indexes, that is, less lock and unlock 
> operations.  In such a situation,
> write-ahead logging appears to be more significant for performance.
> 
> 
> Conditions
> ==
> - Use one physical server having 2 NUMA nodes (node 0 and 1)
>   - Pin postgres (server processes) to node 0 and pgbench to node 1
>   - 18 cores and 192GiB DRAM per node
> - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
>   - Both are installed on the server-side node, that is, node 0
>   - Both are formatted with ext4
>   - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
> - Use the attached postgresql.conf
>   - Two new items nvwal_path and nvwal_size are used only after patch
> 
> 
> Steps
> =
> For each (c,j) pair, I did the following steps three times then I found the 
> median of the three as a final result shown
> in the tables above.
> 
> (1) Run initdb with proper -D and -X options; and also give --nvwal-path and 
> --nvwal-size options after patch
> (2) Start postgres and create a database for pgbench tables
> (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
> (4) Stop postgres, remount filesystems, and start postgres again
> (5) Execute pg_prewarm extension for all the four pgbench tables
> (6) Run pgbench during 30 minutes
> 
> 
> pgbench command line
> 
> $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ 
> dbname
> 
> I gave no -b option to use the built-in "TPC-B (sort-of)" query.
> 
> 
> Software
> 
> - Distro: Ubuntu 18.04
> - Kernel: Linux 5.4 (vanilla kernel)
> - C Compiler: gcc 7.4.0
> - PMDK: 1.7
> - PostgreSQL: d677550 (master on Mar 3, 2020)
> 
> 
> Hardware
> 
> - System: HPE ProLiant DL380 Gen10
> - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
> - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
> - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
> - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA
> 
> 
> Best regards,
> Takashi
> 
> --
> Takashi Menjo  NTT Software Innovation Center
> 
> > -Original Message-
> > From: Takashi Menjo 
> > Sent: Thursday, February 20, 2020 6:30 PM
> > To: 'Amit Langote' 
> > Cc: 'Robert Haas' ; 'Heikki Linnakangas' 
> > ;
> 'PostgreSQL-development'
> > 
> > Subject: RE: [PoC] Non-volatile WAL buffer
> >
> > Dear Amit,
> >
> > Thank you for your advice.  Exactly, it's so to speak "do as the hackers do 
> > when in pgsql"...

Re: Improve handling of parameter differences in physical replication

2020-06-24 Thread Peter Eisentraut

Here is another stab at this subject.

This is a much simplified variant:  When encountering a parameter change 
in the WAL that is higher than the standby's current setting, we log a 
warning (instead of an error until now) and pause recovery.  If you 
resume (unpause) recovery, the instance shuts down as before.


This allows you to keep your standbys running for a bit (depending on 
lag requirements) and schedule the required restart more deliberately.


I had previously suggested making this new behavior configurable, but 
there didn't seem to be much interest in that, so I have not included 
that there.


The documentation changes are mostly carried over from previous patch 
versions (but adjusted for the actual behavior of the patch).


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 2d3ffb221ae9418ebd7e1fc7fb771213014ca216 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 24 Jun 2020 08:49:42 +0200
Subject: [PATCH v3 1/2] Replace a macro by a function

Using a macro is ugly and not justified here.
---
 src/backend/access/transam/xlog.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index a1256a103b..dcec2111b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6232,16 +6232,17 @@ GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream)
  * Note that text field supplied is a parameter name and does not require
  * translation
  */
-#define RecoveryRequiresIntParameter(param_name, currValue, minValue) \
-do { \
-   if ((currValue) < (minValue)) \
-   ereport(ERROR, \
-   (errcode(ERRCODE_INVALID_PARAMETER_VALUE), \
-errmsg("hot standby is not possible because %s 
= %d is a lower setting than on the master server (its value was %d)", \
-   param_name, \
-   currValue, \
-   minValue))); \
-} while(0)
+static void
+RecoveryRequiresIntParameter(const char *param_name, int currValue, int 
minValue)
+{
+   if (currValue < minValue)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("hot standby is not possible because %s 
= %d is a lower setting than on the master server (its value was %d)",
+   param_name,
+   currValue,
+   minValue)));
+}
 
 /*
  * Check to see if required parameters are set high enough on this server
-- 
2.27.0

From 18740a021a90207423032cb7201f54650b81c15a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 24 Jun 2020 09:18:21 +0200
Subject: [PATCH v3 2/2] Pause recovery for insufficient parameter settings

When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record.  The standby then checks whether its own settings are at least
as big as the ones on the primary.  If not, the standby shuts down
with a fatal error.

This patch changes this behavior to pause recovery at that point
instead.  That allows read traffic on the standby to continue while
database administrators figure out next steps.  When recovery is
unpaused, the server shuts down (as before).  The idea is to fix the
parameters while recovery is paused and then restart when there is a
maintenance window.
---
 doc/src/sgml/high-availability.sgml | 47 +
 src/backend/access/transam/xlog.c   | 17 ++-
 2 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 65c3fc62a9..d32ec6d9b3 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2159,18 +2159,14 @@ Administrator's Overview

 

-The setting of some parameters on the standby will need reconfiguration
-if they have been changed on the primary. For these parameters,
-the value on the standby must
-be equal to or greater than the value on the primary.
-Therefore, if you want to increase these values, you should do so on all
-standby servers first, before applying the changes to the primary server.
-Conversely, if you want to decrease these values, you should do so on the
-primary server first, before applying the changes to all standby servers.
-If these parameters
-are not set high enough then the standby will refuse to start.
-Higher values can then be supplied and the server
-restarted to begin recovery again.  These parameters are:
+The settings of some parameters determine the s

Re: Allow CURRENT_ROLE in GRANTED BY

2020-06-24 Thread Vik Fearing
On 6/24/20 8:35 AM, Peter Eisentraut wrote:
> I was checking some loose ends in SQL conformance, when I noticed: We
> support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
> CURRENT_ROLE in that place, even though in PostgreSQL they are
> equivalent.  Here is a trivial patch to add that.


The only thing that isn't dead-obvious about this patch is the commit
message says "[PATCH 1/2]".  What is in the other part?

Assuming that's just a remnant of development, this LGTM.
-- 
Vik Fearing




Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-24 Thread Inoue, Hiroshi

Hi Michael,

Where do you test, on Windows or on *nix?
How do you test there?

regards,
Hiroshi Inoue

On 2020/06/24 11:11, Michael Paquier wrote:

On Tue, Jun 23, 2020 at 02:02:33PM +0900, Michael Paquier wrote:

Actually, while reviewing the code, the only code path where we use
currtid2() involves positioned_load() and LATEST_TUPLE_LOAD.  And the
only location where this happens is in SC_pos_reload_with_key(), where
I don't actually see how it would be possible to not have a keyset and
still use a CTID, which would led to LATEST_TUPLE_LOAD being used.  So
could it be possible that the code paths of currtid2() are actually
just dead code?

I have dug more into this one, and we actually stressed this code path
quite a lot up to commit d9cb23f in the ODBC driver, with tests
cursor-block-delete, positioned-update and bulkoperations particularly
when calling SQLSetPos().  However, 86e2e7a has reworked the code in
such a way that we visibly don't use anymore CTIDs if we don't have a
keyset, and that combinations of various options like UseDeclareFetch
or UpdatableCursors don't trigger this code path anymore.  In short,
currtid2() does not get used.  Inoue-san, Saito-san, what do you
think?  I am adding also Tsunakawa-san in CC who has some experience
in this area.
--
Michael






Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-24 Thread tushar

On 6/22/20 11:59 AM, Amul Sul wrote:

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].


I am not able to perform pg_checksums o/p after shutting down my server 
in read only  mode .


Steps -

1.initdb (./initdb -k -D data)
2.start the server(./pg_ctl -D data start)
3.connect to psql (./psql postgres)
4.Fire query (alter system read only;)
5.shutdown the server(./pg_ctl -D data stop)
6.pg_checksums

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
pg_checksums: error: cluster must be shut down
[edb@tushar-ldap-docker bin]$

Result - (when server is not in read only)

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
Checksum operation completed
Files scanned:  916
Blocks scanned: 2976
Bad checksums:  0
Data checksum version: 1

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Daniel Gustafsson
> On 24 Jun 2020, at 08:39, Peter Eisentraut  
> wrote:
> 
> In PG13, we raised the server-side default of ssl_min_protocol_version to 
> TLSv1.2.  We also added a connection setting named ssl_min_protocol_version 
> to libpq.  But AFAICT, the default value of the libpq setting is empty, so 
> any protocol version will be accepted.  Is this what we wanted?  Should we 
> raise the default in libpq as well?

This was discussed [0] when the connection settings were introduced, and the
concensus was to leave them alone [1] to allow for example a new pg_dump to
work against an old server.  Re-reading the thread I think the argument still
holds, but I was about to respond "yes, let's do this" before refreshing my
memory.  Perhaps we should add a comment explaining this along the lines of the
attached?

cheers ./daniel

[0] 
https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
[1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us



libpq_minmaxproto_doc.diff
Description: Binary data


Re: Parallel copy

2020-06-24 Thread Bharath Rupireddy
Hi,

It looks like the parsing of newly introduced "PARALLEL" option for
COPY FROM command has an issue(in the
0002-Framework-for-leader-worker-in-parallel-copy.patch),
Mentioning PARALLEL '4ar2eteid'); would pass with 4 workers since
atoi() is being used for converting string to integer which just
returns 4, ignoring other strings.

I used strtol(), added error checks and introduced the error "
improper use of argument to option "parallel"" for the above cases.

parallel '4ar2eteid');
ERROR:  improper use of argument to option "parallel"
LINE 5: parallel '1\');

Along with the updated patch
0002-Framework-for-leader-worker-in-parallel-copy.patch, also
attaching all the latest patches from [1].

[1] - 
https://www.postgresql.org/message-id/CALj2ACW94icER3WrWapon7JkcX8j0TGRue5ycWMTEvgA3X7fOg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

On Tue, Jun 23, 2020 at 12:22 PM vignesh C  wrote:
>
> On Tue, Jun 23, 2020 at 8:07 AM vignesh C  wrote:
> > I have attached the patch for the same with the fixes.
>
> The patches were not applying on the head, attached the patches that can be 
> applied on head.
> I have added a commitfest entry[1] for this feature.
>
> [1] - https://commitfest.postgresql.org/28/2610/
>
>
> On Tue, Jun 23, 2020 at 8:07 AM vignesh C  wrote:
>>
>> Thanks Ashutosh For your review, my comments are inline.
>> On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma  
>> wrote:
>> >
>> > Hi,
>> >
>> > I just got some time to review the first patch in the list i.e. 
>> > 0001-Copy-code-readjustment-to-support-parallel-copy.patch. As the patch 
>> > name suggests, it is just trying to reshuffle the existing code for COPY 
>> > command here and there. There is no extra changes added in the patch as 
>> > such, but still I do have some review comments, please have a look:
>> >
>> > 1) Can you please add some comments atop the new function 
>> > PopulateAttributes() describing its functionality in detail. Further, this 
>> > new function contains the code from BeginCopy() to set attribute level 
>> > options used with COPY FROM such as FORCE_QUOTE, FORCE_NOT_NULL, 
>> > FORCE_NULL etc. in cstate and along with that it also copies the code from 
>> > BeginCopy() to set other infos such as client encoding type, encoding 
>> > conversion etc. Hence, I think it would be good to give it some better 
>> > name, basically something that matches with what actually it is doing.
>> >
>>
>> There is no new code added in this function, some part of code from
>> BeginCopy was made in to a new function as this part of code will also
>> be required for the parallel copy workers before the workers start the
>> actual copy operation. This code was made into a function to avoid
>> duplication. Changed the function name to PopulateGlobalsForCopyFrom &
>> added few comments.
>>
>> > 2) Again, the name for the new function CheckCopyFromValidity() doesn't 
>> > look good to me. From the function name it appears as if it does the 
>> > sanity check of the entire COPY FROM command, but actually it is just 
>> > doing the sanity check for the target relation specified with COPY FROM. 
>> > So, probably something like CheckTargetRelValidity would look more 
>> > sensible, I think? TBH, I am not good at naming the functions so you can 
>> > always ignore my suggestions about function and variable names :)
>> >
>>
>> Changed as suggested.
>> > 3) Any reason for not making CheckCopyFromValidity as a macro instead of a 
>> > new function. It is just doing the sanity check for the target relation.
>> >
>>
>> I felt there is reasonable number of lines in the function & it is not
>> in performance intensive path, so I preferred function over macro.
>> Your thoughts?
>>
>> > 4) Earlier in CopyReadLine() function while trying to clear the EOL marker 
>> > from cstate->line_buf.data (copied data), we were not checking if the line 
>> > read by CopyReadLineText() function is a header line or not, but I can see 
>> > that your patch checks that before clearing the EOL marker. Any reason for 
>> > this extra check?
>> >
>>
>> If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does
>> nothing for the header line, server basically calls CopyReadLine
>> again, it is a kind of small optimization. Anyway server is not going
>> to do anything with header line, I felt no need to clear EOL marker
>> for header lines.
>> /* on input just throw the header line away */
>> if (cstate->cur_lineno == 0 && cstate->header_line)
>> {
>> cstate->cur_lineno++;
>> if (CopyReadLine(cstate))
>> return false; /* done */
>> }
>>
>> cstate->cur_lineno++;
>>
>> /* Actually read the line into memory here */
>> done = CopyReadLine(cstate);
>> I think no need to make a fix for this. Your thoughts?
>>
>> > 5) I noticed the below spurious line removal in the patch.
>> >
>> > @@ -3839,7 +3953,6 @@ static bool
>> >  CopyReadLine(CopyState cstate)
>> >  {
>> > boolresult;
>> > -
>> >
>>
>

Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Magnus Hagander
On Wed, Jun 24, 2020 at 10:33 AM Daniel Gustafsson  wrote:

> > On 24 Jun 2020, at 08:39, Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
> >
> > In PG13, we raised the server-side default of ssl_min_protocol_version
> to TLSv1.2.  We also added a connection setting named
> ssl_min_protocol_version to libpq.  But AFAICT, the default value of the
> libpq setting is empty, so any protocol version will be accepted.  Is this
> what we wanted?  Should we raise the default in libpq as well?
>
> This was discussed [0] when the connection settings were introduced, and
> the
> concensus was to leave them alone [1] to allow for example a new pg_dump to
> work against an old server.  Re-reading the thread I think the argument
> still
> holds, but I was about to respond "yes, let's do this" before refreshing my
> memory.  Perhaps we should add a comment explaining this along the lines
> of the
> attached?
>
>
Another argument for not changing the default is that if you want to use
SSL in any meaningful way you have to *already* change the connection
string (with sslmode=require or verify-*), so it's not unreasonable to make
that consideration at the same time.

It might also be worth noting that it's not really "any protocol version",
it means it will be "whatever the openssl configuration says", I think? For
example, debian buster sets:

[system_default_sect]
MinProtocol = TLSv1.2

Which I believe means that if your libpq app is running on debian buster,
it will be min v1.2 already (and it would likely be more useful to use
ssl_min_protocol_version to *lower* that when connecting to older servers).

//Magnus


Re: pg_resetwal --next-transaction-id may cause database failed to restart.

2020-06-24 Thread movead...@highgo.ca

>Yeah, the normal workaround is to create the necessary file manually in
>order to let the system start after such an operation; they are
>sometimes necessary to enable testing weird cases with wraparound and
>such.  So a total rejection to work for these cases would be unhelpful
>precisely for the scenario that those switches were intended to serve.
I think these words should appear in pg_resetwal document if we decide
to do nothing for this issue. 

>Maybe a better answer is to have a new switch in postmaster that creates
>any needed files (incl. producing associated WAL etc); so you'd run
>pg_resetwal -x some-value
>postmaster --create-special-stuff
>then start your server and off you go.
As shown in the document, it looks like to rule a safe input, so I think it's 
better
to rule it and add an option to focus write an unsafe value if necessary.
 
>Now maybe this is too much complication for a mechanism that really
>isn't for general consumption anyway.  I mean, if you're using
>pg_resetwal, you're already playing with fire.
Yes, that's true, I always heard the word "You'd better not use pg_walreset".
But the tool appear in PG code, it's better to improve it than do nothing.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-24 Thread Michael Paquier
Hi Inoue-san,

On Wed, Jun 24, 2020 at 05:20:42PM +0900, Inoue, Hiroshi wrote:
> Where do you test, on Windows or on *nix?
> How do you test there?

I have been testing the driver on macos only, with various backend
versions, from 11 to 14.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Daniel Gustafsson
> On 24 Jun 2020, at 10:46, Magnus Hagander  wrote:

> It might also be worth noting that it's not really "any protocol version", it 
> means it will be "whatever the openssl configuration says", I think? For 
> example, debian buster sets:
> 
> [system_default_sect]
> MinProtocol = TLSv1.2
> 
> Which I believe means that if your libpq app is running on debian buster, it 
> will be min v1.2 already

Correct, that being said I'm not sure how common it is for distributions to set
a default protocol version.  The macOS versions I have handy doesn't enforce a
default version, nor does Ubuntu 20.04, FreeBSD 12 or OpenBSD 6.5 AFAICT.

> (and it would likely be more useful to use ssl_min_protocol_version to 
> *lower* that when connecting to older servers).

That is indeed one use-case for the connection parameter.

cheers ./daniel



Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:
> On Tue, 23 Jun 2020 at 08:24, Jeff Davis  wrote:
> > Another way of looking at it is that the weird behavior is already
> > there in v12, so there are already users relying on this weird behavior
> > as a crutch for some other planner mistake. The question is whether we
> > want to:
> >
> > (a) take the weird behavior away now as a consequence of implementing
> > disk-based HashAgg; or
> > (b) support the weird behavior forever; or
> > (c) introduce a GUC now to help transition away from the weird behavior
> >
> > The danger with (c) is that it gives users more time to become more
> > reliant on the weird behavior; and worse, a GUC could be seen as an
> > endorsement of the weird behavior rather than a path to eliminating it.
> > So we could intend to do (c) and end up with (b). We can mitigate this
> > with documentation warnings, perhaps.
> 
> So, I have a few thoughts on this subject. I understand both problem
> cases have been mentioned before on this thread, but just to reiterate
> the two problem cases that we really would rather people didn't hit.

I appreciated this summary since I wasn't fully following the issues.

> As for GUCs to try to help the group of users who, *I'm certain*, will
> have problems with PG13's plan choice. I think the overloaded
> enable_hashagg option is a really nice compromise.   We don't really
> have any other executor node type that has multiple GUCs controlling
> its behaviour, so I believe it would be nice to keep it that way.

So, in trying to anticipate how users will be affected by an API change,
I try to look at similar cases where we already have this behavior, and
how users react to this.  Looking at the available join methods, I think
we have one.  We currently support:

* nested loop with sequential scan
* nested loop with index scan
* hash join
* merge join

It would seem merge join has almost the same complexities as the new
hash join code, since it can spill to disk doing sorts for merge joins,
and adjusting work_mem is the only way to control that spill to disk.  I
don't remember anyone complaining about spills to disk during merge
join, so I am unclear why we would need a such control for hash join.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





PostgreSQL and big data - FDW

2020-06-24 Thread ROS Didier
Hi
I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP cluster 
which uses KERBEROS authentication.
is it possible to achieve this ? which FDW should be used ?

Thanks in advance

Best Regards
Didier ROS
EDF



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

2020-06-24 Thread Amit Kapila
On Wed, Jun 24, 2020 at 12:41 PM Maciek Sakrejda  wrote:
>
> On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila  wrote:
> > > I don't see any other reason for
> > > looping over the NL node itself in this plan. The Gather itself
> > > doesn't do any real looping, right?
> >
> > It is right that Gather doesn't do looping but Parallel Seq Scan node does 
> > so.
>
> Sorry, I still don't follow. How does a Parallel Seq Scan do looping?

Sorry, I intend to say that Parallel Seq Scan is involved in looping.
Let me try by example:

Gather (actual time=6.444..722.642 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop (actual time=0.046..705.936 rows=5000 loops=2)
 ->  Parallel Seq Scan on t1 (actual time=0.010..45.423
rows=25 loops=2)
 ->  Index Scan using idx_t2 on t2 (actual time=0.002..0.002
rows=0 loops=50)
   Index Cond: (c1 = t1.c1)

In the above plan, each of the worker runs
NestLoop
   -> Parallel Seq Scan on t1
   -> Index Scan using idx_t2 on t2

So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop"
nodes.  Does this make sense now?

> I looked at the parallel plan docs but I don't see looping mentioned
> anywhere[1]. Also, is looping not normally indicated on children,
> rather than on the node doing the looping? E.g., with a standard
> Nested Loop, the outer child will have loops=1 and the inner child
> will have loops equal to the row count produced by the outer child
> (and the Nested Loop itself will have loops=1 unless it also is being
> looped over by a parent node), right?
>

Yeah, I hope the above has clarified it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: min_safe_lsn column in pg_replication_slots view

2020-06-24 Thread Fujii Masao




On 2020/06/23 15:27, Amit Kapila wrote:

On Tue, Jun 23, 2020 at 7:47 AM Fujii Masao  wrote:


On 2020/06/23 10:10, Kyotaro Horiguchi wrote:

At Mon, 22 Jun 2020 22:02:51 +0900, Fujii Masao  
wrote in



I feel such a function is good to have but I am not sure if there is a
need to tie it with the removal of min_safe_lsn column.


We should expose the LSN calculated from
"the current WAL LSN - max(wal_keep_segments * 16MB,
max_slot_wal_keep_size)"?
This indicates the minimum LSN of WAL files that are guaraneed to be
currently retained by wal_keep_segments and max_slot_wal_keep_size.
That is, if checkpoint occurs when restart_lsn of replication slot is
smaller than that minimum LSN, some required WAL files may be removed.
So DBAs can periodically monitor and compare restart_lsn and that
minimum
LSN. If they see frequently that difference of those LSN is very
small,
they can decide to increase wal_keep_segments or
max_slot_wal_keep_size,
to prevent required WAL files from being removed. Thought?


I'm not sure about the consensus here about showing that number in the
view. It is similar to "remain" in the earlier versions of this patch
but a bit simpler. It would be usable in a similar way. I can live
with either numbers.


It's useless to display this value in each replication slot in the view.
I'm thinking to expose it as a function.



Having a separate function for this seems like a good idea but can we
consider displaying it in a view like pg_stat_replication_slots as we
are discussing a nearby thread to have such a view for other things.
I think ultimately this information is required to check whether some
slot can be invalidated or not, so having it displayed along with
other slot information might not be a bad idea.


"the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"
is the same value between all the replication slots. But you think it's better
to display that same value for every slots in the view?

Or you're thinking to display the difference of that LSN value and
restart_lsn as Horiguchi-san suggested? That diff varies each replication slot,
so it seems ok to display it for every rows.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: min_safe_lsn column in pg_replication_slots view

2020-06-24 Thread Fujii Masao




On 2020/06/24 8:39, Alvaro Herrera wrote:

On 2020-Jun-23, Kyotaro Horiguchi wrote:


At Tue, 23 Jun 2020 11:50:34 +0530, Amit Kapila  wrote 
in

On Mon, Jun 22, 2020 at 6:32 PM Fujii Masao  wrote:



We should expose the LSN calculated from
"the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"?
This indicates the minimum LSN of WAL files that are guaraneed to be
currently retained by wal_keep_segments and max_slot_wal_keep_size.
That is, if checkpoint occurs when restart_lsn of replication slot is
smaller than that minimum LSN, some required WAL files may be removed.

So DBAs can periodically monitor and compare restart_lsn and that minimum
LSN. If they see frequently that difference of those LSN is very small,
they can decide to increase wal_keep_segments or max_slot_wal_keep_size,
to prevent required WAL files from being removed. Thought?


+1.  This sounds like a good and useful stat for users.


+1 for showing a number that is not involving lastRemovedSegNo. It is
like returning to the initial version of this patch. It showed a
number like ((the suggested above) minus restart_lsn). The number is
different for each slot so they fit in the view.

The number is usable for the same purpose so I'm ok with it.


I think we should publish the value from wal_keep_segments separately
from max_slot_wal_keep_size.  ISTM that the user might decide to change
or remove wal_keep_segments and be suddenly at risk of losing slots
because of overlooking that it was wal_keep_segments, not
max_slot_wal_keep_size, that was protecting them.


You mean to have two functions that returns

1. "current WAL LSN - wal_keep_segments * 16MB"
2. "current WAL LSN - max_slot_wal_keep_size"

Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: PostgreSQL and big data - FDW

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> Hi
> 
> I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP 
> cluster
> which uses KERBEROS authentication.
> 
> is it possible to achieve this ? which FDW should be used ?

Well, I would use the Hadoop FDW:

https://github.com/EnterpriseDB/hdfs_fdw

and it only supports these authentication methods:

Authentication Support

The FDW supports NOSASL and LDAP authentication modes. In order to use
NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
username and password must be specified in OPTIONS while creating user 
mapping.

Not every FDW supports every Postgres server authentication method.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Assertion failure in pg_copy_logical_replication_slot()

2020-06-24 Thread Fujii Masao



On 2020/06/24 9:38, Alvaro Herrera wrote:

On 2020-Jun-23, Fujii Masao wrote:


If restart_lsn of logical replication slot gets behind more than
max_slot_wal_keep_size from the current LSN, the logical replication
slot would be invalidated and its restart_lsn is reset to an invalid LSN.
If this logical replication slot with an invalid restart_lsn is specified
as the source slot in pg_copy_logical_replication_slot(), the function
causes the following assertion failure.

 TRAP: FailedAssertion("!logical_slot", File: "slotfuncs.c", Line: 727)


Oops.


This assertion failure is caused by

/* Copying non-reserved slot doesn't make sense */
if (XLogRecPtrIsInvalid(src_restart_lsn))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot copy a replication slot that doesn't 
reserve WAL")));


Heh, you pasted the code after your patch rather than the original.


oh sorry.



I think the errcode is a bit bogus considering the new case.
IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate.


Agreed. So I updated the patch so this errcode is used instead.
Patch attached.



One could argue that the error message could also be different for the
case of a logical slot (or even a physical slot that has the upcoming
"invalidated_at" LSN set), maybe "cannot copy a replication slot that
has been invalidated" but maybe that's a pointless distinction.
I don't object to the patch as presented.


I have no strong opinion about this, but for now I kept the message as it is.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slotfuncs.c 
b/src/backend/replication/slotfuncs.c
index 06e4955de7..ce17d44227 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -723,12 +723,9 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool 
logical_slot)
 
/* Copying non-reserved slot doesn't make sense */
if (XLogRecPtrIsInvalid(src_restart_lsn))
-   {
-   Assert(!logical_slot);
ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("cannot copy a replication slot that 
doesn't reserve WAL")));
-   }
 
/* Overwrite params from optional arguments */
if (PG_NARGS() >= 3)


Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

2020-06-24 Thread Amit Kapila
On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda  wrote:
>
> Hello,
>
> I had some questions about the behavior of some accounting in parallel
> EXPLAIN plans. Take the following plan:
>
>
..
>
> The Nested Loop here aggregates data for metrics like `buffers read`
> from its workers, and to calculate a metric like `buffers read` for
> the parallel leader, we can subtract the values recorded in each
> individual worker. This happens in the Seq Scan and Index Scan
> children, as well. However, the Gather node appears to only include
> values from its direct parallel leader child (excluding that child's
> workers).
>

I have tried to check a similar plan and for me, the values at Gather
node seems to be considering the values from all workers and leader
(aka whatever is displayed at "Nested Loop " node), see below.  I have
tried the test on HEAD.  Which version of PostgreSQL are you using?
If you are also using the latest version then it is possible that in
some cases it is not displaying correct data. If that turns out to be
the case, then feel to share the test case.  Sorry, for the confusion
caused by my previous reply.

 Gather (actual time=2.083..550.093 rows=1 loops=1)
   Output: t1.c1, t1.c2, t2.c1, t2.c2
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=1012621 read=84
   I/O Timings: read=0.819
   ->  Nested Loop (actual time=0.084..541.882 rows= loops=3)
 Output: t1.c1, t1.c2, t2.c1, t2.c2
 Buffers: shared hit=1012621 read=84
 I/O Timings: read=0.819
 Worker 0:  actual time=0.069..541.249 rows=3155 loops=1
   Buffers: shared hit=326529 read=29
   I/O Timings: read=0.325
 Worker 1:  actual time=0.063..541.376 rows=3330 loops=1
   Buffers: shared hit=352045 read=26
   I/O Timings: read=0.179
 ->  Parallel Seq Scan on public.t1 (actual time=0.011..34.250
rows=17 loops=3)
   Output: t1.c1, t1.c2
   Buffers: shared hit=2703
   Worker 0:  actual time=0.011..33.785 rows=161265 loops=1
 Buffers: shared hit=872
   Worker 1:  actual time=0.009..34.582 rows=173900 loops=1
 Buffers: shared hit=940
 ->  Index Scan using idx_t2 on public.t2 (actual
time=0.003..0.003 rows=0 loops=50)
   Output: t2.c1, t2.c2
   Index Cond: (t2.c1 = t1.c1)
   Buffers: shared hit=1009918 read=84
   I/O Timings: read=0.819
   Worker 0:  actual time=0.003..0.003 rows=0 loops=161265
 Buffers: shared hit=325657 read=29
 I/O Timings: read=0.325
   Worker 1:  actual time=0.002..0.002 rows=0 loops=173900
 Buffers: shared hit=351105 read=26
 I/O Timings: read=0.179

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Fujii Masao




On 2020/06/24 11:56, Kyotaro Horiguchi wrote:

At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier  wrote 
in

On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:

I still maintain that adding restrictions here is a bad idea. Even
disregarding the discussion of running normal queries interspersed, it's
useful to be able to both request WAL and receive logical changes over
the same connection. E.g. for creating a logical replica by first doing
a physical base backup (vastly faster), or fetching WAL for decoding
large transactions onto a standby.

And I just don't see any reasons to disallow it. There's basically no
reduction in complexity by doing so.


Yeah, I still stand by the same opinion here to do nothing.  I suspect
that we have good chances to annoy people and some cases we are
overlooking here, that used to work.


In logical replication, a replication role is intended to be
accessible only to the GRANTed databases.  On the other hand the same
role can create a dead copy of the whole cluster, including
non-granted databases.  It seems like a sieve missing a mesh screen.


Personally I'd like to disallow physical replication commands
when I explicitly reject physical replication connection
(i.e., set "host replication user x.x.x.x/x reject") in pg_hba.conf,
whether on physical or logical replication connection.



I agree that that doesn't harm as far as roles are strictly managed so
I don't insist so strongly on inhibiting the behavior. However, the
documentation at least needs amendment.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




[PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-24 Thread Bharath Rupireddy
Hi,

COPY command's FORMAT option allows only all lowercase csv, text or
binary, this is true because strcmp is being used while parsing these
values.

It would be nice if the uppercase or combination of lower and upper
case format options such as CSV, TEXT, BINARY, Csv, Text, Binary so
on. is also allowed.

To achieve this pg_strcasecmp() is used instead of strcmp.

Attached is a patch having above changes.

Request the community to review the patch, if it makes sense.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-COPY-command-s-data-format-option-allows-only-low.patch
Description: Binary data


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-24 Thread Amit Kapila
On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
>
> Here is the POC patch to discuss the idea of a cleanup of shared
> fileset on proc exit.  As discussed offlist,  here I am maintaining
> the list of shared fileset.  First time when the list is NULL I am
> registering the cleanup function with on_proc_exit routine.  After
> that for subsequent fileset, I am just appending it to filesetlist.
> There is also an interface to unregister the shared file set from the
> cleanup list and that is done by the caller whenever we are deleting
> the shared fileset manually.  While explaining it here, I think there
> could be one issue if we delete all the element from the list will
> become NULL and on next SharedFileSetInit we will again register the
> function.  Maybe that is not a problem but we can avoid registering
> multiple times by using some flag in the file
>

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;

+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
  SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
  fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared.  If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

3.
+static List * filesetlist = NULL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
 static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison.  See lock_files usage.

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Comments on other patches:
=
5.
> 3. On concurrent abort we are truncating all the changes including
> some incomplete changes,  so later when we get the complete changes we
> don't have the previous changes,  e.g, if we had specinsert in the
> last stream and due to concurrent abort detection if we delete that
> changes later we will get spec_confirm without spec insert.  We could
> have simply avoided deleting all the changes, but I think the better
> fix is once we detect the concurrent abort for any transaction, then
> why do we need to collect the changes for that, we can simply avoid
> that.  So I have put that fix. (0006)
>

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

6.
In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
xact_desc_invalidations seems to be a subset of
standby_desc_invalidations, can we have a common code for them?

7.
I think we can avoid sending v29-0007-Track-statistics-for-streaming
this each time.  We can do this after the main patch is complete.
Also, we might need to change how and where these stats will be
tracked.  See the related discussion [1].

8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
  * Return oldest transaction in reorderbuffer
@

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-24 Thread Amit Kapila
On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar  wrote:
>
> On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila  wrote:
> >
>
> > 8.
> > + /*
> > + * Start a transaction on stream start, this transaction will be committed
> > + * on the stream stop.  We need the transaction for handling the buffile,
> > + * used for serializing the streaming data and subxact info.
> > + */
> > + ensure_transaction();
> >
> > I think we need this for PrepareTempTablespaces to set the
> > temptablespaces.  Also, isn't it required for a cleanup of buffile
> > resources at the transaction end?  Are there any other reasons for it
> > as well?  The comment should be a bit more clear for why we need a
> > transaction here.
>
> I am not sure that will it make sense to add a comment here that why
> buffile and sharedfileset need a transaction?
>

You can say usage of BufFile interface expects us to be in the
transaction for so and so reason

  Do you think that we
> should add comment in buffile/shared fileset API that it should be
> called under a transaction?
>

I am fine with that as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-24 Thread Dilip Kumar
 iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".

Basically, in POC as shown in below code snippet,  We are checking
that if the "filesetlist" is NULL then only register the on_proc_exit
function.  But, as described above if all the items are deleted the
list will be NULL.  So I told that instead of checking the filesetlist
is NULL,  we can have just a boolean variable that if we have
registered the callback then don't do it again.

@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);
+
+ filesetlist = lcons((void *)fileset, filesetlist);
+ }
 }

>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Because only the sharedfileset and the filesetlist which is allocated
under SharedFileSetInit, are required in the permanent context.
BufFileCreateShared, only creates the Buffile and VFD which will be
required only within the current stream so transaction context is
enough.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);
>
> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

But the filesetlist is maintained at the sharedfileset level, so even
if we delete from BufFileDeleteShared, we need to call an API from the
sharedfileset layer to unregister the fileset.  Am I missing
something?

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Right.

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (Sh

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-24 Thread Amit Kapila
On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar  wrote:
>
>  iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
>
> Basically, in POC as shown in below code snippet,  We are checking
> that if the "filesetlist" is NULL then only register the on_proc_exit
> function.  But, as described above if all the items are deleted the
> list will be NULL.  So I told that instead of checking the filesetlist
> is NULL,  we can have just a boolean variable that if we have
> registered the callback then don't do it again.
>

Check if there is any precedent of the same in the code?

>
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Because only the sharedfileset and the filesetlist which is allocated
> under SharedFileSetInit, are required in the permanent context.
> BufFileCreateShared, only creates the Buffile and VFD which will be
> required only within the current stream so transaction context is
> enough.
>

Okay, then add some more comments to explain it or if you have
explained it elsewhere, then add a reference for the same.

> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
> >
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> But the filesetlist is maintained at the sharedfileset level, so even
> if we delete from BufFileDeleteShared, we need to call an API from the
> sharedfileset layer to unregister the fileset.
>

Sure, but isn't it better if we can call such an API from BufFileDeleteShared?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods

2020-06-24 Thread Robert Haas
On Tue, Jun 23, 2020 at 4:00 PM Andres Freund  wrote:
> https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
> thread with more information / patches further along.
>
> I confused this patch with the approach in
> https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
> sorry for that.  It obviously still differs by not having lower space
> overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
> space for two methods, and then 1 byte overhead for 256 more), but
> that's not that fundamental a difference.

Wait a minute. Are we saying there are three (3) dueling patches for
adding an alternate TOAST algorithm? It seems like there is:

This "custom compression methods" thread - vintage 2017 - Original
code by Nikita Glukhov, later work by Ildus Kurbangaliev
The "pluggable compression support" thread - vintage 2013 - Andres Freund
The "plgz performance" thread - vintage 2019 - Petr Jelinek

Anyone want to point to a FOURTH implementation of this feature?

I guess the next thing to do is figure out which one is the best basis
for further work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Fabien COELHO



Hello devs,

I would like to create an "all defaults" row, i.e. a row composed of the 
default values for all attributes, so I wrote:


  INSERT INTO t() VALUES ();

This is forbidden by postgres, and also sqlite.

Is there any good reason why this should be the case?

--
Fabien.




Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Thomas Kellerer
Fabien COELHO schrieb am 24.06.2020 um 14:18:
> I would like to create an "all defaults" row, i.e. a row composed of the 
> default values for all attributes, so I wrote:
>
>   INSERT INTO t() VALUES ();
>
> This is forbidden by postgres, and also sqlite.
>
> Is there any good reason why this should be the case?
>

Maybe because

   insert into t default values;

exists (and is standard SQL if I'm not mistaken)

Thomas





Re: Default setting for enable_hashagg_disk

2020-06-24 Thread David Rowley
On Wed, 24 Jun 2020 at 21:06, Bruce Momjian  wrote:
> I
> don't remember anyone complaining about spills to disk during merge
> join, so I am unclear why we would need a such control for hash join.

Hash aggregate, you mean?   The reason is that upgrading to PG13 can
cause a performance regression for an underestimated ndistinct on the
GROUP BY clause and cause hash aggregate to spill to disk where it
previously did everything in RAM.   Sure, that behaviour was never
what we wanted to happen, Jeff has fixed that now, but the fact
remains that this does happen in the real world quite often and people
often get away with it, likey because work_mem is generally set to
some very conservative value.  Of course, there's also a bunch of
people that have been bitten by OOM due to this too. The "neverspill"
wouldn't be for those people.  Certainly, it's possible that we just
tell these people to increase work_mem for this query, that way they
can set it to something reasonable and still get spilling if it's
really needed to save them from OOM, but the problem there is that
it's not very easy to go and set work_mem for a certain query.

FWIW, I wish that I wasn't suggesting we do this, but I am because it
seems simple enough to implement and it removes a pretty big roadblock
that might exist for a small subset of people wishing to upgrade to
PG13. It seems lightweight enough to maintain, at least until we
invent some better management of how many executor nodes we can have
allocating work_mem at once.

The suggestion I made was just based on asking myself the following
set of questions:

Since Hash Aggregate has been able to overflow work_mem since day 1,
and now that we've completely changed that fact in PG13,  is that
likely to upset anyone?  If so, should we go to the trouble of giving
those people a way of getting the old behaviour back? If we do want to
help those people, what's the best way to make those options available
to them in a way that we can remove the special options with the least
pain in some future version of PostgreSQL?

I'd certainly be interested in hearing how other people would answer
those question.

David




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Justin Pryzby
On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
> On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:
> > On Tue, 23 Jun 2020 at 08:24, Jeff Davis  wrote:
> > > Another way of looking at it is that the weird behavior is already
> > > there in v12, so there are already users relying on this weird behavior
> > > as a crutch for some other planner mistake. The question is whether we
> > > want to:
> > >
> > > (a) take the weird behavior away now as a consequence of implementing
> > > disk-based HashAgg; or
> > > (b) support the weird behavior forever; or
> > > (c) introduce a GUC now to help transition away from the weird behavior
> > >
> > > The danger with (c) is that it gives users more time to become more
> > > reliant on the weird behavior; and worse, a GUC could be seen as an
> > > endorsement of the weird behavior rather than a path to eliminating it.
> > > So we could intend to do (c) and end up with (b). We can mitigate this
> > > with documentation warnings, perhaps.
> > 
> > So, I have a few thoughts on this subject. I understand both problem
> > cases have been mentioned before on this thread, but just to reiterate
> > the two problem cases that we really would rather people didn't hit.
> 
> I appreciated this summary since I wasn't fully following the issues.
> 
> > As for GUCs to try to help the group of users who, *I'm certain*, will
> > have problems with PG13's plan choice. I think the overloaded
> > enable_hashagg option is a really nice compromise.   We don't really
> > have any other executor node type that has multiple GUCs controlling
> > its behaviour, so I believe it would be nice to keep it that way.
...
> It would seem merge join has almost the same complexities as the new
> hash join code, since it can spill to disk doing sorts for merge joins,
> and adjusting work_mem is the only way to control that spill to disk.  I
> don't remember anyone complaining about spills to disk during merge
> join, so I am unclear why we would need a such control for hash join.

It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
since the old behavior was still available with enable_mergejoin=off.

I think a better analogy would be if we now changed sort nodes beneath merge
join to use at most 0.5*work_mem, with no way of going back to using
1.0*work_mem.

-- 
Justin




RE: PostgreSQL and big data - FDW

2020-06-24 Thread ROS Didier
Hi Bruce

In the following link : 
https://www.enterprisedb.com/blog/connecting-hadoop-and-edb-postgres-shrink-big-data-challenges
We can see : 
"Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)"

So HDFS_FDW  support kerberos authentication . how to be sure of that  ? 
Could EDB make a clear statement on this point?

If so, how to implement this method ? is there any document on this subject ?

Thanks in advance.
Best Regards

Didier ROS
didier@edf.fr
Tél. : +33 6 49 51 11 88




-Message d'origine-
De : br...@momjian.us [mailto:br...@momjian.us] 
Envoyé : mercredi 24 juin 2020 11:13
À : ROS Didier 
Cc : pgsql-hackers@lists.postgresql.org
Objet : Re: PostgreSQL and big data - FDW

On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> Hi
> 
> I would like to use a Foreign Data Wrapper (FDW) to connect to a 
> HADOOP cluster which uses KERBEROS authentication.
> 
> is it possible to achieve this ? which FDW should be used ?

Well, I would use the Hadoop FDW:

https://github.com/EnterpriseDB/hdfs_fdw

and it only supports these authentication methods:

Authentication Support

The FDW supports NOSASL and LDAP authentication modes. In order to use
NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
username and password must be specified in OPTIONS while creating user 
mapping.

Not every FDW supports every Postgres server authentication method.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee






Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.





Re: min_safe_lsn column in pg_replication_slots view

2020-06-24 Thread Amit Kapila
On Wed, Jun 24, 2020 at 2:37 PM Fujii Masao  wrote:
>
> On 2020/06/23 15:27, Amit Kapila wrote:
> >
> > Having a separate function for this seems like a good idea but can we
> > consider displaying it in a view like pg_stat_replication_slots as we
> > are discussing a nearby thread to have such a view for other things.
> > I think ultimately this information is required to check whether some
> > slot can be invalidated or not, so having it displayed along with
> > other slot information might not be a bad idea.
>
> "the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"
> is the same value between all the replication slots. But you think it's better
> to display that same value for every slots in the view?
>
> Or you're thinking to display the difference of that LSN value and
> restart_lsn as Horiguchi-san suggested?
>

I see value in Horiguchi-San's proposal.  IIUC, it will tell help
DBAs/Users to know if any particular slot will get invalidated soon.

> That diff varies each replication slot,
> so it seems ok to display it for every rows.
>

Yes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-24 Thread Tom Lane
Bharath Rupireddy  writes:
> COPY command's FORMAT option allows only all lowercase csv, text or
> binary, this is true because strcmp is being used while parsing these
> values.

This is nonsense, actually:

regression=# create table foo (f1 int);
CREATE TABLE
regression=# copy foo from stdin (format CSV);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

As that shows, there's already a round of lowercasing done by the parser.
The only way that strcasecmp in copy.c would be useful is if you wanted to
accept things like
copy foo from stdin (format "CSV");
I don't find that to be a terribly good idea.  The normal implication
of quoting is that it *prevents* case folding, so why should that
happen anyway?

More generally, though, why would we want to change this policy only
here?  I believe we're reasonably consistent about letting the parser
do any required down-casing and then just checking keyword matches
with strcmp.

regards, tom lane




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Kyotaro Horiguchi wrote:

> In logical replication, a replication role is intended to be
> accessible only to the GRANTed databases.  On the other hand the same
> role can create a dead copy of the whole cluster, including
> non-granted databases.

In other words -- essentially, if you grant replication access to a role
only to a specific database, they can steal the whole cluster.

I don't see what's so great about that, but apparently people like it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Assertion failure in pg_copy_logical_replication_slot()

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Fujii Masao wrote:

> > I think the errcode is a bit bogus considering the new case.
> > IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate.
> 
> Agreed. So I updated the patch so this errcode is used instead.
> Patch attached.

LGTM.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_resetwal --next-transaction-id may cause database failed to restart.

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, movead...@highgo.ca wrote:

> >Maybe a better answer is to have a new switch in postmaster that creates
> >any needed files (incl. producing associated WAL etc); so you'd run
> >pg_resetwal -x some-value
> >postmaster --create-special-stuff
> >then start your server and off you go.
>
> As shown in the document, it looks like to rule a safe input, so I think it's 
> better
> to rule it and add an option to focus write an unsafe value if necessary.

ISTM that a reasonable compromise is that if you use -x (or -c, -m, -O)
and the input value is outside the range supported by existing files,
then it's a fatal error; unless you use --force, which turns it into
just a warning.

> >Now maybe this is too much complication for a mechanism that really
> >isn't for general consumption anyway.  I mean, if you're using
> >pg_resetwal, you're already playing with fire.
> Yes, that's true, I always heard the word "You'd better not use pg_walreset".
> But the tool appear in PG code, it's better to improve it than do nothing.

Sure.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: min_safe_lsn column in pg_replication_slots view

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Fujii Masao wrote:

> On 2020/06/24 8:39, Alvaro Herrera wrote:

> > I think we should publish the value from wal_keep_segments separately
> > from max_slot_wal_keep_size.  ISTM that the user might decide to change
> > or remove wal_keep_segments and be suddenly at risk of losing slots
> > because of overlooking that it was wal_keep_segments, not
> > max_slot_wal_keep_size, that was protecting them.
> 
> You mean to have two functions that returns
> 
> 1. "current WAL LSN - wal_keep_segments * 16MB"
> 2. "current WAL LSN - max_slot_wal_keep_size"

Hmm, but all the values there are easily findable.  What would be the
point in repeating it?

Maybe we should disregard this line of thinking and go back to
Horiguchi-san's original proposal, to wit use the "distance to
breakage", as also supported now by Amit Kapila[1] (unless I
misunderstand him).

[1] 
https://postgr.es/m/CAA4eK1L2oJ7T1cESdc5w4J9L3Q_hhvWqTigdAXKfnsJy4=v...@mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PostgreSQL and big data - FDW

2020-06-24 Thread Jeevan Ladhe
On Wed, Jun 24, 2020 at 6:09 PM ROS Didier  wrote:

> Hi Bruce
>
> In the following link :
> https://www.enterprisedb.com/blog/connecting-hadoop-and-edb-postgres-shrink-big-data-challenges
> We can see :
> "Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)"
>
> So HDFS_FDW  support kerberos authentication . how to be sure of that  ?
> Could EDB make a clear statement on this point?
>

HDFS_FDW does not support kerberos authentication.
The sentence you have pasted above is from the wish list or say TODO
list, here is what it says:
"Currently the HDFS_FDW only provides READ capabilities but EDB is planning
the following additional functionality:"

The functionality was not implemented. I think the part of confusion might
be
due to the formatting of the list in the blog.

You can follow the README[1] of HDFS_FDW to get an idea of how to use it.

[1] https://github.com/EnterpriseDB/hdfs_fdw/blob/master/README.md

Regards,
Jeevan


Re: extensible options syntax for replication parser?

2020-06-24 Thread Robert Haas
On Sun, Jun 14, 2020 at 3:15 AM Fabien COELHO  wrote:
> > so instead I'd like to have a better way to do it.
>
> > Attached is v1 of a patch to refactor things so that parts of the
> > BASE_BACKUP and CREATE_REPLICATION_SLOT are replaced with a flexible
> > options syntax.
>
> Patch applies cleanly, however compilation fails on:
>
>repl_gram.y:271:106: error: expected ‘;’ before ‘}’

Oops. I'm surprised my compiler didn't complain.

> Getting rid of "ident_or_keyword", some day, would be a relief.

Actually, I think that particular thing is a sign that things are
being done correctly. You need something like that if you have
contexts where you want to treat keywords and non-keywords the same
way, and it's generally good to have such places. In fact, this could
probably be profitably used in more places in the replication grammar.

> For boolean options, you are generating (EXPORT_SNAPSHOT TRUE) where
> (EXPORT_SNAPSHOT) would do.

True, but it doesn't seem to matter much one way or the other. I
thought this way looked a little clearer.

> Maybe allowing (!EXPORT_SNAPSHOT) for (FOO FALSE) would be nice, if it is
> not allowed yet. That would also allow to get rid of FOO/NOFOO variants if
> any for FOO & !FOO, so one keyword is enough for a concept.

Well, the goal for this is not to need ANY new keywords for a new
concept, but !FOO would be inconsistent with other places where we do
similar things (e.g. EXPLAIN, VACUUM, COPY) so I don't think that's
the way to go.

> > There are some debatable decisions here, so I'd be happy to get some
> > feedback on whether to go further with this, or less far, or maybe even
> > just abandon the idea altogether. I doubt the last one is the right
> > course, though: ISTM something's got to be done about the BASE_BACKUP
> > case, at least.
>
> ISTM that it would be better to generalize the approach to all commands
> which accept options, so that the syntax is homogeneous.

As a general principle, sure, but it's always debatable how far to
take things in particular cases. For instance, in the cases of
EXPLAIN, VACUUM, and COPY, the relation name is given as a dedicated
piece of syntax, not an option. It could be given as an option, but
since it's mandatory and important, we didn't. In the case of COPY,
the source file is also specified via dedicated syntax, rather than an
option. So we have to make the same kinds of decisions here. For
example, for CREATE_REPLICATION_SLOT, one could argue that PHYSICAL
and LOGICAL should be moved to the extensible options list instead of
being kept as separate syntax. However, that seems somewhat
inconsistent with the long-standing syntax for START_REPLICATION,
which already does use extensible options:

START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ ( option_name
[option_value] [, ... ] ) ]

On balance I am comfortable with what the patch does, but other people
might have a different take.

> Just wondering: ISTM that the patch implies that dumping a v14 db
> generates the new syntax, which makes sense. Now I see 4 use cases wrt to
> version.
>
>   #  sourcetarget   comment
>   1  v < 14v < 14   probably the dump would use one of the older version
>   2  v < 14v >= 14  upgrade
>   3  v >= 14   v < 14   downgrade: oops, the output uses the new syntax
>   4  v >= 14   v >= 14  ok
>
> Both cross version usages may be legitimate. In particular, 3 (oops,
> hardware issue, I have to move the db to a server where pg has not been
> upgraded) seems not possible because the generated syntax uses the new
> approach. Should/could there be some option to tell "please generate vXXX
> syntax" to allow that?

I don't think dumping a DB is really affected by any of this. AFAIK,
replication commands aren't used in pg_dump output. It just affects
pg_basebackup and the server, and you'll notice that I have taken
pains to allow the server to continue to accept the current format,
and to allow pg_basebackup to generate that format when talking to an
older server.

Thanks for the review. v2 attached, hopefully fixing the compilation
issue you mentioned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
Description: Binary data


Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-06-24 Thread Jesse Zhang
Hi Tomas,

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra wrote:
>
> Now, a couple comments / questions about the code.
>
>
> nodeHash.c
> --
>
>
> 1) MultiExecPrivateHash says this
>
>/*
> * Not subject to skew optimization, so either insert normally
> * or save to batch file if it belongs to another stripe
> */
>
> I wonder what it means to "belong to another stripe". I understand what
> that means for batches, which are identified by batchno computed from
> the hash value. But I thought "stripes" are just work_mem-sized pieces
> of a batch, so I don't quite understand this. Especially when the code
> does not actually check "which stripe" the row belongs to.

I have to concur that "stripe" did inspire a RAID vibe when I heard it,
but it seemed to be a better name than what it replaces

> 3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches
>
>childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | 
> hashtable->curbatch;
>
> and also about this comment
>
>/*
> * TODO: what to do about tuples that don't go to the child
> * batch or stay in the current batch? (this is why we are
> * counting tuples to child and curbatch with two diff
> * variables in case the tuples go to a batch that isn't the
> * child)
> */
>if (batchno == childbatch)
>  childbatch_outgoing_tuples++;
>
> I thought each old batch is split into two new ones, and the tuples
> either stay in the current one, or are moved to the new one - which I
> presume is the childbatch, although I haven't tried to decode that
> formula. So where else could the tuple go, as the comment tried to
> suggest?

True, every old batch is split into two new ones, if you only consider
tuples coming from the batch file that _still belong in there_. i.e.
there are tuples in the old batch file that belong to a future batch. As
an example, if the current nbatch = 8, and we want to expand to nbatch =
16, (old) batch 1 will split into (new) batch 1 and batch 9, but it can
already contain tuples that need to go into (current) batches 3, 5, and
7 (soon-to-be batches 11, 13, and 15).

Cheers,
Jesse




Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-24 Thread Robert Haas
On Wed, Jun 24, 2020 at 10:27 AM Tom Lane  wrote:
> More generally, though, why would we want to change this policy only
> here?  I believe we're reasonably consistent about letting the parser
> do any required down-casing and then just checking keyword matches
> with strcmp.

I've had the feeling in the past that our use of pg_strcasecmp() was a
bit random. Looking through the output of 'git grep pg_strcasecmp', it
seems like we don't typically don't use it on option names, but
sometimes we use it on option values. For instance, DefineCollation()
uses pg_strcasecmp() on the collproviderstr, and DefineType() uses it
on storageEl; and also, not to be overlooked, defGetBoolean() uses it
when matching true/false/on/off, which probably affects a lot of
places. On the other hand, ExplainQuery() matches the format using
plain old strcmp(), despite indirectly using pg_strcasecmp() for the
Boolean parameters. So I guess I'm not really convinced that there is
all that much consistency here.

Mind you, I'm not sure whether or not anything really needs to be
changed, or exactly what ought to be done. I'm just making the
observation that it might not be as consistent as you may think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Stephen Frost
Greetings,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> On 2020-Jun-24, Kyotaro Horiguchi wrote:
> 
> > In logical replication, a replication role is intended to be
> > accessible only to the GRANTed databases.  On the other hand the same
> > role can create a dead copy of the whole cluster, including
> > non-granted databases.
> 
> In other words -- essentially, if you grant replication access to a role
> only to a specific database, they can steal the whole cluster.
> 
> I don't see what's so great about that, but apparently people like it.

Sure, people who aren't in charge of security I'm sure like the ease of
use.

Doesn't mean it makes sense or that we should be supporting that.  What
we should have is a way to allow administrators to configure a system
for exactly what they want to allow, and it doesn't seem like we're
doing that today and therefore we should fix it.  This isn't the only
area we have that issue in.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: PostgreSQL and big data - FDW

2020-06-24 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> > I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP 
> > cluster
> > which uses KERBEROS authentication.

Sadly, not really.

> > is it possible to achieve this ? which FDW should be used ?
> 
> Well, I would use the Hadoop FDW:
> 
>   https://github.com/EnterpriseDB/hdfs_fdw
> 
> and it only supports these authentication methods:
> 
>   Authentication Support
> 
>   The FDW supports NOSASL and LDAP authentication modes. In order to use
>   NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
>   username and password must be specified in OPTIONS while creating user 
> mapping.
> 
> Not every FDW supports every Postgres server authentication method.

That isn't really the issue here, the problem is really that the GSSAPI
support in PG today doesn't support credential delegation- if it did,
then the HDFS FDW (and the postgres FDW) could be easily extended to
leverage those delegated credentials to connect.

That's been something that's been on my personal todo list of things to
work on but unfortunately I've not, as yet, had time to go implement.  I
don't actually think it would be very hard- if someone writes it, I'd
definitely review it.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-24 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jun 24, 2020 at 10:27 AM Tom Lane  wrote:
>> More generally, though, why would we want to change this policy only
>> here?  I believe we're reasonably consistent about letting the parser
>> do any required down-casing and then just checking keyword matches
>> with strcmp.

> ... Mind you, I'm not sure whether or not anything really needs to be
> changed, or exactly what ought to be done. I'm just making the
> observation that it might not be as consistent as you may think.

Yeah, I'm sure there are a few inconsistencies.  We previously made a
pass to get rid of pg_strcasecmp for anything that had been through
the parser's downcasing (commit fb8697b31) but I wouldn't be surprised
if that missed a few cases, or if new ones have snuck in.  Anyway,
"don't use pg_strcasecmp unnecessarily" was definitely the agreed-to
policy as of Jan 2018.

My vague recollection is that there are a few exceptions (defGetBoolean
may well be one of them) where pg_strcasecmp still seemed necessary
because the input might not have come through the parser in some usages.

regards, tom lane




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Stephen Frost wrote:

> Doesn't mean it makes sense or that we should be supporting that.  What
> we should have is a way to allow administrators to configure a system
> for exactly what they want to allow, and it doesn't seem like we're
> doing that today and therefore we should fix it.  This isn't the only
> area we have that issue in.

The way to do that, for the case under discussion, is to reject using a
logical replication connection for physical replication commands.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 07:38:43AM -0500, Justin Pryzby wrote:
> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
> > It would seem merge join has almost the same complexities as the new
> > hash join code, since it can spill to disk doing sorts for merge joins,
> > and adjusting work_mem is the only way to control that spill to disk.  I
> > don't remember anyone complaining about spills to disk during merge
> > join, so I am unclear why we would need a such control for hash join.
> 
> It loooks like merge join was new in 8.3.  I don't think that's a good 
> analogy,
> since the old behavior was still available with enable_mergejoin=off.

Uh, we don't gurantee backward compatibility in the optimizer.  You can
turn off hashagg if you want.  That doesn't get you to PG 13 behavior,
but we don't gurantee that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 12:24:29AM +1200, David Rowley wrote:
> On Wed, 24 Jun 2020 at 21:06, Bruce Momjian  wrote:
> > I
> > don't remember anyone complaining about spills to disk during merge
> > join, so I am unclear why we would need a such control for hash join.
> 
> Hash aggregate, you mean?   The reason is that upgrading to PG13 can

Yes, sorry.

> cause a performance regression for an underestimated ndistinct on the
> GROUP BY clause and cause hash aggregate to spill to disk where it
> previously did everything in RAM.   Sure, that behaviour was never
> what we wanted to happen, Jeff has fixed that now, but the fact
> remains that this does happen in the real world quite often and people
> often get away with it, likey because work_mem is generally set to
> some very conservative value.  Of course, there's also a bunch of
> people that have been bitten by OOM due to this too. The "neverspill"
> wouldn't be for those people.  Certainly, it's possible that we just
> tell these people to increase work_mem for this query, that way they
> can set it to something reasonable and still get spilling if it's
> really needed to save them from OOM, but the problem there is that
> it's not very easy to go and set work_mem for a certain query.

Well, my point is that merge join works that way, and no one has needed
a knob to avoid mergejoin if it is going to spill to disk.  If they are
adjusting work_mem to prevent spill of merge join, they can do the same
for hash agg.  We just need to document this in the release notes.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-24 Thread Robert Haas
On Fri, May 22, 2020 at 4:40 PM Peter Geoghegan  wrote:
> On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada
>  wrote:
> > I've attached WIP patch for HEAD. With this patch, the core pass
> > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they
> > can make decision whether run vacuum or not.
> >
> > If the direction of this patch seems good, I'll change the patch so
> > that we pass something information to these callbacks indicating
> > whether this vacuum is anti-wraparound vacuum. This is necessary
> > because it's enough to invoke index cleanup before XID wraparound as
> > per discussion so far.
>
> It. seems like the right direction to me. Robert?

Sorry, I'm so far behind on my email. Argh.

I think, especially on the blog post you linked, that we should aim to
have INDEX_CLEANUP OFF mode do the minimum possible amount of work
while still keeping us safe against transaction ID wraparound. So if,
for example, it's desirable but not imperative for BRIN to
resummarize, then it's OK normally but should be skipped with
INDEX_CLEANUP OFF.

I find the patch itself confusing and the comments inadequate,
especially the changes to lazy_scan_heap(). Before the INDEX_CLEANUP
patch went into the tree, LVRelStats had a member hasindex indicating
whether or not the table had any indexes. The patch changed that
member to useindex, indicating whether or not we were going to do
index vacuuming; thus, it would be false if either there were no
indexes or if we were going to ignore them. This patch redefines
useindex to mean whether or not the table has any indexes, but without
renaming the variable. There's also really no comments anywhere in the
vacuumlazy.c explaining the overall scheme here; what are we actually
doing? Apparently, what we're really doing here is that even when
INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
That seems sad, but if it's what we have to do then it at least needs
comments explaining it.

As for the btree portion of the change, I expect you understand this
better than I do, so I'm reluctant to stick my neck out, but it seems
that what the patch does is force index cleanup to happen even when
INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
the btree index has at least 1 recyclable page. My first reaction is
to wonder whether that doesn't nerf this feature into oblivion. Isn't
it likely that an index that is being vacuumed for wraparound will
have a recyclable page someplace? If the presence of that 1 recyclable
page causes the "help, I'm about to run out of XIDs, please do the
least possible work" flag to become a no-op, I don't think users are
going to be too happy with that. Maybe I am misunderstanding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
>> It would seem merge join has almost the same complexities as the new
>> hash join code, since it can spill to disk doing sorts for merge joins,
>> and adjusting work_mem is the only way to control that spill to disk.  I
>> don't remember anyone complaining about spills to disk during merge
>> join, so I am unclear why we would need a such control for hash join.

> It loooks like merge join was new in 8.3.  I don't think that's a good 
> analogy,
> since the old behavior was still available with enable_mergejoin=off.

Uh, what?  A look into our git history shows immediately that
nodeMergejoin.c has been there since the initial code import in 1996.

I tend to agree with Bruce that it's not very obvious that we need
another GUC knob here ... especially not one as ugly as this.
I'm especially against the "neverspill" option, because that makes a
single GUC that affects both the planner and executor independently.

If we feel we need something to let people have the v12 behavior
back, let's have
(1) enable_hashagg on/off --- controls planner, same as it ever was
(2) enable_hashagg_spill on/off --- controls executor by disabling spill

But I'm not really convinced that we need (2).

regards, tom lane




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-06-24 Thread James Coleman
On Fri, Jun 19, 2020 at 12:04 AM Justin Pryzby  wrote:
>
> On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby  wrote:
> > > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > > I've pushed the fist part of this patch series - I've reorganized it a
> > >
> > > I scanned through this again post-commit.  Find attached some suggestions.
> > >
> > > Shouldn't non-text explain output always show both disk *and* mem, 
> > > including
> > > zeros ?
> >
> > Could you give more context on this? Is there a standard to follow?
> > Regular sort nodes only ever report one type, so there's not a good
> > parallel there.
>
> The change I proposed was like:
>
> @@ -2829,7 +2829,6 @@ 
> show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
>
> ExplainPropertyList("Sort Methods Used", methodNames, es);
>
> -   if (groupInfo->maxMemorySpaceUsed > 0)
> {
> longavgSpace = 
> groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
> const char *spaceTypeName;
> ...
> -   if (groupInfo->maxDiskSpaceUsed > 0)
> {
> ...
>
> To show in non-text format *both* disk and memory space used, even if zero.
>
> I still think that's what's desirable.

I'm of the opposite opinion. I believe showing both unnecessarily is confusing.

> If it's important to show *whether* a sort space was used, then I think there
> should be a boolean, or an int 0/1.  But I don't think it's actually needed,
> since someone parsing the explain output could just check
> |if _dict['Peak Sort Space Used'] > 0: ...
> the same as you're doing, without having to write some variation on:
> |if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

I think it's desirable for code to be explicitly about the type having
been used rather than implicitly assuming it based on 0/non-zero
values.

James




Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Peter Eisentraut

On 2020-06-24 10:33, Daniel Gustafsson wrote:

In PG13, we raised the server-side default of ssl_min_protocol_version to 
TLSv1.2.  We also added a connection setting named ssl_min_protocol_version to 
libpq.  But AFAICT, the default value of the libpq setting is empty, so any 
protocol version will be accepted.  Is this what we wanted?  Should we raise 
the default in libpq as well?


This was discussed [0] when the connection settings were introduced, and the
concensus was to leave them alone [1] to allow for example a new pg_dump to
work against an old server.  Re-reading the thread I think the argument still
holds, but I was about to respond "yes, let's do this" before refreshing my
memory.  Perhaps we should add a comment explaining this along the lines of the
attached?

[0] 
https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
[1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us


ISTM that these discussions went through the same questions and 
arguments that were made regarding the server-side change but arrived at 
a different conclusion.  So I suggest to reconsider this so that we 
don't ship with contradictory results.


That doesn't necessarily mean that we have to make a change, but we 
should make sure our rationale is sound.


Note that all OpenSSL versions that do not support TLSv1.2 also do not 
support TLSv1.1.  So by saying, in effect, that TLSv1.2 is too new to 
require, we are saying that we need to keep supporting TLSv1.0 -- which 
is heavily deprecated.  Also note that the first OpenSSL version with 
support for TLSv1.2 shipped on March 14, 2012.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow CURRENT_ROLE in GRANTED BY

2020-06-24 Thread Peter Eisentraut

On 2020-06-24 10:12, Vik Fearing wrote:

On 6/24/20 8:35 AM, Peter Eisentraut wrote:

I was checking some loose ends in SQL conformance, when I noticed: We
support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
CURRENT_ROLE in that place, even though in PostgreSQL they are
equivalent.  Here is a trivial patch to add that.



The only thing that isn't dead-obvious about this patch is the commit
message says "[PATCH 1/2]".  What is in the other part?


Hehe.  The second patch is some in-progress work to add the GRANTED BY 
clause to the regular GRANT command.  More on that perhaps at a later date.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 07:57:31PM +0200, Peter Eisentraut wrote:
> On 2020-06-24 10:33, Daniel Gustafsson wrote:
> > > In PG13, we raised the server-side default of ssl_min_protocol_version to 
> > > TLSv1.2.  We also added a connection setting named 
> > > ssl_min_protocol_version to libpq.  But AFAICT, the default value of the 
> > > libpq setting is empty, so any protocol version will be accepted.  Is 
> > > this what we wanted?  Should we raise the default in libpq as well?
> > 
> > This was discussed [0] when the connection settings were introduced, and the
> > concensus was to leave them alone [1] to allow for example a new pg_dump to
> > work against an old server.  Re-reading the thread I think the argument 
> > still
> > holds, but I was about to respond "yes, let's do this" before refreshing my
> > memory.  Perhaps we should add a comment explaining this along the lines of 
> > the
> > attached?
> > 
> > [0] 
> > https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
> > [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us
> 
> ISTM that these discussions went through the same questions and arguments
> that were made regarding the server-side change but arrived at a different
> conclusion.  So I suggest to reconsider this so that we don't ship with
> contradictory results.
> 
> That doesn't necessarily mean that we have to make a change, but we should
> make sure our rationale is sound.
> 
> Note that all OpenSSL versions that do not support TLSv1.2 also do not
> support TLSv1.1.  So by saying, in effect, that TLSv1.2 is too new to
> require, we are saying that we need to keep supporting TLSv1.0 -- which is
> heavily deprecated.  Also note that the first OpenSSL version with support
> for TLSv1.2 shipped on March 14, 2012.

I do think mismatched SSL requirements between client and server is
confusing, though I can see the back-version pg_dump being an issue. 
Maybe a clear error message would help here.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Review for GetWALAvailability()

2020-06-24 Thread Alvaro Herrera
Thanks for those corrections.

I have pushed this.  I think all problems Masao-san reported have been
dealt with, so we're done here.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Tomas Vondra

On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:

Justin Pryzby  writes:

On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:

It would seem merge join has almost the same complexities as the new
hash join code, since it can spill to disk doing sorts for merge joins,
and adjusting work_mem is the only way to control that spill to disk.  I
don't remember anyone complaining about spills to disk during merge
join, so I am unclear why we would need a such control for hash join.



It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
since the old behavior was still available with enable_mergejoin=off.


Uh, what?  A look into our git history shows immediately that
nodeMergejoin.c has been there since the initial code import in 1996.

I tend to agree with Bruce that it's not very obvious that we need
another GUC knob here ... especially not one as ugly as this.
I'm especially against the "neverspill" option, because that makes a
single GUC that affects both the planner and executor independently.

If we feel we need something to let people have the v12 behavior
back, let's have
(1) enable_hashagg on/off --- controls planner, same as it ever was
(2) enable_hashagg_spill on/off --- controls executor by disabling spill



What if a user specifies

  enable_hashagg = on
  enable_hashagg_spill = off

and the estimates say the hashagg would need to spill to disk. Should
that disable the query (in which case the second GUC affects both
executor and planner) or run it (in which case we knowingly ignore
work_mem, which seems wrong).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Tom Lane
Tomas Vondra  writes:
> On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:
>> If we feel we need something to let people have the v12 behavior
>> back, let's have
>> (1) enable_hashagg on/off --- controls planner, same as it ever was
>> (2) enable_hashagg_spill on/off --- controls executor by disabling spill

> What if a user specifies
>enable_hashagg = on
>enable_hashagg_spill = off

It would probably be reasonable for the planner to behave as it did
pre-v13, that is not choose hashagg if it estimates that work_mem
would be exceeded.  (So, okay, that means enable_hashagg_spill
affects both planner and executor ... but ISTM it's just one
behavior not two.)

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Andres Freund
Hi,

On 2020-06-24 14:11:57 +1200, David Rowley wrote:
> 1. Statistics underestimation can cause hashagg to be selected. The
> executor will spill to disk in PG13.  Users may find performance
> suffers as previously the query may have just overshot work_mem
> without causing any OOM issues. Their I/O performance might be
> terrible.

> 2. We might now choose to hash aggregate where pre PG13, we didn't
> choose that because the hash table was estimated to be bigger than
> work_mem. Hash agg might not be the best plan for the job.

> For #1. We know users are forced to run smaller work_mems than they
> might like as they need to allow for that random moment where all
> backends happen to be doing that 5-way hash join all at the same time.
> It seems reasonable that someone might want the old behaviour. They
> may well be sitting on a timebomb that's about to OOM, but it would be
> sad if someone's upgrade to PG13 was blocked on this, especially if
> it's just due to some query that runs once per month but needs to
> perform quickly.

I'm quite concerned about this one. I think this isn't just going to hit
when the planner mis-estimates ndistinct, but also when transition
values use a bit more space. We'll now start spilling in cases the
< v13 planner did everything right.

That's great for cases where we'd otherwise OOM, but for a lot of other
cases where there actually is more than sufficient RAM to overrun
work_mem by a single-digit factor, it can cause a pretty massive
increase of IO over < v13.


FWIW, my gut feeling is that we'll end up have to separate the
"execution time" spilling from using plain work mem, because it'll
trigger spilling too often. E.g. if the plan isn't expected to spill,
only spill at 10 x work_mem or something like that.  Or we'll need
better management of temp file data when there's plenty memory
available.


> For #2. This seems like a very legitimate requirement to me.  If a
> user is unhappy that PG13 now hashaggs where before it sorted and
> group aggregated, but they're unhappy, not because there's some issue
> with hashagg spilling, but because that causes the node above the agg
> to becomes a Hash Join rather than a Merge Join and that's bad for
> some existing reason. Our planner doing the wrong thing based on
> either; lack of, inaccurate or out-of-date statistics is not Jeff's
> fault. Having the ability to switch off a certain planner feature is
> just following along with what we do today for many other node types.

This one concerns me a bit less, fwiw. There's a lot more "pressure" in
the planner to choose hash agg or sorted agg, compared to e.g. a bunch
of aggregate states taking up a bit more space (can't estimate that at
all for ma.


> As for GUCs to try to help the group of users who, *I'm certain*, will
> have problems with PG13's plan choice. I think the overloaded
> enable_hashagg option is a really nice compromise.   We don't really
> have any other executor node type that has multiple GUCs controlling
> its behaviour, so I believe it would be nice to keep it that way.
> 
> How about:
> 
> enable_hashagg = "on" -- enables hashagg allowing it to freely spill
> to disk as it pleases.
> enable_hashagg = "trynospill" -- Planner will only choose hash_agg if
> it thinks it won't spill (pre PG13 planner behaviour)
> enable_hashagg = "neverspill" -- executor will *never* spill to disk
> and can still OOM (NOT RECOMMENDED, but does give pre PG13 planner and
> executor behaviour)
> enable_hashagg = "off" -- planner does not consider hash agg, ever.
> Same as what PG12 did for this setting.
> 
> Now, it's a bit weird to have "neverspill" as this is controlling
> what's done in the executor from a planner GUC.  Likely we can just
> work around that by having a new "allowhashspill" bool field in the
> "Agg" struct that's set by the planner, say during createplan that
> controls if nodeAgg.c is allowed to spill or not.  That'll also allow
> PREPAREd plans to continue to do what they had planned to do already.
> 
> The thing I like about doing it this way is that:
> 
> a) it does not add any new GUCs
> b) it semi hides the weird values that we really wish nobody would
> ever have to set in a GUC that people have become used it just
> allowing the values "on" and "off".
> 
> The thing I don't quite like about this idea is:
> a) I wish the planner was perfect and we didn't need to do this.
> b) It's a bit weird to overload a GUC that has a very booleanish name
> to not be bool.
> 
> However, I also think it's pretty lightweight to support this. I
> imagine a dozen lines of docs and likely about half a dozen lines per
> GUC option in the planner.

That'd work for me, but I honestly don't particularly care about the
specific naming, as long as we provide users an escape hatch from the
increased amount of IO.


Greetings,

Andres Freund




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Andres Freund
Hi,

On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> Well, my point is that merge join works that way, and no one has needed
> a knob to avoid mergejoin if it is going to spill to disk.  If they are
> adjusting work_mem to prevent spill of merge join, they can do the same
> for hash agg.  We just need to document this in the release notes.

I don't think this is comparable. For starters, the IO indirectly
triggered by mergejoin actually leads to plenty people just straight out
disabling it. For lots of workloads there's never a valid reason to use
a mergejoin (and often the planner will never choose one). Secondly, the
planner has better information about estimating the memory usage for the
to-be-sorted data than it has about the size of the transition
values. And lastly, there's a difference between a long existing cause
for bad IO behaviour and one that's suddenly kicks in after a major
version upgrade, to which there's no escape hatch (it's rarely realistic
to disable hash aggs, in contrast to merge joins).

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Robert Haas
On Wed, Jun 24, 2020 at 1:06 PM Alvaro Herrera  wrote:
> On 2020-Jun-24, Stephen Frost wrote:
> > Doesn't mean it makes sense or that we should be supporting that.  What
> > we should have is a way to allow administrators to configure a system
> > for exactly what they want to allow, and it doesn't seem like we're
> > doing that today and therefore we should fix it.  This isn't the only
> > area we have that issue in.
>
> The way to do that, for the case under discussion, is to reject using a
> logical replication connection for physical replication commands.

Reading over this discussion, I see basically three arguments:

1. Andres argues that being able to execute physical replication
commands from the same connection as SQL queries is useful, and that
people may be relying on it, and that we shouldn't break it without
need.

2. Fujii Masao argues that the current situation makes it impossible
to write a pg_hba.conf rule that disallows all physical replication
connections, because people could get around it by using a logical
replication connection for physical replication.

3. Various people argue that it's only accidental that physical
replication on a replication=database connection ever worked at all,
and therefore we ought to block it.

I find argument #1 most convincing, #2 less convincing, and #3 least
convincing. In my view, the problem with argument #3 is that just
because some feature combination was unintentional doesn't mean it's
unuseful or unused. As for #2, suppose someone were to propose a
design for logical replication that allowed it to take place without a
database connection, so that it could be done with just a regular
replication connection. Such a feature would create the same problem
Fujii Masao mentions here, but it seems inconceivable that we would
for that reason reject it; we make decisions about features based on
their usefulness, not their knock-on effects on pg_hba.conf rules. We
can always add new kinds of access control restrictions if they are
needed; that is a better approach than removing features so that the
existing pg_hba.conf facilities can be used to accomplish some
particular goal. So really I think this turns on #1: is it plausible
that people are using this feature, however inadvertent it may be, and
is it potentially useful? I don't see that anybody's made an argument
against either of those things. Unless someone can do so, I think we
shouldn't disable this.

That having been said, I think that the fact that you can execute SQL
queries in replication=database connections is horrifying. I really
hate that feature. I think it's a bad design, and a bad
implementation, and a recipe for tons of bugs. But, blocking physical
replication commands on such connections isn't going to solve any of
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Robert Haas
On Wed, Jun 24, 2020 at 3:14 PM Andres Freund  wrote:
> FWIW, my gut feeling is that we'll end up have to separate the
> "execution time" spilling from using plain work mem, because it'll
> trigger spilling too often. E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.  Or we'll need
> better management of temp file data when there's plenty memory
> available.

So, I don't think we can wire in a constant like 10x. That's really
unprincipled and I think it's a bad idea. What we could do, though, is
replace the existing Boolean-valued GUC with a new GUC that controls
the size at which the aggregate spills. The default could be -1,
meaning work_mem, but a user could configure a larger value if desired
(presumably, we would just treat a value smaller than work_mem as
work_mem, and document the same).

I think that's actually pretty appealing. Separating the memory we
plan to use from the memory we're willing to use before spilling seems
like a good idea in general, and I think we should probably also do it
in other places - like sorts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Andres Freund
Hi,

On 2020-06-24 14:40:50 -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:
> >> If we feel we need something to let people have the v12 behavior
> >> back, let's have
> >> (1) enable_hashagg on/off --- controls planner, same as it ever was
> >> (2) enable_hashagg_spill on/off --- controls executor by disabling spill
>
> > What if a user specifies
> >enable_hashagg = on
> >enable_hashagg_spill = off
>
> It would probably be reasonable for the planner to behave as it did
> pre-v13, that is not choose hashagg if it estimates that work_mem
> would be exceeded.  (So, okay, that means enable_hashagg_spill
> affects both planner and executor ... but ISTM it's just one
> behavior not two.)

There's two different reasons for spilling in the executor right now:

1) The planner estimated that we'd need to spill, and that turns out to
   be true. There seems no reason to not spill in that case (as long as
   it's enabled/chosen in the planner).

2) The planner didn't think we'd need to spill, but we end up using more
   than work_mem memory.

nodeAgg.c already treats those separately:

void
hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
Size *mem_limit, uint64 *ngroups_limit,
int *num_partitions)
{
int npartitions;
Sizepartition_mem;

/* if not expected to spill, use all of work_mem */
if (input_groups * hashentrysize < work_mem * 1024L)
{
if (num_partitions != NULL)
*num_partitions = 0;
*mem_limit = work_mem * 1024L;
*ngroups_limit = *mem_limit / hashentrysize;
return;
}

We can't sensibly disable spilling when chosen at plan time, because
that'd lead to *more* OOMS than in v12.

ISTM that we should have one option that controls whether 1) is done,
and one that controls whether 2) is done. Even if the option for 2 is
off, we still should spill when the option for 1) chooses a spilling
plan.  I don't think it makes sense for one of those options to
influence the other implicitly.

So maybe enable_hashagg_spilling_plan for 1) and
hashagg_spill_on_exhaust for 2).

Greetings,

Andres Freund




Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Fabien COELHO


Hallo Thomas,


  INSERT INTO t() VALUES ();

This is forbidden by postgres, and also sqlite.

Is there any good reason why this should be the case?


Maybe because

  insert into t default values;

exists (and is standard SQL if I'm not mistaken)


That's a nice alternative I did not notice. Well, not an alternative as 
the other one does not work.


I'm still unclear why it would be forbidden though, it seems logical to 
try that, whereas the working one is quite away from the usual syntax.


--
Fabien.

Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Andres Freund
Hi,

On 2020-06-24 15:28:47 -0400, Robert Haas wrote:
> On Wed, Jun 24, 2020 at 3:14 PM Andres Freund  wrote:
> > FWIW, my gut feeling is that we'll end up have to separate the
> > "execution time" spilling from using plain work mem, because it'll
> > trigger spilling too often. E.g. if the plan isn't expected to spill,
> > only spill at 10 x work_mem or something like that.  Or we'll need
> > better management of temp file data when there's plenty memory
> > available.
> 
> So, I don't think we can wire in a constant like 10x. That's really
> unprincipled and I think it's a bad idea. What we could do, though, is
> replace the existing Boolean-valued GUC with a new GUC that controls
> the size at which the aggregate spills. The default could be -1,
> meaning work_mem, but a user could configure a larger value if desired
> (presumably, we would just treat a value smaller than work_mem as
> work_mem, and document the same).

To be clear, I wasn't actually thinking of hard-coding 10x, but having a
config option that specifies a factor of work_mem. A factor seems better
because it'll work reasonably for different values of work_mem, whereas
a concrete size wouldn't.


> I think that's actually pretty appealing. Separating the memory we
> plan to use from the memory we're willing to use before spilling seems
> like a good idea in general, and I think we should probably also do it
> in other places - like sorts.

Indeed. And then perhaps we could eventually add some reporting /
monitoring infrastructure for the cases where plan time and execution
time memory estimate/usage widely differs.

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Robert Haas wrote:

> So really I think this turns on #1: is it plausible
> that people are using this feature, however inadvertent it may be, and
> is it potentially useful? I don't see that anybody's made an argument
> against either of those things. Unless someone can do so, I think we
> shouldn't disable this.

People (specifically the jdbc driver) *are* using this feature in this
way, but they didn't realize they were doing it.  It was an accident and
they didn't notice.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Dave Cramer
On Wed, 24 Jun 2020 at 15:41, Alvaro Herrera 
wrote:

> On 2020-Jun-24, Robert Haas wrote:
>
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
>
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.
>
>
Not sure we are using it as much as we accidently did it that way. It would
be trivial to fix.

That said I think we should fix the security hole this opens and leave the
functionality.

Dave


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Andres Freund
Hi,

On 2020-06-24 15:41:14 -0400, Alvaro Herrera wrote:
> On 2020-Jun-24, Robert Haas wrote:
> 
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
> 
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

As I said before, I've utilized being able to do both over a single
connection (among others to initialize a logical replica using a base
backup). And I've seen at least one other codebase (developed without my
input) doing so. I really don't understand how you just dismiss this
without any sort of actual argument.  Yes, those uses can be fixed to
reconnect with a different replication parameter, but that's code that
needs to be adjusted and it requires adjustments to pg_hba.conf etc.

And obviously you'd lock out older versions of jdbc, and possibly other
drivers.

Obviously we should allow more granular permissions here, I don't think
anybody is arguing against that.

Greetings,

Andres Freund




Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Tom Lane
Fabien COELHO  writes:
>>>   INSERT INTO t() VALUES ();

> I'm still unclear why it would be forbidden though, it seems logical to 
> try that, whereas the working one is quite away from the usual syntax.

It's forbidden because the SQL standard forbids it.

We allow zero-column syntaxes in some other places where SQL forbids
them, but that's only because there is no reasonable alternative.
In this case, there's a perfectly good, standards-compliant alternative.
So why encourage people to write unportable code?

regards, tom lane




Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-24 Thread Peter Geoghegan
On Wed, Jun 24, 2020 at 10:21 AM Robert Haas  wrote:
> Sorry, I'm so far behind on my email. Argh.

That's okay.

> I think, especially on the blog post you linked, that we should aim to
> have INDEX_CLEANUP OFF mode do the minimum possible amount of work
> while still keeping us safe against transaction ID wraparound. So if,
> for example, it's desirable but not imperative for BRIN to
> resummarize, then it's OK normally but should be skipped with
> INDEX_CLEANUP OFF.

I agree that that's very important.

> Apparently, what we're really doing here is that even when
> INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> That seems sad, but if it's what we have to do then it at least needs
> comments explaining it.

+1. Though I think that AMs should technically have the right to
consider it advisory.

> As for the btree portion of the change, I expect you understand this
> better than I do, so I'm reluctant to stick my neck out, but it seems
> that what the patch does is force index cleanup to happen even when
> INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
> the btree index has at least 1 recyclable page. My first reaction is
> to wonder whether that doesn't nerf this feature into oblivion. Isn't
> it likely that an index that is being vacuumed for wraparound will
> have a recyclable page someplace? If the presence of that 1 recyclable
> page causes the "help, I'm about to run out of XIDs, please do the
> least possible work" flag to become a no-op, I don't think users are
> going to be too happy with that. Maybe I am misunderstanding.

I have mixed feelings about it myself. These are valid concerns.

This is a problem to the extent that the user has a table that they'd
like to use INDEX_CLEANUP with, that has indexes that regularly
require cleanup due to page deletion. ISTM, then, that the really
relevant high level design questions for this patch are:

1. How often is that likely to happen in The Real World™?

2. If we fail to do cleanup and leak already-deleted pages, how bad is
that? ( Both in general, and in the worst case.)

I'll hazard a guess for 1: I think that it might not come up that
often. Page deletion is often something that we hardly ever need. And,
unlike some DB systems, we only do it when pages are fully empty
(which, as it turns out, isn't necessarily better than our simple
approach [1]). I tend to think it's unlikely to happen in cases where
INDEX_CLEANUP is used, because those are cases that also must not have
that much index churn to begin with.

Then there's question 2. The intuition behind the approach from
Sawada-san's patch was that allowing wraparound here feels wrong -- it
should be in the AM's hands. However, it's not like I can point to
some ironclad guarantee about not leaking deleted pages that existed
before the INDEX_CLEANUP feature. We know that the FSM is not crash
safe, and that's that. Is this really all that different? Maybe it is,
but it seems like a quantitative difference to me.

I'm kind of arguing against myself even as I try to advance my
original argument. If workloads that use INDEX_CLEANUP don't need to
delete and recycle pages in any case, then why should we care that
those same workloads might leak pages on account of the wraparound
hazard? There's nothing to leak! Maybe some compromise is possible, at
least for the backbranches. Perhaps nbtree is told about the
requirements in a bit more detail than we'd imagined. It's not just an
INDEX_CLEANUP boolean. It could be an enum or something. Not sure,
though.

The real reason that I want to push the mechanism down into index
access methods is because that design is clearly better overall; it
just so happens that the specific way in which we currently defer
recycling in nbtree makes very little sense, so it's harder to see the
big picture. The xid-cleanup design that we have was approximately the
easiest way to do it, so that's what we got. We should figure out a
way to recycle the pages at something close to the earliest possible
opportunity, without having to perform a full scan on the index
relation within btvacuumscan(). Maybe we can use the autovacuum work
item mechanism for that. For indexes that get VACUUMed once a week on
average, it makes zero sense to wait another week to recycle the pages
that get deleted, in a staggered fashion. It should be possible to
recycle the pages a minute or two after VACUUM proper finishes, with
extra work that's proportionate to the number of deleted pages. This
is still conservative. I am currently very busy with an unrelated
B-Tree prototype, so I might not get around to it this year. Maybe
Sawada-san can think about this?

Also, handling of the vacuum_cleanup_index_scale_factor stuff (added
to Postgres 11 by commit 857f9c36) should live next to code for the
confusingly-similar INDEX_CLEANUP stuff (added to Postgres 12 by
commit a96c41feec6), on general principle. I think that that
organization is a lot easier to follow.

[1] http

Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Tomas Vondra

On Wed, Jun 24, 2020 at 12:36:24PM -0700, Andres Freund wrote:

Hi,

On 2020-06-24 15:28:47 -0400, Robert Haas wrote:

On Wed, Jun 24, 2020 at 3:14 PM Andres Freund  wrote:
> FWIW, my gut feeling is that we'll end up have to separate the
> "execution time" spilling from using plain work mem, because it'll
> trigger spilling too often. E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.  Or we'll need
> better management of temp file data when there's plenty memory
> available.

So, I don't think we can wire in a constant like 10x. That's really
unprincipled and I think it's a bad idea. What we could do, though, is
replace the existing Boolean-valued GUC with a new GUC that controls
the size at which the aggregate spills. The default could be -1,
meaning work_mem, but a user could configure a larger value if desired
(presumably, we would just treat a value smaller than work_mem as
work_mem, and document the same).


To be clear, I wasn't actually thinking of hard-coding 10x, but having a
config option that specifies a factor of work_mem. A factor seems better
because it'll work reasonably for different values of work_mem, whereas
a concrete size wouldn't.



I'm not quite convinced we need/should introduce a new memory limit.
It's true keping it equal to work_mem by default makes this less of an
issue, but it's still another moving part the users will need to learn
how to use.

But if we do introduce a new limit, I very much think it should be a
plain limit, not a factor. That just makes it even more complicated, and
we don't have any such limit yet.




I think that's actually pretty appealing. Separating the memory we
plan to use from the memory we're willing to use before spilling seems
like a good idea in general, and I think we should probably also do it
in other places - like sorts.


Indeed. And then perhaps we could eventually add some reporting /
monitoring infrastructure for the cases where plan time and execution
time memory estimate/usage widely differs.



I wouldn't mind something like that in general - not just for hashagg,
but for various other nodes.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow CURRENT_ROLE in GRANTED BY

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-24, Peter Eisentraut wrote:

> I was checking some loose ends in SQL conformance, when I noticed: We
> support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
> CURRENT_ROLE in that place, even though in PostgreSQL they are equivalent.
> Here is a trivial patch to add that.

Hmm, since this adds to RoleSpec, this change makes every place that
uses that production also take CURRENT_ROLE, so we'd need to document in
all those places.  For example, alter_role.sgml, create_schema.sgml,
etc.

This also affects role_list (but maybe the docs for those are already
vague enough -- eg. ALTER INDEX .. OWNED BY only says "role_name" with
no further explanation, even though it does take "current_user".)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Strange behavior with polygon and NaN

2020-06-24 Thread Jesse Zhang
Hi hackers,

While working with Chris Hajas on merging Postgres 12 with Greenplum
Database we stumbled upon the following strange behavior in the geometry
type polygon:

-- >8 

CREATE TEMP TABLE foo (p point);
CREATE INDEX ON foo USING gist(p);

INSERT INTO foo VALUES ('0,0'), ('1,1'), ('NaN,NaN');

SELECT $q$
SELECT * FROM foo WHERE p <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)'
$q$ AS qry \gset

BEGIN;
SAVEPOINT yolo;
SET LOCAL enable_seqscan TO off;
:qry;

ROLLBACK TO SAVEPOINT yolo;
SET LOCAL enable_indexscan TO off;
SET LOCAL enable_bitmapscan TO off;
:qry;

-- 8< 

If you run the above repro SQL in HEAD (and 12, and likely all older
versions), you get the following output:

CREATE TABLE
CREATE INDEX
INSERT 0 3
BEGIN
SAVEPOINT
SET
   p
---
 (0,0)
 (1,1)
(2 rows)

ROLLBACK
SET
SET
 p
---
 (0,0)
 (1,1)
 (NaN,NaN)
(3 rows)


At first glance, you'd think this is the gist AM's bad, but on a second
thought, something else is strange here. The following query returns
true:

SELECT point '(NaN, NaN)' <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)'

The above behavior of the "contained in" operator is surprising, and
it's probably not what the GiST AM is expecting. I took a look at
point_inside() in geo_ops.c, and it doesn't seem well equipped to handle
NaN. Similary ill-equipped is dist_ppoly_internal() which underlies the
distnace operator for polygon. It gives the following interesting
output:

SELECT *, c <-> polygon '(0,0),(0,100),(100,100),(100,0)' as distance
FROM (
  SELECT circle(point(100 * i, 'NaN'), 50) AS c
  FROM generate_series(-2, 4) i
) t(c)
ORDER BY 2;

c| distance
-+--
 <(-200,NaN),50> |0
 <(-100,NaN),50> |0
 <(0,NaN),50>|0
 <(100,NaN),50>  |0
 <(200,NaN),50>  |  NaN
 <(300,NaN),50>  |  NaN
 <(400,NaN),50>  |  NaN
(7 rows)

Should they all be NaN? Am I alone in thinking the index is right but
the operators are wrong? Or should we call the indexes wrong here?

Cheers,
Jesse and Chris




Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Daniel Gustafsson
> On 24 Jun 2020, at 19:57, Peter Eisentraut  
> wrote:
> 
> On 2020-06-24 10:33, Daniel Gustafsson wrote:
>>> In PG13, we raised the server-side default of ssl_min_protocol_version to 
>>> TLSv1.2.  We also added a connection setting named ssl_min_protocol_version 
>>> to libpq.  But AFAICT, the default value of the libpq setting is empty, so 
>>> any protocol version will be accepted.  Is this what we wanted?  Should we 
>>> raise the default in libpq as well?
>> This was discussed [0] when the connection settings were introduced, and the
>> concensus was to leave them alone [1] to allow for example a new pg_dump to
>> work against an old server.  Re-reading the thread I think the argument still
>> holds, but I was about to respond "yes, let's do this" before refreshing my
>> memory.  Perhaps we should add a comment explaining this along the lines of 
>> the
>> attached?
>> [0] 
>> https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
>> [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us
> 
> ISTM that these discussions went through the same questions and arguments 
> that were made regarding the server-side change but arrived at a different 
> conclusion.  So I suggest to reconsider this so that we don't ship with 
> contradictory results.

I don't think anyone argues against safe defaults for communication between
upgraded clients and upgraded servers.  That being said; out of the box, an
upgraded client *will* use TLSv1.2 when connecting to a upgraded server due to
the server defaults requirements (assuming the server hasn't been reconfigured
with a lower TLS version, but since we're talking defaults we have to assume
that).

The problem comes when an updated client needs to talk to an old server which
hasn't been upgraded and which use an older OpenSSL version.  If we set TLSv1.2
as the minimum client version, the user will have to amend a connection string
which used to work when talking to a server which hasn't changed.  If we don't
raise the default, a user to wants nothing lower than TLSv1.2 will have to
amend the connection string.

> That doesn't necessarily mean that we have to make a change, but we should 
> make sure our rationale is sound.

Totally agree.  I think I, FWIW, still vote for keeping it at 1.0 to not break
connections to old servers, since upgraded/new servers will enforce 1.2
anyways.

As mentioned elsewhere in the thread, maybe this is also something which can be
done more easily if we improve the error reporting?  Right now it's fairly
cryptic IMO.

> Note that all OpenSSL versions that do not support TLSv1.2 also do not 
> support TLSv1.1.  So by saying, in effect, that TLSv1.2 is too new to 
> require, we are saying that we need to keep supporting TLSv1.0 -- which is 
> heavily deprecated.  Also note that the first OpenSSL version with support 
> for TLSv1.2 shipped on March 14, 2012.

Correct, this being the 1.0.1 release which is referred to.

cheers ./daniel



Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> Fabien COELHO  writes:
   INSERT INTO t() VALUES ();
>
>> I'm still unclear why it would be forbidden though, it seems logical to 
>> try that, whereas the working one is quite away from the usual syntax.
>
> It's forbidden because the SQL standard forbids it.
>
> We allow zero-column syntaxes in some other places where SQL forbids
> them, but that's only because there is no reasonable alternative.
> In this case, there's a perfectly good, standards-compliant alternative.
> So why encourage people to write unportable code?

FWIW, MySQL (and MariaDB) only support INSERT INTO t () VALUES (), not
DEFAULT VALUES.  We have added syntax for MySQL compatibility in the
past, e.g. the CONCAT() function.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen




Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread David G. Johnston
On Wed, Jun 24, 2020 at 3:31 PM Dagfinn Ilmari Mannsåker 
wrote:

> FWIW, MySQL (and MariaDB) only support INSERT INTO t () VALUES (), not
> DEFAULT VALUES.


We have added syntax for MySQL compatibility in the
> past, e.g. the CONCAT() function.
>

I don't see the similarities.  IIUC there isn't a standard mandated
function that provides the behavior that the concat function does.  There
is an operator but the treatment of NULL is different.  So for concat we
decided to add a custom function modelled on another DB's custom function.
Adding custom syntax here when an identically behaving standard syntax
already exists has considerably less going for it.  I would say that
accepting the compatibility hit while being the ones that are
standard-compliant is in line with project values.

David J.


Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 11:02:10PM +0200, Tomas Vondra wrote:
> > Indeed. And then perhaps we could eventually add some reporting /
> > monitoring infrastructure for the cases where plan time and execution
> > time memory estimate/usage widely differs.
> > 
> 
> I wouldn't mind something like that in general - not just for hashagg,
> but for various other nodes.

Well, other than worrying about problems with pre-13 queries, how is
this different from any other spill to disk when we exceed work_mem,
like sorts for merge join.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> > Well, my point is that merge join works that way, and no one has needed
> > a knob to avoid mergejoin if it is going to spill to disk.  If they are
> > adjusting work_mem to prevent spill of merge join, they can do the same
> > for hash agg.  We just need to document this in the release notes.
> 
> I don't think this is comparable. For starters, the IO indirectly
> triggered by mergejoin actually leads to plenty people just straight out
> disabling it. For lots of workloads there's never a valid reason to use
> a mergejoin (and often the planner will never choose one). Secondly, the
> planner has better information about estimating the memory usage for the
> to-be-sorted data than it has about the size of the transition
> values. And lastly, there's a difference between a long existing cause
> for bad IO behaviour and one that's suddenly kicks in after a major
> version upgrade, to which there's no escape hatch (it's rarely realistic
> to disable hash aggs, in contrast to merge joins).

Well, this sounds like an issue of degree, rather than kind.  It sure
sounds like "ignore work_mem for this join type, but not the other".

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Bruce Momjian
On Wed, Jun 24, 2020 at 07:18:10PM -0400, Bruce Momjian wrote:
> On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> > > Well, my point is that merge join works that way, and no one has needed
> > > a knob to avoid mergejoin if it is going to spill to disk.  If they are
> > > adjusting work_mem to prevent spill of merge join, they can do the same
> > > for hash agg.  We just need to document this in the release notes.
> > 
> > I don't think this is comparable. For starters, the IO indirectly
> > triggered by mergejoin actually leads to plenty people just straight out
> > disabling it. For lots of workloads there's never a valid reason to use
> > a mergejoin (and often the planner will never choose one). Secondly, the
> > planner has better information about estimating the memory usage for the
> > to-be-sorted data than it has about the size of the transition
> > values. And lastly, there's a difference between a long existing cause
> > for bad IO behaviour and one that's suddenly kicks in after a major
> > version upgrade, to which there's no escape hatch (it's rarely realistic
> > to disable hash aggs, in contrast to merge joins).
> 
> Well, this sounds like an issue of degree, rather than kind.  It sure
> sounds like "ignore work_mem for this join type, but not the other".

I think my main point is that work_mem was not being honored for
hash-agg before, but now that PG 13 can do it, we are again allowing
work_mem not to apply in certain cases.  I am wondering if our hard
limit for work_mem is the issue, and we should make that more flexible
for all uses.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: hashagg slowdown due to spill changes

2020-06-24 Thread Melanie Plageman
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund  wrote:

> Hi,
>
> On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund 
> wrote:
> > > It's not this patch's fault, but none, really none, of this stuff
> should
> > > be in the executor.
> > >
> > >
> > Were you thinking it could be done in grouping_planner() and then the
> > bitmaps could be saved in the PlannedStmt?
> > Or would you have to wait until query_planner()? Or are you imagining
> > somewhere else entirely?
>
> I haven't thought about it in too much detail, but I would say
> create_agg_plan() et al. I guess there's some argument to be made to do
> it in setrefs.c, because we already do convert_combining_aggrefs() there
> (but I don't like that much).
>
> There's no reason to do it before we actually decided on one specific
> path, so doing it earlier than create_plan() seems unnecessary. And
> having it in agg specific code seems better than putting it into global
> routines.
>
> There's probably an argument for having a bit more shared code between
> create_agg_plan(), create_group_plan() and
> create_groupingsets_plan(). But even just adding a new extract_*_cols()
> call to each of those would probably be ok.
>
>
So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.

So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().

Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.

It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.

Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.

Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().

Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).

-- 
Melanie Plageman
From 0b40ad205cf8fc8a83b8c65f5f18502a51c47370 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 24 Jun 2020 17:07:14 -0700
Subject: [PATCH v1] Move extracting columns for hashagg to planner

Find the columns needed the executor will need to know about for hashagg
and bifurcate them into aggregated and unaggregated column bitmapsets.
Save them in the plan tree for use during execution.

This is done after set_plan_refs() so all the references are correct.
---
 src/backend/executor/nodeAgg.c   | 117 +--
 src/backend/nodes/copyfuncs.c|   2 +
 src/backend/nodes/outfuncs.c |   2 +
 src/backend/nodes/readfuncs.c|   2 +
 src/backend/optimizer/plan/setrefs.c |  69 
 src/include/nodes/execnodes.h|   8 +-
 src/include/nodes/plannodes.h|   2 +
 7 files changed, 140 insertions(+), 62 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a20554ae65..2f340e4031 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -359,6 +359,14 @@ typedef struct HashAggBatch
 	int64		input_tuples;	/* number of tuples in this batch */
 } HashAggBatch;
 
+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+	bool	   is_aggref;		/* is under an aggref */
+	Bitmapset *aggregated;		/* column references under an aggref */
+	Bitmapset *unaggregated;	/* other column references */
+} FindColsContext;
+
 static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -391,8 +399,6 @@ static void finalize_aggregates(AggSta

Re: [PATCH] Initial progress reporting for COPY command

2020-06-24 Thread vignesh C
On Tue, Jun 23, 2020 at 4:45 PM Tomas Vondra
 wrote:
>
> >>
> >> Anyway if you would like to make this view more user-friendly, I can add 
> >> that. Just ping me.
> >
> >I felt we could add pg_size_pretty to make the view more user friendly.
> >
>
> Please no. That'd make processing of the data (say, computing progress
> as processed/total) impossible. It's easy to add pg_size_pretty if you
> want it, it's impossible to undo it. I don't see a single pg_size_pretty
> call in system_views.sql.
>

I thought of including pg_size_pretty as we there was no total_bytes
to compare with, but I'm ok without it too as there is an option for
user to always include it in the client side like "SELECT
pg_size_pretty(file_bytes_processed) from pg_stat_progress_copy;" if
required.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Michael Paquier
On Thu, Jun 25, 2020 at 12:30:03AM +0200, Daniel Gustafsson wrote:
> I don't think anyone argues against safe defaults for communication between
> upgraded clients and upgraded servers.  That being said; out of the box, an
> upgraded client *will* use TLSv1.2 when connecting to a upgraded server due to
> the server defaults requirements (assuming the server hasn't been reconfigured
> with a lower TLS version, but since we're talking defaults we have to assume
> that).

My take here is to let things as they are for libpq.  pg_dump is a very
good argument, because we allow backward compatibility with a newer
version of the tool, not upward compatibility.

> The problem comes when an updated client needs to talk to an old server which
> hasn't been upgraded and which use an older OpenSSL version.  If we set 
> TLSv1.2
> as the minimum client version, the user will have to amend a connection string
> which used to work when talking to a server which hasn't changed.  If we don't
> raise the default, a user to wants nothing lower than TLSv1.2 will have to
> amend the connection string.

Yeah, and I would not be surprised to see cases where people complain
to us about that when attempting to reach one of their old boxes,
breaking some stuff they have been relying on for years by forcing the
addition of a tls_min_server_protocol in the connection string.  It is
a more important step that we enforce TLSv1.2 on the server side
actually, and libpq just follows up automatically with that.

> As mentioned elsewhere in the thread, maybe this is also something which can 
> be
> done more easily if we improve the error reporting?  Right now it's fairly
> cryptic IMO.

This part may be tricky to get right I think, because the error comes
directly from OpenSSL when negotiating the protocol used between the
client and the server, like "no protocols available" or such.
--
Michael


signature.asc
Description: PGP signature


Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-06-24 Thread David Rowley
Back in [1] I experimented with a patch to coax compilers to build all
elog/ereport calls that were >= ERROR into a cold path away from the
function rasing the error. At the time, I really just wanted to test
how much of a speedup we could get by doing this and ended up just
writing up a patch that basically changed all elog(ERROR) calls from:

if ()
{
   ;
elog(ERROR, "...");
}

to add an unlikely() and become;

if (unlikely()
{
   ;
elog(ERROR, "...");
}

Per the report in [1] I did see some pretty good gains in performance
from doing this.  The problem was, that at the time I couldn't figure
out a way to do this without an invasive patch that changed the code
in the thousands of elog/ereport calls.

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path.  To make the
compiler properly mark just >= ERROR calls as cold, and only when the
elevel is a constant, I modified the ereport_domain macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I see no reason why the compiler shouldn't always fold that constant
expression at compile-time and correctly select the correct version of
the function for the job.  (I also tried using __builtin_choose_expr()
but GCC complained when the elevel was not constant, even with the
__builtin_constant_p() test in the condition)

I sampled a few .s files to inspect what code had changed.  Looking at
mcxt.s, fmgr.s and xlog.s, the first two of these because I found in
[1] that elogs were being done from those files quite often and xlog.s
because it's pretty big.  As far as I could see, GCC correctly moved
all the error reporting stuff where the elevel was a constant and >=
ERROR into the cold path and left the lower-level or non-consts elevel
calls alone.

For clang, I didn't see any changes in the .s files. I suspect that
they might have a few smarts in there and see the
__builtin_unreachable() call and assume the path is cold already based
on that. That was with clang 10. Perhaps older versions are not as
smart.

Benchmarking:

For benchmarking, I've not done a huge amount to test the impacts of
this change.  However, I can say that I am seeing some fairly good
improvements.  There seems to be some small improvements to execution
speed using TPCH-Q1 and also some good results from a pgbench -S test.

For TPCH-Q1:

Master:
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5272.630 ms
latency average = 5258.610 ms
latency average = 5250.871 ms

Master + elog_ereport_attribute_cold.patch
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5182.761 ms
latency average = 5194.851 ms
latency average = 5183.128 ms

Which is about a 1.42% increase in performance. That's not exactly
groundbreaking, but pretty useful to have if that happens to apply
across the board for execution performance.

For pgbench -S:

My results were a bit noisier than the TPCH test, but the results I
obtained did show about a 10% increase in performance:

Master:
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 25245.903255 (excluding connections establishing)
tps = 26144.454208 (excluding connections establishing)
tps = 25931.850518 (excluding connections establishing)

Master + elog_ereport_attribute_cold.patch
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 28351.480631 (excluding connections establishing)
tps = 27763.656557 (excluding connections establishing)
tps = 28896.427929 (excluding connections establishing)

It would be useful if someone with some server-grade Intel hardware
could run a few tests on this.  The above results are all from AMD
hardware.

I've attached the patch for this. I'll add it to the July 'fest.

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8cjwtp9zxeo...@mail.gmail.com


elog_ereport_attribute_cold.patch
Description: Binary data


Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-24 Thread Michael Paquier
On Wed, Jun 24, 2020 at 12:55:22PM -0400, Tom Lane wrote:
> Yeah, I'm sure there are a few inconsistencies.  We previously made a
> pass to get rid of pg_strcasecmp for anything that had been through
> the parser's downcasing (commit fb8697b31) but I wouldn't be surprised
> if that missed a few cases, or if new ones have snuck in.  Anyway,
> "don't use pg_strcasecmp unnecessarily" was definitely the agreed-to
> policy as of Jan 2018.

0d8c9c1 has introduced some in parse_basebackup_options() for the
new manifest option, and fe30e7e for AlterType(), no?

> My vague recollection is that there are a few exceptions (defGetBoolean
> may well be one of them) where pg_strcasecmp still seemed necessary
> because the input might not have come through the parser in some usages.

Yep, there were a couple of exceptions.  What was done at this time
was a case-by-case lookup to check what came only from the parser.
--
Michael


signature.asc
Description: PGP signature


Re: Assertion failure in pg_copy_logical_replication_slot()

2020-06-24 Thread Fujii Masao




On 2020/06/24 23:58, Alvaro Herrera wrote:

On 2020-Jun-24, Fujii Masao wrote:


I think the errcode is a bit bogus considering the new case.
IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate.


Agreed. So I updated the patch so this errcode is used instead.
Patch attached.


LGTM.


Thanks! Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Jun 25, 2020 at 12:30:03AM +0200, Daniel Gustafsson wrote:
>> As mentioned elsewhere in the thread, maybe this is also something which can 
>> be
>> done more easily if we improve the error reporting?  Right now it's fairly
>> cryptic IMO.

> This part may be tricky to get right I think, because the error comes
> directly from OpenSSL when negotiating the protocol used between the
> client and the server, like "no protocols available" or such.

Can we do something comparable to the backend's HINT protocol, where
we add on a comment that's only mostly-likely to be right?

regards, tom lane




Re: Review for GetWALAvailability()

2020-06-24 Thread Fujii Masao




On 2020/06/25 3:27, Alvaro Herrera wrote:

Thanks for those corrections.

I have pushed this.  I think all problems Masao-san reported have been
dealt with, so we're done here.


Sorry for my late to reply here...

Thanks for committing the patch and improving the feature!

/*
 * Find the oldest extant segment file. We get 1 until checkpoint 
removes
 * the first WAL segment file since startup, which causes the status 
being
 * wrong under certain abnormal conditions but that doesn't actually 
harm.
 */
oldestSeg = XLogGetLastRemovedSegno() + 1;

I see the point of the above comment, but this can cause wal_status to be
changed from "lost" to "unreserved" after the server restart. Isn't this
really confusing? At least it seems better to document that behavior.

Or if we *can ensure* that the slot with invalidated_at set always means
"lost" slot, we can judge that wal_status is "lost" without using fragile
XLogGetLastRemovedSegno(). Thought?

Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
value even after the restart?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Review for GetWALAvailability()

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-25, Fujii Masao wrote:

>   /*
>* Find the oldest extant segment file. We get 1 until checkpoint 
> removes
>* the first WAL segment file since startup, which causes the status 
> being
>* wrong under certain abnormal conditions but that doesn't actually 
> harm.
>*/
>   oldestSeg = XLogGetLastRemovedSegno() + 1;
> 
> I see the point of the above comment, but this can cause wal_status to be
> changed from "lost" to "unreserved" after the server restart. Isn't this
> really confusing? At least it seems better to document that behavior.

Hmm.

> Or if we *can ensure* that the slot with invalidated_at set always means
> "lost" slot, we can judge that wal_status is "lost" without using fragile
> XLogGetLastRemovedSegno(). Thought?

Hmm, this sounds compelling -- I think it just means we need to ensure
we reset invalidated_at to zero if the slot's restart_lsn is set to a
correct position afterwards.  I don't think we have any operation that
does that, so it should be safe -- hopefully I didn't overlook anything?
Neither copy nor advance seem to work with a slot that has invalid
restart_lsn.

> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
> value even after the restart?

This seems more work to implement.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Michael Paquier
On Wed, Jun 24, 2020 at 10:50:39PM -0400, Tom Lane wrote:
> Can we do something comparable to the backend's HINT protocol, where
> we add on a comment that's only mostly-likely to be right?

OpenSSL publishes its error codes as of openssl/sslerr.h, and it looks
like the two error codes we would need to worry about are
SSL_R_UNSUPPORTED_PROTOCOL and SSL_R_NO_PROTOCOLS_AVAILABLE.  So we
could for example amend open_client_SSL() when negotiating the SSL
connection in libpq with error messages or hints that help better than
the current state of things, but that also means an extra maintenance
on our side to make sure that we keep in sync with new error codes
coming from the OpenSSL world.
--
Michael


signature.asc
Description: PGP signature


Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Fabien COELHO



Hello Tom,


  INSERT INTO t() VALUES ();



I'm still unclear why it would be forbidden though, it seems logical to
try that, whereas the working one is quite away from the usual syntax.


It's forbidden because the SQL standard forbids it.


Ok, that is definitely a reason. I'm not sure it is a good reason, though.

Why would the standard forbid it? From the language design point of view, 
it is basically having a syntax for lists which would not work for empty 
lists, or a syntax for strings which would not work for empty strings.


It also means that if for some reason someone wants to insert several such 
all-default rows, they have to repeat the insert, as "VALUES (), ();" 
would not work, so it is also losing a corner-corner case capability 
without obvious reason.



We allow zero-column syntaxes in some other places where SQL forbids
them,


Then forbidding there it just adds awkwardness: the same thing works in 
one place but not in another. That does not help users.


but that's only because there is no reasonable alternative. In this 
case, there's a perfectly good, standards-compliant alternative. So why 
encourage people to write unportable code?


I doubt that people look at the (costly) standard when writing corner case 
queries, they just try something logical as I did.


As some other databases accepts it, and if it is already allowed elsewhere 
in pg, encouraging portability is not the main issue here. I'd rather have 
logic and uniformity accross commands.


If I'm annoyed enough to send a patch some day, would you veto it because 
it departs from the standard?


Anyway, thanks for the answer!

--
Fabien.




Re: Review for GetWALAvailability()

2020-06-24 Thread Fujii Masao




On 2020/06/25 12:57, Alvaro Herrera wrote:

On 2020-Jun-25, Fujii Masao wrote:


/*
 * Find the oldest extant segment file. We get 1 until checkpoint 
removes
 * the first WAL segment file since startup, which causes the status 
being
 * wrong under certain abnormal conditions but that doesn't actually 
harm.
 */
oldestSeg = XLogGetLastRemovedSegno() + 1;

I see the point of the above comment, but this can cause wal_status to be
changed from "lost" to "unreserved" after the server restart. Isn't this
really confusing? At least it seems better to document that behavior.


Hmm.


Or if we *can ensure* that the slot with invalidated_at set always means
"lost" slot, we can judge that wal_status is "lost" without using fragile
XLogGetLastRemovedSegno(). Thought?


Hmm, this sounds compelling -- I think it just means we need to ensure
we reset invalidated_at to zero if the slot's restart_lsn is set to a
correct position afterwards.


Yes.


I don't think we have any operation that
does that, so it should be safe -- hopefully I didn't overlook anything?


We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave()
just after setting invalidated_at and restart_lsn in 
InvalidateObsoleteReplicationSlots()?
Otherwise, restart_lsn can go back to the previous value after the restart.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8761f3a18..5584e5dd2c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1229,6 +1229,13 @@ restart:
s->data.invalidated_at = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(&s->mutex);
+
+   /*
+* Save this invalidated slot to disk, to ensure that the slot
+* is still invalid even after the server restart.
+*/
+   ReplicationSlotMarkDirty();
+   ReplicationSlotSave();
ReplicationSlotRelease();
 
/* if we did anything, start from scratch */


Maybe we don't need to do this if the slot is temporary?


Neither copy nor advance seem to work with a slot that has invalid
restart_lsn.


Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
value even after the restart?


This seems more work to implement.


Yes.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: archive status ".ready" files may be created too early

2020-06-24 Thread Kyotaro Horiguchi
Hello.  Matsumura-san.

I agree that WAL writer is not the place to notify segmnet. And the
direction you suggested would work.

At Fri, 19 Jun 2020 10:18:34 +, "matsumura@fujitsu.com" 
 wrote in 
> 1. Description in primary side
> 
> [Basic problem]
>   A process flushing WAL record doesn't know whether the flushed RecPtr is 
>   EndRecPtr of cross-segment-boundary WAL record or not because only process 
>   inserting the WAL record knows it and it never memorizes the information to 
> anywhere.
> 
> [Basic concept of the patch in primary]
>   A process inserting a cross-segment-boundary WAL record memorizes its 
> EndRecPtr
>   (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
>   A flushing process creates .ready (Later, I call it just 'notify'.) against 
>   a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
>   flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
...
>   See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in 
> my patch.

I think we don't need most of that shmem stuff.  XLogWrite is called
after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.

> * Action of inserting process
>   A inserting process memorie its CrossBoundaryEndRecPtr to 
> CrossBoundaryEndRecPtr
>   array element calculated by XLogRecPtrToBufIdx with its 
> CrossBoundaryEndRecPtr.
>   If the WAL record crosses many segments, only element against last segment
>   including the EndRecPtr is set and others are not set.
>   See also CopyXLogRecordToWAL() in my patch.

If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
every time a record is written, most of which are wasteful.
XLogInsertRecord already has a code block executed only at every page
boundary.

> * Action of flushing process
>   Overview has been already written as the follwing.
> A flushing process creates .ready (Later, I call it just 'notify'.) 
> against 
> a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
> flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
> 
>   An additional detail is as the following.  The flushing process may notify
>   many segments if the record crosses many segments, so the process memorizes
>   latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
>   The process notifies from latestArchiveNotifiedSegNo + 1 to
>   flushing segment number - 1.
>
>   And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process 
> exits
>   replay-loop.  Standby also set same timing (= before promoting).
> 
>   Mutual exlusion about latestArchiveNotifiedSegNo is not required because
>   the timing of accessing has been already included in WALWriteLock critical 
> section.

Looks reasonable.

> 2. Description in standby side
> 
> * Who notifies?
>   walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
>   cross-segment-boundary WAL record or not.  In standby, only Startup process
>   knows the information because it is hidden in WAL record itself and only
>   Startup process reads and builds WAL record.

Standby doesn't write it's own WAL records.  Even if primary sent an
immature record on segment boundary, it just would promote to a new
TLI and starts its own history. Nothing breaks.  However it could be a
problem if a standby that crashed the problematic way were started
as-is as a primary, such scenario is out of our scope.

Now we can identify stable portion of WAL stream. It's enough to
prevent walsender from sending data that can be overwritten
afterwards. GetReplicationTargetRecPtr() in the attached does that.

> * Action of Statup process
>   Therefore, I implemented that walreceiver never notify and Startup process 
> does it.
>   In detail, when Startup process reads one full-length WAL record, it 
> notifies
>   from a segment that includes head(ReadRecPtr) of the record to a previous 
> segment that 
>   includes EndRecPtr of the record.

I don't like that archiving on standby relies on replay progress.  We
should avoid that and fortunately I think we dont need it.

>   Now, we must pay attention about switching time line.
>   The last segment of previous TimeLineID must be notified before switching.
>   This case is considered when RM_XLOG_ID is detected.

That segment is archived after renamed as ".partial" later. We don't
archive the last incomplete segment of the previous timeline as-is.

> 3. About other notifying for segment
> Two notifyings for segment are remain.  They are not needed to fix.
> 
> (1) Notifying for partial segment
> It is not needed to be completed, so it's OK to notify without special 
> consideration.
> 
> (2) Re-notifying
> Currently, Checkpointer has notified through XLogArchiveCheckDone().
> It is a sa

Re: some more pg_dump refactoring

2020-06-24 Thread Fabien COELHO


Hallo Peter,

My 0.02 €:

Patch applies cleanly, compiles, make check and pg_dump tap tests ok. The 
refactoring is a definite improvements.


You changed the query strings to use "\n" instead of " ". I would not have 
changed that, because it departs from the style around, and I do not think 
it improves readability at the C code level.


I tried to check manually and randomly that the same query is built for 
the same version, although my check may have been partial, especially on 
the aggregate query which does not comment about what is changed between 
versions, and my eyes are not very good at diffing.


I've notice that some attributes are given systematic replacements (eg 
proparallel), removing the need to test for presence afterwards. This 
looks fine to me.


However, on version < 8.4, ISTM that funcargs and funciargs are always 
added: is this voluntary?.


Would it make sense to accumulate in the other direction, older to newer, 
so that new attributes are added at the end of the select?


Should array_to_string be pg_catalog.array_to_string? All other calls seem 
to have an explicit schema.


I'm fine with inlining most PQfnumber calls.

I do not have old versions at hand for testing.

Here is a patch to reorganize dumpFunc() and dumpAgg() in pg_dump, similar to 
daa9fe8a5264a3f192efa5ddee8fb011ad9da365.  Instead of repeating the almost 
same large query in each version branch, use one

query and add a few columns to the SELECT list depending on the
version.  This saves a lot of duplication.

I have tested this with various old versions of PostgreSQL I had available, 
but a bit more random testing with old versions would be welcome.


--
Fabien.