date:20200624

Hi Inoue-san,

On Wed, Jun 24, 2020 at 05:20:42PM +0900, Inoue, Hiroshi wrote:
> Where do you test, on Windows or on *nix?
> How do you test there?

I have been testing the driver on macos only, with various backend
versions, from 11 to 14.

Thanks,
--
Michael

signature.asc
Description: PGP signature

Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Daniel Gustafsson

> On 24 Jun 2020, at 10:46, Magnus Hagander  wrote:

> It might also be worth noting that it's not really "any protocol version", it 
> means it will be "whatever the openssl configuration says", I think? For 
> example, debian buster sets:
> 
> [system_default_sect]
> MinProtocol = TLSv1.2
> 
> Which I believe means that if your libpq app is running on debian buster, it 
> will be min v1.2 already

Correct, that being said I'm not sure how common it is for distributions to set
a default protocol version.  The macOS versions I have handy doesn't enforce a
default version, nor does Ubuntu 20.04, FreeBSD 12 or OpenBSD 6.5 AFAICT.

> (and it would likely be more useful to use ssl_min_protocol_version to 
> *lower* that when connecting to older servers).

That is indeed one use-case for the connection parameter.

cheers ./daniel

Re: Default setting for enable_hashagg_disk

On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:
> On Tue, 23 Jun 2020 at 08:24, Jeff Davis  wrote:
> > Another way of looking at it is that the weird behavior is already
> > there in v12, so there are already users relying on this weird behavior
> > as a crutch for some other planner mistake. The question is whether we
> > want to:
> >
> > (a) take the weird behavior away now as a consequence of implementing
> > disk-based HashAgg; or
> > (b) support the weird behavior forever; or
> > (c) introduce a GUC now to help transition away from the weird behavior
> >
> > The danger with (c) is that it gives users more time to become more
> > reliant on the weird behavior; and worse, a GUC could be seen as an
> > endorsement of the weird behavior rather than a path to eliminating it.
> > So we could intend to do (c) and end up with (b). We can mitigate this
> > with documentation warnings, perhaps.
> 
> So, I have a few thoughts on this subject. I understand both problem
> cases have been mentioned before on this thread, but just to reiterate
> the two problem cases that we really would rather people didn't hit.

I appreciated this summary since I wasn't fully following the issues.

> As for GUCs to try to help the group of users who, *I'm certain*, will
> have problems with PG13's plan choice. I think the overloaded
> enable_hashagg option is a really nice compromise.   We don't really
> have any other executor node type that has multiple GUCs controlling
> its behaviour, so I believe it would be nice to keep it that way.

So, in trying to anticipate how users will be affected by an API change,
I try to look at similar cases where we already have this behavior, and
how users react to this.  Looking at the available join methods, I think
we have one.  We currently support:

* nested loop with sequential scan
* nested loop with index scan
* hash join
* merge join

It would seem merge join has almost the same complexities as the new
hash join code, since it can spill to disk doing sorts for merge joins,
and adjusting work_mem is the only way to control that spill to disk.  I
don't remember anyone complaining about spills to disk during merge
join, so I am unclear why we would need a such control for hash join.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

PostgreSQL and big data - FDW

2020-06-24 Thread ROS Didier

Hi
I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP cluster 
which uses KERBEROS authentication.
is it possible to achieve this ? which FDW should be used ?

Thanks in advance

Best Regards
Didier ROS
EDF



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.

Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

On Wed, Jun 24, 2020 at 12:41 PM Maciek Sakrejda  wrote:
>
> On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila  wrote:
> > > I don't see any other reason for
> > > looping over the NL node itself in this plan. The Gather itself
> > > doesn't do any real looping, right?
> >
> > It is right that Gather doesn't do looping but Parallel Seq Scan node does 
> > so.
>
> Sorry, I still don't follow. How does a Parallel Seq Scan do looping?

Sorry, I intend to say that Parallel Seq Scan is involved in looping.
Let me try by example:

Gather (actual time=6.444..722.642 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop (actual time=0.046..705.936 rows=5000 loops=2)
 ->  Parallel Seq Scan on t1 (actual time=0.010..45.423
rows=25 loops=2)
 ->  Index Scan using idx_t2 on t2 (actual time=0.002..0.002
rows=0 loops=50)
   Index Cond: (c1 = t1.c1)

In the above plan, each of the worker runs
NestLoop
   -> Parallel Seq Scan on t1
   -> Index Scan using idx_t2 on t2

So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop"
nodes.  Does this make sense now?

> I looked at the parallel plan docs but I don't see looping mentioned
> anywhere[1]. Also, is looping not normally indicated on children,
> rather than on the node doing the looping? E.g., with a standard
> Nested Loop, the outer child will have loops=1 and the inner child
> will have loops equal to the row count produced by the outer child
> (and the Nested Loop itself will have loops=1 unless it also is being
> looped over by a parent node), right?
>

Yeah, I hope the above has clarified it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: min_safe_lsn column in pg_replication_slots view





On 2020/06/23 15:27, Amit Kapila wrote:

On Tue, Jun 23, 2020 at 7:47 AM Fujii Masao  wrote:


On 2020/06/23 10:10, Kyotaro Horiguchi wrote:

At Mon, 22 Jun 2020 22:02:51 +0900, Fujii Masao  
wrote in



I feel such a function is good to have but I am not sure if there is a
need to tie it with the removal of min_safe_lsn column.


We should expose the LSN calculated from
"the current WAL LSN - max(wal_keep_segments * 16MB,
max_slot_wal_keep_size)"?
This indicates the minimum LSN of WAL files that are guaraneed to be
currently retained by wal_keep_segments and max_slot_wal_keep_size.
That is, if checkpoint occurs when restart_lsn of replication slot is
smaller than that minimum LSN, some required WAL files may be removed.
So DBAs can periodically monitor and compare restart_lsn and that
minimum
LSN. If they see frequently that difference of those LSN is very
small,
they can decide to increase wal_keep_segments or
max_slot_wal_keep_size,
to prevent required WAL files from being removed. Thought?


I'm not sure about the consensus here about showing that number in the
view. It is similar to "remain" in the earlier versions of this patch
but a bit simpler. It would be usable in a similar way. I can live
with either numbers.


It's useless to display this value in each replication slot in the view.
I'm thinking to expose it as a function.



Having a separate function for this seems like a good idea but can we
consider displaying it in a view like pg_stat_replication_slots as we
are discussing a nearby thread to have such a view for other things.
I think ultimately this information is required to check whether some
slot can be invalidated or not, so having it displayed along with
other slot information might not be a bad idea.


"the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"
is the same value between all the replication slots. But you think it's better
to display that same value for every slots in the view?

Or you're thinking to display the difference of that LSN value and
restart_lsn as Horiguchi-san suggested? That diff varies each replication slot,
so it seems ok to display it for every rows.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: min_safe_lsn column in pg_replication_slots view





On 2020/06/24 8:39, Alvaro Herrera wrote:

On 2020-Jun-23, Kyotaro Horiguchi wrote:


At Tue, 23 Jun 2020 11:50:34 +0530, Amit Kapila  wrote 
in

On Mon, Jun 22, 2020 at 6:32 PM Fujii Masao  wrote:



We should expose the LSN calculated from
"the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"?
This indicates the minimum LSN of WAL files that are guaraneed to be
currently retained by wal_keep_segments and max_slot_wal_keep_size.
That is, if checkpoint occurs when restart_lsn of replication slot is
smaller than that minimum LSN, some required WAL files may be removed.

So DBAs can periodically monitor and compare restart_lsn and that minimum
LSN. If they see frequently that difference of those LSN is very small,
they can decide to increase wal_keep_segments or max_slot_wal_keep_size,
to prevent required WAL files from being removed. Thought?


+1.  This sounds like a good and useful stat for users.


+1 for showing a number that is not involving lastRemovedSegNo. It is
like returning to the initial version of this patch. It showed a
number like ((the suggested above) minus restart_lsn). The number is
different for each slot so they fit in the view.

The number is usable for the same purpose so I'm ok with it.


I think we should publish the value from wal_keep_segments separately
from max_slot_wal_keep_size.  ISTM that the user might decide to change
or remove wal_keep_segments and be suddenly at risk of losing slots
because of overlooking that it was wal_keep_segments, not
max_slot_wal_keep_size, that was protecting them.


You mean to have two functions that returns

1. "current WAL LSN - wal_keep_segments * 16MB"
2. "current WAL LSN - max_slot_wal_keep_size"

Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: PostgreSQL and big data - FDW

On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> Hi
> 
> I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP 
> cluster
> which uses KERBEROS authentication.
> 
> is it possible to achieve this ? which FDW should be used ?

Well, I would use the Hadoop FDW:

https://github.com/EnterpriseDB/hdfs_fdw

and it only supports these authentication methods:

Authentication Support

The FDW supports NOSASL and LDAP authentication modes. In order to use
NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
username and password must be specified in OPTIONS while creating user 
mapping.

Not every FDW supports every Postgres server authentication method.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Assertion failure in pg_copy_logical_replication_slot()




On 2020/06/24 9:38, Alvaro Herrera wrote:

On 2020-Jun-23, Fujii Masao wrote:


If restart_lsn of logical replication slot gets behind more than
max_slot_wal_keep_size from the current LSN, the logical replication
slot would be invalidated and its restart_lsn is reset to an invalid LSN.
If this logical replication slot with an invalid restart_lsn is specified
as the source slot in pg_copy_logical_replication_slot(), the function
causes the following assertion failure.

 TRAP: FailedAssertion("!logical_slot", File: "slotfuncs.c", Line: 727)


Oops.


This assertion failure is caused by

/* Copying non-reserved slot doesn't make sense */
if (XLogRecPtrIsInvalid(src_restart_lsn))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot copy a replication slot that doesn't 
reserve WAL")));


Heh, you pasted the code after your patch rather than the original.


oh sorry.



I think the errcode is a bit bogus considering the new case.
IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate.


Agreed. So I updated the patch so this errcode is used instead.
Patch attached.



One could argue that the error message could also be different for the
case of a logical slot (or even a physical slot that has the upcoming
"invalidated_at" LSN set), maybe "cannot copy a replication slot that
has been invalidated" but maybe that's a pointless distinction.
I don't object to the patch as presented.


I have no strong opinion about this, but for now I kept the message as it is.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slotfuncs.c 
b/src/backend/replication/slotfuncs.c
index 06e4955de7..ce17d44227 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -723,12 +723,9 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool 
logical_slot)
 
/* Copying non-reserved slot doesn't make sense */
if (XLogRecPtrIsInvalid(src_restart_lsn))
-   {
-   Assert(!logical_slot);
ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("cannot copy a replication slot that 
doesn't reserve WAL")));
-   }
 
/* Overwrite params from optional arguments */
if (PG_NARGS() >= 3)

Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda  wrote:
>
> Hello,
>
> I had some questions about the behavior of some accounting in parallel
> EXPLAIN plans. Take the following plan:
>
>
..
>
> The Nested Loop here aggregates data for metrics like `buffers read`
> from its workers, and to calculate a metric like `buffers read` for
> the parallel leader, we can subtract the values recorded in each
> individual worker. This happens in the Seq Scan and Index Scan
> children, as well. However, the Gather node appears to only include
> values from its direct parallel leader child (excluding that child's
> workers).
>

I have tried to check a similar plan and for me, the values at Gather
node seems to be considering the values from all workers and leader
(aka whatever is displayed at "Nested Loop " node), see below.  I have
tried the test on HEAD.  Which version of PostgreSQL are you using?
If you are also using the latest version then it is possible that in
some cases it is not displaying correct data. If that turns out to be
the case, then feel to share the test case.  Sorry, for the confusion
caused by my previous reply.

 Gather (actual time=2.083..550.093 rows=1 loops=1)
   Output: t1.c1, t1.c2, t2.c1, t2.c2
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=1012621 read=84
   I/O Timings: read=0.819
   ->  Nested Loop (actual time=0.084..541.882 rows= loops=3)
 Output: t1.c1, t1.c2, t2.c1, t2.c2
 Buffers: shared hit=1012621 read=84
 I/O Timings: read=0.819
 Worker 0:  actual time=0.069..541.249 rows=3155 loops=1
   Buffers: shared hit=326529 read=29
   I/O Timings: read=0.325
 Worker 1:  actual time=0.063..541.376 rows=3330 loops=1
   Buffers: shared hit=352045 read=26
   I/O Timings: read=0.179
 ->  Parallel Seq Scan on public.t1 (actual time=0.011..34.250
rows=17 loops=3)
   Output: t1.c1, t1.c2
   Buffers: shared hit=2703
   Worker 0:  actual time=0.011..33.785 rows=161265 loops=1
 Buffers: shared hit=872
   Worker 1:  actual time=0.009..34.582 rows=173900 loops=1
 Buffers: shared hit=940
 ->  Index Scan using idx_t2 on public.t2 (actual
time=0.003..0.003 rows=0 loops=50)
   Output: t2.c1, t2.c2
   Index Cond: (t2.c1 = t1.c1)
   Buffers: shared hit=1009918 read=84
   I/O Timings: read=0.819
   Worker 0:  actual time=0.003..0.003 rows=0 loops=161265
 Buffers: shared hit=325657 read=29
 I/O Timings: read=0.325
   Worker 1:  actual time=0.002..0.002 rows=0 loops=173900
 Buffers: shared hit=351105 read=26
 I/O Timings: read=0.179

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762





On 2020/06/24 11:56, Kyotaro Horiguchi wrote:

At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier  wrote 
in

On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:

I still maintain that adding restrictions here is a bad idea. Even
disregarding the discussion of running normal queries interspersed, it's
useful to be able to both request WAL and receive logical changes over
the same connection. E.g. for creating a logical replica by first doing
a physical base backup (vastly faster), or fetching WAL for decoding
large transactions onto a standby.

And I just don't see any reasons to disallow it. There's basically no
reduction in complexity by doing so.


Yeah, I still stand by the same opinion here to do nothing.  I suspect
that we have good chances to annoy people and some cases we are
overlooking here, that used to work.


In logical replication, a replication role is intended to be
accessible only to the GRANTed databases.  On the other hand the same
role can create a dead copy of the whole cluster, including
non-granted databases.  It seems like a sieve missing a mesh screen.


Personally I'd like to disallow physical replication commands
when I explicitly reject physical replication connection
(i.e., set "host replication user x.x.x.x/x reject") in pg_hba.conf,
whether on physical or logical replication connection.



I agree that that doesn't harm as far as roles are strictly managed so
I don't insist so strongly on inhibiting the behavior. However, the
documentation at least needs amendment.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

[PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-24 Thread Bharath Rupireddy

Hi,

COPY command's FORMAT option allows only all lowercase csv, text or
binary, this is true because strcmp is being used while parsing these
values.

It would be nice if the uppercase or combination of lower and upper
case format options such as CSV, TEXT, BINARY, Csv, Text, Binary so
on. is also allowed.

To achieve this pg_strcasecmp() is used instead of strcmp.

Attached is a patch having above changes.

Request the community to review the patch, if it makes sense.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-COPY-command-s-data-format-option-allows-only-low.patch
Description: Binary data

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
>
> Here is the POC patch to discuss the idea of a cleanup of shared
> fileset on proc exit.  As discussed offlist,  here I am maintaining
> the list of shared fileset.  First time when the list is NULL I am
> registering the cleanup function with on_proc_exit routine.  After
> that for subsequent fileset, I am just appending it to filesetlist.
> There is also an interface to unregister the shared file set from the
> cleanup list and that is done by the caller whenever we are deleting
> the shared fileset manually.  While explaining it here, I think there
> could be one issue if we delete all the element from the list will
> become NULL and on next SharedFileSetInit we will again register the
> function.  Maybe that is not a problem but we can avoid registering
> multiple times by using some flag in the file
>

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;

+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
  SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
  fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared.  If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

3.
+static List * filesetlist = NULL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
 static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison.  See lock_files usage.

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Comments on other patches:
=
5.
> 3. On concurrent abort we are truncating all the changes including
> some incomplete changes,  so later when we get the complete changes we
> don't have the previous changes,  e.g, if we had specinsert in the
> last stream and due to concurrent abort detection if we delete that
> changes later we will get spec_confirm without spec insert.  We could
> have simply avoided deleting all the changes, but I think the better
> fix is once we detect the concurrent abort for any transaction, then
> why do we need to collect the changes for that, we can simply avoid
> that.  So I have put that fix. (0006)
>

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

6.
In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
xact_desc_invalidations seems to be a subset of
standby_desc_invalidations, can we have a common code for them?

7.
I think we can avoid sending v29-0007-Track-statistics-for-streaming
this each time.  We can do this after the main patch is complete.
Also, we might need to change how and where these stats will be
tracked.  See the related discussion [1].

8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
  * Return oldest transaction in reorderbuffer
@

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar  wrote:
>
> On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila  wrote:
> >
>
> > 8.
> > + /*
> > + * Start a transaction on stream start, this transaction will be committed
> > + * on the stream stop.  We need the transaction for handling the buffile,
> > + * used for serializing the streaming data and subxact info.
> > + */
> > + ensure_transaction();
> >
> > I think we need this for PrepareTempTablespaces to set the
> > temptablespaces.  Also, isn't it required for a cleanup of buffile
> > resources at the transaction end?  Are there any other reasons for it
> > as well?  The comment should be a bit more clear for why we need a
> > transaction here.
>
> I am not sure that will it make sense to add a comment here that why
> buffile and sharedfileset need a transaction?
>

You can say usage of BufFile interface expects us to be in the
transaction for so and so reason

  Do you think that we
> should add comment in buffile/shared fileset API that it should be
> called under a transaction?
>

I am fine with that as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-24 Thread Dilip Kumar

 iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".

Basically, in POC as shown in below code snippet,  We are checking
that if the "filesetlist" is NULL then only register the on_proc_exit
function.  But, as described above if all the items are deleted the
list will be NULL.  So I told that instead of checking the filesetlist
is NULL,  we can have just a boolean variable that if we have
registered the callback then don't do it again.

@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);
+
+ filesetlist = lcons((void *)fileset, filesetlist);
+ }
 }

>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Because only the sharedfileset and the filesetlist which is allocated
under SharedFileSetInit, are required in the permanent context.
BufFileCreateShared, only creates the Buffile and VFD which will be
required only within the current stream so transaction context is
enough.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);
>
> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

But the filesetlist is maintained at the sharedfileset level, so even
if we delete from BufFileDeleteShared, we need to call an API from the
sharedfileset layer to unregister the fileset.  Am I missing
something?

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Right.

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (Sh

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar  wrote:
>
>  iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
>
> Basically, in POC as shown in below code snippet,  We are checking
> that if the "filesetlist" is NULL then only register the on_proc_exit
> function.  But, as described above if all the items are deleted the
> list will be NULL.  So I told that instead of checking the filesetlist
> is NULL,  we can have just a boolean variable that if we have
> registered the callback then don't do it again.
>

Check if there is any precedent of the same in the code?

>
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Because only the sharedfileset and the filesetlist which is allocated
> under SharedFileSetInit, are required in the permanent context.
> BufFileCreateShared, only creates the Buffile and VFD which will be
> required only within the current stream so transaction context is
> enough.
>

Okay, then add some more comments to explain it or if you have
explained it elsewhere, then add a reference for the same.

> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
> >
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> But the filesetlist is maintained at the sharedfileset level, so even
> if we delete from BufFileDeleteShared, we need to call an API from the
> sharedfileset layer to unregister the fileset.
>

Sure, but isn't it better if we can call such an API from BufFileDeleteShared?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

On Tue, Jun 23, 2020 at 4:00 PM Andres Freund  wrote:
> https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
> thread with more information / patches further along.
>
> I confused this patch with the approach in
> https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
> sorry for that.  It obviously still differs by not having lower space
> overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
> space for two methods, and then 1 byte overhead for 256 more), but
> that's not that fundamental a difference.

Wait a minute. Are we saying there are three (3) dueling patches for
adding an alternate TOAST algorithm? It seems like there is:

This "custom compression methods" thread - vintage 2017 - Original
code by Nikita Glukhov, later work by Ildus Kurbangaliev
The "pluggable compression support" thread - vintage 2013 - Andres Freund
The "plgz performance" thread - vintage 2019 - Petr Jelinek

Anyone want to point to a FOURTH implementation of this feature?

I guess the next thing to do is figure out which one is the best basis
for further work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Why forbid "INSERT INTO t () VALUES ();"




Hello devs,

I would like to create an "all defaults" row, i.e. a row composed of the 
default values for all attributes, so I wrote:


  INSERT INTO t() VALUES ();

This is forbidden by postgres, and also sqlite.

Is there any good reason why this should be the case?

--
Fabien.

Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Thomas Kellerer

Fabien COELHO schrieb am 24.06.2020 um 14:18:
> I would like to create an "all defaults" row, i.e. a row composed of the 
> default values for all attributes, so I wrote:
>
>   INSERT INTO t() VALUES ();
>
> This is forbidden by postgres, and also sqlite.
>
> Is there any good reason why this should be the case?
>

Maybe because

   insert into t default values;

exists (and is standard SQL if I'm not mistaken)

Thomas

Re: Default setting for enable_hashagg_disk

2020-06-24 Thread David Rowley

On Wed, 24 Jun 2020 at 21:06, Bruce Momjian  wrote:
> I
> don't remember anyone complaining about spills to disk during merge
> join, so I am unclear why we would need a such control for hash join.

Hash aggregate, you mean?   The reason is that upgrading to PG13 can
cause a performance regression for an underestimated ndistinct on the
GROUP BY clause and cause hash aggregate to spill to disk where it
previously did everything in RAM.   Sure, that behaviour was never
what we wanted to happen, Jeff has fixed that now, but the fact
remains that this does happen in the real world quite often and people
often get away with it, likey because work_mem is generally set to
some very conservative value.  Of course, there's also a bunch of
people that have been bitten by OOM due to this too. The "neverspill"
wouldn't be for those people.  Certainly, it's possible that we just
tell these people to increase work_mem for this query, that way they
can set it to something reasonable and still get spilling if it's
really needed to save them from OOM, but the problem there is that
it's not very easy to go and set work_mem for a certain query.

FWIW, I wish that I wasn't suggesting we do this, but I am because it
seems simple enough to implement and it removes a pretty big roadblock
that might exist for a small subset of people wishing to upgrade to
PG13. It seems lightweight enough to maintain, at least until we
invent some better management of how many executor nodes we can have
allocating work_mem at once.

The suggestion I made was just based on asking myself the following
set of questions:

Since Hash Aggregate has been able to overflow work_mem since day 1,
and now that we've completely changed that fact in PG13,  is that
likely to upset anyone?  If so, should we go to the trouble of giving
those people a way of getting the old behaviour back? If we do want to
help those people, what's the best way to make those options available
to them in a way that we can remove the special options with the least
pain in some future version of PostgreSQL?

I'd certainly be interested in hearing how other people would answer
those question.

David

Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Justin Pryzby

On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
> On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote:
> > On Tue, 23 Jun 2020 at 08:24, Jeff Davis  wrote:
> > > Another way of looking at it is that the weird behavior is already
> > > there in v12, so there are already users relying on this weird behavior
> > > as a crutch for some other planner mistake. The question is whether we
> > > want to:
> > >
> > > (a) take the weird behavior away now as a consequence of implementing
> > > disk-based HashAgg; or
> > > (b) support the weird behavior forever; or
> > > (c) introduce a GUC now to help transition away from the weird behavior
> > >
> > > The danger with (c) is that it gives users more time to become more
> > > reliant on the weird behavior; and worse, a GUC could be seen as an
> > > endorsement of the weird behavior rather than a path to eliminating it.
> > > So we could intend to do (c) and end up with (b). We can mitigate this
> > > with documentation warnings, perhaps.
> > 
> > So, I have a few thoughts on this subject. I understand both problem
> > cases have been mentioned before on this thread, but just to reiterate
> > the two problem cases that we really would rather people didn't hit.
> 
> I appreciated this summary since I wasn't fully following the issues.
> 
> > As for GUCs to try to help the group of users who, *I'm certain*, will
> > have problems with PG13's plan choice. I think the overloaded
> > enable_hashagg option is a really nice compromise.   We don't really
> > have any other executor node type that has multiple GUCs controlling
> > its behaviour, so I believe it would be nice to keep it that way.
...
> It would seem merge join has almost the same complexities as the new
> hash join code, since it can spill to disk doing sorts for merge joins,
> and adjusting work_mem is the only way to control that spill to disk.  I
> don't remember anyone complaining about spills to disk during merge
> join, so I am unclear why we would need a such control for hash join.

It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
since the old behavior was still available with enable_mergejoin=off.

I think a better analogy would be if we now changed sort nodes beneath merge
join to use at most 0.5*work_mem, with no way of going back to using
1.0*work_mem.

-- 
Justin

RE: PostgreSQL and big data - FDW

2020-06-24 Thread ROS Didier

Hi Bruce

In the following link :
https://www.enterprisedb.com/blog/connecting-hadoop-and-edb-postgres-shrink-big-data-challenges
We can see :
"Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)"

So HDFS_FDW support kerberos authentication . how to be sure of that ?
Could EDB make a clear statement on this point?

If so, how to implement this method ? is there any document on this subject ?

Thanks in advance.
Best Regards

Didier ROS
didier@edf.fr
Tél. : +33 6 49 51 11 88

-Message d'origine-
De : br...@momjian.us [mailto:br...@momjian.us]
Envoyé : mercredi 24 juin 2020 11:13
À : ROS Didier
Cc : pgsql-hackers@lists.postgresql.org
Objet : Re: PostgreSQL and big data - FDW

On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> Hi
>
> I would like to use a Foreign Data Wrapper (FDW) to connect to a
> HADOOP cluster which uses KERBEROS authentication.
>
> is it possible to achieve this ? which FDW should be used ?

Well, I would use the Hadoop FDW:

https://github.com/EnterpriseDB/hdfs_fdw

and it only supports these authentication methods:

Authentication Support

The FDW supports NOSASL and LDAP authentication modes. In order to use
NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
username and password must be specified in OPTIONS while creating user
mapping.

Not every FDW supports every Postgres server authentication method.

--
Bruce Momjian https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à
l'intention exclusive des destinataires et les informations qui y figurent sont
strictement confidentielles. Toute utilisation de ce Message non conforme à sa
destination, toute diffusion ou toute publication totale ou partielle, est
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si
vous avez reçu ce Message par erreur, merci de le supprimer de votre système,
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support
que ce soit. Nous vous remercions également d'en avertir immédiatement
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
erreur ou virus.

This message and any attachments (the 'Message') are intended solely for the
addressees. The information contained in this Message is confidential. Any use
of information contained in this Message not in accord with its purpose, any
dissemination or disclosure, either whole or partial, is prohibited except
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any
part of it. If you have received this message in error, please delete it and
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or
virus-free.

Re: min_safe_lsn column in pg_replication_slots view

On Wed, Jun 24, 2020 at 2:37 PM Fujii Masao  wrote:
>
> On 2020/06/23 15:27, Amit Kapila wrote:
> >
> > Having a separate function for this seems like a good idea but can we
> > consider displaying it in a view like pg_stat_replication_slots as we
> > are discussing a nearby thread to have such a view for other things.
> > I think ultimately this information is required to check whether some
> > slot can be invalidated or not, so having it displayed along with
> > other slot information might not be a bad idea.
>
> "the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"
> is the same value between all the replication slots. But you think it's better
> to display that same value for every slots in the view?
>
> Or you're thinking to display the difference of that LSN value and
> restart_lsn as Horiguchi-san suggested?
>

I see value in Horiguchi-San's proposal.  IIUC, it will tell help
DBAs/Users to know if any particular slot will get invalidated soon.

> That diff varies each replication slot,
> so it seems ok to display it for every rows.
>

Yes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

Bharath Rupireddy  writes:
> COPY command's FORMAT option allows only all lowercase csv, text or
> binary, this is true because strcmp is being used while parsing these
> values.

This is nonsense, actually:

regression=# create table foo (f1 int);
CREATE TABLE
regression=# copy foo from stdin (format CSV);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

As that shows, there's already a round of lowercasing done by the parser.
The only way that strcasecmp in copy.c would be useful is if you wanted to
accept things like
copy foo from stdin (format "CSV");
I don't find that to be a terribly good idea.  The normal implication
of quoting is that it *prevents* case folding, so why should that
happen anyway?

More generally, though, why would we want to change this policy only
here?  I believe we're reasonably consistent about letting the parser
do any required down-casing and then just checking keyword matches
with strcmp.

regards, tom lane

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On 2020-Jun-24, Kyotaro Horiguchi wrote:

> In logical replication, a replication role is intended to be
> accessible only to the GRANTed databases.  On the other hand the same
> role can create a dead copy of the whole cluster, including
> non-granted databases.

In other words -- essentially, if you grant replication access to a role
only to a specific database, they can steal the whole cluster.

I don't see what's so great about that, but apparently people like it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Assertion failure in pg_copy_logical_replication_slot()

On 2020-Jun-24, Fujii Masao wrote:

> > I think the errcode is a bit bogus considering the new case.
> > IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate.
> 
> Agreed. So I updated the patch so this errcode is used instead.
> Patch attached.

LGTM.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_resetwal --next-transaction-id may cause database failed to restart.

On 2020-Jun-24, movead...@highgo.ca wrote:

> >Maybe a better answer is to have a new switch in postmaster that creates
> >any needed files (incl. producing associated WAL etc); so you'd run
> >pg_resetwal -x some-value
> >postmaster --create-special-stuff
> >then start your server and off you go.
>
> As shown in the document, it looks like to rule a safe input, so I think it's 
> better
> to rule it and add an option to focus write an unsafe value if necessary.

ISTM that a reasonable compromise is that if you use -x (or -c, -m, -O)
and the input value is outside the range supported by existing files,
then it's a fatal error; unless you use --force, which turns it into
just a warning.

> >Now maybe this is too much complication for a mechanism that really
> >isn't for general consumption anyway.  I mean, if you're using
> >pg_resetwal, you're already playing with fire.
> Yes, that's true, I always heard the word "You'd better not use pg_walreset".
> But the tool appear in PG code, it's better to improve it than do nothing.

Sure.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: min_safe_lsn column in pg_replication_slots view

On 2020-Jun-24, Fujii Masao wrote:

> On 2020/06/24 8:39, Alvaro Herrera wrote:

> > I think we should publish the value from wal_keep_segments separately
> > from max_slot_wal_keep_size.  ISTM that the user might decide to change
> > or remove wal_keep_segments and be suddenly at risk of losing slots
> > because of overlooking that it was wal_keep_segments, not
> > max_slot_wal_keep_size, that was protecting them.
> 
> You mean to have two functions that returns
> 
> 1. "current WAL LSN - wal_keep_segments * 16MB"
> 2. "current WAL LSN - max_slot_wal_keep_size"

Hmm, but all the values there are easily findable.  What would be the
point in repeating it?

Maybe we should disregard this line of thinking and go back to
Horiguchi-san's original proposal, to wit use the "distance to
breakage", as also supported now by Amit Kapila[1] (unless I
misunderstand him).

[1] 
https://postgr.es/m/CAA4eK1L2oJ7T1cESdc5w4J9L3Q_hhvWqTigdAXKfnsJy4=v...@mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PostgreSQL and big data - FDW

2020-06-24 Thread Jeevan Ladhe

On Wed, Jun 24, 2020 at 6:09 PM ROS Didier  wrote:

> Hi Bruce
>
> In the following link :
> https://www.enterprisedb.com/blog/connecting-hadoop-and-edb-postgres-shrink-big-data-challenges
> We can see :
> "Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)"
>
> So HDFS_FDW  support kerberos authentication . how to be sure of that  ?
> Could EDB make a clear statement on this point?
>

HDFS_FDW does not support kerberos authentication.
The sentence you have pasted above is from the wish list or say TODO
list, here is what it says:
"Currently the HDFS_FDW only provides READ capabilities but EDB is planning
the following additional functionality:"

The functionality was not implemented. I think the part of confusion might
be
due to the formatting of the list in the blog.

You can follow the README[1] of HDFS_FDW to get an idea of how to use it.

[1] https://github.com/EnterpriseDB/hdfs_fdw/blob/master/README.md

Regards,
Jeevan

Re: extensible options syntax for replication parser?

On Sun, Jun 14, 2020 at 3:15 AM Fabien COELHO  wrote:
> > so instead I'd like to have a better way to do it.
>
> > Attached is v1 of a patch to refactor things so that parts of the
> > BASE_BACKUP and CREATE_REPLICATION_SLOT are replaced with a flexible
> > options syntax.
>
> Patch applies cleanly, however compilation fails on:
>
>repl_gram.y:271:106: error: expected ‘;’ before ‘}’

Oops. I'm surprised my compiler didn't complain.

> Getting rid of "ident_or_keyword", some day, would be a relief.

Actually, I think that particular thing is a sign that things are
being done correctly. You need something like that if you have
contexts where you want to treat keywords and non-keywords the same
way, and it's generally good to have such places. In fact, this could
probably be profitably used in more places in the replication grammar.

> For boolean options, you are generating (EXPORT_SNAPSHOT TRUE) where
> (EXPORT_SNAPSHOT) would do.

True, but it doesn't seem to matter much one way or the other. I
thought this way looked a little clearer.

> Maybe allowing (!EXPORT_SNAPSHOT) for (FOO FALSE) would be nice, if it is
> not allowed yet. That would also allow to get rid of FOO/NOFOO variants if
> any for FOO & !FOO, so one keyword is enough for a concept.

Well, the goal for this is not to need ANY new keywords for a new
concept, but !FOO would be inconsistent with other places where we do
similar things (e.g. EXPLAIN, VACUUM, COPY) so I don't think that's
the way to go.

> > There are some debatable decisions here, so I'd be happy to get some
> > feedback on whether to go further with this, or less far, or maybe even
> > just abandon the idea altogether. I doubt the last one is the right
> > course, though: ISTM something's got to be done about the BASE_BACKUP
> > case, at least.
>
> ISTM that it would be better to generalize the approach to all commands
> which accept options, so that the syntax is homogeneous.

As a general principle, sure, but it's always debatable how far to
take things in particular cases. For instance, in the cases of
EXPLAIN, VACUUM, and COPY, the relation name is given as a dedicated
piece of syntax, not an option. It could be given as an option, but
since it's mandatory and important, we didn't. In the case of COPY,
the source file is also specified via dedicated syntax, rather than an
option. So we have to make the same kinds of decisions here. For
example, for CREATE_REPLICATION_SLOT, one could argue that PHYSICAL
and LOGICAL should be moved to the extensible options list instead of
being kept as separate syntax. However, that seems somewhat
inconsistent with the long-standing syntax for START_REPLICATION,
which already does use extensible options:

START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ ( option_name
[option_value] [, ... ] ) ]

On balance I am comfortable with what the patch does, but other people
might have a different take.

> Just wondering: ISTM that the patch implies that dumping a v14 db
> generates the new syntax, which makes sense. Now I see 4 use cases wrt to
> version.
>
>   #  sourcetarget   comment
>   1  v < 14v < 14   probably the dump would use one of the older version
>   2  v < 14v >= 14  upgrade
>   3  v >= 14   v < 14   downgrade: oops, the output uses the new syntax
>   4  v >= 14   v >= 14  ok
>
> Both cross version usages may be legitimate. In particular, 3 (oops,
> hardware issue, I have to move the db to a server where pg has not been
> upgraded) seems not possible because the generated syntax uses the new
> approach. Should/could there be some option to tell "please generate vXXX
> syntax" to allow that?

I don't think dumping a DB is really affected by any of this. AFAIK,
replication commands aren't used in pg_dump output. It just affects
pg_basebackup and the server, and you'll notice that I have taken
pains to allow the server to continue to accept the current format,
and to allow pg_basebackup to generate that format when talking to an
older server.

Thanks for the review. v2 attached, hopefully fixing the compilation
issue you mentioned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

v2-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
Description: Binary data

Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-06-24 Thread Jesse Zhang

Hi Tomas,

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra wrote:
>
> Now, a couple comments / questions about the code.
>
>
> nodeHash.c
> --
>
>
> 1) MultiExecPrivateHash says this
>
>/*
> * Not subject to skew optimization, so either insert normally
> * or save to batch file if it belongs to another stripe
> */
>
> I wonder what it means to "belong to another stripe". I understand what
> that means for batches, which are identified by batchno computed from
> the hash value. But I thought "stripes" are just work_mem-sized pieces
> of a batch, so I don't quite understand this. Especially when the code
> does not actually check "which stripe" the row belongs to.

I have to concur that "stripe" did inspire a RAID vibe when I heard it,
but it seemed to be a better name than what it replaces

> 3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches
>
>childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | 
> hashtable->curbatch;
>
> and also about this comment
>
>/*
> * TODO: what to do about tuples that don't go to the child
> * batch or stay in the current batch? (this is why we are
> * counting tuples to child and curbatch with two diff
> * variables in case the tuples go to a batch that isn't the
> * child)
> */
>if (batchno == childbatch)
>  childbatch_outgoing_tuples++;
>
> I thought each old batch is split into two new ones, and the tuples
> either stay in the current one, or are moved to the new one - which I
> presume is the childbatch, although I haven't tried to decode that
> formula. So where else could the tuple go, as the comment tried to
> suggest?

True, every old batch is split into two new ones, if you only consider
tuples coming from the batch file that _still belong in there_. i.e.
there are tuples in the old batch file that belong to a future batch. As
an example, if the current nbatch = 8, and we want to expand to nbatch =
16, (old) batch 1 will split into (new) batch 1 and batch 9, but it can
already contain tuples that need to go into (current) batches 3, 5, and
7 (soon-to-be batches 11, 13, and 15).

Cheers,
Jesse

Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

On Wed, Jun 24, 2020 at 10:27 AM Tom Lane  wrote:
> More generally, though, why would we want to change this policy only
> here?  I believe we're reasonably consistent about letting the parser
> do any required down-casing and then just checking keyword matches
> with strcmp.

I've had the feeling in the past that our use of pg_strcasecmp() was a
bit random. Looking through the output of 'git grep pg_strcasecmp', it
seems like we don't typically don't use it on option names, but
sometimes we use it on option values. For instance, DefineCollation()
uses pg_strcasecmp() on the collproviderstr, and DefineType() uses it
on storageEl; and also, not to be overlooked, defGetBoolean() uses it
when matching true/false/on/off, which probably affects a lot of
places. On the other hand, ExplainQuery() matches the format using
plain old strcmp(), despite indirectly using pg_strcasecmp() for the
Boolean parameters. So I guess I'm not really convinced that there is
all that much consistency here.

Mind you, I'm not sure whether or not anything really needs to be
changed, or exactly what ought to be done. I'm just making the
observation that it might not be as consistent as you may think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> On 2020-Jun-24, Kyotaro Horiguchi wrote:
> 
> > In logical replication, a replication role is intended to be
> > accessible only to the GRANTed databases.  On the other hand the same
> > role can create a dead copy of the whole cluster, including
> > non-granted databases.
> 
> In other words -- essentially, if you grant replication access to a role
> only to a specific database, they can steal the whole cluster.
> 
> I don't see what's so great about that, but apparently people like it.

Sure, people who aren't in charge of security I'm sure like the ease of
use.

Doesn't mean it makes sense or that we should be supporting that.  What
we should have is a way to allow administrators to configure a system
for exactly what they want to allow, and it doesn't seem like we're
doing that today and therefore we should fix it.  This isn't the only
area we have that issue in.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: PostgreSQL and big data - FDW

2020-06-24 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> > I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP 
> > cluster
> > which uses KERBEROS authentication.

Sadly, not really.

> > is it possible to achieve this ? which FDW should be used ?
> 
> Well, I would use the Hadoop FDW:
> 
>   https://github.com/EnterpriseDB/hdfs_fdw
> 
> and it only supports these authentication methods:
> 
>   Authentication Support
> 
>   The FDW supports NOSASL and LDAP authentication modes. In order to use
>   NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
>   username and password must be specified in OPTIONS while creating user 
> mapping.
> 
> Not every FDW supports every Postgres server authentication method.

That isn't really the issue here, the problem is really that the GSSAPI
support in PG today doesn't support credential delegation- if it did,
then the HDFS FDW (and the postgres FDW) could be easily extended to
leverage those delegated credentials to connect.

That's been something that's been on my personal todo list of things to
work on but unfortunately I've not, as yet, had time to go implement.  I
don't actually think it would be very hard- if someone writes it, I'd
definitely review it.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

Robert Haas  writes:
> On Wed, Jun 24, 2020 at 10:27 AM Tom Lane  wrote:
>> More generally, though, why would we want to change this policy only
>> here?  I believe we're reasonably consistent about letting the parser
>> do any required down-casing and then just checking keyword matches
>> with strcmp.

> ... Mind you, I'm not sure whether or not anything really needs to be
> changed, or exactly what ought to be done. I'm just making the
> observation that it might not be as consistent as you may think.

Yeah, I'm sure there are a few inconsistencies.  We previously made a
pass to get rid of pg_strcasecmp for anything that had been through
the parser's downcasing (commit fb8697b31) but I wouldn't be surprised
if that missed a few cases, or if new ones have snuck in.  Anyway,
"don't use pg_strcasecmp unnecessarily" was definitely the agreed-to
policy as of Jan 2018.

My vague recollection is that there are a few exceptions (defGetBoolean
may well be one of them) where pg_strcasecmp still seemed necessary
because the input might not have come through the parser in some usages.

regards, tom lane

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On 2020-Jun-24, Stephen Frost wrote:

> Doesn't mean it makes sense or that we should be supporting that.  What
> we should have is a way to allow administrators to configure a system
> for exactly what they want to allow, and it doesn't seem like we're
> doing that today and therefore we should fix it.  This isn't the only
> area we have that issue in.

The way to do that, for the case under discussion, is to reject using a
logical replication connection for physical replication commands.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

On Wed, Jun 24, 2020 at 07:38:43AM -0500, Justin Pryzby wrote:
> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
> > It would seem merge join has almost the same complexities as the new
> > hash join code, since it can spill to disk doing sorts for merge joins,
> > and adjusting work_mem is the only way to control that spill to disk.  I
> > don't remember anyone complaining about spills to disk during merge
> > join, so I am unclear why we would need a such control for hash join.
> 
> It loooks like merge join was new in 8.3.  I don't think that's a good 
> analogy,
> since the old behavior was still available with enable_mergejoin=off.

Uh, we don't gurantee backward compatibility in the optimizer.  You can
turn off hashagg if you want.  That doesn't get you to PG 13 behavior,
but we don't gurantee that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Default setting for enable_hashagg_disk

On Thu, Jun 25, 2020 at 12:24:29AM +1200, David Rowley wrote:
> On Wed, 24 Jun 2020 at 21:06, Bruce Momjian  wrote:
> > I
> > don't remember anyone complaining about spills to disk during merge
> > join, so I am unclear why we would need a such control for hash join.
> 
> Hash aggregate, you mean?   The reason is that upgrading to PG13 can

Yes, sorry.

> cause a performance regression for an underestimated ndistinct on the
> GROUP BY clause and cause hash aggregate to spill to disk where it
> previously did everything in RAM.   Sure, that behaviour was never
> what we wanted to happen, Jeff has fixed that now, but the fact
> remains that this does happen in the real world quite often and people
> often get away with it, likey because work_mem is generally set to
> some very conservative value.  Of course, there's also a bunch of
> people that have been bitten by OOM due to this too. The "neverspill"
> wouldn't be for those people.  Certainly, it's possible that we just
> tell these people to increase work_mem for this query, that way they
> can set it to something reasonable and still get spilling if it's
> really needed to save them from OOM, but the problem there is that
> it's not very easy to go and set work_mem for a certain query.

Well, my point is that merge join works that way, and no one has needed
a knob to avoid mergejoin if it is going to spill to disk.  If they are
adjusting work_mem to prevent spill of merge join, they can do the same
for hash agg.  We just need to document this in the release notes.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: xid wraparound danger due to INDEX_CLEANUP false

On Fri, May 22, 2020 at 4:40 PM Peter Geoghegan  wrote:
> On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada
>  wrote:
> > I've attached WIP patch for HEAD. With this patch, the core pass
> > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they
> > can make decision whether run vacuum or not.
> >
> > If the direction of this patch seems good, I'll change the patch so
> > that we pass something information to these callbacks indicating
> > whether this vacuum is anti-wraparound vacuum. This is necessary
> > because it's enough to invoke index cleanup before XID wraparound as
> > per discussion so far.
>
> It. seems like the right direction to me. Robert?

Sorry, I'm so far behind on my email. Argh.

I think, especially on the blog post you linked, that we should aim to
have INDEX_CLEANUP OFF mode do the minimum possible amount of work
while still keeping us safe against transaction ID wraparound. So if,
for example, it's desirable but not imperative for BRIN to
resummarize, then it's OK normally but should be skipped with
INDEX_CLEANUP OFF.

I find the patch itself confusing and the comments inadequate,
especially the changes to lazy_scan_heap(). Before the INDEX_CLEANUP
patch went into the tree, LVRelStats had a member hasindex indicating
whether or not the table had any indexes. The patch changed that
member to useindex, indicating whether or not we were going to do
index vacuuming; thus, it would be false if either there were no
indexes or if we were going to ignore them. This patch redefines
useindex to mean whether or not the table has any indexes, but without
renaming the variable. There's also really no comments anywhere in the
vacuumlazy.c explaining the overall scheme here; what are we actually
doing? Apparently, what we're really doing here is that even when
INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
That seems sad, but if it's what we have to do then it at least needs
comments explaining it.

As for the btree portion of the change, I expect you understand this
better than I do, so I'm reluctant to stick my neck out, but it seems
that what the patch does is force index cleanup to happen even when
INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
the btree index has at least 1 recyclable page. My first reaction is
to wonder whether that doesn't nerf this feature into oblivion. Isn't
it likely that an index that is being vacuumed for wraparound will
have a recyclable page someplace? If the presence of that 1 recyclable
page causes the "help, I'm about to run out of XIDs, please do the
least possible work" flag to become a no-op, I don't think users are
going to be too happy with that. Maybe I am misunderstanding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Default setting for enable_hashagg_disk

Justin Pryzby  writes:
> On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:
>> It would seem merge join has almost the same complexities as the new
>> hash join code, since it can spill to disk doing sorts for merge joins,
>> and adjusting work_mem is the only way to control that spill to disk.  I
>> don't remember anyone complaining about spills to disk during merge
>> join, so I am unclear why we would need a such control for hash join.

> It loooks like merge join was new in 8.3.  I don't think that's a good 
> analogy,
> since the old behavior was still available with enable_mergejoin=off.

Uh, what?  A look into our git history shows immediately that
nodeMergejoin.c has been there since the initial code import in 1996.

I tend to agree with Bruce that it's not very obvious that we need
another GUC knob here ... especially not one as ugly as this.
I'm especially against the "neverspill" option, because that makes a
single GUC that affects both the planner and executor independently.

If we feel we need something to let people have the v12 behavior
back, let's have
(1) enable_hashagg on/off --- controls planner, same as it ever was
(2) enable_hashagg_spill on/off --- controls executor by disabling spill

But I'm not really convinced that we need (2).

regards, tom lane

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-06-24 Thread James Coleman

On Fri, Jun 19, 2020 at 12:04 AM Justin Pryzby  wrote:
>
> On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby  wrote:
> > > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > > I've pushed the fist part of this patch series - I've reorganized it a
> > >
> > > I scanned through this again post-commit.  Find attached some suggestions.
> > >
> > > Shouldn't non-text explain output always show both disk *and* mem, 
> > > including
> > > zeros ?
> >
> > Could you give more context on this? Is there a standard to follow?
> > Regular sort nodes only ever report one type, so there's not a good
> > parallel there.
>
> The change I proposed was like:
>
> @@ -2829,7 +2829,6 @@ 
> show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
>
> ExplainPropertyList("Sort Methods Used", methodNames, es);
>
> -   if (groupInfo->maxMemorySpaceUsed > 0)
> {
> longavgSpace = 
> groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
> const char *spaceTypeName;
> ...
> -   if (groupInfo->maxDiskSpaceUsed > 0)
> {
> ...
>
> To show in non-text format *both* disk and memory space used, even if zero.
>
> I still think that's what's desirable.

I'm of the opposite opinion. I believe showing both unnecessarily is confusing.

> If it's important to show *whether* a sort space was used, then I think there
> should be a boolean, or an int 0/1.  But I don't think it's actually needed,
> since someone parsing the explain output could just check
> |if _dict['Peak Sort Space Used'] > 0: ...
> the same as you're doing, without having to write some variation on:
> |if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

I think it's desirable for code to be explicitly about the type having
been used rather than implicitly assuming it based on 0/non-zero
values.

James

Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Peter Eisentraut

On 2020-06-24 10:33, Daniel Gustafsson wrote:

In PG13, we raised the server-side default of ssl_min_protocol_version to
TLSv1.2. We also added a connection setting named ssl_min_protocol_version to
libpq. But AFAICT, the default value of the libpq setting is empty, so any
protocol version will be accepted. Is this what we wanted? Should we raise
the default in libpq as well?

This was discussed [0] when the connection settings were introduced, and the
concensus was to leave them alone [1] to allow for example a new pg_dump to
work against an old server. Re-reading the thread I think the argument still
holds, but I was about to respond "yes, let's do this" before refreshing my
memory. Perhaps we should add a comment explaining this along the lines of the
attached?

[0]
https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
[1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us

ISTM that these discussions went through the same questions and
arguments that were made regarding the server-side change but arrived at
a different conclusion. So I suggest to reconsider this so that we
don't ship with contradictory results.

That doesn't necessarily mean that we have to make a change, but we
should make sure our rationale is sound.

Note that all OpenSSL versions that do not support TLSv1.2 also do not
support TLSv1.1. So by saying, in effect, that TLSv1.2 is too new to
require, we are saying that we need to keep supporting TLSv1.0 -- which
is heavily deprecated. Also note that the first OpenSSL version with
support for TLSv1.2 shipped on March 14, 2012.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow CURRENT_ROLE in GRANTED BY

2020-06-24 Thread Peter Eisentraut


On 2020-06-24 10:12, Vik Fearing wrote:

On 6/24/20 8:35 AM, Peter Eisentraut wrote:

I was checking some loose ends in SQL conformance, when I noticed: We
support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
CURRENT_ROLE in that place, even though in PostgreSQL they are
equivalent.  Here is a trivial patch to add that.



The only thing that isn't dead-obvious about this patch is the commit
message says "[PATCH 1/2]".  What is in the other part?


Hehe.  The second patch is some in-progress work to add the GRANTED BY 
clause to the regular GRANT command.  More on that perhaps at a later date.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: should libpq also require TLSv1.2 by default?

On Wed, Jun 24, 2020 at 07:57:31PM +0200, Peter Eisentraut wrote:
> On 2020-06-24 10:33, Daniel Gustafsson wrote:
> > > In PG13, we raised the server-side default of ssl_min_protocol_version to 
> > > TLSv1.2.  We also added a connection setting named 
> > > ssl_min_protocol_version to libpq.  But AFAICT, the default value of the 
> > > libpq setting is empty, so any protocol version will be accepted.  Is 
> > > this what we wanted?  Should we raise the default in libpq as well?
> > 
> > This was discussed [0] when the connection settings were introduced, and the
> > concensus was to leave them alone [1] to allow for example a new pg_dump to
> > work against an old server.  Re-reading the thread I think the argument 
> > still
> > holds, but I was about to respond "yes, let's do this" before refreshing my
> > memory.  Perhaps we should add a comment explaining this along the lines of 
> > the
> > attached?
> > 
> > [0] 
> > https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
> > [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us
> 
> ISTM that these discussions went through the same questions and arguments
> that were made regarding the server-side change but arrived at a different
> conclusion.  So I suggest to reconsider this so that we don't ship with
> contradictory results.
> 
> That doesn't necessarily mean that we have to make a change, but we should
> make sure our rationale is sound.
> 
> Note that all OpenSSL versions that do not support TLSv1.2 also do not
> support TLSv1.1.  So by saying, in effect, that TLSv1.2 is too new to
> require, we are saying that we need to keep supporting TLSv1.0 -- which is
> heavily deprecated.  Also note that the first OpenSSL version with support
> for TLSv1.2 shipped on March 14, 2012.

I do think mismatched SSL requirements between client and server is
confusing, though I can see the back-version pg_dump being an issue. 
Maybe a clear error message would help here.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Review for GetWALAvailability()

Thanks for those corrections.

I have pushed this.  I think all problems Masao-san reported have been
dealt with, so we're done here.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Tomas Vondra


On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:

Justin Pryzby  writes:

On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote:

It would seem merge join has almost the same complexities as the new
hash join code, since it can spill to disk doing sorts for merge joins,
and adjusting work_mem is the only way to control that spill to disk.  I
don't remember anyone complaining about spills to disk during merge
join, so I am unclear why we would need a such control for hash join.



It loooks like merge join was new in 8.3.  I don't think that's a good analogy,
since the old behavior was still available with enable_mergejoin=off.


Uh, what?  A look into our git history shows immediately that
nodeMergejoin.c has been there since the initial code import in 1996.

I tend to agree with Bruce that it's not very obvious that we need
another GUC knob here ... especially not one as ugly as this.
I'm especially against the "neverspill" option, because that makes a
single GUC that affects both the planner and executor independently.

If we feel we need something to let people have the v12 behavior
back, let's have
(1) enable_hashagg on/off --- controls planner, same as it ever was
(2) enable_hashagg_spill on/off --- controls executor by disabling spill



What if a user specifies

  enable_hashagg = on
  enable_hashagg_spill = off

and the estimates say the hashagg would need to spill to disk. Should
that disable the query (in which case the second GUC affects both
executor and planner) or run it (in which case we knowingly ignore
work_mem, which seems wrong).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

Tomas Vondra  writes:
> On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:
>> If we feel we need something to let people have the v12 behavior
>> back, let's have
>> (1) enable_hashagg on/off --- controls planner, same as it ever was
>> (2) enable_hashagg_spill on/off --- controls executor by disabling spill

> What if a user specifies
>enable_hashagg = on
>enable_hashagg_spill = off

It would probably be reasonable for the planner to behave as it did
pre-v13, that is not choose hashagg if it estimates that work_mem
would be exceeded.  (So, okay, that means enable_hashagg_spill
affects both planner and executor ... but ISTM it's just one
behavior not two.)

regards, tom lane

Re: Default setting for enable_hashagg_disk

Hi,

On 2020-06-24 14:11:57 +1200, David Rowley wrote:
> 1. Statistics underestimation can cause hashagg to be selected. The
> executor will spill to disk in PG13.  Users may find performance
> suffers as previously the query may have just overshot work_mem
> without causing any OOM issues. Their I/O performance might be
> terrible.

> 2. We might now choose to hash aggregate where pre PG13, we didn't
> choose that because the hash table was estimated to be bigger than
> work_mem. Hash agg might not be the best plan for the job.

> For #1. We know users are forced to run smaller work_mems than they
> might like as they need to allow for that random moment where all
> backends happen to be doing that 5-way hash join all at the same time.
> It seems reasonable that someone might want the old behaviour. They
> may well be sitting on a timebomb that's about to OOM, but it would be
> sad if someone's upgrade to PG13 was blocked on this, especially if
> it's just due to some query that runs once per month but needs to
> perform quickly.

I'm quite concerned about this one. I think this isn't just going to hit
when the planner mis-estimates ndistinct, but also when transition
values use a bit more space. We'll now start spilling in cases the
< v13 planner did everything right.

That's great for cases where we'd otherwise OOM, but for a lot of other
cases where there actually is more than sufficient RAM to overrun
work_mem by a single-digit factor, it can cause a pretty massive
increase of IO over < v13.


FWIW, my gut feeling is that we'll end up have to separate the
"execution time" spilling from using plain work mem, because it'll
trigger spilling too often. E.g. if the plan isn't expected to spill,
only spill at 10 x work_mem or something like that.  Or we'll need
better management of temp file data when there's plenty memory
available.


> For #2. This seems like a very legitimate requirement to me.  If a
> user is unhappy that PG13 now hashaggs where before it sorted and
> group aggregated, but they're unhappy, not because there's some issue
> with hashagg spilling, but because that causes the node above the agg
> to becomes a Hash Join rather than a Merge Join and that's bad for
> some existing reason. Our planner doing the wrong thing based on
> either; lack of, inaccurate or out-of-date statistics is not Jeff's
> fault. Having the ability to switch off a certain planner feature is
> just following along with what we do today for many other node types.

This one concerns me a bit less, fwiw. There's a lot more "pressure" in
the planner to choose hash agg or sorted agg, compared to e.g. a bunch
of aggregate states taking up a bit more space (can't estimate that at
all for ma.


> As for GUCs to try to help the group of users who, *I'm certain*, will
> have problems with PG13's plan choice. I think the overloaded
> enable_hashagg option is a really nice compromise.   We don't really
> have any other executor node type that has multiple GUCs controlling
> its behaviour, so I believe it would be nice to keep it that way.
> 
> How about:
> 
> enable_hashagg = "on" -- enables hashagg allowing it to freely spill
> to disk as it pleases.
> enable_hashagg = "trynospill" -- Planner will only choose hash_agg if
> it thinks it won't spill (pre PG13 planner behaviour)
> enable_hashagg = "neverspill" -- executor will *never* spill to disk
> and can still OOM (NOT RECOMMENDED, but does give pre PG13 planner and
> executor behaviour)
> enable_hashagg = "off" -- planner does not consider hash agg, ever.
> Same as what PG12 did for this setting.
> 
> Now, it's a bit weird to have "neverspill" as this is controlling
> what's done in the executor from a planner GUC.  Likely we can just
> work around that by having a new "allowhashspill" bool field in the
> "Agg" struct that's set by the planner, say during createplan that
> controls if nodeAgg.c is allowed to spill or not.  That'll also allow
> PREPAREd plans to continue to do what they had planned to do already.
> 
> The thing I like about doing it this way is that:
> 
> a) it does not add any new GUCs
> b) it semi hides the weird values that we really wish nobody would
> ever have to set in a GUC that people have become used it just
> allowing the values "on" and "off".
> 
> The thing I don't quite like about this idea is:
> a) I wish the planner was perfect and we didn't need to do this.
> b) It's a bit weird to overload a GUC that has a very booleanish name
> to not be bool.
> 
> However, I also think it's pretty lightweight to support this. I
> imagine a dozen lines of docs and likely about half a dozen lines per
> GUC option in the planner.

That'd work for me, but I honestly don't particularly care about the
specific naming, as long as we provide users an escape hatch from the
increased amount of IO.


Greetings,

Andres Freund

Re: Default setting for enable_hashagg_disk

Hi,

On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> Well, my point is that merge join works that way, and no one has needed
> a knob to avoid mergejoin if it is going to spill to disk.  If they are
> adjusting work_mem to prevent spill of merge join, they can do the same
> for hash agg.  We just need to document this in the release notes.

I don't think this is comparable. For starters, the IO indirectly
triggered by mergejoin actually leads to plenty people just straight out
disabling it. For lots of workloads there's never a valid reason to use
a mergejoin (and often the planner will never choose one). Secondly, the
planner has better information about estimating the memory usage for the
to-be-sorted data than it has about the size of the transition
values. And lastly, there's a difference between a long existing cause
for bad IO behaviour and one that's suddenly kicks in after a major
version upgrade, to which there's no escape hatch (it's rarely realistic
to disable hash aggs, in contrast to merge joins).

Greetings,

Andres Freund

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Wed, Jun 24, 2020 at 1:06 PM Alvaro Herrera  wrote:
> On 2020-Jun-24, Stephen Frost wrote:
> > Doesn't mean it makes sense or that we should be supporting that.  What
> > we should have is a way to allow administrators to configure a system
> > for exactly what they want to allow, and it doesn't seem like we're
> > doing that today and therefore we should fix it.  This isn't the only
> > area we have that issue in.
>
> The way to do that, for the case under discussion, is to reject using a
> logical replication connection for physical replication commands.

Reading over this discussion, I see basically three arguments:

1. Andres argues that being able to execute physical replication
commands from the same connection as SQL queries is useful, and that
people may be relying on it, and that we shouldn't break it without
need.

2. Fujii Masao argues that the current situation makes it impossible
to write a pg_hba.conf rule that disallows all physical replication
connections, because people could get around it by using a logical
replication connection for physical replication.

3. Various people argue that it's only accidental that physical
replication on a replication=database connection ever worked at all,
and therefore we ought to block it.

I find argument #1 most convincing, #2 less convincing, and #3 least
convincing. In my view, the problem with argument #3 is that just
because some feature combination was unintentional doesn't mean it's
unuseful or unused. As for #2, suppose someone were to propose a
design for logical replication that allowed it to take place without a
database connection, so that it could be done with just a regular
replication connection. Such a feature would create the same problem
Fujii Masao mentions here, but it seems inconceivable that we would
for that reason reject it; we make decisions about features based on
their usefulness, not their knock-on effects on pg_hba.conf rules. We
can always add new kinds of access control restrictions if they are
needed; that is a better approach than removing features so that the
existing pg_hba.conf facilities can be used to accomplish some
particular goal. So really I think this turns on #1: is it plausible
that people are using this feature, however inadvertent it may be, and
is it potentially useful? I don't see that anybody's made an argument
against either of those things. Unless someone can do so, I think we
shouldn't disable this.

That having been said, I think that the fact that you can execute SQL
queries in replication=database connections is horrifying. I really
hate that feature. I think it's a bad design, and a bad
implementation, and a recipe for tons of bugs. But, blocking physical
replication commands on such connections isn't going to solve any of
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Default setting for enable_hashagg_disk

On Wed, Jun 24, 2020 at 3:14 PM Andres Freund  wrote:
> FWIW, my gut feeling is that we'll end up have to separate the
> "execution time" spilling from using plain work mem, because it'll
> trigger spilling too often. E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.  Or we'll need
> better management of temp file data when there's plenty memory
> available.

So, I don't think we can wire in a constant like 10x. That's really
unprincipled and I think it's a bad idea. What we could do, though, is
replace the existing Boolean-valued GUC with a new GUC that controls
the size at which the aggregate spills. The default could be -1,
meaning work_mem, but a user could configure a larger value if desired
(presumably, we would just treat a value smaller than work_mem as
work_mem, and document the same).

I think that's actually pretty appealing. Separating the memory we
plan to use from the memory we're willing to use before spilling seems
like a good idea in general, and I think we should probably also do it
in other places - like sorts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Default setting for enable_hashagg_disk

Hi,

On 2020-06-24 14:40:50 -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote:
> >> If we feel we need something to let people have the v12 behavior
> >> back, let's have
> >> (1) enable_hashagg on/off --- controls planner, same as it ever was
> >> (2) enable_hashagg_spill on/off --- controls executor by disabling spill
>
> > What if a user specifies
> >enable_hashagg = on
> >enable_hashagg_spill = off
>
> It would probably be reasonable for the planner to behave as it did
> pre-v13, that is not choose hashagg if it estimates that work_mem
> would be exceeded.  (So, okay, that means enable_hashagg_spill
> affects both planner and executor ... but ISTM it's just one
> behavior not two.)

There's two different reasons for spilling in the executor right now:

1) The planner estimated that we'd need to spill, and that turns out to
   be true. There seems no reason to not spill in that case (as long as
   it's enabled/chosen in the planner).

2) The planner didn't think we'd need to spill, but we end up using more
   than work_mem memory.

nodeAgg.c already treats those separately:

void
hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
Size *mem_limit, uint64 *ngroups_limit,
int *num_partitions)
{
int npartitions;
Sizepartition_mem;

/* if not expected to spill, use all of work_mem */
if (input_groups * hashentrysize < work_mem * 1024L)
{
if (num_partitions != NULL)
*num_partitions = 0;
*mem_limit = work_mem * 1024L;
*ngroups_limit = *mem_limit / hashentrysize;
return;
}

We can't sensibly disable spilling when chosen at plan time, because
that'd lead to *more* OOMS than in v12.

ISTM that we should have one option that controls whether 1) is done,
and one that controls whether 2) is done. Even if the option for 2 is
off, we still should spill when the option for 1) chooses a spilling
plan.  I don't think it makes sense for one of those options to
influence the other implicitly.

So maybe enable_hashagg_spilling_plan for 1) and
hashagg_spill_on_exhaust for 2).

Greetings,

Andres Freund

Re: Why forbid "INSERT INTO t () VALUES ();"



Hallo Thomas,


  INSERT INTO t() VALUES ();

This is forbidden by postgres, and also sqlite.

Is there any good reason why this should be the case?


Maybe because

  insert into t default values;

exists (and is standard SQL if I'm not mistaken)


That's a nice alternative I did not notice. Well, not an alternative as 
the other one does not work.


I'm still unclear why it would be forbidden though, it seems logical to 
try that, whereas the working one is quite away from the usual syntax.


--
Fabien.

Re: Default setting for enable_hashagg_disk

Hi,

On 2020-06-24 15:28:47 -0400, Robert Haas wrote:
> On Wed, Jun 24, 2020 at 3:14 PM Andres Freund  wrote:
> > FWIW, my gut feeling is that we'll end up have to separate the
> > "execution time" spilling from using plain work mem, because it'll
> > trigger spilling too often. E.g. if the plan isn't expected to spill,
> > only spill at 10 x work_mem or something like that.  Or we'll need
> > better management of temp file data when there's plenty memory
> > available.
> 
> So, I don't think we can wire in a constant like 10x. That's really
> unprincipled and I think it's a bad idea. What we could do, though, is
> replace the existing Boolean-valued GUC with a new GUC that controls
> the size at which the aggregate spills. The default could be -1,
> meaning work_mem, but a user could configure a larger value if desired
> (presumably, we would just treat a value smaller than work_mem as
> work_mem, and document the same).

To be clear, I wasn't actually thinking of hard-coding 10x, but having a
config option that specifies a factor of work_mem. A factor seems better
because it'll work reasonably for different values of work_mem, whereas
a concrete size wouldn't.


> I think that's actually pretty appealing. Separating the memory we
> plan to use from the memory we're willing to use before spilling seems
> like a good idea in general, and I think we should probably also do it
> in other places - like sorts.

Indeed. And then perhaps we could eventually add some reporting /
monitoring infrastructure for the cases where plan time and execution
time memory estimate/usage widely differs.

Greetings,

Andres Freund

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On 2020-Jun-24, Robert Haas wrote:

> So really I think this turns on #1: is it plausible
> that people are using this feature, however inadvertent it may be, and
> is it potentially useful? I don't see that anybody's made an argument
> against either of those things. Unless someone can do so, I think we
> shouldn't disable this.

People (specifically the jdbc driver) *are* using this feature in this
way, but they didn't realize they were doing it.  It was an accident and
they didn't notice.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-24 Thread Dave Cramer

On Wed, 24 Jun 2020 at 15:41, Alvaro Herrera 
wrote:

> On 2020-Jun-24, Robert Haas wrote:
>
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
>
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.
>
>
Not sure we are using it as much as we accidently did it that way. It would
be trivial to fix.

That said I think we should fix the security hole this opens and leave the
functionality.

Dave

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

Hi,

On 2020-06-24 15:41:14 -0400, Alvaro Herrera wrote:
> On 2020-Jun-24, Robert Haas wrote:
> 
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
> 
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

As I said before, I've utilized being able to do both over a single
connection (among others to initialize a logical replica using a base
backup). And I've seen at least one other codebase (developed without my
input) doing so. I really don't understand how you just dismiss this
without any sort of actual argument.  Yes, those uses can be fixed to
reconnect with a different replication parameter, but that's code that
needs to be adjusted and it requires adjustments to pg_hba.conf etc.

And obviously you'd lock out older versions of jdbc, and possibly other
drivers.

Obviously we should allow more granular permissions here, I don't think
anybody is arguing against that.

Greetings,

Andres Freund

Re: Why forbid "INSERT INTO t () VALUES ();"

Fabien COELHO  writes:
>>>   INSERT INTO t() VALUES ();

> I'm still unclear why it would be forbidden though, it seems logical to 
> try that, whereas the working one is quite away from the usual syntax.

It's forbidden because the SQL standard forbids it.

We allow zero-column syntaxes in some other places where SQL forbids
them, but that's only because there is no reasonable alternative.
In this case, there's a perfectly good, standards-compliant alternative.
So why encourage people to write unportable code?

regards, tom lane

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-24 Thread Peter Geoghegan

On Wed, Jun 24, 2020 at 10:21 AM Robert Haas  wrote:
> Sorry, I'm so far behind on my email. Argh.

That's okay.

> I think, especially on the blog post you linked, that we should aim to
> have INDEX_CLEANUP OFF mode do the minimum possible amount of work
> while still keeping us safe against transaction ID wraparound. So if,
> for example, it's desirable but not imperative for BRIN to
> resummarize, then it's OK normally but should be skipped with
> INDEX_CLEANUP OFF.

I agree that that's very important.

> Apparently, what we're really doing here is that even when
> INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> That seems sad, but if it's what we have to do then it at least needs
> comments explaining it.

+1. Though I think that AMs should technically have the right to
consider it advisory.

> As for the btree portion of the change, I expect you understand this
> better than I do, so I'm reluctant to stick my neck out, but it seems
> that what the patch does is force index cleanup to happen even when
> INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
> the btree index has at least 1 recyclable page. My first reaction is
> to wonder whether that doesn't nerf this feature into oblivion. Isn't
> it likely that an index that is being vacuumed for wraparound will
> have a recyclable page someplace? If the presence of that 1 recyclable
> page causes the "help, I'm about to run out of XIDs, please do the
> least possible work" flag to become a no-op, I don't think users are
> going to be too happy with that. Maybe I am misunderstanding.

I have mixed feelings about it myself. These are valid concerns.

This is a problem to the extent that the user has a table that they'd
like to use INDEX_CLEANUP with, that has indexes that regularly
require cleanup due to page deletion. ISTM, then, that the really
relevant high level design questions for this patch are:

1. How often is that likely to happen in The Real World™?

2. If we fail to do cleanup and leak already-deleted pages, how bad is
that? ( Both in general, and in the worst case.)

I'll hazard a guess for 1: I think that it might not come up that
often. Page deletion is often something that we hardly ever need. And,
unlike some DB systems, we only do it when pages are fully empty
(which, as it turns out, isn't necessarily better than our simple
approach [1]). I tend to think it's unlikely to happen in cases where
INDEX_CLEANUP is used, because those are cases that also must not have
that much index churn to begin with.

Then there's question 2. The intuition behind the approach from
Sawada-san's patch was that allowing wraparound here feels wrong -- it
should be in the AM's hands. However, it's not like I can point to
some ironclad guarantee about not leaking deleted pages that existed
before the INDEX_CLEANUP feature. We know that the FSM is not crash
safe, and that's that. Is this really all that different? Maybe it is,
but it seems like a quantitative difference to me.

I'm kind of arguing against myself even as I try to advance my
original argument. If workloads that use INDEX_CLEANUP don't need to
delete and recycle pages in any case, then why should we care that
those same workloads might leak pages on account of the wraparound
hazard? There's nothing to leak! Maybe some compromise is possible, at
least for the backbranches. Perhaps nbtree is told about the
requirements in a bit more detail than we'd imagined. It's not just an
INDEX_CLEANUP boolean. It could be an enum or something. Not sure,
though.

The real reason that I want to push the mechanism down into index
access methods is because that design is clearly better overall; it
just so happens that the specific way in which we currently defer
recycling in nbtree makes very little sense, so it's harder to see the
big picture. The xid-cleanup design that we have was approximately the
easiest way to do it, so that's what we got. We should figure out a
way to recycle the pages at something close to the earliest possible
opportunity, without having to perform a full scan on the index
relation within btvacuumscan(). Maybe we can use the autovacuum work
item mechanism for that. For indexes that get VACUUMed once a week on
average, it makes zero sense to wait another week to recycle the pages
that get deleted, in a staggered fashion. It should be possible to
recycle the pages a minute or two after VACUUM proper finishes, with
extra work that's proportionate to the number of deleted pages. This
is still conservative. I am currently very busy with an unrelated
B-Tree prototype, so I might not get around to it this year. Maybe
Sawada-san can think about this?

Also, handling of the vacuum_cleanup_index_scale_factor stuff (added
to Postgres 11 by commit 857f9c36) should live next to code for the
confusingly-similar INDEX_CLEANUP stuff (added to Postgres 12 by
commit a96c41feec6), on general principle. I think that that
organization is a lot easier to follow.

[1] http

Re: Default setting for enable_hashagg_disk

2020-06-24 Thread Tomas Vondra

On Wed, Jun 24, 2020 at 12:36:24PM -0700, Andres Freund wrote:

Hi,

On 2020-06-24 15:28:47 -0400, Robert Haas wrote:

On Wed, Jun 24, 2020 at 3:14 PM Andres Freund  wrote:
> FWIW, my gut feeling is that we'll end up have to separate the
> "execution time" spilling from using plain work mem, because it'll
> trigger spilling too often. E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.  Or we'll need
> better management of temp file data when there's plenty memory
> available.

So, I don't think we can wire in a constant like 10x. That's really
unprincipled and I think it's a bad idea. What we could do, though, is
replace the existing Boolean-valued GUC with a new GUC that controls
the size at which the aggregate spills. The default could be -1,
meaning work_mem, but a user could configure a larger value if desired
(presumably, we would just treat a value smaller than work_mem as
work_mem, and document the same).

To be clear, I wasn't actually thinking of hard-coding 10x, but having a
config option that specifies a factor of work_mem. A factor seems better
because it'll work reasonably for different values of work_mem, whereas
a concrete size wouldn't.

I'm not quite convinced we need/should introduce a new memory limit.
It's true keping it equal to work_mem by default makes this less of an
issue, but it's still another moving part the users will need to learn
how to use.

But if we do introduce a new limit, I very much think it should be a
plain limit, not a factor. That just makes it even more complicated, and
we don't have any such limit yet.

I think that's actually pretty appealing. Separating the memory we
plan to use from the memory we're willing to use before spilling seems
like a good idea in general, and I think we should probably also do it
in other places - like sorts.

Indeed. And then perhaps we could eventually add some reporting /
monitoring infrastructure for the cases where plan time and execution
time memory estimate/usage widely differs.

I wouldn't mind something like that in general - not just for hashagg,
but for various other nodes.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow CURRENT_ROLE in GRANTED BY

On 2020-Jun-24, Peter Eisentraut wrote:

> I was checking some loose ends in SQL conformance, when I noticed: We
> support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
> CURRENT_ROLE in that place, even though in PostgreSQL they are equivalent.
> Here is a trivial patch to add that.

Hmm, since this adds to RoleSpec, this change makes every place that
uses that production also take CURRENT_ROLE, so we'd need to document in
all those places.  For example, alter_role.sgml, create_schema.sgml,
etc.

This also affects role_list (but maybe the docs for those are already
vague enough -- eg. ALTER INDEX .. OWNED BY only says "role_name" with
no further explanation, even though it does take "current_user".)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Strange behavior with polygon and NaN

2020-06-24 Thread Jesse Zhang

Hi hackers,

While working with Chris Hajas on merging Postgres 12 with Greenplum
Database we stumbled upon the following strange behavior in the geometry
type polygon:

-- >8 

CREATE TEMP TABLE foo (p point);
CREATE INDEX ON foo USING gist(p);

INSERT INTO foo VALUES ('0,0'), ('1,1'), ('NaN,NaN');

SELECT $q$
SELECT * FROM foo WHERE p <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)'
$q$ AS qry \gset

BEGIN;
SAVEPOINT yolo;
SET LOCAL enable_seqscan TO off;
:qry;

ROLLBACK TO SAVEPOINT yolo;
SET LOCAL enable_indexscan TO off;
SET LOCAL enable_bitmapscan TO off;
:qry;

-- 8< 

If you run the above repro SQL in HEAD (and 12, and likely all older
versions), you get the following output:

CREATE TABLE
CREATE INDEX
INSERT 0 3
BEGIN
SAVEPOINT
SET
   p
---
 (0,0)
 (1,1)
(2 rows)

ROLLBACK
SET
SET
 p
---
 (0,0)
 (1,1)
 (NaN,NaN)
(3 rows)


At first glance, you'd think this is the gist AM's bad, but on a second
thought, something else is strange here. The following query returns
true:

SELECT point '(NaN, NaN)' <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)'

The above behavior of the "contained in" operator is surprising, and
it's probably not what the GiST AM is expecting. I took a look at
point_inside() in geo_ops.c, and it doesn't seem well equipped to handle
NaN. Similary ill-equipped is dist_ppoly_internal() which underlies the
distnace operator for polygon. It gives the following interesting
output:

SELECT *, c <-> polygon '(0,0),(0,100),(100,100),(100,0)' as distance
FROM (
  SELECT circle(point(100 * i, 'NaN'), 50) AS c
  FROM generate_series(-2, 4) i
) t(c)
ORDER BY 2;

c| distance
-+--
 <(-200,NaN),50> |0
 <(-100,NaN),50> |0
 <(0,NaN),50>|0
 <(100,NaN),50>  |0
 <(200,NaN),50>  |  NaN
 <(300,NaN),50>  |  NaN
 <(400,NaN),50>  |  NaN
(7 rows)

Should they all be NaN? Am I alone in thinking the index is right but
the operators are wrong? Or should we call the indexes wrong here?

Cheers,
Jesse and Chris

Re: should libpq also require TLSv1.2 by default?

2020-06-24 Thread Daniel Gustafsson

> On 24 Jun 2020, at 19:57, Peter Eisentraut  
> wrote:
> 
> On 2020-06-24 10:33, Daniel Gustafsson wrote:
>>> In PG13, we raised the server-side default of ssl_min_protocol_version to 
>>> TLSv1.2.  We also added a connection setting named ssl_min_protocol_version 
>>> to libpq.  But AFAICT, the default value of the libpq setting is empty, so 
>>> any protocol version will be accepted.  Is this what we wanted?  Should we 
>>> raise the default in libpq as well?
>> This was discussed [0] when the connection settings were introduced, and the
>> concensus was to leave them alone [1] to allow for example a new pg_dump to
>> work against an old server.  Re-reading the thread I think the argument still
>> holds, but I was about to respond "yes, let's do this" before refreshing my
>> memory.  Perhaps we should add a comment explaining this along the lines of 
>> the
>> attached?
>> [0] 
>> https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org
>> [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us
> 
> ISTM that these discussions went through the same questions and arguments 
> that were made regarding the server-side change but arrived at a different 
> conclusion.  So I suggest to reconsider this so that we don't ship with 
> contradictory results.

I don't think anyone argues against safe defaults for communication between
upgraded clients and upgraded servers.  That being said; out of the box, an
upgraded client *will* use TLSv1.2 when connecting to a upgraded server due to
the server defaults requirements (assuming the server hasn't been reconfigured
with a lower TLS version, but since we're talking defaults we have to assume
that).

The problem comes when an updated client needs to talk to an old server which
hasn't been upgraded and which use an older OpenSSL version.  If we set TLSv1.2
as the minimum client version, the user will have to amend a connection string
which used to work when talking to a server which hasn't changed.  If we don't
raise the default, a user to wants nothing lower than TLSv1.2 will have to
amend the connection string.

> That doesn't necessarily mean that we have to make a change, but we should 
> make sure our rationale is sound.

Totally agree.  I think I, FWIW, still vote for keeping it at 1.0 to not break
connections to old servers, since upgraded/new servers will enforce 1.2
anyways.

As mentioned elsewhere in the thread, maybe this is also something which can be
done more easily if we improve the error reporting?  Right now it's fairly
cryptic IMO.

> Note that all OpenSSL versions that do not support TLSv1.2 also do not 
> support TLSv1.1.  So by saying, in effect, that TLSv1.2 is too new to 
> require, we are saying that we need to keep supporting TLSv1.0 -- which is 
> heavily deprecated.  Also note that the first OpenSSL version with support 
> for TLSv1.2 shipped on March 14, 2012.

Correct, this being the 1.0.1 release which is referred to.

cheers ./daniel

Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread Dagfinn Ilmari Mannsåker

Tom Lane  writes:

> Fabien COELHO  writes:
   INSERT INTO t() VALUES ();
>
>> I'm still unclear why it would be forbidden though, it seems logical to 
>> try that, whereas the working one is quite away from the usual syntax.
>
> It's forbidden because the SQL standard forbids it.
>
> We allow zero-column syntaxes in some other places where SQL forbids
> them, but that's only because there is no reasonable alternative.
> In this case, there's a perfectly good, standards-compliant alternative.
> So why encourage people to write unportable code?

FWIW, MySQL (and MariaDB) only support INSERT INTO t () VALUES (), not
DEFAULT VALUES.  We have added syntax for MySQL compatibility in the
past, e.g. the CONCAT() function.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen

Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-24 Thread David G. Johnston

On Wed, Jun 24, 2020 at 3:31 PM Dagfinn Ilmari Mannsåker 
wrote:

> FWIW, MySQL (and MariaDB) only support INSERT INTO t () VALUES (), not
> DEFAULT VALUES.

We have added syntax for MySQL compatibility in the
> past, e.g. the CONCAT() function.
>

I don't see the similarities.  IIUC there isn't a standard mandated
function that provides the behavior that the concat function does.  There
is an operator but the treatment of NULL is different.  So for concat we
decided to add a custom function modelled on another DB's custom function.
Adding custom syntax here when an identically behaving standard syntax
already exists has considerably less going for it.  I would say that
accepting the compatibility hit while being the ones that are
standard-compliant is in line with project values.

David J.

Re: Default setting for enable_hashagg_disk

On Wed, Jun 24, 2020 at 11:02:10PM +0200, Tomas Vondra wrote:
> > Indeed. And then perhaps we could eventually add some reporting /
> > monitoring infrastructure for the cases where plan time and execution
> > time memory estimate/usage widely differs.
> > 
> 
> I wouldn't mind something like that in general - not just for hashagg,
> but for various other nodes.

Well, other than worrying about problems with pre-13 queries, how is
this different from any other spill to disk when we exceed work_mem,
like sorts for merge join.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Default setting for enable_hashagg_disk

On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> > Well, my point is that merge join works that way, and no one has needed
> > a knob to avoid mergejoin if it is going to spill to disk.  If they are
> > adjusting work_mem to prevent spill of merge join, they can do the same
> > for hash agg.  We just need to document this in the release notes.
> 
> I don't think this is comparable. For starters, the IO indirectly
> triggered by mergejoin actually leads to plenty people just straight out
> disabling it. For lots of workloads there's never a valid reason to use
> a mergejoin (and often the planner will never choose one). Secondly, the
> planner has better information about estimating the memory usage for the
> to-be-sorted data than it has about the size of the transition
> values. And lastly, there's a difference between a long existing cause
> for bad IO behaviour and one that's suddenly kicks in after a major
> version upgrade, to which there's no escape hatch (it's rarely realistic
> to disable hash aggs, in contrast to merge joins).

Well, this sounds like an issue of degree, rather than kind.  It sure
sounds like "ignore work_mem for this join type, but not the other".

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Default setting for enable_hashagg_disk

On Wed, Jun 24, 2020 at 07:18:10PM -0400, Bruce Momjian wrote:
> On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote:
> > > Well, my point is that merge join works that way, and no one has needed
> > > a knob to avoid mergejoin if it is going to spill to disk.  If they are
> > > adjusting work_mem to prevent spill of merge join, they can do the same
> > > for hash agg.  We just need to document this in the release notes.
> > 
> > I don't think this is comparable. For starters, the IO indirectly
> > triggered by mergejoin actually leads to plenty people just straight out
> > disabling it. For lots of workloads there's never a valid reason to use
> > a mergejoin (and often the planner will never choose one). Secondly, the
> > planner has better information about estimating the memory usage for the
> > to-be-sorted data than it has about the size of the transition
> > values. And lastly, there's a difference between a long existing cause
> > for bad IO behaviour and one that's suddenly kicks in after a major
> > version upgrade, to which there's no escape hatch (it's rarely realistic
> > to disable hash aggs, in contrast to merge joins).
> 
> Well, this sounds like an issue of degree, rather than kind.  It sure
> sounds like "ignore work_mem for this join type, but not the other".

I think my main point is that work_mem was not being honored for
hash-agg before, but now that PG 13 can do it, we are again allowing
work_mem not to apply in certain cases.  I am wondering if our hard
limit for work_mem is the issue, and we should make that more flexible
for all uses.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: hashagg slowdown due to spill changes

2020-06-24 Thread Melanie Plageman

On Tue, Jun 23, 2020 at 10:06 AM Andres Freund  wrote:

> Hi,
>
> On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund 
> wrote:
> > > It's not this patch's fault, but none, really none, of this stuff
> should
> > > be in the executor.
> > >
> > >
> > Were you thinking it could be done in grouping_planner() and then the
> > bitmaps could be saved in the PlannedStmt?
> > Or would you have to wait until query_planner()? Or are you imagining
> > somewhere else entirely?
>
> I haven't thought about it in too much detail, but I would say
> create_agg_plan() et al. I guess there's some argument to be made to do
> it in setrefs.c, because we already do convert_combining_aggrefs() there
> (but I don't like that much).
>
> There's no reason to do it before we actually decided on one specific
> path, so doing it earlier than create_plan() seems unnecessary. And
> having it in agg specific code seems better than putting it into global
> routines.
>
> There's probably an argument for having a bit more shared code between
> create_agg_plan(), create_group_plan() and
> create_groupingsets_plan(). But even just adding a new extract_*_cols()
> call to each of those would probably be ok.
>
>
So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.

So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().

Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.

It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.

Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.

Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().

Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).

-- 
Melanie Plageman
From 0b40ad205cf8fc8a83b8c65f5f18502a51c47370 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 24 Jun 2020 17:07:14 -0700
Subject: [PATCH v1] Move extracting columns for hashagg to planner

Find the columns needed the executor will need to know about for hashagg
and bifurcate them into aggregated and unaggregated column bitmapsets.
Save them in the plan tree for use during execution.

This is done after set_plan_refs() so all the references are correct.
---
 src/backend/executor/nodeAgg.c   | 117 +--
 src/backend/nodes/copyfuncs.c|   2 +
 src/backend/nodes/outfuncs.c |   2 +
 src/backend/nodes/readfuncs.c|   2 +
 src/backend/optimizer/plan/setrefs.c |  69 
 src/include/nodes/execnodes.h|   8 +-
 src/include/nodes/plannodes.h|   2 +
 7 files changed, 140 insertions(+), 62 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a20554ae65..2f340e4031 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -359,6 +359,14 @@ typedef struct HashAggBatch
 	int64		input_tuples;	/* number of tuples in this batch */
 } HashAggBatch;

+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+	bool	   is_aggref;		/* is under an aggref */
+	Bitmapset *aggregated;		/* column references under an aggref */
+	Bitmapset *unaggregated;	/* other column references */
+} FindColsContext;
+
 static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -391,8 +399,6 @@ static void finalize_aggregates(AggSta

Re: [PATCH] Initial progress reporting for COPY command

2020-06-24 Thread vignesh C

On Tue, Jun 23, 2020 at 4:45 PM Tomas Vondra
 wrote:
>
> >>
> >> Anyway if you would like to make this view more user-friendly, I can add 
> >> that. Just ping me.
> >
> >I felt we could add pg_size_pretty to make the view more user friendly.
> >
>
> Please no. That'd make processing of the data (say, computing progress
> as processed/total) impossible. It's easy to add pg_size_pretty if you
> want it, it's impossible to undo it. I don't see a single pg_size_pretty
> call in system_views.sql.
>

I thought of including pg_size_pretty as we there was no total_bytes
to compare with, but I'm ok without it too as there is an option for
user to always include it in the client side like "SELECT
pg_size_pretty(file_bytes_processed) from pg_stat_progress_copy;" if
required.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: should libpq also require TLSv1.2 by default?

On Thu, Jun 25, 2020 at 12:30:03AM +0200, Daniel Gustafsson wrote:
> I don't think anyone argues against safe defaults for communication between
> upgraded clients and upgraded servers.  That being said; out of the box, an
> upgraded client *will* use TLSv1.2 when connecting to a upgraded server due to
> the server defaults requirements (assuming the server hasn't been reconfigured
> with a lower TLS version, but since we're talking defaults we have to assume
> that).

My take here is to let things as they are for libpq.  pg_dump is a very
good argument, because we allow backward compatibility with a newer
version of the tool, not upward compatibility.

> The problem comes when an updated client needs to talk to an old server which
> hasn't been upgraded and which use an older OpenSSL version.  If we set 
> TLSv1.2
> as the minimum client version, the user will have to amend a connection string
> which used to work when talking to a server which hasn't changed.  If we don't
> raise the default, a user to wants nothing lower than TLSv1.2 will have to
> amend the connection string.

Yeah, and I would not be surprised to see cases where people complain
to us about that when attempting to reach one of their old boxes,
breaking some stuff they have been relying on for years by forcing the
addition of a tls_min_server_protocol in the connection string.  It is
a more important step that we enforce TLSv1.2 on the server side
actually, and libpq just follows up automatically with that.

> As mentioned elsewhere in the thread, maybe this is also something which can 
> be
> done more easily if we improve the error reporting?  Right now it's fairly
> cryptic IMO.

This part may be tricky to get right I think, because the error comes
directly from OpenSSL when negotiating the protocol used between the
client and the server, like "no protocols available" or such.
--
Michael

signature.asc
Description: PGP signature

Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-06-24 Thread David Rowley

Back in [1] I experimented with a patch to coax compilers to build all
elog/ereport calls that were >= ERROR into a cold path away from the
function rasing the error. At the time, I really just wanted to test
how much of a speedup we could get by doing this and ended up just
writing up a patch that basically changed all elog(ERROR) calls from:

if ()
{
   ;
elog(ERROR, "...");
}

to add an unlikely() and become;

if (unlikely()
{
   ;
elog(ERROR, "...");
}

Per the report in [1] I did see some pretty good gains in performance
from doing this.  The problem was, that at the time I couldn't figure
out a way to do this without an invasive patch that changed the code
in the thousands of elog/ereport calls.

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path.  To make the
compiler properly mark just >= ERROR calls as cold, and only when the
elevel is a constant, I modified the ereport_domain macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I see no reason why the compiler shouldn't always fold that constant
expression at compile-time and correctly select the correct version of
the function for the job.  (I also tried using __builtin_choose_expr()
but GCC complained when the elevel was not constant, even with the
__builtin_constant_p() test in the condition)

I sampled a few .s files to inspect what code had changed.  Looking at
mcxt.s, fmgr.s and xlog.s, the first two of these because I found in
[1] that elogs were being done from those files quite often and xlog.s
because it's pretty big.  As far as I could see, GCC correctly moved
all the error reporting stuff where the elevel was a constant and >=
ERROR into the cold path and left the lower-level or non-consts elevel
calls alone.

For clang, I didn't see any changes in the .s files. I suspect that
they might have a few smarts in there and see the
__builtin_unreachable() call and assume the path is cold already based
on that. That was with clang 10. Perhaps older versions are not as
smart.

Benchmarking:

For benchmarking, I've not done a huge amount to test the impacts of
this change.  However, I can say that I am seeing some fairly good
improvements.  There seems to be some small improvements to execution
speed using TPCH-Q1 and also some good results from a pgbench -S test.

For TPCH-Q1:

Master:
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5272.630 ms
latency average = 5258.610 ms
latency average = 5250.871 ms

Master + elog_ereport_attribute_cold.patch
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5182.761 ms
latency average = 5194.851 ms
latency average = 5183.128 ms

Which is about a 1.42% increase in performance. That's not exactly
groundbreaking, but pretty useful to have if that happens to apply
across the board for execution performance.

For pgbench -S:

My results were a bit noisier than the TPCH test, but the results I
obtained did show about a 10% increase in performance:

Master:
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 25245.903255 (excluding connections establishing)
tps = 26144.454208 (excluding connections establishing)
tps = 25931.850518 (excluding connections establishing)

Master + elog_ereport_attribute_cold.patch
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 28351.480631 (excluding connections establishing)
tps = 27763.656557 (excluding connections establishing)
tps = 28896.427929 (excluding connections establishing)

It would be useful if someone with some server-grade Intel hardware
could run a few tests on this.  The above results are all from AMD
hardware.

I've attached the patch for this. I'll add it to the July 'fest.

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8cjwtp9zxeo...@mail.gmail.com


elog_ereport_attribute_cold.patch
Description: Binary data

Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

On Wed, Jun 24, 2020 at 12:55:22PM -0400, Tom Lane wrote:
> Yeah, I'm sure there are a few inconsistencies.  We previously made a
> pass to get rid of pg_strcasecmp for anything that had been through
> the parser's downcasing (commit fb8697b31) but I wouldn't be surprised
> if that missed a few cases, or if new ones have snuck in.  Anyway,
> "don't use pg_strcasecmp unnecessarily" was definitely the agreed-to
> policy as of Jan 2018.

0d8c9c1 has introduced some in parse_basebackup_options() for the
new manifest option, and fe30e7e for AlterType(), no?

> My vague recollection is that there are a few exceptions (defGetBoolean
> may well be one of them) where pg_strcasecmp still seemed necessary
> because the input might not have come through the parser in some usages.

Yep, there were a couple of exceptions.  What was done at this time
was a case-by-case lookup to check what came only from the parser.
--
Michael


signature.asc
Description: PGP signature

Re: Assertion failure in pg_copy_logical_replication_slot()





On 2020/06/24 23:58, Alvaro Herrera wrote:

On 2020-Jun-24, Fujii Masao wrote:


I think the errcode is a bit bogus considering the new case.
IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate.


Agreed. So I updated the patch so this errcode is used instead.
Patch attached.


LGTM.


Thanks! Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: should libpq also require TLSv1.2 by default?

Michael Paquier  writes:
> On Thu, Jun 25, 2020 at 12:30:03AM +0200, Daniel Gustafsson wrote:
>> As mentioned elsewhere in the thread, maybe this is also something which can 
>> be
>> done more easily if we improve the error reporting?  Right now it's fairly
>> cryptic IMO.

> This part may be tricky to get right I think, because the error comes
> directly from OpenSSL when negotiating the protocol used between the
> client and the server, like "no protocols available" or such.

Can we do something comparable to the backend's HINT protocol, where
we add on a comment that's only mostly-likely to be right?

regards, tom lane

Re: Review for GetWALAvailability()





On 2020/06/25 3:27, Alvaro Herrera wrote:

Thanks for those corrections.

I have pushed this.  I think all problems Masao-san reported have been
dealt with, so we're done here.


Sorry for my late to reply here...

Thanks for committing the patch and improving the feature!

/*
 * Find the oldest extant segment file. We get 1 until checkpoint 
removes
 * the first WAL segment file since startup, which causes the status 
being
 * wrong under certain abnormal conditions but that doesn't actually 
harm.
 */
oldestSeg = XLogGetLastRemovedSegno() + 1;

I see the point of the above comment, but this can cause wal_status to be
changed from "lost" to "unreserved" after the server restart. Isn't this
really confusing? At least it seems better to document that behavior.

Or if we *can ensure* that the slot with invalidated_at set always means
"lost" slot, we can judge that wal_status is "lost" without using fragile
XLogGetLastRemovedSegno(). Thought?

Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
value even after the restart?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Review for GetWALAvailability()

On 2020-Jun-25, Fujii Masao wrote:

>   /*
>* Find the oldest extant segment file. We get 1 until checkpoint 
> removes
>* the first WAL segment file since startup, which causes the status 
> being
>* wrong under certain abnormal conditions but that doesn't actually 
> harm.
>*/
>   oldestSeg = XLogGetLastRemovedSegno() + 1;
> 
> I see the point of the above comment, but this can cause wal_status to be
> changed from "lost" to "unreserved" after the server restart. Isn't this
> really confusing? At least it seems better to document that behavior.

Hmm.

> Or if we *can ensure* that the slot with invalidated_at set always means
> "lost" slot, we can judge that wal_status is "lost" without using fragile
> XLogGetLastRemovedSegno(). Thought?

Hmm, this sounds compelling -- I think it just means we need to ensure
we reset invalidated_at to zero if the slot's restart_lsn is set to a
correct position afterwards.  I don't think we have any operation that
does that, so it should be safe -- hopefully I didn't overlook anything?
Neither copy nor advance seem to work with a slot that has invalid
restart_lsn.

> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
> value even after the restart?

This seems more work to implement.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: should libpq also require TLSv1.2 by default?

On Wed, Jun 24, 2020 at 10:50:39PM -0400, Tom Lane wrote:
> Can we do something comparable to the backend's HINT protocol, where
> we add on a comment that's only mostly-likely to be right?

OpenSSL publishes its error codes as of openssl/sslerr.h, and it looks
like the two error codes we would need to worry about are
SSL_R_UNSUPPORTED_PROTOCOL and SSL_R_NO_PROTOCOLS_AVAILABLE.  So we
could for example amend open_client_SSL() when negotiating the SSL
connection in libpq with error messages or hints that help better than
the current state of things, but that also means an extra maintenance
on our side to make sure that we keep in sync with new error codes
coming from the OpenSSL world.
--
Michael


signature.asc
Description: PGP signature

Re: Why forbid "INSERT INTO t () VALUES ();"




Hello Tom,


  INSERT INTO t() VALUES ();



I'm still unclear why it would be forbidden though, it seems logical to
try that, whereas the working one is quite away from the usual syntax.


It's forbidden because the SQL standard forbids it.


Ok, that is definitely a reason. I'm not sure it is a good reason, though.

Why would the standard forbid it? From the language design point of view, 
it is basically having a syntax for lists which would not work for empty 
lists, or a syntax for strings which would not work for empty strings.


It also means that if for some reason someone wants to insert several such 
all-default rows, they have to repeat the insert, as "VALUES (), ();" 
would not work, so it is also losing a corner-corner case capability 
without obvious reason.



We allow zero-column syntaxes in some other places where SQL forbids
them,


Then forbidding there it just adds awkwardness: the same thing works in 
one place but not in another. That does not help users.


but that's only because there is no reasonable alternative. In this 
case, there's a perfectly good, standards-compliant alternative. So why 
encourage people to write unportable code?


I doubt that people look at the (costly) standard when writing corner case 
queries, they just try something logical as I did.


As some other databases accepts it, and if it is already allowed elsewhere 
in pg, encouraging portability is not the main issue here. I'd rather have 
logic and uniformity accross commands.


If I'm annoyed enough to send a patch some day, would you veto it because 
it departs from the standard?


Anyway, thanks for the answer!

--
Fabien.

Re: Review for GetWALAvailability()





On 2020/06/25 12:57, Alvaro Herrera wrote:

On 2020-Jun-25, Fujii Masao wrote:


/*
 * Find the oldest extant segment file. We get 1 until checkpoint 
removes
 * the first WAL segment file since startup, which causes the status 
being
 * wrong under certain abnormal conditions but that doesn't actually 
harm.
 */
oldestSeg = XLogGetLastRemovedSegno() + 1;

I see the point of the above comment, but this can cause wal_status to be
changed from "lost" to "unreserved" after the server restart. Isn't this
really confusing? At least it seems better to document that behavior.


Hmm.


Or if we *can ensure* that the slot with invalidated_at set always means
"lost" slot, we can judge that wal_status is "lost" without using fragile
XLogGetLastRemovedSegno(). Thought?


Hmm, this sounds compelling -- I think it just means we need to ensure
we reset invalidated_at to zero if the slot's restart_lsn is set to a
correct position afterwards.


Yes.


I don't think we have any operation that
does that, so it should be safe -- hopefully I didn't overlook anything?


We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave()
just after setting invalidated_at and restart_lsn in 
InvalidateObsoleteReplicationSlots()?
Otherwise, restart_lsn can go back to the previous value after the restart.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8761f3a18..5584e5dd2c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1229,6 +1229,13 @@ restart:
s->data.invalidated_at = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(&s->mutex);
+
+   /*
+* Save this invalidated slot to disk, to ensure that the slot
+* is still invalid even after the server restart.
+*/
+   ReplicationSlotMarkDirty();
+   ReplicationSlotSave();
ReplicationSlotRelease();
 
/* if we did anything, start from scratch */


Maybe we don't need to do this if the slot is temporary?


Neither copy nor advance seem to work with a slot that has invalid
restart_lsn.


Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
value even after the restart?


This seems more work to implement.


Yes.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: archive status ".ready" files may be created too early

2020-06-24 Thread Kyotaro Horiguchi

Hello.  Matsumura-san.

I agree that WAL writer is not the place to notify segmnet. And the
direction you suggested would work.

At Fri, 19 Jun 2020 10:18:34 +, "matsumura@fujitsu.com" 
 wrote in 
> 1. Description in primary side
> 
> [Basic problem]
>   A process flushing WAL record doesn't know whether the flushed RecPtr is 
>   EndRecPtr of cross-segment-boundary WAL record or not because only process 
>   inserting the WAL record knows it and it never memorizes the information to 
> anywhere.
> 
> [Basic concept of the patch in primary]
>   A process inserting a cross-segment-boundary WAL record memorizes its 
> EndRecPtr
>   (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
>   A flushing process creates .ready (Later, I call it just 'notify'.) against 
>   a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
>   flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
...
>   See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in 
> my patch.

I think we don't need most of that shmem stuff.  XLogWrite is called
after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.

> * Action of inserting process
>   A inserting process memorie its CrossBoundaryEndRecPtr to 
> CrossBoundaryEndRecPtr
>   array element calculated by XLogRecPtrToBufIdx with its 
> CrossBoundaryEndRecPtr.
>   If the WAL record crosses many segments, only element against last segment
>   including the EndRecPtr is set and others are not set.
>   See also CopyXLogRecordToWAL() in my patch.

If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
every time a record is written, most of which are wasteful.
XLogInsertRecord already has a code block executed only at every page
boundary.

> * Action of flushing process
>   Overview has been already written as the follwing.
> A flushing process creates .ready (Later, I call it just 'notify'.) 
> against 
> a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
> flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
> 
>   An additional detail is as the following.  The flushing process may notify
>   many segments if the record crosses many segments, so the process memorizes
>   latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
>   The process notifies from latestArchiveNotifiedSegNo + 1 to
>   flushing segment number - 1.
>
>   And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process 
> exits
>   replay-loop.  Standby also set same timing (= before promoting).
> 
>   Mutual exlusion about latestArchiveNotifiedSegNo is not required because
>   the timing of accessing has been already included in WALWriteLock critical 
> section.

Looks reasonable.

> 2. Description in standby side
> 
> * Who notifies?
>   walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
>   cross-segment-boundary WAL record or not.  In standby, only Startup process
>   knows the information because it is hidden in WAL record itself and only
>   Startup process reads and builds WAL record.

Standby doesn't write it's own WAL records.  Even if primary sent an
immature record on segment boundary, it just would promote to a new
TLI and starts its own history. Nothing breaks.  However it could be a
problem if a standby that crashed the problematic way were started
as-is as a primary, such scenario is out of our scope.

Now we can identify stable portion of WAL stream. It's enough to
prevent walsender from sending data that can be overwritten
afterwards. GetReplicationTargetRecPtr() in the attached does that.

> * Action of Statup process
>   Therefore, I implemented that walreceiver never notify and Startup process 
> does it.
>   In detail, when Startup process reads one full-length WAL record, it 
> notifies
>   from a segment that includes head(ReadRecPtr) of the record to a previous 
> segment that 
>   includes EndRecPtr of the record.

I don't like that archiving on standby relies on replay progress.  We
should avoid that and fortunately I think we dont need it.

>   Now, we must pay attention about switching time line.
>   The last segment of previous TimeLineID must be notified before switching.
>   This case is considered when RM_XLOG_ID is detected.

That segment is archived after renamed as ".partial" later. We don't
archive the last incomplete segment of the previous timeline as-is.

> 3. About other notifying for segment
> Two notifyings for segment are remain.  They are not needed to fix.
> 
> (1) Notifying for partial segment
> It is not needed to be completed, so it's OK to notify without special 
> consideration.
> 
> (2) Re-notifying
> Currently, Checkpointer has notified through XLogArchiveCheckDone().
> It is a sa

Re: some more pg_dump refactoring