Re: Removing unneeded self joins

2021-05-07 Thread Andrey Lepikhov

On 12/3/21 14:05, Hywel Carver wrote:

I've built and tested this, and it seems to function correctly to me. One question I have is 
whether the added "IS NOT NULL" filters can be omitted when they're unnecessary. Some of 
the resulting plans included an "IS NOT NULL" filter on a non-nullable column. To be 
clear, this is still an improvement (to me) without that.


New version of the feature. Deeply refactored with main goal - to reduce 
the code size) and rebased on current master.


Here I didn't work on 'unnecessary IS NOT NULL filter'.

--
regards,
Andrey Lepikhov
Postgres Professional
From eee0c5f0de35d8cb83e6c4ca7749020acb18a4d1 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 28 Apr 2021 18:27:53 +0500
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if can be proven that such
join can be replaced with a scan. We can build the required proofs of
uniqueness using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row, if:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars then the inner row
is (physically) same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals looks like a.x = b.x
2. Collect all another join quals.
3. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. Proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] 
https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 942 +-
 src/backend/optimizer/plan/planmain.c |   5 +
 src/backend/optimizer/util/joininfo.c |   3 +
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/include/optimizer/pathnode.h  |   4 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/join.out| 331 
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/equivclass.sql   |  16 +
 src/test/regress/sql/join.sql | 166 
 12 files changed, 1511 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index 37eb64bcef..dd5c4d2bd3 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -32,10 +33,12 @@
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
 
+bool enable_self_join_removal;
+
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
- Relids 
joinrelids);
+ Relids 
joinrelids, int subst_relid);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int 
*nremoved);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
@@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
   RelOptInfo 
*innerrel,
   JoinType 
jointype,
   List 
*restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
+static Bitmapset* change_relid(Relids relids, Index oldId, Index newId);
+static void change_varno(Expr *expr, Index oldRelid, Index newRelid);
 
 
 /*
@@ -86,7 +92,7 @@ restart:
 
remove_rel_from_query(root, innerrelid,
  
bms_union(sjinfo->min_lefthand,
-   
sjinfo->min_righthand));
+   
sjinfo->min_righthand), 0);
 
/* We verify that exactly one reference gets removed from 
joinlist */
nremoved = 0;
@@ -300,7 +306,10 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo 
*sjinfo)
 
 /*
  * Remove the target relid from the planner's data structures, having
- * determined that there is no need to include it in the query.
+ * determined that there is no need to include it in the query. Or replace
+ * with another relid.
+ * To reusability, this routine can wo

Re: few ideas for pgbench

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 9:46 odesílatel Fabien COELHO <
fabien.coe...@mines-paristech.fr> napsal:

>
> Hello,
>
> >>> When I run pgbench, I usually work with more releases together, so the
> >>> server version is important info.
> >>
> >> Ok. Yes.
>
> Here is a putative patch for this simple part.
>

+1


> >>> What about CSV? Any run can produce one row.
> >>
> >> Yep, CSV is simple and nice. It depends on what information you would
> >> like. For instance for progress report (-P 1) or logs/sampling (-l)
> would
> >> be relevant candidates for CSV. Not so much for the final report,
> though.
> >
> > I think so there can be almost all information. We have to ensure
> > consistency of columns.
> >
> > The basic usage can be
> >
> > for 
> > do
> >  pg_bench ... >> logfile
> > done
> >
> > and log file can looks like
> >
> > start time, rowno, serverver, clientver, connections, scale, readonly,
> > jobs, tps, latency, ...
> >
> > The header row can be optional
>
> Hmmm. It is less clear how to do that with minimal code impact on the
> code, as some options which change the report structure, eg when using
> multiple scripts (-b/-f) or having detailed per-op informations (-r), as
> show below:
>
>   sh> pgbench -P 1 -T 10 -M prepared -c 2 -b se@9 -b si -r
>   starting vacuum...end.
>   progress: 1.0 s, 10666.9 tps, lat 0.186 ms stddev 0.454
>   progress: 2.0 s, 9928.0 tps, lat 0.201 ms stddev 0.466
>   progress: 3.0 s, 10314.8 tps, lat 0.193 ms stddev 0.469
>   progress: 4.0 s, 10042.7 tps, lat 0.198 ms stddev 0.466
>   progress: 5.0 s, 11084.3 tps, lat 0.180 ms stddev 0.408
>   progress: 6.0 s, 9804.1 tps, lat 0.203 ms stddev 0.474
>   progress: 7.0 s, 10271.5 tps, lat 0.194 ms stddev 0.463
>   progress: 8.0 s, 10511.5 tps, lat 0.190 ms stddev 0.424
>   progress: 9.0 s, 10005.7 tps, lat 0.199 ms stddev 0.501
>   progress: 10.0 s, 10512.4 tps, lat 0.190 ms stddev 0.428
>   pgbench (PostgreSQL) 14.0
>   server version: 13.2
>   transaction type: multiple scripts
>   scaling factor: 1
>   query mode: prepared
>   number of clients: 2
>   number of threads: 1
>   duration: 10 s
>   number of transactions actually processed: 103144
>   latency average = 0.193 ms
>   latency stddev = 0.455 ms
>   initial connection time = 5.043 ms
>   tps = 10319.361549 (without initial connection time)
>   SQL script 1: 
>   - weight: 9 (targets 90.0% of total)
>   - 92654 transactions (89.8% of total, tps = 9269.856947)
>   - latency average = 0.052 ms
>   - latency stddev = 0.018 ms
>   - statement latencies in milliseconds:
>   0.000  \set aid random(1, 10 * :scale)
>   0.052  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
>   SQL script 2: 
>   - weight: 1 (targets 10.0% of total)
>   - 10490 transactions (10.2% of total, tps = 1049.504602)
>   - latency average = 1.436 ms
>   - latency stddev = 0.562 ms
>   - statement latencies in milliseconds:
>   0.001  \set aid random(1, 10 * :scale)
>   0.000  \set bid random(1, 1 * :scale)
>   0.000  \set tid random(1, 10 * :scale)
>   0.000  \set delta random(-5000, 5000)
>   0.027  BEGIN;
>   0.065  UPDATE pgbench_accounts SET abalance = abalance + :delta
> WHERE aid = :aid;
>   0.045  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
>   0.048  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
> VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
>   1.249  END
>
> The nature of columns would change depending on options, eg "initial
> connection time" does not make sense under -C, so that under a loop around
> pgbench scenario rows would not necessarily be consistent…
>
> Also, I'm not sure whether such a report can/should include all inputs
> options.
>
> Finally it is unclear how to add such a feature with minimal impact on the
> source code.


It is a question if this is possible without more changes or without
compatibility break :( Probably not. All output should be centralized.


> What I usually do is to put each pgbench run output in a separate file and
> write a small shell/perl/python script to process these, possibly
> generating CSV on the way.
>

The goal of my proposal was a reduction of necessity to write auxiliary
scripts. The produced document should not be "nice", but should be very
easy to import it to some analytical tools.

There is an analogy with Postgres's CSV logs. It is the same. We can see
the result of pgbench like some log.

Pavel



>
> --
> Fabien.


Re: AlterSubscription_refresh "wrconn" wrong variable?

2021-05-07 Thread Peter Smith
On Fri, May 7, 2021 at 7:08 AM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2021-May-06, Peter Smith wrote:
> >> PSA v3 of the patch. Same as before, but now also renames the global
> >> variable from "wrconn" to "lrep_worker_wrconn".
>
> > I think there are two patches here -- the changes to
> > AlterSubscription_refresh are a backpatchable bugfix, and the rest of it
> > can just be applied to master.
>
> The rename of that variable is just cosmetic, true, but I'd still be
> inclined to back-patch it.  If we don't do so then I'm afraid that
> future back-patched fixes might be bitten by the same confusion,
> possibly introducing new real bugs.
>
> Having said that, keeping the two aspects in separate patches might
> ease review and testing.

Done.

>
> > In my mind we make a bit of a distinction for global variables by using
> > CamelCase rather than undercore_separated_words.
>
> I think it's about 50/50, TBH.  I'd stick with whichever style is
> being used in nearby code.
>

The nearby code was a random mixture of Camels and Snakes, so instead
of flipping a coin I went with the suggestion from Alvaro.

~~

PSA v4 of the patch.

0001 - Fixes the AlterSubscription_refresh as before.
0002 - Renames the global var "wrconn" -> "LogRepWorkerWalRcvConn" as suggested.

--
Kind Regards,
Peter Smith
Fujitsu Australia


v4-0001-AlterSubscription_refresh-Use-stack-variable-for-.patch
Description: Binary data


v4-0002-Rename-the-logical-replication-global-wrconn.patch
Description: Binary data


Re: Race condition in recovery?

2021-05-07 Thread Kyotaro Horiguchi
Thanks.


At Fri, 7 May 2021 11:04:53 +0530, Dilip Kumar  wrote in 
> On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar  
> > wrote in
> > Could you please fix the test script so that it causes your issue
> > correctly? And/or elaborate a bit more?
> >
> > The attached first file is the debugging aid logging. The second is
> > the test script, to be placed in src/test/recovery/t.
> 
> I will look into your test case and try to see whether we can
> reproduce the issue.  But let me summarise what is the exact issue.
> Basically, the issue is that first in validateRecoveryParameters if
> the recovery target is the latest then we fetch the latest history
> file and set the recoveryTargetTLI timeline to the latest available
> timeline assume it's 2 but we delay updating the expectedTLEs (as per
> commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading

I think it is right up to here.

> the checkpoint record if we don't get the required WAL from the
> archive then we try to get from primary, and while getting checkpoint
> from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
> suppose that is older timeline 1.  Now after reading the checkpoint we
> will set the expectedTLEs based on the timeline from which we got the
> checkpoint record.

I doubt this point.  ReadCheckpointRecord finally calls
XLogFileReadAnyTLI and it uses the content of the 0002.history as
the local timeline entry list, since expectedTLEs is NIL and
recoveryTargetTLI has been updated to 2 by
validateRecoveryParameters(). But node 3 was having only the segment
on TLI=1 so ReadCheckPointRecord() finds the wanted checkpoint recrod
on TLI=1.  XLogFileReadAnyTLI() copies the local TLE list based on
TLI=2 to expectedTLEs just after that because the wanted checkpoint
record was available based on the list.

So I don't think checkPointCopy.ThisTimeLineID cannot affect this
logic, and don't think expectedTLEs is left with NIL.  It's helpful
that you could show the specific code path to cause that.

> See below Logic in WaitForWalToBecomeAvailable
> if (readFile < 0)
> {
> if (!expectedTLEs)
> expectedTLEs = 
> readTimeLineHistory(receiveTLI);
> 
> Now, the first problem is we are breaking the sanity of expectedTLEs
> because as per the definition it should already start with
> recoveryTargetTLI but it is starting with the older TLI.  Now, in

If my description above is correct, expectedTLEs has been always
filled by TLI=2's hisotry so readTimeLineHistory is not called there.

After that the things are working as described in my previous mail. So
The following is not an issue if I'm not missing something.


> rescanLatestTimeLine we are trying to fetch the latest TLI which is
> still 2, so this logic returns without reinitializing the expectedTLEs
> because it assumes that if recoveryTargetTLI is pointing to 2 then
> expectedTLEs must be correct and need not be changed.
> 
> See below logic:
> rescanLatestTimeLine(void)
> {
> 
> newtarget = findNewestTimeLine(recoveryTargetTLI);
> if (newtarget == recoveryTargetTLI)
> {
> /* No new timelines found */
> return false;
> }
> ...
> newExpectedTLEs = readTimeLineHistory(newtarget);
> ...
> expectedTLEs = newExpectedTLEs;
> 
> 
> Solution:
> 1. Find better way to fix the problem of commit
> (ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the
> sanity of expectedTLEs.
> 2. Assume, we have to live with fix 1 and we have to initialize
> expectedTLEs with an older timeline for validating the checkpoint in
> absence of tl.hostory file (as this commit claims).  Then as soon as
> we read and validate the checkpoint, fix the expectedTLEs and set it
> based on the history file of recoveryTargetTLI.
> 
> Does this explanation make sense?  If not please let me know what part
> is not clear in the explanation so I can point to that code.

So, unfortunately not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: few ideas for pgbench

2021-05-07 Thread Fabien COELHO




Finally it is unclear how to add such a feature with minimal impact on the
source code.



It is a question if this is possible without more changes or without
compatibility break :( Probably not. All output should be centralized.


Yes and no.

For some things we could have "void report_sometype(file, name, data)" 
which append "data," under csv but "name = data\n" under text, but this 
does not work for nested data (eg -r -b/-f), which would rather require 
some json/yaml/whatever format which can embed a hierarchy.



What I usually do is to put each pgbench run output in a separate file and
write a small shell/perl/python script to process these, possibly
generating CSV on the way.


The goal of my proposal was a reduction of necessity to write auxiliary
scripts. The produced document should not be "nice", but should be very
easy to import it to some analytical tools.


Yes, I understood that. I tend to use CSV for that, import results in pg 
or sqlite and analyse with SQL.



There is an analogy with Postgres's CSV logs. It is the same. We can see
the result of pgbench like some log.


Sure, but this works for simple flat data, not changing structures.

--
Fabien.




RE: Allow batched insert during cross-partition updates

2021-05-07 Thread houzj.f...@fujitsu.com
> > Thanks! It looks good!
> 
> Thanks for checking.  I'll mark this as RfC.

Hi,

The patch cannot be applied to the latest head branch, it will be nice if you 
can rebase it.
And when looking into the patch, I have some comments on it.

1)
IIRC, After the commit c5b7ba4, the initialization of 
mt_partition_tuple_routing was postponed out of ExecInitModifyTable.
So, the following if-test use "proute" which is initialized at the beginning of 
the ExecModifyTable() could be out of date.
And the regression test of postgres_fdw failed with the patch after the commit 
c5b7ba4.

+* If the query's main target relation is a partitioned table, any 
inserts
+* would have been performed using tuple routing, so use the appropriate
+* set of target relations.  Note that this also covers any inserts
+* performed during cross-partition UPDATEs that would have occurred
+* through tuple routing.
 */
if (proute)
...

It seems we should get the mt_partition_tuple_routing just before the if-test.

2)
+   foreach(lc, estate->es_opened_result_relations)
+   {
+   resultRelInfo = lfirst(lc);
+   if (resultRelInfo &&

I am not sure do we need to check if resultRelInfo == NULL because the the 
existing code did not check it.
And if we need to check it, it might be better use "if (resultRelInfo == NULL 
&&..."

3)
+   if (fmstate && fmstate->aux_fmstate != NULL)
+   fmstate = fmstate->aux_fmstate;

It might be better to write like " if (fmstate != NULL && fmstate->aux_fmstate 
!= NULL)".

Best regards,
houzj




Re: few ideas for pgbench

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 11:28 odesílatel Fabien COELHO  napsal:

>
> >> Finally it is unclear how to add such a feature with minimal impact on
> the
> >> source code.
> >
> >
> > It is a question if this is possible without more changes or without
> > compatibility break :( Probably not. All output should be centralized.
>
> Yes and no.
>
> For some things we could have "void report_sometype(file, name, data)"
> which append "data," under csv but "name = data\n" under text, but this
> does not work for nested data (eg -r -b/-f), which would rather require
> some json/yaml/whatever format which can embed a hierarchy.
>

It can work with nested data too, but the result should be denormalized.



> >> What I usually do is to put each pgbench run output in a separate file
> and
> >> write a small shell/perl/python script to process these, possibly
> >> generating CSV on the way.
> >
> > The goal of my proposal was a reduction of necessity to write auxiliary
> > scripts. The produced document should not be "nice", but should be very
> > easy to import it to some analytical tools.
>
> Yes, I understood that. I tend to use CSV for that, import results in pg
> or sqlite and analyse with SQL.
>
> > There is an analogy with Postgres's CSV logs. It is the same. We can see
> > the result of pgbench like some log.
>
> Sure, but this works for simple flat data, not changing structures.
>

Denormalized tables are common.  Although it can be ugly, it should work.



> --
> Fabien.
>


Re: Race condition in recovery?

2021-05-07 Thread Dilip Kumar
 On Fri, May 7, 2021 at 2:33 PM Kyotaro Horiguchi
 wrote:
>
> Thanks.
>
>
> At Fri, 7 May 2021 11:04:53 +0530, Dilip Kumar  wrote 
> in
> > On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar  
> > > wrote in
> > > Could you please fix the test script so that it causes your issue
> > > correctly? And/or elaborate a bit more?
> > >
> > > The attached first file is the debugging aid logging. The second is
> > > the test script, to be placed in src/test/recovery/t.
> >
> > I will look into your test case and try to see whether we can
> > reproduce the issue.  But let me summarise what is the exact issue.
> > Basically, the issue is that first in validateRecoveryParameters if
> > the recovery target is the latest then we fetch the latest history
> > file and set the recoveryTargetTLI timeline to the latest available
> > timeline assume it's 2 but we delay updating the expectedTLEs (as per
> > commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading
>
> I think it is right up to here.
>
> > the checkpoint record if we don't get the required WAL from the
> > archive then we try to get from primary, and while getting checkpoint
> > from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
> > suppose that is older timeline 1.  Now after reading the checkpoint we
> > will set the expectedTLEs based on the timeline from which we got the
> > checkpoint record.
>
> I doubt this point.  ReadCheckpointRecord finally calls
> XLogFileReadAnyTLI and it uses the content of the 0002.history as
> the local timeline entry list, since expectedTLEs is NIL and
> recoveryTargetTLI has been updated to 2 by
> validateRecoveryParameters(). But node 3 was having only the segment
> on TLI=1 so ReadCheckPointRecord() finds the wanted checkpoint recrod
> on TLI=1.  XLogFileReadAnyTLI() copies the local TLE list based on
> TLI=2 to expectedTLEs just after that because the wanted checkpoint
> record was available based on the list.

Okay, I got your point, now, consider the scenario that we are trying
to get the checkpoint record in XLogFileReadAnyTLI, you are right that
it returns history file 0002.history.  I think I did not mention
one point, basically, the tool while restarting node 3 after promoting
node 2 is deleting all the local WAL of node3 (so that node 3 can
follow node2).  So now node3 doesn't have the checkpoint in the local
segment.  Suppose checkpoint record was in segment
00010001, but after TL switch 00010001
is renamed to 00010001.partial on node2 so now
practically doesn't have 00010001 file anywhere.
However if TL switch mid-segment then we copy that segment with new TL
so we have 00020001 which contains the checkpoint
record, but node 2 haven't yet archived it.

So now you come out of XLogFileReadAnyTLI, without reading checkpoint
record and without setting expectedTLEs.  Because expectedTLEs is only
set if we are able to read the checkpoint record.  Make sense?

> So I don't think checkPointCopy.ThisTimeLineID cannot affect this
> logic, and don't think expectedTLEs is left with NIL.  It's helpful
> that you could show the specific code path to cause that.

So now expectedTLEs is still NULL and you go to get the checkpoint
record from primary and use checkPointCopy.ThisTimeLineID.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-07 Thread Michail Nikolaev
Hello, Antonin.

> I'm trying to review the patch, but not sure if I understand this problem,
> please see my comment below.

Thanks a lot for your attention. It is strongly recommended to look at
version N3 (1) because it is a much more elegant, easy, and reliable
solution :) But the minRecoveryPoint-related issue affects it anyway.

> Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
> replaying the commit record. And if the standby happens to crash before the
> commit record could be replayed, no query should see the deletion and thus no
> hint bit should be set in the index.

minRecoveryPoint is not affected by replaying the commit record in
most cases. It is updated in a lazy way, something like this:
minRecoveryPoint = max LSN of flushed page. Version 3 of a patch
contains a code_optional.patch to move minRecoveryPoint more
aggressively to get additional performance on standby (based on
Peter’s answer in (2).

So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS
NEW ROW IN INDEX” it is just a random event.

Thanks,
Michail.

[1]: 
https://www.postgresql.org/message-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u%3Dk6igAg%40mail.gmail.com
(“Also, btw, do you know any reason to keep minRecoveryPoint at a low
value?”)




batch fdw insert bug (Postgres 14)

2021-05-07 Thread Pavel Stehule
Hi

I am testing new features in Postgres 14, and I found bug

EXPLAIN ANALYZE VERBOSE  for insert to FDW table with batch_size 1000
returns

---
 Insert on public.vzdalena_tabulka2  (cost=0.00..175.00 rows=0 width=0)
(actual time=30.269..30.270 rows=0 loops=1)
   Remote SQL:
\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F
   Batch Size: 1000
   ->  Function Scan on pg_catalog.generate_series g  (cost=0.00..175.00
rows=1 width=36) (actual time=0.453..1.988 rows=10
 Output: g.i, ('AHOJ'::text || (g.i)::text)
 Function Call: generate_series(1, 1)
 Planning Time: 0.075 ms
 Execution Time: 31.032 ms
(8 rows)

Regards

Pavel


Re: Asynchronous Append on postgres_fdw nodes.

2021-05-07 Thread Andrey Lepikhov

On 6/5/21 22:12, Stephen Frost wrote:

* Etsuro Fujita (etsuro.fuj...@gmail.com) wrote:

I think the user should be careful about this.  How about adding a
note about it to the “Asynchronous Execution Options” section in
postgres-fdw.sgml, like the attached?

+1

... then again, it'd really be better if we could figure out a way to
just do the right thing here.  I haven't looked at this in depth but I
would think that the overhead of async would be well worth it just about
any time there's more than one foreign server involved.  Is it not
reasonable to have a heuristic where we disable async in the cases where
there's only one foreign server, but have it enabled all the other time?
While continuing to allow users to manage it explicitly if they want.
Bechmarking of SELECT from foreign partitions hosted on the same server, 
i see results:


With async append:
1 partition - 178 ms; 4 - 263; 8 - 450; 16 - 860; 32 - 1740.

Without:
1 - 178 ms; 4 - 583; 8 - 1140; 16 - 2302; 32 - 4620.

So, these results show that we have a reason to use async append in the 
case where there's only one foreign server.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Bogus collation version recording in recordMultipleDependencies

2021-05-07 Thread Thomas Munro
On Thu, May 6, 2021 at 9:23 AM Andrew Dunstan  wrote:
> On 5/5/21 5:12 PM, Thomas Munro wrote:
> > On Thu, May 6, 2021 at 8:58 AM Andrew Dunstan  wrote:
> >> this is an open item for release 14 . The discussion seems to have gone
> >> silent for a couple of weeks. Are we in a position to make any
> >> decisions? I hear what Andres says, but is anyone acting on it?
> > I'm going to revert this and resubmit for 15.  That'll give proper
> > time to reconsider the question of whether pg_depend is right for
> > this, and come up with a non-rushed response to the composite type
> > problem etc.
>
> OK, thanks.

Reverted.  Rebasing notes:

1.  Commit b4c9695e moved toast table declarations so I adapted to the
new scheme, but commit 0cc99327 had taken the OIDs that pg_collation
was previously using, so I had to pick some new ones from the
temporary range for later reassignment.

2.  It took me quite a while to figure out that the collversion column
now needs BKI_DEFAULT(_null_), or the perl script wouldn't accept the
contents of pg_collation.dat.

3.  In a separate commit, I rescued a few sentences of text from the
documentation about libc collation versions and reinstated them in the
most obvious place, because although the per-index tracking has been
reverted, the per-collation version tracking (limited as it is) is now
back and works on more OSes than before.




Re: pg_receivewal makes a bad daemon

2021-05-07 Thread Magnus Hagander
On Wed, May 5, 2021 at 7:12 PM Robert Haas  wrote:
>
> On Wed, May 5, 2021 at 12:34 PM Magnus Hagander  wrote:
> > Is this really a problem we should fix ourselves? Most daemon-managers
> > today will happily be configured to automatically restart a daemon on
> > failure with a single setting since a long time now. E.g. in systemd
> > (which most linuxen uses now) you just set Restart=on-failure (or
> > maybe even Restart=always) and something like RestartSec=10.
> >
> > That said, it wouldn't cover an fsync() error -- they will always
> > restart. The way to handle that is for the operator to capture the
> > error message perhaps, and just "deal with it"?
>
> Maybe, but if that's really a non-problem, why does postgres itself
> restart, and have facilities to write and rotate log files? I feel
> like this argument boils down to "a manual transmission ought to be
> good enough for anyone, let's not have automatics." But over the years
> people have found that automatics are a lot easier to drive. It may be
> true that if you know just how to configure your system's daemon
> manager, you can make all of this work, but it's not like we document
> how to do any of that, and it's probably not the same on every
> platform - Windows in particular - and, really, why should people have
> to do this much work? If I want to run postgres in the background I
> can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my
> crontab to make sure it gets restarted within a few minutes even if
> the postmaster dies. If I want to keep pg_receivewal running all the
> time ... I need a whole pile of extra mechanism to work around its
> inherent fragility. Documenting how that's typically done on modern
> systems, as you propose further on, would be great, but I can't do it,
> because I don't know how to make it work. Hence the thread.

If PostgreSQL was built today, I'm not sure we would've built that
functionality TBH.

The vast majority of people are not interested in manually starting
postgres and then putting in a crontab to "restart it if it fails".
That's not how anybody runs a server and hasn't for a long time.

It might be interesting for us as developers, but not to the vast
majority of our users. Most of those get their startup scripts from
our packagers -- so maybe we should encourage packagers to provide it,
like they do for PostgreSQL itself. But I don't think adding log
rotations and other independent functionality to pg_receivexyz would
help almost anybody in our user base.

In relation to the other thread about pid 1 handling and containers --
if anything, I bet a larger portion of our users would be interested
in running pg_receivewal in a dedicated container, than would want to
start it manually and verify it's running using crontab... By a large
margin.

It is true that Windows is a special case in this. But it is, I'd say,
equally true that adding something akin to "pg_ctl start" for
pg_receivewal would be equally useless on Windows.

We can certainly build and add such functionality. But my feeling is
that it's going to be added complexity for very little practical gain.
Much of the server world moved to "we don't want every single daemon
to implement it it's own way, ever so slightly different".

I like your car analogy though. But I'd consider it more like "we used
to have to mix the right amount of oil into the gasoline manually. But
modern engines don't really require us to do that anymore, so most
people have stopped, only those who want very special cars do". Or
something along that line. (Reality is probably somewhere in between,
and I suck at car analogies)


> > Also, all the above also apply to pg_recvlogical, right? So if we do
> > want to invent our own daemon-init-system, we should probably do one
> > more generic that can handle both.
>
> Yeah. And I'm not really 100% convinced that trying to patch this
> functionality into pg_receive{wal,logical} is the best way forward ...

It does in a lot of ways amount to basically a daemon-init system. It
might be easier to just vendor one of the existing ones :) Or more
realistically, suggest they use something that's already on their
system. On linux that'll be systemd, on *bsd it'll probably be
something like supervisord, on mac it'll be launchd. But this is
really more a function of the operating system/distribution.

Windows is again the one that stands out. But PostgreSQL *alraedy*
does a pretty weak job of solving that problem on Windows, so
duplicating that is not that strong a win..


> but I'm not entirely convinced that it isn't, either. I think one of
> the basic problems with trying to deploy PostgreSQL in 2021 is that it
> needs so much supporting infrastructure and so much babysitting.
> archive_command has to be a complicated, almost magical program we
> don't provide, and we don't even tell you in the documentation that
> you need it. If you don't want to use that, you can stream with
> pg_receivewal instead, but now you need a 

Re: pg_receivewal makes a bad daemon

2021-05-07 Thread Magnus Hagander
On Thu, May 6, 2021 at 5:43 AM Robert Haas  wrote:
>
> On Wed, May 5, 2021 at 10:42 PM David Fetter  wrote:
> > We do use at least one bit and piece from around the internet to make
> > our software usable, namely libreadline, the absence of which make
> > psql pretty much unusable.

FWIW, we did go with the idea of using readline. Which doesn't work
properly on Windows. So this is an excellent example of how we're
already not solving the problem for Windows users, but are apparently
OK with it in this case.


> I'm not talking about dependent libraries. We obviously have to depend
> on some external libraries; it would be crazy to write our own
> versions of libreadline, zlib, glibc, and everything else we use.

Why is that more crazy than building our own limited version of
supervisord? readline and glibc might be one thing, but zlib (at least
the parts we use) is probably less complex than building our own cross
platform daemon-management.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Asynchronous Append on postgres_fdw nodes.

2021-05-07 Thread Andrey Lepikhov

On 6/5/21 11:45, Etsuro Fujita wrote:

On Tue, Apr 27, 2021 at 9:31 PM Etsuro Fujita  wrote:
The patch fixes the issue, but I don’t think it’s the right way to go,
because it requires an extra ExecProcNode() call, which wouldn’t be
efficient.  Also, the patch wouldn’t address another issue I noticed
in EXPLAIN ANALYZE for async-capable nodes that the command wouldn’t
measure the time spent in such nodes accurately.  For the case of
async-capable node using postgres_fdw, it only measures the time spent
in ExecProcNode() in ExecAsyncRequest()/ExecAsyncNotify(), missing the
time spent in other things such as creating a cursor in
ExecAsyncRequest().  :-(.  To address both issues, I’d like to propose
the attached, in which I added instrumentation support to
ExecAsyncRequest()/ExecAsyncConfigureWait()/ExecAsyncNotify().  I
think this would not only address the reported issue more efficiently,
but allow to collect timing for async-capable nodes more accurately.


Ok, I agree with the approach, but the next test case failed:

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM (
(SELECT * FROM f1) UNION ALL (SELECT * FROM f2)
) q1 LIMIT 100;
ERROR:  InstrUpdateTupleCount called on node not yet executed

Initialization script see in attachment.

--
regards,
Andrey Lepikhov
Postgres Professional


t1.sql
Description: application/sql


Re: Asynchronous Append on postgres_fdw nodes.

2021-05-07 Thread Andrey Lepikhov

On 6/5/21 14:11, Etsuro Fujita wrote:

On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
 wrote:

One more question. Append choose async plans at the stage of the Append
plan creation.
Later, the planner performs some optimizations, such as eliminating
trivial Subquery nodes. So, AsyncAppend is impossible in some
situations, for example:

(SELECT * FROM f1 WHERE a < 10)
UNION ALL
(SELECT * FROM f2 WHERE a < 10);

But works for the query:

SELECT *
FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
WHERE a < 10;

As far as I understand, this is not a hard limit.


I think so, but IMO I think this would be an improvement rather than a bug fix.


We can choose async
subplans at the beginning of the execution stage.
For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.


Thanks for the patch!  IIUC, another approach to this would be the
patch you proposed before [1].  Right?

Yes. I think, new solution will be better.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: batch fdw insert bug (Postgres 14)

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 11:48 odesílatel Pavel Stehule 
napsal:

> Hi
>
> I am testing new features in Postgres 14, and I found bug
>
> EXPLAIN ANALYZE VERBOSE  for insert to FDW table with batch_size 1000
> returns
>
>
> ---
>  Insert on public.vzdalena_tabulka2  (cost=0.00..175.00 rows=0 width=0)
> (actual time=30.269..30.270 rows=0 loops=1)
>Remote SQL:
> \x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F
>Batch Size: 1000
>->  Function Scan on pg_catalog.generate_series g  (cost=0.00..175.00
> rows=1 width=36) (actual time=0.453..1.988 rows=10
>  Output: g.i, ('AHOJ'::text || (g.i)::text)
>  Function Call: generate_series(1, 1)
>  Planning Time: 0.075 ms
>  Execution Time: 31.032 ms
> (8 rows)
>

reproducer

CREATE DATABASE omega;

\c omega

CREATE TABLE tabulka(a int, b varchar);

\c postgres

CREATE EXTENSION postgres_fdw;

CREATE SERVER omega_db
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (dbname 'omega');

CREATE USER MAPPING FOR "pavel.stehule"
  SERVER omega_db OPTIONS (user 'pavel.stehule');

CREATE FOREIGN TABLE vzdalena_tabulka(a int, b varchar)
  SERVER omega_db
  OPTIONS (table_name 'tabulka');

CREATE FOREIGN TABLE vzdalena_tabulka2(a int, b varchar)
  SERVER omega_db
  OPTIONS (table_name 'tabulka', batch_size '1000');

EXPLAIN ANALYZE VERBOSE INSERT INTO vzdalena_tabulka SELECT i, 'AHOJ' || i
FROM generate_series(1,1) g(i);
EXPLAIN ANALYZE VERBOSE INSERT INTO vzdalena_tabulka2 SELECT i, 'AHOJ' || i
FROM generate_series(1,1) g(i);

Pavel

>
> Regards
>
> Pavel
>


Query regarding RANGE Partitioning

2021-05-07 Thread Nitin Jadhav
Hi,

I am not convinced with the following behaviour of RANGE Partitioning.
Kindly let me know if this is expected behaviour or it should be changed.

*Case-1*:
postgres@68941=#create table r(a int, b int) partition by range(a,b);
CREATE TABLE
postgres@68941=#create table r1 partition of r for values from (100,0) to
(200,100);
CREATE TABLE
postgres@68941=#create table r2 partition of r for values from (400,200) to
(500,300);
CREATE TABLE
postgres@68941=#create table r3 partition of r for values from (0,100) to
(100,200);
ERROR:  partition "r3" would overlap partition "r1"
LINE 1: ...able r3 partition of r for values from (0,100) to (100,200);

As we can see here, I am trying to create a partition table with ranges
from (0,100) to (100,200)
which is actually not overlapped with any of the existing partitions. But I
am getting error saying,
it overlaps with partition 'r1'.

*Case-2:*
postgres@68941=#\d+ r
  Partitioned table "public.r"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression
| Stats target | Description
+-+---+--+-+-+-+--+-
 a  | integer |   |  | | plain   |
|  |
 b  | integer |   |  | | plain   |
|  |
Partition key: RANGE (a, b)
Partitions: r1 FOR VALUES FROM (100, 0) TO (200, 100),
r2 FOR VALUES FROM (400, 200) TO (500, 300),
r3 FOR VALUES FROM (200, 100) TO (300, 200)

postgres@68941=#insert into r values(300, 50);
INSERT 0 1
postgres@68941=#select * from r3;
  a  |  b
-+-
 300 |  50
(2 rows)

As per my understanding, in the range partitioned table, lower bound is
included and upper bound is excluded.
and in case of multi-column partition keys, the row comparison operator is
used for tuple routing which means
the columns are compared left to right. If the partition key value is equal
to the upper bound of that column then
the next column will be considered.

So, In case of insertion of row (300, 50). Based on the understanding,
partition 'r3' should have rejected it.

Kindly confirm whether the above is expected or not. If expected, kindly
explain.

Thanks and Regards,
Nitin Jadhav


Re: Query regarding RANGE Partitioning

2021-05-07 Thread Ashutosh Bapat
On Fri, May 7, 2021 at 4:21 PM Nitin Jadhav
 wrote:
>
> Hi,
>
> I am not convinced with the following behaviour of RANGE Partitioning.
> Kindly let me know if this is expected behaviour or it should be changed.
>
> Case-1:
> postgres@68941=#create table r(a int, b int) partition by range(a,b);
> CREATE TABLE
> postgres@68941=#create table r1 partition of r for values from (100,0) to 
> (200,100);
> CREATE TABLE
> postgres@68941=#create table r2 partition of r for values from (400,200) to 
> (500,300);
> CREATE TABLE
> postgres@68941=#create table r3 partition of r for values from (0,100) to 
> (100,200);
> ERROR:  partition "r3" would overlap partition "r1"
> LINE 1: ...able r3 partition of r for values from (0,100) to (100,200);
>
> As we can see here, I am trying to create a partition table with ranges from 
> (0,100) to (100,200)
> which is actually not overlapped with any of the existing partitions. But I 
> am getting error saying,
> it overlaps with partition 'r1'.

overlapping range is (100, 0), (100, 200)

>
> Case-2:
> postgres@68941=#\d+ r
>   Partitioned table "public.r"
>  Column |  Type   | Collation | Nullable | Default | Storage | Compression | 
> Stats target | Description
> +-+---+--+-+-+-+--+-
>  a  | integer |   |  | | plain   | |  
> |
>  b  | integer |   |  | | plain   | |  
> |
> Partition key: RANGE (a, b)
> Partitions: r1 FOR VALUES FROM (100, 0) TO (200, 100),
> r2 FOR VALUES FROM (400, 200) TO (500, 300),
> r3 FOR VALUES FROM (200, 100) TO (300, 200)
>
> postgres@68941=#insert into r values(300, 50);
> INSERT 0 1
> postgres@68941=#select * from r3;
>   a  |  b
> -+-
>  300 |  50
> (2 rows)
>
> As per my understanding, in the range partitioned table, lower bound is 
> included and upper bound is excluded.
> and in case of multi-column partition keys, the row comparison operator is 
> used for tuple routing which means
> the columns are compared left to right. If the partition key value is equal 
> to the upper bound of that column then
> the next column will be considered.
>
> So, In case of insertion of row (300, 50). Based on the understanding, 
> partition 'r3' should have rejected it.

r3 contains (300, 0) to (300, 200) which contains (300, 50). First key
300 is equal to upper bound 300, so it compares 50, which is less than
the upper bound of the second column. Am I missing something?


-- 
Best Wishes,
Ashutosh Bapat




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2021-05-07 Thread ahsan hadi
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I have also seen this error with logical replication using pglogical extension, 
will this patch also address similar problem with pglogical?

Re: Identify missing publications from publisher while create/alter subscription.

2021-05-07 Thread Bharath Rupireddy
On Fri, May 7, 2021 at 11:50 AM Dilip Kumar  wrote:
>
> On Thu, May 6, 2021 at 7:22 PM vignesh C  wrote:
> >
>
> Some comments:
> 1.
> I don't see any change in pg_dump.c, don't we need to dump this option?

I don't think it is necessary there as the default value of the
validate_publication is false, so even if the pg_dump has no mention
of the option, then it is assumed to be false while restoring. Note
that the validate_publication option is transient (like with other
options such as create_slot, copy_data) which means it can't be stored
in pg_subscritpion catalogue. Therefore, user specified value can't be
fetched once the CREATE/ALTER subscription command is finished. If we
were to dump the option, we should be storing it in the catalogue,
which I don't think is necessary. Thoughts?

> 2.
> + /* Try to connect to the publisher. */
> + wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
> + if (!wrconn)
> + ereport(ERROR,
> + (errmsg("could not connect to the publisher: %s", err)));
>
> Instead of using global wrconn, I think you should use a local variable?

Yeah, we should be using local wrconn, otherwise there can be
consequences, see the patches at [1]. Thanks for pointing out this.

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPuSwWmmeK%2Bfe6E2duep8588Jk82XXH73nE4dUxwDNkNUg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Identify missing publications from publisher while create/alter subscription.

2021-05-07 Thread Dilip Kumar
On Fri, May 7, 2021 at 5:38 PM Bharath Rupireddy
 wrote:
>
> On Fri, May 7, 2021 at 11:50 AM Dilip Kumar  wrote:
> >
> > On Thu, May 6, 2021 at 7:22 PM vignesh C  wrote:
> > >
> >
> > Some comments:
> > 1.
> > I don't see any change in pg_dump.c, don't we need to dump this option?
>
> I don't think it is necessary there as the default value of the
> validate_publication is false, so even if the pg_dump has no mention
> of the option, then it is assumed to be false while restoring. Note
> that the validate_publication option is transient (like with other
> options such as create_slot, copy_data) which means it can't be stored
> in pg_subscritpion catalogue. Therefore, user specified value can't be
> fetched once the CREATE/ALTER subscription command is finished. If we
> were to dump the option, we should be storing it in the catalogue,
> which I don't think is necessary. Thoughts?

If we are not storing it in the catalog then it does not need to be dumped.

> > 2.
> > + /* Try to connect to the publisher. */
> > + wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
> > + if (!wrconn)
> > + ereport(ERROR,
> > + (errmsg("could not connect to the publisher: %s", err)));
> >
> > Instead of using global wrconn, I think you should use a local variable?
>
> Yeah, we should be using local wrconn, otherwise there can be
> consequences, see the patches at [1]. Thanks for pointing out this.

Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Logical Replication - behavior of TRUNCATE ... CASCADE

2021-05-07 Thread Dilip Kumar
On Mon, May 3, 2021 at 6:08 PM Bharath Rupireddy
 wrote:
>
> Having said that, isn't it good if we can provide a subscription
> (CREATE/ALTER) level option say "cascade"(similar to other options
> such as binary, synchronous_commit, stream)  default being false, when
> set to true, we send upstream CASCADE option to ExecuteTruncateGuts in
> apply_handle_truncate? It will be useful to truncate all the dependent
> tables in the subscriber. Users will have to use it with caution
> though.

I think this could be a useful feature in some cases.  Suppose
subscriber has some table that is dependent on the subscribed table,
in such case if the main table gets truncated it will always error out
in subscriber, which is fine.  But if user doesn’t want error and he
is fine even if the dependent table gets truncated so I feel there
should be some option to set that.  I think the documentation should
clearly say the impact of setting this to true.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: batch fdw insert bug (Postgres 14)

2021-05-07 Thread houzj.f...@fujitsu.com

>  I am testing new features in Postgres 14, and I found bug 
> EXPLAIN ANALYZE VERBOSE  for insert to FDW table with batch_size 1000 returns
> ---
> Insert on public.vzdalena_tabulka2  (cost=0.00..175.00 rows=0 width=0) 
>(actual time=30.269..30.270 rows=0 loops=1)
>   Remote SQL: 
>\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F
>   Batch Size: 1000
>   ->  Function Scan on pg_catalog.generate_series g  (cost=0.00..175.00 
>rows=1 width=36) (actual time=0.453..1.988 rows=10
>         Output: g.i, ('AHOJ'::text || (g.i)::text)
>         Function Call: generate_series(1, 1)
> Planning Time: 0.075 ms
> Execution Time: 31.032 ms
> (8 rows)
> reproducer

I can reproduce the issue and did some basic analysis on it.

The "Remote SQL" is built from the following code:


char   *sql = strVal(list_nth(fdw_private,

  FdwModifyPrivateUpdateSql));

ExplainPropertyText("Remote SQL", sql, es);
---

It use the query string stored in list fdw_private.
However, the "fmstate->query" will also point to the string in fdw_private,
by  postgresBeginForeignModify --> create_foreign_modify --> "fmstate->query = 
query;"

And in execute_foreign_modify(), " fmstate->query "  will be freed when rebuild 
the query
string to do the batch insert like the following:


if (operation == CMD_INSERT && fmstate->num_slots != *numSlots) 
{   
...   
/* Build INSERT string with numSlots records in its VALUES clause. */
initStringInfo(&sql);   
rebuildInsertSql(&sql, fmstate->orig_query, fmstate->values_end,
 fmstate->p_nums, *numSlots - 1)
**  pfree(fmstate->query);  
fmstate->query = sql.data;


So, it output the freed pointer as "Remote SQL".

For the fix.
The query string could be rebuilt depending on the numSlots,
which query string should be output ?
should we just output the original query string like the attached patch ?
Or should we output the last one?

Best regards,
houzj


0001-fix-explain-info-about-fdw-batch-insert.patch
Description: 0001-fix-explain-info-about-fdw-batch-insert.patch


Re: Query regarding RANGE Partitioning

2021-05-07 Thread Jeevan Ladhe
Hi Nitin,

On Fri, May 7, 2021 at 4:21 PM Nitin Jadhav 
wrote:

> Hi,
>
> I am not convinced with the following behaviour of RANGE Partitioning.
> Kindly let me know if this is expected behaviour or it should be changed.
>
> *Case-1*:
> postgres@68941=#create table r(a int, b int) partition by range(a,b);
> CREATE TABLE
> postgres@68941=#create table r1 partition of r for values from (100,0) to
> (200,100);
> CREATE TABLE
> postgres@68941=#create table r2 partition of r for values from (400,200)
> to (500,300);
> CREATE TABLE
> postgres@68941=#create table r3 partition of r for values from (0,100) to
> (100,200);
> ERROR:  partition "r3" would overlap partition "r1"
> LINE 1: ...able r3 partition of r for values from (0,100) to (100,200);
>
> As we can see here, I am trying to create a partition table with ranges
> from (0,100) to (100,200)
> which is actually not overlapped with any of the existing partitions. But
> I am getting error saying,
> it overlaps with partition 'r1'.
>
>
*Case-2:*
> postgres@68941=#\d+ r
>   Partitioned table "public.r"
>  Column |  Type   | Collation | Nullable | Default | Storage | Compression
> | Stats target | Description
>
> +-+---+--+-+-+-+--+-
>  a  | integer |   |  | | plain   |
> |  |
>  b  | integer |   |  | | plain   |
> |  |
> Partition key: RANGE (a, b)
> Partitions: r1 FOR VALUES FROM (100, 0) TO (200, 100),
> r2 FOR VALUES FROM (400, 200) TO (500, 300),
> r3 FOR VALUES FROM (200, 100) TO (300, 200)
>
> postgres@68941=#insert into r values(300, 50);
> INSERT 0 1
> postgres@68941=#select * from r3;
>   a  |  b
> -+-
>  300 |  50
> (2 rows)
>
> As per my understanding, in the range partitioned table, lower bound is
> included and upper bound is excluded.
> and in case of multi-column partition keys, the row comparison operator is
> used for tuple routing which means
> the columns are compared left to right. If the partition key value is
> equal to the upper bound of that column then
> the next column will be considered.
>
> So, In case of insertion of row (300, 50). Based on the understanding,
> partition 'r3' should have rejected it.
>
> Kindly confirm whether the above is expected or not. If expected, kindly
> explain.
>

If you describe the partition r3, you can see the way partition
constraints are formed:

postgres=# \d+ r3
   Table "public.r3"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression
| Stats target | Description
+-+---+--+-+-+-+--+-
 a  | integer |   |  | | plain   |
|  |
 b  | integer |   |  | | plain   |
|  |
Partition of: r FOR VALUES FROM (200, 100) TO (300, 200)
Partition constraint: ((a IS NOT NULL) AND (b IS NOT NULL) AND ((a > 200)
OR ((a = 200) AND (b >= 100))) AND ((a < 300) OR ((a = 300) AND (b < 200
Access method: heap

The above constraint very well fits the tuple you are trying to insert
that is: (a, b) = (300, 50) (where (a = 300) AND (b < 200))

Also, the table partition syntax documentation[1]
clarifies
this (look
for "partition_bound_expr"):

"When creating a range partition, the lower bound specified with
FROM is an inclusive bound, whereas the upper bound specified with
TO is an exclusive bound. That is, the values specified in the FROM
list are valid values of the corresponding partition key columns
for this partition, whereas those in the TO list are not. Note that
this statement must be understood according to the rules of row-wise
comparison (Section 9.24.5). For example, given PARTITION BY RANGE
(x,y), a partition bound FROM (1, 2) TO (3, 4) allows x=1 with any y>=2,
x=2 with any non-null y, and x=3 with any y<4."

So, in your case the partition (a, b) for bound (200, 100) TO (300, 200)
would transform to allowing:
a = 200 with any b >= 100 OR
a > 200 and a < 300 with any non-null b
OR a=300 with any b<200

Your particular tuple (300, 50) fits in the last part of the OR i.e
(a=300 with any b<200).

So, IMHO, the range partitioning is behaving as expected.

Similarly, for the case-1 you mention above:
create table r1 partition of r for values from (100,0) to (200,100);
create table r3 partition of r for values from (0,100) to (100,200);
here, (100, 0) or r1 would overlap with (100, 200) of r3.


[1] https://www.postgresql.org/docs/current/sql-createtable.html

Regards,
Jeevan Ladhe


Re: Identify missing publications from publisher while create/alter subscription.

2021-05-07 Thread vignesh C
On Fri, May 7, 2021 at 5:44 PM Dilip Kumar  wrote:
>
> On Fri, May 7, 2021 at 5:38 PM Bharath Rupireddy
>  wrote:
> >
> > On Fri, May 7, 2021 at 11:50 AM Dilip Kumar  wrote:
> > >
> > > On Thu, May 6, 2021 at 7:22 PM vignesh C  wrote:
> > > >
> > >
> > > Some comments:
> > > 1.
> > > I don't see any change in pg_dump.c, don't we need to dump this option?
> >
> > I don't think it is necessary there as the default value of the
> > validate_publication is false, so even if the pg_dump has no mention
> > of the option, then it is assumed to be false while restoring. Note
> > that the validate_publication option is transient (like with other
> > options such as create_slot, copy_data) which means it can't be stored
> > in pg_subscritpion catalogue. Therefore, user specified value can't be
> > fetched once the CREATE/ALTER subscription command is finished. If we
> > were to dump the option, we should be storing it in the catalogue,
> > which I don't think is necessary. Thoughts?
>
> If we are not storing it in the catalog then it does not need to be dumped.

I intentionally did not store this value, I felt we need not persist
this option's value. This value will be false while dumping similar to
other non stored parameters.

> > > 2.
> > > + /* Try to connect to the publisher. */
> > > + wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
> > > + if (!wrconn)
> > > + ereport(ERROR,
> > > + (errmsg("could not connect to the publisher: %s", err)));
> > >
> > > Instead of using global wrconn, I think you should use a local variable?
> >
> > Yeah, we should be using local wrconn, otherwise there can be
> > consequences, see the patches at [1]. Thanks for pointing out this.

Modified.

Thanks for the comments, the attached patch has the fix for the same.

Regards,
Vignesh
From c6e4df57824309644eb486296e69741404186b9e Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Wed, 7 Apr 2021 22:05:53 +0530
Subject: [PATCH v10] Identify missing publications from publisher while
 create/alter subscription.

Creating/altering subscription is successful when we specify a publication which
does not exist in the publisher. This patch checks if the specified publications
are present in the publisher and throws an error if any of the publication is
missing in the publisher.
---
 doc/src/sgml/ref/alter_subscription.sgml  |  13 ++
 doc/src/sgml/ref/create_subscription.sgml |  18 +-
 src/backend/commands/subscriptioncmds.c   | 231 +++---
 src/bin/psql/tab-complete.c   |   7 +-
 src/test/subscription/t/007_ddl.pl|  68 ++-
 5 files changed, 302 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac814f4..81e156437b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -160,6 +160,19 @@ ALTER SUBSCRIPTION name RENAME TO <
  
 

+
+   
+validate_publication (boolean)
+
+ 
+  When true, the command verifies if all the specified publications
+  that are being subscribed to are present in the publisher and throws
+  an error if any of the publication doesn't exist. The default is
+  false.
+ 
+
+   
+
   
 

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812beee37..cad9285c16 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -207,8 +207,9 @@ CREATE SUBSCRIPTION subscription_nameCREATE SUBSCRIPTION
   should connect to the publisher at all.  Setting this to
   false will change default values of
-  enabled, create_slot and
-  copy_data to false.
+  enabled, create_slot,
+  copy_data and
+  validate_publication to false.
  
 
  
@@ -239,6 +240,19 @@ CREATE SUBSCRIPTION subscription_name
 

+
+   
+validate_publication (boolean)
+
+ 
+  When true, the command verifies if all the specified publications
+  that are being subscribed to are present in the publisher and throws
+  an error if any of the publication doesn't exist. The default is
+  false.
+ 
+
+   
+
   
 

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8edd3b..2352f7e71f 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,9 @@ parse_subscription_options(List *options,
 		   char **synchronous_commit,
 		   bool *refresh,
 		   bool *binary_given, bool *binary,
-		   bool *streaming_given, bool *streaming)
+		   bool *streaming_given, bool *streaming,
+		   bool *validate_publication_given,
+		   bool *validate_publication)
 {
 	ListCell   *lc;
 	bool		connect_g

Re: Use simplehash.h instead of dynahash in SMgr

2021-05-07 Thread Yura Sokolov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I believe it is ready for committer.

The new status of this patch is: Ready for Committer


Re: Multi-Column List Partitioning

2021-05-07 Thread Jeevan Ladhe
> While reviewing one of the 'Table partitioning' related patches,
> I found that Postgres does not support multiple column based LIST
> partitioning. Based on this understanding, I have started working on
> this feature. I also feel that 'Multi-Column List Partitioning' can
> be benefited to the Postgres users in future.

+1 for the feature. I also think this can help users deal with some
useful cases.


> CREATE TABLE t2_1 PARTITION OF t2 FOR VALUES IN (1, 2), (1, 5), (2,
> 2),(2, 10);

IMHO, listing every single tuple like this might be a bit cumbersome for
the user. What about something like this:

...FOR VALUES IN (1, 2, 3, 4), (11, 22, 33, 44), where the first set
is the list for values of column A and second list is for column B. We
can treat these lists as A X B possible values or simply (a1, b1), (a2,
b2) internally. However I see other proprietary databases already have
syntax something similar that you are proposing here. So, I leave it
open for the thoughts from experts. Also, though what I propose might be
easy from a user perspective, but might not be that easy for
implementation, given that for a larger number of columns in partition list
e.g. A X B X C X D lists become unmanageable.

I did not review the patch in detail, but a quick look at it leaves me
with following comments:

1.
> + * list. Then this function will continue the serach and return the
index of
Typo:
s/serach/search

2.
A compiler warning:
partprune.c: In function ‘get_matching_list_bounds’:
partprune.c:2731:20: error: passing argument 5 of ‘partition_list_bsearch’
makes pointer from integer without a cast [-Werror=int-conversion]
 2731 |   nvalues, value, &is_equal);
  |^
  ||
  |Datum {aka long unsigned int}
In file included from partprune.c:53:
../../../src/include/partitioning/partbounds.h:120:32: note: expected
‘Datum *’ {aka ‘long unsigned int *’} but argument is of type ‘Datum’ {aka
‘long unsigned int’}
  120 |int nvalues, Datum *value, bool *is_equal);
  | ~~~^

3.
And, a server crash with following case:
postgres=# CREATE TABLE t1 (a int) PARTITION BY LIST (a);
CREATE TABLE
postgres=# CREATE TABLE t1p1 PARTITION OF t1 FOR VALUES IN (1, 2, 3);
CREATE TABLE
postgres=# \d+ t1p1
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Stacktrace:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f5d273c5859 in __GI_abort () at abort.c:79
#2  0x55779d2eb69d in ExceptionalCondition
(conditionName=0x55779d4978d8 "ptr == NULL || nodeTag(ptr) == type",
errorType=0x55779d4978c3 "FailedAssertion",
fileName=0x55779d4978a0 "../../../src/include/nodes/nodes.h",
lineNumber=603) at assert.c:69
#3  0x55779d03a684 in castNodeImpl (type=T_Const, ptr=0x55779e457b18)
at ../../../src/include/nodes/nodes.h:603
#4  0x55779d04368a in get_qual_for_list (parent=0x7f5d1df829b8,
spec=0x55779e457950) at partbounds.c:4155
#5  0x55779d03ac60 in get_qual_from_partbound (rel=0x7f5d1df82570,
parent=0x7f5d1df829b8, spec=0x55779e457950) at partbounds.c:272
#6  0x55779d2cf630 in generate_partition_qual (rel=0x7f5d1df82570) at
partcache.c:379
#7  0x55779d2cf468 in get_partition_qual_relid (relid=32771) at
partcache.c:308
#8  0x55779d2592bf in pg_get_partition_constraintdef
(fcinfo=0x55779e44ee50) at ruleutils.c:2019
#9  0x55779cec7221 in ExecInterpExpr (state=0x55779e44dfb0,
econtext=0x55779e407fe8, isnull=0x7ffddf9b109f) at execExprInterp.c:744
#10 0x55779cec954f in ExecInterpExprStillValid (state=0x55779e44dfb0,
econtext=0x55779e407fe8, isNull=0x7ffddf9b109f) at execExprInterp.c:1819
#11 0x55779cf1d58a in ExecEvalExprSwitchContext (state=0x55779e44dfb0,
econtext=0x55779e407fe8, isNull=0x7ffddf9b109f)
at ../../../src/include/executor/executor.h:338
#12 0x55779cf1d602 in ExecProject (projInfo=0x55779e44dfa8) at
../../../src/include/executor/executor.h:372
#13 0x55779cf1db2f in ExecNestLoop (pstate=0x55779e407ed0) at
nodeNestloop.c:241
#14 0x55779cedf136 in ExecProcNodeFirst (node=0x55779e407ed0) at
execProcnode.c:462
#15 0x55779ced3053 in ExecProcNode (node=0x55779e407ed0) at
../../../src/include/executor/executor.h:257
#16 0x55779ced5a87 in ExecutePlan (estate=0x55779e407c80,
planstate=0x55779e407ed0, use_parallel_mode=false, operation=CMD_SELECT,
sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x55779e425a88, execute_once=true)
at execMain.c:1551
#17 0x55779ced372d in standard_ExecutorRun (queryDesc=0x55779e453520,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:361
#18 0x55779ced353c in ExecutorRun (queryDesc=0x55779e453520,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:305
#19 0x

Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
I wrote:
> 1. No wonder we could not reproduce it anywhere else.  I've warned
> the cfarm admins that their machine may be having hardware issues.

I heard back from the machine's admin.  The time of the crash I observed
matches exactly to these events in the kernel log:

May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, 
offset 0, sector 159239256
May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, 
sector 157618896 op 0x1:(WRITE) flags 0x4800 phys_seg 16 prio class 0

So it's not a mirage.  The admin seems to think it might be a kernel
bug though.

regards, tom lane




Re: Copyright on perl files

2021-05-07 Thread Andrew Dunstan


On 4/20/21 10:09 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> I've just noticed that we have 41 perl files in our sources with
>> copyright notices of some sort and 161 without. Should we do something
>> about that?
> +1 for pasting the usual copyright notice on the rest.
>
>   



Done.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Why do we have perl and sed versions of Gen_dummy_probes?

2021-05-07 Thread Andrew Dunstan

On 5/6/21 9:55 AM, Andrew Dunstan wrote:
> On 5/6/21 12:59 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-05-06 00:18:12 -0400, Tom Lane wrote:
>>> Andres Freund  writes:
 I understand why we don't want to rely on sed because of windows - but
 it's far from obvious why we can't just use the .pl variant all the
 time?
>>> Perl is not considered a hard build requirement on non-Windows.
>> Oops, forgot that.
>>
>>
>>> We could dodge that by shipping a pre-built dummy probes.h,
>>> but that doesn't really seem like a cleaner way than what's
>>> there now.
>> I tried to regenerate Gen_dummy_probes.pl using s2p - which doesn't seem
>> to exist for modern versions of perl anymore :(
>>
>>
>>> Also, as I read it, Gen_dummy_probes.sed is useful in any case as
>>> being the "source code" for Gen_dummy_probes.pl.  You'd need some
>>> other form of documentation if you removed it.
>
> I suggest we add a README that sets out
>
>
> a) why we do things this way
>
> b) that the sed script is what's authoritative
>
> c) how to regenerate the perl script if you change the sed script,
> including where to get s2p
>
>
> I can do that.
>
>


Here's a patch that adds the README and also adds a Makefile recipe for
regenerating Gen_dummy_probes.pl after the sed script is changed. On my
system at least the recipe is idempotent.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

>From 488d812f06fe9ebe53368228af78e7ec8d7478d5 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Fri, 7 May 2021 10:08:01 -0400
Subject: [PATCH] Add a README and Makefile recipe for Gen_dummy_probes.pl

---
 src/backend/utils/Makefile| 10 +
 src/backend/utils/README.Gen_dummy_probes | 25 +++
 2 files changed, 35 insertions(+)
 create mode 100644 src/backend/utils/README.Gen_dummy_probes

diff --git a/src/backend/utils/Makefile b/src/backend/utils/Makefile
index 59b2255260..bcf9dd41ad 100644
--- a/src/backend/utils/Makefile
+++ b/src/backend/utils/Makefile
@@ -88,6 +88,16 @@ $(top_builddir)/src/include/utils/probes.h: probes.h
 	cd '$(dir $@)' && rm -f $(notdir $@) && \
 	$(LN_S) "../../../$(subdir)/probes.h" .
 
+# Recipe for rebuilding the Perl version of Gen_dummy_probes
+# Nothing depends on it, so it will never be called unless explicitly requested
+# The last two lines of the recipe format the script according to  our
+# standard and put back some blank lines for improved readability.
+Gen_dummy_probes.pl: Gen_dummy_probes.sed
+	perl -ni -e ' print; exit if /^\$$0/;' $@
+	s2p -f $<  | sed -e 1,4d -e '/# #/d' -e '$$d' >> $@
+	perltidy --profile=../../tools/pgindent/perltidyrc $@
+	perl -pi -e '!$$lb && ( /^\t+#/  || /^# prototypes/ ) && print qq{\n};'\
+		-e '$$lb = m/^\n/; ' $@
 
 .PHONY: install-data
 install-data: errcodes.txt installdirs
diff --git a/src/backend/utils/README.Gen_dummy_probes b/src/backend/utils/README.Gen_dummy_probes
new file mode 100644
index 00..90fec37bce
--- /dev/null
+++ b/src/backend/utils/README.Gen_dummy_probes
@@ -0,0 +1,25 @@
+# Generating dummy probes
+
+If Postgres isn't configured with dtrace enabled, we need to generate
+dummy probes for the entries in probes.d, that do nothing.
+
+This is accomplished in Unix via the sed script `Gen_dummy_probes.sed`. We
+used to use this in MSVC builds using the perl utility `psed`, which mimicked
+sed. However, that utility disappeared from Windows perl distributions and so
+we converted the sed script to a perl script to be used in MSVC builds.
+
+We still keep the sed script as the authoritative source for generating
+these dummy probes because except on Windows perl is not a hard requirement
+when building from a tarball.
+
+So, if you need to change the way dummy probes are generated, first change
+the sed script, and when it's working generate the perl script. This can
+be accomplished by using the perl utility s2p.
+
+s2p is no longer part of the perl core, so it might not be on your system,
+but it is available on CPAN and also in many package systems. e.g.
+on Fedora it can be installed using `cpan App::s2p` or
+`dnf install perl-App-s2p`.
+
+The Makefile contains a recipe for regenerating Gen_dummy_probes.pl, so all
+you need to do is once you have s2p installed is `make Gen_dummy_probes.pl`
-- 
2.25.4



Re: cache lookup failed for statistics object 123

2021-05-07 Thread Tomas Vondra

I've pushed all three patches, with some better commit messages etc.

thanks!

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andrew Dunstan


On 5/7/21 12:38 AM, Andres Freund wrote:
> Hi,
>
> On 2021-05-07 00:30:11 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
 That I'm not sure about.  gdb is certainly installed, and thorntail is
 visibly running the current buildfarm client and is configured with the
 correct core_file_glob, and I can report that the crash did leave a 'core'
 file in the data directory (so it's not a case of systemd commandeering
 the core dump).  Seems like core-file collection should've worked
 ... unless maybe it's not covering TAP tests at all?
>>> I suspect that is it - there's not really a good way for the buildfarm
>>> client to even know where there could be data directories :(.
>> Does it need to?  I'm envisioning "find tmp_check -name '$core_file_glob'"
>> or something along that line.
> Yea, it'd be doable that way. It'd be a bit harder to associate the core
> files with specific tests though. But I now checked, and it indeed
> checks for core files in a specific subset of tests, and that that test
> only globs inside the passed-in datadir.
>

working on it ...


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Asynchronous Append on postgres_fdw nodes.

2021-05-07 Thread Etsuro Fujita
On Fri, May 7, 2021 at 2:12 AM Stephen Frost  wrote:
> I'd suggest the language point out that it's not actually possible to do
> otherwise, since they all need to be part of the same transaction.
>
> Without that, it looks like we're just missing a trick somewhere and
> someone might think that they could improve PG to open multiple
> connections to the same remote server to execute them in parallel.

Agreed.

> Maybe:
>
> In order to ensure that the data being returned from a foreign server
> is consistent, postgres_fdw will only open one connection for a given
> foreign server and will run all queries against that server sequentially
> even if there are multiple foreign tables involved.  In such a case, it
> may be more performant to disable this option to eliminate the overhead
> associated with running queries asynchronously.

Ok, I’ll merge this into the next version.

Thanks!

Best regards,
Etsuro Fujita




Re: Asynchronous Append on postgres_fdw nodes.

2021-05-07 Thread Etsuro Fujita
On Fri, May 7, 2021 at 7:35 PM Andrey Lepikhov
 wrote:
> On 6/5/21 14:11, Etsuro Fujita wrote:
> > On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
> >  wrote:
> >> One more question. Append choose async plans at the stage of the Append
> >> plan creation.
> >> Later, the planner performs some optimizations, such as eliminating
> >> trivial Subquery nodes. So, AsyncAppend is impossible in some
> >> situations, for example:
> >>
> >> (SELECT * FROM f1 WHERE a < 10)
> >> UNION ALL
> >> (SELECT * FROM f2 WHERE a < 10);

> >> We can choose async
> >> subplans at the beginning of the execution stage.
> >> For a demo, I prepared the patch (see in attachment).
> >> It solves the problem and passes the regression tests.
> >
> > IIUC, another approach to this would be the
> > patch you proposed before [1].  Right?
> Yes. I think, new solution will be better.

Ok, will review.

I think it would be better to start a new thread for this, and add the
patch to the next CF so that it doesn’t get lost.

Best regards,
Etsuro Fujita




Inherited UPDATE/DELETE vs async execution

2021-05-07 Thread Etsuro Fujita
I noticed this while working on the
EXPLAIN-ANALYZE-for-async-capable-nodes issue:

EXPLAIN (VERBOSE, COSTS OFF)
DELETE FROM async_pt;
   QUERY PLAN

 Delete on public.async_pt
   Foreign Delete on public.async_p1 async_pt_1
   Foreign Delete on public.async_p2 async_pt_2
   Delete on public.async_p3 async_pt_3
   ->  Append
 ->  Async Foreign Delete on public.async_p1 async_pt_1
   Remote SQL: DELETE FROM public.base_tbl1
 ->  Async Foreign Delete on public.async_p2 async_pt_2
   Remote SQL: DELETE FROM public.base_tbl2
 ->  Seq Scan on public.async_p3 async_pt_3
   Output: async_pt_3.tableoid, async_pt_3.ctid
(11 rows)

DELETE FROM async_pt;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

The cause for this would be that direct-update plans are mistakenly
treated as async-capable ones, as shown in the EXPLAIN output.  To
fix, I think we should modify postgresPlanDirectModify() so that it
clears the async-capable flag if it is set.  Attached is a patch for
that.  Maybe I am missing something, though.

Best regards,
Etsuro Fujita


fix-postgresPlanDirectModify.patch
Description: Binary data


Draft back-branch release notes are up

2021-05-07 Thread Tom Lane
See
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7f4bab7f4a0e42ee9fa14707f726017b7869386b

As usual, please send comments/corrections by Sunday.

regards, tom lane




Parallelize correlated subqueries that execute within each worker

2021-05-07 Thread James Coleman
In a bug report back in November [1] a subthread explored why parallel
query is excluded any time we have "Plan nodes which reference a
correlated SubPlan". Amit's understanding was that the reasoning had
to do with inability to easily pass (potentially variable length)
Param values between workers.

However a decent-sized subset of this kind of query doesn't actually
require that we communicate between workers. If the Subplan executes
per-tuple within the worker then there's no reason I can see why it
needs to be marked parallel unsafe. Amit concurred but noted that
identifying that subset of plans is the difficult part (as is usually
the case!)

At the time I'd started work on an approach to handle this case and
hoped to "post about it in a new thread later this week." That didn't
happen, but here we are now, and I finally have this patch cleaned up
enough to share.

The basic idea is that we need to track (both on nodes and relations)
not only whether that node or rel is parallel safe but also whether
it's parallel safe assuming params are rechecked in the using context.
That allows us to delay making a final decision until we have
sufficient context to conclude that a given usage of a Param is
actually parallel safe or unsafe.

The first patch in this series was previously posted in the thread
"Consider parallel for lateral subqueries with limit" [2] and is
required as a precursor for various test cases to work here.

The second patch implements the core of the series. It results in
parallel query being possible for subplans that execute entirely
within the context of a parallel worker for cases where that subplan
is in the target, a LATERAL JOIN, or the WHERE and ORDER BY clauses.

The final patch notes several places where we set e.g.
rel->consider_parallel but setting the corresponding new value
rel->consider_parallel_recheckng_params wasn't yet necessary. It shows
opportunity either for further improvement or concluding certain cases
can't benefit and should be left unchanged.

James

1: 
https://www.postgresql.org/message-id/CAAaqYe_vihKjc%2B8LuQa49EHW4%2BKfefb3wHqPYFnCuUqozo%2BLFg%40mail.gmail.com
2: 
https://www.postgresql.org/message-id/flat/CAAaqYe_HEkmLwf_1iEHxXwQOWiRyiFd%3DuOu6kwj3sWYdVd1-zA%40mail.gmail.com


v1-0003-Other-places-to-consider-for-completeness.patch
Description: Binary data


v1-0002-Parallel-query-support-for-basic-correlated-subqu.patch
Description: Binary data


v1-0001-Allow-parallel-LATERAL-subqueries-with-LIMIT-OFFS.patch
Description: Binary data


[PATCH] Add configure cache checkpoints before fatal checks

2021-05-07 Thread Joel Jacobson
Hi,

Enabling optional ./configure features/packages using --enable-* / --with-*
will and should cause fatal errors aborting the configure process
if some dependencies are missing, letting the user install such packages,
before proceeding and resuming ./configure.

However, ./configure currently only saves config.cache at the very end upon 
success.

This means the manual loop below can take quite a few iterations,
until the user have encountered all errors for their distribution,
and installed all required packages:

loop
   ./configure --with-... --enable-...
   if error occured then
   figure out which distro package should be installed using apt-file or 
google
   install the package
else
 break
end
end loop

To speed-up the resuming of ./configure, I propose adding AC_CACHE_SAVE entires
at a few positions *before* possibly executing some fatal checks (AC_MSG_ERROR).

It's important not to run AC_CACHE_SAVE in between the actual check and 
AC_MSG_ERROR,
as that would cache the "no" value, causing the package later installed by the 
user to go undetected.

Attached is a patch adding AC_CACHE_SAVE at places where I think it makes most 
sense.
Since it's a macro that expanded causes configure to grow quite a bit,
I only added it at the 10 places where we get most bang for the bucks, in terms 
of speed-up.

Just for fun, I also created a little helper-tool, magicmake [1], to automate 
the pseudo-code loop above,
to verify all packages could be found, after installation, so that no undesired 
"no" values were cached.
The gif animation [2] shows the building of PostgreSQL with these configure 
flags:

magicmake ./configure --config-cache --prefix="$HOME/pg-head" --enable-nls 
--with-perl --with-python --with-tcl --with-icu --with-llvm --with-ssl=openssl 
--with-gssapi --with-ldap --with-pam --with-systemd --with-libxml 
--with-libxslt --with-lz4 --with-pgport=54321 --enable-debug --enable-cassert 
--enable-tap-tests --enable-depend --enable-coverage --enable-profiling 
--enable-dtrace

The following packages were detected and installed by magicmake to make it 
possible to build PostgreSQL with all those options:

liblog-agent-perl lcov systemtap-sdt-dev llvm clang pkg-config libicu-dev 
libxml2-dev liblz4-dev libreadline-dev bison flex python-is-python3 zlib1g-dev 
libkrb5-dev libssl-dev libgss-dev libpam0g-dev libxslt1-dev libldap2-dev 
libsystemd-dev gettext tcl tcl-dev libperl-dev libpython3.8-dev libipc-run-perl 
dbtoepub fop libxml2-utils xsltproc libterm-table-perl libterm-readkey-perl 
libterm-size-any-perl

(I only attached the patch for configure.ac, you have to run autoconf to also 
update configure.)

Thoughts?

/Joel

[1] https://github.com/truthly/magicmake
[2] https://github.com/truthly/magicmake/raw/master/magicmake.gif




add-configure-cache-checkpoints.patch
Description: Binary data


Re: Draft back-branch release notes are up

2021-05-07 Thread Matthias van de Meent
On Fri, 7 May 2021 at 18:23, Tom Lane  wrote:
>
> See
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7f4bab7f4a0e42ee9fa14707f726017b7869386b
>
> As usual, please send comments/corrections by Sunday.

I noticed only one potential issue.

I see similar (if not duplicate) entries for a "COMMIT AND CHAIN"
issue, committed at nearly the same time, and both by Fujii Masao. Are
these the same / should they be contained in one entry?

> +Author: Fujii Masao 
> +Branch: master [8a55cb5ba] 2021-02-19 21:57:52 +0900

> +Author: Fujii Masao 
> +Branch: master [fe06819f1] 2021-02-19 22:01:25 +0900

Thanks in advance,

Matthias van de Meent




Re: Draft back-branch release notes are up

2021-05-07 Thread Tom Lane
Matthias van de Meent  writes:
> I see similar (if not duplicate) entries for a "COMMIT AND CHAIN"
> issue, committed at nearly the same time, and both by Fujii Masao. Are
> these the same / should they be contained in one entry?

>> +Author: Fujii Masao 
>> +Branch: master [8a55cb5ba] 2021-02-19 21:57:52 +0900

>> +Author: Fujii Masao 
>> +Branch: master [fe06819f1] 2021-02-19 22:01:25 +0900

No, the first is a server bug, the second is a psql bug.

Thanks for looking though!

regards, tom lane




Re: Why do we have perl and sed versions of Gen_dummy_probes?

2021-05-07 Thread Tom Lane
Andrew Dunstan  writes:
> Here's a patch that adds the README and also adds a Makefile recipe for
> regenerating Gen_dummy_probes.pl after the sed script is changed. On my
> system at least the recipe is idempotent.

I've not tested the Makefile recipe, but the README looks good.

regards, tom lane




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the
>> tree" to fix this exact problem.

> Hah, so that maybe explains why thorntail has only shown this in
> the v12 branch.  Should we consider back-patching that?

Realizing that 9989d37d prevents the assertion failure, I went
to see if thorntail had shown EIO failures without assertions.
Looking back 180 days, I found these:

  sysname  |branch |  snapshot   |   stage| 
  l 
   
---+---+-++
 thorntail | HEAD  | 2021-03-19 21:28:15 | recoveryCheck  | 
2021-03-20 00:48:48.117 MSK [4089174:11] 008_fsm_truncation.pl PANIC:  could 
not fdatasync file "00010002": Input/output error
 thorntail | HEAD  | 2021-04-06 16:08:10 | recoveryCheck  | 
2021-04-06 19:30:54.103 MSK [3355008:11] 008_fsm_truncation.pl PANIC:  could 
not fdatasync file "00010002": Input/output error
 thorntail | REL9_6_STABLE | 2021-04-12 02:38:04 | pg_basebackupCheck | 
pg_basebackup: could not fsync file "00010013": Input/output 
error

So indeed the kernel-or-hardware problem is affecting other branches.
I suspect that the lack of reports in the pre-v12 branches is mostly
down to there having been many fewer runs on those branches within
the past couple months.

regards, tom lane




Re: Why do we have perl and sed versions of Gen_dummy_probes?

2021-05-07 Thread Andres Freund
Hi,

On 2021-05-07 11:19:02 -0400, Andrew Dunstan wrote:
> Here's a patch that adds the README and also adds a Makefile recipe for
> regenerating Gen_dummy_probes.pl after the sed script is changed. On my
> system at least the recipe is idempotent.

Nice! Thanks for this work.

Greetings,

Andres Freund




Re: Why do we have perl and sed versions of Gen_dummy_probes?

2021-05-07 Thread Andres Freund
Hi,

On 2021-05-06 11:13:28 +0100, Dagfinn Ilmari Mannsåker wrote:
> Andres Freund  writes:
> 
> > I tried to regenerate Gen_dummy_probes.pl using s2p - which doesn't seem
> > to exist for modern versions of perl anymore :(
> 
> It still exists, it's just not part of the core Perl distribution any
> more (since 5.22, released in 2015):
> 
>   https://metacpan.org/pod/perl5220delta#find2perl,-s2p-and-a2p-removal
>   https://metacpan.org/release/App-s2p.

Oh, I got confused because the cpan link at the top of
https://perldoc.perl.org/5.6.2/s2p is dead, and because I forgot all I
knew about perl a long time ago.

Greetings,

Andres Freund




Re: Draft back-branch release notes are up

2021-05-07 Thread Jonathan S. Katz
On 5/7/21 12:22 PM, Tom Lane wrote:
> See
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7f4bab7f4a0e42ee9fa14707f726017b7869386b

Thanks!

> As usual, please send comments/corrections by Sunday.

==snip=
A previous bug fix caused environment variables (such as PGPORT) to
override entries in the service file in this context. Previously, and in
other contexts, the priority is the other way around; so restore that
behavior.
==snip==

s/;/,/ per grammar check.

Otherwise on a quick read, looks good. I'll be reading it more
thoroughly as the day progresses.

Are there going to be any tzdata changes?

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Draft back-branch release notes are up

2021-05-07 Thread Tom Lane
"Jonathan S. Katz"  writes:
> Are there going to be any tzdata changes?

Nope, they're still on 2021a:
https://www.iana.org/time-zones

regards, tom lane




Re: [HACKERS] Comparing primary/HS standby in tests

2021-05-07 Thread Andres Freund
Hi,

On 2015-03-03 16:49:22 +0100, Andres Freund wrote:
> I every now and then run installcheck against a primary, verify that
> replay works without errors, and then compare pg_dumpall from both
> clusters. Unfortunately that currently requires hand inspection of
> dumps, there are differences like:
> -SELECT pg_catalog.setval('default_seq', 1, true);
> +SELECT pg_catalog.setval('default_seq', 33, true);
> 
> The reason these differences is that the primary increases the
> sequence's last_value by 1, but temporarily sets it to +SEQ_LOG_VALS
> before XLogInsert(). So the two differ.
> 
> Does anybody have a good idea how to get rid of that difference? One way
> to do that would be to log the value the standby is sure to have - but
> that's not entirely trivial.

I found a way that's actually fairly simple. On the primary call nextval
often enough to use up all the cached values. The below query does so:

DO $$
DECLARE
s regclass;
BEGIN
FOR s IN SELECT oid::regclass FROM pg_class WHERE relkind = 'S' LOOP
EXECUTE format($s$SELECT nextval(%s), generate_series(1, log_cnt) FROM 
%s;$s$, s::oid, s::text);
END LOOP;
END;$$;

After that dumps on master generate the same dump on primary / standby
for me, after running a regression test.

Greetings,

Andres Freund




Re: Processing btree walks as a batch to parallelize IO

2021-05-07 Thread James Coleman
On Fri, Apr 9, 2021 at 4:57 PM Tomas Vondra
 wrote:
>
>
>
> On 4/9/21 7:33 PM, James Coleman wrote:
> > $SUBJECT is still a very loosely formed idea, so forgive lack of detail
> > or things I've likely missed, but I wanted to get it out there to see if
> > it sounded at all intriguing to people.
> >
> > Background: One of the big problems with non-local storage such as AWS
> > EBS volumes or a SAN is that in a large database (really, working set,
> > where working set includes reads) exceeds the size of buffer cache (and
> > page cache) the cost of random page reads hitting the underlying disk
> > system dominates. This is because networked disks have an order of
> > magnitude higher latency than a bunch of RAIDed SSDs (even more so with
> > NVMe storage). In some of our experiments on Aurora I've seen a 10x
> > change versus pretty good physical hardware, and I'd assume RDS (since
> > it's EBS-backed) is similar.
> >
> > A specific area where this is particularly painful is btree index reads.
> > Walking the tree to leaf pages isn't naturally prefetchable, and so for
> > each level you pay the random page cost. Of course higher levels in the
> > tree will almost certainly exhibit emergent behavior such that they
> > (just by fact of the LRU caching) will be in the buffer cache, but for a
> > large index lower levels likely won't be.
> >
>
> What do you consider a large index level?

In general it's probably all levels but the leaves (though depends on
cache and index size etc.)

> Consider a 1TB table, with just a single UUID column - that's ~25B rows,
> give or take. Real tables will have more columns, so this seems like a
> reasonable model of the largest number of rows per relation. With ~32B
> per index tuple, that's about 100M leaf pages, and with ~256 branches
> per internal page, that's still only ~5 levels. I think it's quite rare
> to see indexes with more than 6 or 7 levels.
>
> And the internal pages are maybe 0.5% of the whole index (so ~4GB out of
> 750GB). I think the usual expectation is that most of that will fit into
> RAM, but of course there may be more indexes competing for that.
>
> I think the index level is not really the crucial bit - it's more about
> the total amount of indexes in the DB.

I suppose? If the tables/indexes/etc. size is sufficiently large
relative to cache size it won't matter the quantity.

> > If we squint a bit, insertions look a whole lot like reads as well since
> > we have to walk the tree to find the leaf insertion page for a new
> > tuple. This is particularly true for indexes where inserts are roughly
> > randomly distributed data, like a uuid.
> >
>
> Yep. We need to walk the index to the leaf pages in both cases, both for
> read and insert workloads.
>
> > The read-for-lookups problem is harder to solve, but the cost as it
> > relates to table inserts is possibly more tractable. Tables typically
> > have more than one index to update, so the obvious approach is "let's
> > just parallelize the index insertions". Of course we know that's
> > difficult given the multi-process approach Postgres uses for parallelism.
> >
>
> Hmm. Not sure if reads are harder to real with, but I think you're right
> those two cases (reads and writes) may look similar at the level of a
> single index, but may need rather different approaches exactly because
> inserts have to deal with all indexes, while reads only really deal with
> a single index.

Right. In practice it's harder to deal with a single index scan
because you don't have multiple such scans to parallelize.

> FWIW I think there are a couple options for improving reads, at least in
> some cases.
>
> 1) I wonder if e.g. _bt_readnextpage could prefetch at least one page
> ahead. We can't look further ahead, but perhaps this would help.
>
> 2) In some cases (e.g. nested loop with inner indexes scan) we could
> collect an array of values and then look them up at once, which should
> allow us to do at least some fo the I/O in parallel, I think. That's
> similar to what you propose for writes, except that it works against the
> same index.

The "collect an array of values" approach isn't one I'd considered,
but seems likely interesting.

> > Another approach that at first glance seems like it fits better into
> > Postgres (I'm not claiming it's easy or a small patch) would be to
> > process a batch of indexes at once. For example, if the index access
> > methods were extended to allow being given a list of indexes that need
> > to be walked, then the btree code could process each layer in the walk
> > as a group -- issuing IO fetches for all of the first level blocks in
> > the tree, and then computing all of the next level blocks needed and
> > issuing those IO requests at a time, and so on.
> >
>
> Yeah, I agree having a way to say "prefetch all pages needed to insert
> these keys into these indexes" might be better than just parallelizing
> it in a "naive" way.
>
> Not sure how complex would it be - I think the API wo

plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Pavel Stehule
Hi

I am testing new features of Postgres 14, and now I am trying to check the
result cache. Unfortunately on my test data, the result is not too good.
the behaviour is very non linear. Is it expected?

create table t1(a int, t2_id int);
insert into t1 select random() * 10, random() * 10 from
generate_series(1,100);
create table t2(b int, id int);
insert into t2 select random() * 10, random() * 10 from
generate_series(1,100);
create index on t2(id);

vacuum analyze t1, t2;

when work_mem is 40MB

QUERY PLAN


---


 Nested Loop  (cost=4.65..472639.79 rows=100 width=16) (actual
time=0.041..1078.882 rows=100 loops=1)


   ->  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8) (actual
time=0.010..60.212 rows=100 loops=1)

   ->  Result Cache  (cost=4.65..4.67 rows=1 width=8) (actual
time=0.001..0.001 rows=1 loops=100)


 Cache Key: t1.t2_id


 Hits: 96  Misses: 4  Evictions: 0  Overflows: 0  Memory
Usage: 10547kB

 ->  Aggregate  (cost=4.64..4.65 rows=1 width=8) (actual
time=0.003..0.003 rows=1 loops=4)


   ->  Index Only Scan using t2_id_idx on t2  (cost=0.42..4.62
rows=11 width=0) (actual time=0.002..0.003 rows=10 loops=4)

 Index Cond: (id = t1.t2_id)


 Heap Fetches: 0


 Planning Time: 0.091 ms


 Execution Time: 1120.177 ms

when work_mem is 10MB


postgres=# set work_mem to '10MB'; -- 11MB is ok
SET
postgres=# explain analyze select * from t1, lateral(select count(*) from
t2 where t1.t2_id = t2.id) s ;
 QUERY PLAN


 Nested Loop  (cost=4.65..472639.79 rows=100 width=16) (actual
time=0.040..56576.187 rows=100 loops=1)
   ->  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8) (actual
time=0.010..76.753 rows=100 loops=1)
   ->  Result Cache  (cost=4.65..4.67 rows=1 width=8) (actual
time=0.056..0.056 rows=1 loops=100)
 Cache Key: t1.t2_id
 Hits: 884158  Misses: 115842  Evictions: 18752  Overflows: 0
 Memory Usage: 10241kB
 ->  Aggregate  (cost=4.64..4.65 rows=1 width=8) (actual
time=0.005..0.005 rows=1 loops=115842)
   ->  Index Only Scan using t2_id_idx on t2  (cost=0.42..4.62
rows=11 width=0) (actual time=0.003..0.004 rows=10 loops=115842)
 Index Cond: (id = t1.t2_id)
 Heap Fetches: 0
 Planning Time: 0.087 ms
 Execution Time: 56621.421 ms
(11 rows)

The query without result cache

postgres=# explain analyze select * from t1, lateral(select count(*) from
t2 where t1.t2_id = t2.id) s ;
  QUERY PLAN

---
 Nested Loop  (cost=4.64..4689425.00 rows=100 width=16) (actual
time=0.031..3260.858 rows=100 loops=1)
   ->  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8) (actual
time=0.008..71.792 rows=100 loops=1)
   ->  Aggregate  (cost=4.64..4.65 rows=1 width=8) (actual
time=0.003..0.003 rows=1 loops=100)
 ->  Index Only Scan using t2_id_idx on t2  (cost=0.42..4.62
rows=11 width=0) (actual time=0.002..0.002 rows=10 loops=100)
   Index Cond: (id = t1.t2_id)
   Heap Fetches: 0
 Planning Time: 0.081 ms
 Execution Time: 3293.543 ms
(8 rows)



Samples: 119K of event 'cycles', 4000 Hz, Event count (approx.):
Overhead  Shared Object Symbol
  79.20%  postgres  [.] cache_reduce_memory
   1.94%  [kernel]  [k] native_write_msr_safe
   1.63%  [kernel]  [k] update_cfs_shares
   1.00%  [kernel]  [k] trigger_load_balance
   0.97%  [kernel]  [k] timerqueue_add
   0.51%  [kernel]  [k] task_tick_fair
   0.51%  [kernel]  [k] task_cputime
   0.50%  [kernel]  [k] perf_event_task_tick
   0.50%  [kernel]  [k] update_curr
   0.49%  [kernel]  [k] hrtimer_active

Regards

Pavel


Re: Why do we have perl and sed versions of Gen_dummy_probes?

2021-05-07 Thread Andrew Dunstan


On 5/7/21 1:20 PM, Andres Freund wrote:
> Hi,
>
> On 2021-05-07 11:19:02 -0400, Andrew Dunstan wrote:
>> Here's a patch that adds the README and also adds a Makefile recipe for
>> regenerating Gen_dummy_probes.pl after the sed script is changed. On my
>> system at least the recipe is idempotent.
> Nice! Thanks for this work.
>


de nada. pushed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 20:24 odesílatel Pavel Stehule 
napsal:

> Hi
>
> I am testing new features of Postgres 14, and now I am trying to check the
> result cache. Unfortunately on my test data, the result is not too good.
> the behaviour is very non linear. Is it expected?
>
> create table t1(a int, t2_id int);
> insert into t1 select random() * 10, random() * 10 from
> generate_series(1,100);
> create table t2(b int, id int);
> insert into t2 select random() * 10, random() * 10 from
> generate_series(1,100);
> create index on t2(id);
>
> vacuum analyze t1, t2;
>
> when work_mem is 40MB
>
> QUERY PLAN
>
>
> ---
>
>
>  Nested Loop  (cost=4.65..472639.79 rows=100 width=16) (actual
> time=0.041..1078.882 rows=100 loops=1)
>
>
>->  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8) (actual
> time=0.010..60.212 rows=100 loops=1)
>
>->  Result Cache  (cost=4.65..4.67 rows=1 width=8) (actual
> time=0.001..0.001 rows=1 loops=100)
>
>
>  Cache Key: t1.t2_id
>
>
>  Hits: 96  Misses: 4  Evictions: 0  Overflows: 0  Memory
> Usage: 10547kB
>
>  ->  Aggregate  (cost=4.64..4.65 rows=1 width=8) (actual
> time=0.003..0.003 rows=1 loops=4)
>
>
>->  Index Only Scan using t2_id_idx on t2  (cost=0.42..4.62
> rows=11 width=0) (actual time=0.002..0.003 rows=10 loops=4)
>
>  Index Cond: (id = t1.t2_id)
>
>
>  Heap Fetches: 0
>
>
>  Planning Time: 0.091 ms
>
>
>  Execution Time: 1120.177 ms
>
> when work_mem is 10MB
>
>
> postgres=# set work_mem to '10MB'; -- 11MB is ok
> SET
> postgres=# explain analyze select * from t1, lateral(select count(*) from
> t2 where t1.t2_id = t2.id) s ;
>  QUERY
> PLAN
>
> 
>  Nested Loop  (cost=4.65..472639.79 rows=100 width=16) (actual
> time=0.040..56576.187 rows=100 loops=1)
>->  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8) (actual
> time=0.010..76.753 rows=100 loops=1)
>->  Result Cache  (cost=4.65..4.67 rows=1 width=8) (actual
> time=0.056..0.056 rows=1 loops=100)
>  Cache Key: t1.t2_id
>  Hits: 884158  Misses: 115842  Evictions: 18752  Overflows: 0
>  Memory Usage: 10241kB
>  ->  Aggregate  (cost=4.64..4.65 rows=1 width=8) (actual
> time=0.005..0.005 rows=1 loops=115842)
>->  Index Only Scan using t2_id_idx on t2  (cost=0.42..4.62
> rows=11 width=0) (actual time=0.003..0.004 rows=10 loops=115842)
>  Index Cond: (id = t1.t2_id)
>  Heap Fetches: 0
>  Planning Time: 0.087 ms
>  Execution Time: 56621.421 ms
> (11 rows)
>
>
can be possible to disable caching when the number of evictions across some
limit ?

Can be calculated some average cache hit ratio against evictions, and when
this ratio will be too big, then the cache can be bypassed.




> The query without result cache
>
> postgres=# explain analyze select * from t1, lateral(select count(*) from
> t2 where t1.t2_id = t2.id) s ;
>   QUERY PLAN
>
>
> ---
>  Nested Loop  (cost=4.64..4689425.00 rows=100 width=16) (actual
> time=0.031..3260.858 rows=100 loops=1)
>->  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=8) (actual
> time=0.008..71.792 rows=100 loops=1)
>->  Aggregate  (cost=4.64..4.65 rows=1 width=8) (actual
> time=0.003..0.003 rows=1 loops=100)
>  ->  Index Only Scan using t2_id_idx on t2  (cost=0.42..4.62
> rows=11 width=0) (actual time=0.002..0.002 rows=10 loops=100)
>Index Cond: (id = t1.t2_id)
>Heap Fetches: 0
>  Planning Time: 0.081 ms
>  Execution Time: 3293.543 ms
> (8 rows)
>
>
>
> Samples: 119K of event 'cycles', 4000 Hz, Event count (approx.):
> Overhead  Shared Object Symbol
>   79.20%  postgres  [.] cache_reduce_memory
>1.94%  [kernel]  [k] native_write_msr_safe
>1.63%  [kernel]  [k] update_cfs_shares
>1.00%  [kernel]  [k] trigger_load_balance
>0.97%  [kernel]  [k] timerqueue_add
>0.51%  [kernel]  [k] task_tick_fair
>0.51%  [kernel]  [k] task_cputime
>0.50%  [kernel]  [k] perf_event_task_tick
>0.50%  [kernel]  [k] update_curr
>0.49%  [kernel]  [k] hrtimer_active
>
> Regards
>
> Pavel
>


Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Yura Sokolov

Pavel Stehule писал 2021-05-07 21:45:


Samples: 119K of event 'cycles', 4000 Hz, Event count (approx.):
Overhead  Shared Object Symbol
79.20%  postgres  [.] cache_reduce_memory
1.94%  [kernel]  [k] native_write_msr_safe
1.63%  [kernel]  [k] update_cfs_shares
1.00%  [kernel]  [k] trigger_load_balance
0.97%  [kernel]  [k] timerqueue_add
0.51%  [kernel]  [k] task_tick_fair
0.51%  [kernel]  [k] task_cputime
0.50%  [kernel]  [k] perf_event_task_tick
0.50%  [kernel]  [k] update_curr
0.49%  [kernel]  [k] hrtimer_active

Regards

Pavel


It is strange to see cache_reduce_memory itself consumes a lot of CPU.
It doesn't contain CPU hungry code.
It calls prepare_probe_slot, that calls some tuple forming. Then
it calls resultcache_lookup that may call to ResultCacheHash_hash
and ResultCacheHash_equal. And finally it calls remove_cache_entry.
I suppose remove_cache_entry should consume most of CPU time since
it does deallocations.
And if you compile with --enable-cassert, then remove_cache_entry
iterates through whole cache hashtable, therefore it reaches
quadratic complexity easily (or more correct M*N, where M - size
of a table, N - eviction count).

regards,
Yura Sokolov




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Tomas Vondra




On 5/7/21 9:06 PM, Yura Sokolov wrote:

Pavel Stehule писал 2021-05-07 21:45:


Samples: 119K of event 'cycles', 4000 Hz, Event count (approx.):
Overhead  Shared Object Symbol
79.20%  postgres  [.] cache_reduce_memory
1.94%  [kernel]  [k] native_write_msr_safe
1.63%  [kernel]  [k] update_cfs_shares
1.00%  [kernel]  [k] trigger_load_balance
0.97%  [kernel]  [k] timerqueue_add
0.51%  [kernel]  [k] task_tick_fair
0.51%  [kernel]  [k] task_cputime
0.50%  [kernel]  [k] perf_event_task_tick
0.50%  [kernel]  [k] update_curr
0.49%  [kernel]  [k] hrtimer_active

Regards

Pavel


It is strange to see cache_reduce_memory itself consumes a lot of CPU.
It doesn't contain CPU hungry code.
It calls prepare_probe_slot, that calls some tuple forming. Then
it calls resultcache_lookup that may call to ResultCacheHash_hash
and ResultCacheHash_equal. And finally it calls remove_cache_entry.
I suppose remove_cache_entry should consume most of CPU time since
it does deallocations.
And if you compile with --enable-cassert, then remove_cache_entry
iterates through whole cache hashtable, therefore it reaches
quadratic complexity easily (or more correct M*N, where M - size
of a table, N - eviction count).



Yeah. I tried reproducing the issue, but without success ...

Not sure what's wrong, but --enable-cassert is one option. Or maybe 
there's some funny behavior due to collecting timing info?


FWIW the timings on my laptop look like this:

work_mem=40MB5065ms
work_mem=10MB5104ms
resultcache=off 13453ms

So a very different behavior from what Pavel reported. But if I rebuild 
with casserts, I get the same massive slowdown, so I guess that's it.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 21:06 odesílatel Yura Sokolov 
napsal:

> Pavel Stehule писал 2021-05-07 21:45:
> >>
> >> Samples: 119K of event 'cycles', 4000 Hz, Event count (approx.):
> >> Overhead  Shared Object Symbol
> >> 79.20%  postgres  [.] cache_reduce_memory
> >> 1.94%  [kernel]  [k] native_write_msr_safe
> >> 1.63%  [kernel]  [k] update_cfs_shares
> >> 1.00%  [kernel]  [k] trigger_load_balance
> >> 0.97%  [kernel]  [k] timerqueue_add
> >> 0.51%  [kernel]  [k] task_tick_fair
> >> 0.51%  [kernel]  [k] task_cputime
> >> 0.50%  [kernel]  [k] perf_event_task_tick
> >> 0.50%  [kernel]  [k] update_curr
> >> 0.49%  [kernel]  [k] hrtimer_active
> >>
> >> Regards
> >>
> >> Pavel
>
> It is strange to see cache_reduce_memory itself consumes a lot of CPU.
> It doesn't contain CPU hungry code.
> It calls prepare_probe_slot, that calls some tuple forming. Then
> it calls resultcache_lookup that may call to ResultCacheHash_hash
> and ResultCacheHash_equal. And finally it calls remove_cache_entry.
> I suppose remove_cache_entry should consume most of CPU time since
> it does deallocations.
> And if you compile with --enable-cassert, then remove_cache_entry
> iterates through whole cache hashtable, therefore it reaches
> quadratic complexity easily (or more correct M*N, where M - size
> of a table, N - eviction count).
>

yes, the slowdown is related to debug assertions

Pavel


> regards,
> Yura Sokolov
>


Re: Draft back-branch release notes are up

2021-05-07 Thread Alvaro Herrera
On 2021-May-07, Tom Lane wrote:

> See
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7f4bab7f4a0e42ee9fa14707f726017b7869386b

I suppose you're aware of this, so I just want to get it on record that
this entry

+
+ 
+  Fix use-after-free bug in saving tuples for AFTER
+  triggers (Amit Langote)
+ 

only goes back to 12; the commit to 11 was just to add the test case.
This is obvious if you look at the commit, but if you just look at the
release note entry, that detail might be missed.

The notes look good.

Thanks,

-- 
Álvaro Herrera   Valdivia, Chile
"No renuncies a nada. No te aferres a nada."




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andrew Dunstan


On 5/7/21 11:27 AM, Andrew Dunstan wrote:
> On 5/7/21 12:38 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-05-07 00:30:11 -0400, Tom Lane wrote:
>>> Andres Freund  writes:
 On 2021-05-06 21:43:32 -0400, Tom Lane wrote:
> That I'm not sure about.  gdb is certainly installed, and thorntail is
> visibly running the current buildfarm client and is configured with the
> correct core_file_glob, and I can report that the crash did leave a 'core'
> file in the data directory (so it's not a case of systemd commandeering
> the core dump).  Seems like core-file collection should've worked
> ... unless maybe it's not covering TAP tests at all?
 I suspect that is it - there's not really a good way for the buildfarm
 client to even know where there could be data directories :(.
>>> Does it need to?  I'm envisioning "find tmp_check -name '$core_file_glob'"
>>> or something along that line.
>> Yea, it'd be doable that way. It'd be a bit harder to associate the core
>> files with specific tests though. But I now checked, and it indeed
>> checks for core files in a specific subset of tests, and that that test
>> only globs inside the passed-in datadir.
>>
> working on it ...
>
>
> cheers
>
>


see



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andres Freund
Hi,

On 2021-05-07 10:29:58 -0400, Tom Lane wrote:
> I wrote:
> > 1. No wonder we could not reproduce it anywhere else.  I've warned
> > the cfarm admins that their machine may be having hardware issues.
> 
> I heard back from the machine's admin.  The time of the crash I observed
> matches exactly to these events in the kernel log:
> 
> May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, 
> offset 0, sector 159239256
> May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
> May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, 
> sector 157618896 op 0x1:(WRITE) flags 0x4800 phys_seg 16 prio class 0
> 
> So it's not a mirage.  The admin seems to think it might be a kernel
> bug though.

Isn't this a good reason to have at least some tests run with fsync=on?

It makes a ton of sense for buildfarm animals to disable fsync to
achieve acceptable performance. Having something in there that
nevertheless does some light exercise of the fsync code doesn't seem
bad?

Greetings,

Andres Freund




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 07:18, Pavel Stehule  wrote:
> yes, the slowdown is related to debug assertions

With USE_ASSERT_CHECKING builds, I did add some code that verifies the
memory tracking is set correctly when evicting from the cache. This
code is pretty expensive as it loops over the entire cache to check
the memory accounting every time we evict something from the cache.
Originally, I had this code only run when some other constant was
defined, but I ended up changing it to compile that code in for all
assert enabled builds.

I considered that it might be too expensive as you can see from the
comment in [1].  I just wanted to get a few machines other than my own
to verify that the memory accounting code was working as expected.
There have been no complaints of any Assert failures yet, so maybe
it's safe to consider either removing the code entirely or just having
it run when some other more specific to purpose constant is defined.
If we did the latter, then I'd have concerns that nothing would ever
run the code to check the memory accounting, that's why I ended up
changing it to run with USE_ASSERT_CHECKING builds.

David

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeResultCache.c#L305




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 21:56 odesílatel David Rowley  napsal:

> On Sat, 8 May 2021 at 07:18, Pavel Stehule 
> wrote:
> > yes, the slowdown is related to debug assertions
>
> With USE_ASSERT_CHECKING builds, I did add some code that verifies the
> memory tracking is set correctly when evicting from the cache. This
> code is pretty expensive as it loops over the entire cache to check
> the memory accounting every time we evict something from the cache.
> Originally, I had this code only run when some other constant was
> defined, but I ended up changing it to compile that code in for all
> assert enabled builds.
>
> I considered that it might be too expensive as you can see from the
> comment in [1].  I just wanted to get a few machines other than my own
> to verify that the memory accounting code was working as expected.
> There have been no complaints of any Assert failures yet, so maybe
> it's safe to consider either removing the code entirely or just having
> it run when some other more specific to purpose constant is defined.
> If we did the latter, then I'd have concerns that nothing would ever
> run the code to check the memory accounting, that's why I ended up
> changing it to run with USE_ASSERT_CHECKING builds.
>

I understand. I think this is too slow for generic assertions, because the
overhead is about 50x.

But I understand, so it may be necessary to have this code active some time.

Regards

Pavel


> David
>
> [1]
> https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeResultCache.c#L305
>


JSON doc example (matchiness)

2021-05-07 Thread Erik Rijkers


The JSON doc has this example (to show the need for double backslash):

$ ? (@ like_regex "^\\d+$")


The example is not wrong exactly, and can be cast to jsonpath, but as-is 
can never match anything.


I think it'd be helpful to provide that example so that it more probably 
matches when the user does a quick trial.


Llet's change it to something like:

$.* ? (@ like_regex "^\\d+$")


Patch attached.

thanks,

Erik Rijkers

--- ./doc/src/sgml/func.sgml.orig	2021-05-07 21:52:53.577952054 +0200
+++ ./doc/src/sgml/func.sgml	2021-05-07 21:53:33.078578522 +0200
@@ -17132,7 +17132,7 @@
  backslashes you want to use in the regular expression must be doubled.
  For example, to match strings that contain only digits:
 
-$ ? (@ like_regex "^\\d+$")
+$.* ? (@ like_regex "^\\d+$")
 
 



Re: Draft back-branch release notes are up

2021-05-07 Thread Tom Lane
Alvaro Herrera  writes:
> I suppose you're aware of this, so I just want to get it on record that
> this entry

> +
> + 
> +  Fix use-after-free bug in saving tuples for AFTER
> +  triggers (Amit Langote)
> + 

> only goes back to 12; the commit to 11 was just to add the test case.
> This is obvious if you look at the commit, but if you just look at the
> release note entry, that detail might be missed.

Good point.  I'll make sure this doesn't get into the v11 notes
(which I probably would have done if you didn't point it out,
so thanks).

regards, tom lane




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
Andres Freund  writes:
> Isn't this a good reason to have at least some tests run with fsync=on?

Why?

I can certainly see an argument for running some buildfarm animals
with fsync on (for all tests).  I don't see a reason for forcing
them all to run some tests that way; and if I were going to do that,
I doubt that 008_fsm_truncation.pl would be the one I would pick.
I think it's nothing but sloppiness that that one is out of step with
all the rest.

IMO, if a buildfarm owner sets fsync = off, they mean off.
They don't mean "maybe".

regards, tom lane




Re: batch fdw insert bug (Postgres 14)

2021-05-07 Thread Tomas Vondra



On 5/7/21 2:46 PM, houzj.f...@fujitsu.com wrote:



  I am testing new features in Postgres 14, and I found bug
EXPLAIN ANALYZE VERBOSE  for insert to FDW table with batch_size 1000 returns
---
  Insert on public.vzdalena_tabulka2  (cost=0.00..175.00 rows=0 width=0) 
(actual time=30.269..30.270 rows=0 loops=1)
    Remote SQL: 
\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F
    Batch Size: 1000
    ->  Function Scan on pg_catalog.generate_series g  (cost=0.00..175.00 
rows=1 width=36) (actual time=0.453..1.988 rows=10
          Output: g.i, ('AHOJ'::text || (g.i)::text)
          Function Call: generate_series(1, 1)
  Planning Time: 0.075 ms
  Execution Time: 31.032 ms
(8 rows)
reproducer


I can reproduce the issue and did some basic analysis on it.

The "Remote SQL" is built from the following code:


char   *sql = strVal(list_nth(fdw_private,

  FdwModifyPrivateUpdateSql));

ExplainPropertyText("Remote SQL", sql, es);
---

It use the query string stored in list fdw_private.
However, the "fmstate->query" will also point to the string in fdw_private,
by  postgresBeginForeignModify --> create_foreign_modify --> "fmstate->query = 
query;"

And in execute_foreign_modify(), " fmstate->query "  will be freed when rebuild 
the query
string to do the batch insert like the following:


if (operation == CMD_INSERT && fmstate->num_slots != *numSlots)
{
...
 /* Build INSERT string with numSlots records in its VALUES clause. */
 initStringInfo(&sql);
 rebuildInsertSql(&sql, fmstate->orig_query, fmstate->values_end,
  fmstate->p_nums, *numSlots - 1)
**  pfree(fmstate->query);
 fmstate->query = sql.data;


So, it output the freed pointer as "Remote SQL".

For the fix.
The query string could be rebuilt depending on the numSlots,
which query string should be output ?
should we just output the original query string like the attached patch ?
Or should we output the last one?



Yeah. The problem is we build fdw_private list once (which references 
the SQL string), and during execution we may pfree() it. But then 
EXPLAIN ANALYZE gets the same fdw_private list and tries to use the SQL 
string which we pfreed() already.


I think the simplest fix is simply to pstrdup() the query when building 
fmstate in create_foreign_modify. We've already been doing that for 
orig_query anyway. That seems cleaner than printing the last query we 
build (which would be confusing I think).


I've pushed a fix doing that. We only need that for INSERT queries, and 
we might even restrict that to cases with batching if needed.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: batch fdw insert bug (Postgres 14)

2021-05-07 Thread Pavel Stehule
pá 7. 5. 2021 v 22:43 odesílatel Tomas Vondra 
napsal:

>
> On 5/7/21 2:46 PM, houzj.f...@fujitsu.com wrote:
> >
> >>   I am testing new features in Postgres 14, and I found bug
> >> EXPLAIN ANALYZE VERBOSE  for insert to FDW table with batch_size 1000
> returns
> >>
> ---
> >>   Insert on public.vzdalena_tabulka2  (cost=0.00..175.00 rows=0
> width=0) (actual time=30.269..30.270 rows=0 loops=1)
> >> Remote SQL:
> \x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F
> >> Batch Size: 1000
> >> ->  Function Scan on pg_catalog.generate_series g
>  (cost=0.00..175.00 rows=1 width=36) (actual time=0.453..1.988 rows=10
> >>   Output: g.i, ('AHOJ'::text || (g.i)::text)
> >>   Function Call: generate_series(1, 1)
> >>   Planning Time: 0.075 ms
> >>   Execution Time: 31.032 ms
> >> (8 rows)
> >> reproducer
> >
> > I can reproduce the issue and did some basic analysis on it.
> >
> > The "Remote SQL" is built from the following code:
> >
> > 
> >   char   *sql = strVal(list_nth(fdw_private,
> >
>FdwModifyPrivateUpdateSql));
> >
> >   ExplainPropertyText("Remote SQL", sql, es);
> > ---
> >
> > It use the query string stored in list fdw_private.
> > However, the "fmstate->query" will also point to the string in
> fdw_private,
> > by  postgresBeginForeignModify --> create_foreign_modify -->
> "fmstate->query = query;"
> >
> > And in execute_foreign_modify(), " fmstate->query "  will be freed when
> rebuild the query
> > string to do the batch insert like the following:
> >
> > 
> > if (operation == CMD_INSERT && fmstate->num_slots != *numSlots)
> > {
> > ...
> >  /* Build INSERT string with numSlots records in its VALUES
> clause. */
> >  initStringInfo(&sql);
> >  rebuildInsertSql(&sql, fmstate->orig_query, fmstate->values_end,
> >   fmstate->p_nums, *numSlots - 1)
> > **  pfree(fmstate->query);
> >  fmstate->query = sql.data;
> > 
> >
> > So, it output the freed pointer as "Remote SQL".
> >
> > For the fix.
> > The query string could be rebuilt depending on the numSlots,
> > which query string should be output ?
> > should we just output the original query string like the attached patch ?
> > Or should we output the last one?
> >
>
> Yeah. The problem is we build fdw_private list once (which references
> the SQL string), and during execution we may pfree() it. But then
> EXPLAIN ANALYZE gets the same fdw_private list and tries to use the SQL
> string which we pfreed() already.
>
> I think the simplest fix is simply to pstrdup() the query when building
> fmstate in create_foreign_modify. We've already been doing that for
> orig_query anyway. That seems cleaner than printing the last query we
> build (which would be confusing I think).
>
> I've pushed a fix doing that. We only need that for INSERT queries, and
> we might even restrict that to cases with batching if needed.
>

Great

Thank you

Pavel


>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 08:18, Pavel Stehule  wrote:
>
> pá 7. 5. 2021 v 21:56 odesílatel David Rowley  napsal:
>> With USE_ASSERT_CHECKING builds, I did add some code that verifies the
>> memory tracking is set correctly when evicting from the cache. This
>> code is pretty expensive as it loops over the entire cache to check
>> the memory accounting every time we evict something from the cache.
>> Originally, I had this code only run when some other constant was
>> defined, but I ended up changing it to compile that code in for all
>> assert enabled builds.
>>
>> I considered that it might be too expensive as you can see from the
>> comment in [1].  I just wanted to get a few machines other than my own
>> to verify that the memory accounting code was working as expected.
>> There have been no complaints of any Assert failures yet, so maybe
>> it's safe to consider either removing the code entirely or just having
>> it run when some other more specific to purpose constant is defined.
>> If we did the latter, then I'd have concerns that nothing would ever
>> run the code to check the memory accounting, that's why I ended up
>> changing it to run with USE_ASSERT_CHECKING builds.
>
>
> I understand. I think this is too slow for generic assertions, because the 
> overhead is about 50x.

I didn't expect it would show up quite that much.  If you scaled the
test up a bit more and increased work_mem further, then it would be
even more than 50x.

At one point when I was developing the patch, I had two high water
marks for cache memory. When we reached the upper of the two marks,
I'd reduce the memory down to the lower of two marks.  The lower of
the two marks was set to 98% of the higher mark.  In the end, I got
rid of that as I didn't really see what extra overhead there was from
just running the eviction code every time we require another byte.
However, if we did have that again, then the memory checking could
just be done when we run the eviction code. We'd then need to consume
that 2% more memory before it would run again.

My current thinking is that I don't really want to add that complexity
just for some Assert code. I'd only want to do it if there was another
valid reason to.

Another thought I have is that maybe it would be ok just to move
memory accounting debug code so it only runs once in
ExecEndResultCache.  I struggling to imagine that if the memory
tracking did go out of whack, that the problem would have accidentally
fixed itself by the time we got to ExecEndResultCache().  I guess even
if the accounting was counting far too much memory and we'd evicted
everything from the cache to try and get the memory usage down, we'd
still find the problem during ExecEndResultCache(), even if the cache
had become completely empty as a result.

David




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread James Coleman
On Sat, Apr 24, 2021 at 6:25 AM David Rowley  wrote:
>
> On Wed, 14 Apr 2021 at 05:40, James Coleman  wrote:
> > ...and here's a draft patch. I can take this to a new thread if you'd
> > prefer; the one here already got committed, on the other hand this is
> > pretty strongly linked to this discussion, so I figured it made sense
> > to post it here.
>
> I only glanced at this when you sent it and I was confused about how
> it works. The patch didn't look like how I imagined it should and I
> couldn't see how the executor part worked without any changes.
>
> Anyway, I decided to clear up my confusion tonight and apply the patch
> to figure all this out...  unfortunately, I see why I was confused
> now. It actually does not work at all :-(
>
> You're still passing the <> operator to get_op_hash_functions(), which
> of course is not hashable, so we just never do hashing for NOT IN.
>
> All your tests pass just fine because the standard non-hashed code path is 
> used.

I was surprised when it "just worked" too; I should have stopped to
verify the path was being taken. Egg on my face for not doing so. :(

> My idea was that you'd not add any fields to ScalarArrayOpExpr and for
> soaps with useOr == false, check if the negator of the operator is
> hashable. If so set the opfuncid to the negator operator's function.
>
> I'm a bit undecided if it's safe to set the opfuncid to the negator
> function.  If anything were to set that again based on the opno then
> it would likely set it to the wrong thing. We can't go changing the
> opno either because EXPLAIN would display the wrong thing.

I don't personally see a reason why this is a problem. But I also
don't know that I have enough knowledge of the codebase to say that
definitively.

> Anyway, I've attached what I ended up with after spending a few hours
> looking at this.

Overall I like this approach.

One thing I think we could clean up:

+ booluseOr;  /* use OR or AND semantics? */
...
+ /* useOr == true means an IN clause, useOr == false is NOT IN */

I'm wondering if the intersection of these two lines implies that
useOr isn't quite the right name here. Perhaps something like
"negated"?

On the other hand (to make the counterargument) useOr would keep it
consistent with the other ones.

The other thing I got to thinking about was = ALL. It doesn't get
turned into a hash op because the negator of = isn't hashable. I think
it's correct that that's the determining factor, because I can't
imagine what it would mean to hash <>. But...I wanted to confirm I
wasn't missing something. We don't have explicit tests for that case,
but I'm not sure it's necessary either.

> I pretty much used all your tests as is with the exception of removing
> one that looked duplicated.

Sounds good.

James




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Tomas Vondra

On 5/7/21 11:04 PM, David Rowley wrote:

On Sat, 8 May 2021 at 08:18, Pavel Stehule  wrote:


pá 7. 5. 2021 v 21:56 odesílatel David Rowley  napsal:

With USE_ASSERT_CHECKING builds, I did add some code that verifies the
memory tracking is set correctly when evicting from the cache. This
code is pretty expensive as it loops over the entire cache to check
the memory accounting every time we evict something from the cache.
Originally, I had this code only run when some other constant was
defined, but I ended up changing it to compile that code in for all
assert enabled builds.

I considered that it might be too expensive as you can see from the
comment in [1].  I just wanted to get a few machines other than my own
to verify that the memory accounting code was working as expected.
There have been no complaints of any Assert failures yet, so maybe
it's safe to consider either removing the code entirely or just having
it run when some other more specific to purpose constant is defined.
If we did the latter, then I'd have concerns that nothing would ever
run the code to check the memory accounting, that's why I ended up
changing it to run with USE_ASSERT_CHECKING builds.



I understand. I think this is too slow for generic assertions, because the 
overhead is about 50x.


I didn't expect it would show up quite that much.  If you scaled the
test up a bit more and increased work_mem further, then it would be
even more than 50x.

At one point when I was developing the patch, I had two high water
marks for cache memory. When we reached the upper of the two marks,
I'd reduce the memory down to the lower of two marks.  The lower of
the two marks was set to 98% of the higher mark.  In the end, I got
rid of that as I didn't really see what extra overhead there was from
just running the eviction code every time we require another byte.
However, if we did have that again, then the memory checking could
just be done when we run the eviction code. We'd then need to consume
that 2% more memory before it would run again.

My current thinking is that I don't really want to add that complexity
just for some Assert code. I'd only want to do it if there was another
valid reason to.



Agreed. I think this approach to eviction (i.e. evicting more than you 
need) would be useful if the actual eviction code was expensive, and 
doing it in a "batch" would make it significantly cheaper. But I don't 
think "asserts are expensive" is a good reason for it.



Another thought I have is that maybe it would be ok just to move
memory accounting debug code so it only runs once in
ExecEndResultCache.  I struggling to imagine that if the memory
tracking did go out of whack, that the problem would have accidentally
fixed itself by the time we got to ExecEndResultCache().  I guess even
if the accounting was counting far too much memory and we'd evicted
everything from the cache to try and get the memory usage down, we'd
still find the problem during ExecEndResultCache(), even if the cache
had become completely empty as a result.



I don't think postponing the debug code until much later is a great 
idea. When something goes wrong it's good to know ASAP, otherwise it's 
much more difficult to identify the issue.


Not sure we need to do something here - for regression tests this is not 
an issue, because those generally work with small data sets. And if you 
run with asserts on large amounts of data, I think this is acceptable.


I had the same dilemma with the new BRIN index opclasses, which also 
have some extensive and expensive assert checks - for the regression 
tests that's fine, and it proved very useful during development.


I have considered enabling those extra checks only on request somehow, 
but I'd bet no one would do that and I'd forget it exists pretty soon.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Processing btree walks as a batch to parallelize IO

2021-05-07 Thread Greg Stark
On Fri, 9 Apr 2021 at 16:58, Tomas Vondra  wrote:
>
>
>
> On 4/9/21 7:33 PM, James Coleman wrote:

> > A specific area where this is particularly painful is btree index reads.
> > Walking the tree to leaf pages isn't naturally prefetchable, and so for
> > each level you pay the random page cost. Of course higher levels in the
> > tree will almost certainly exhibit emergent behavior such that they
> > (just by fact of the LRU caching) will be in the buffer cache, but for a
> > large index lower levels likely won't be.

We've talked before about buffering inserts even just for disk-based
indexes. Much like how GIN buffers inserts and periodically flushes
them out. We talked about doing a local buffer in each session since
no other session even needs to see these buffered inserts until commit
anyways. And we can more efficiently merge in multiple keys at once
than doing them one by one.

But that was just for disk i/o. For something longer-latency it would
be an even bigger win. Buffer the inserted keys in local memory in
case you do lookups in this same session and start the i/o to insert
the rows into the index but handle that in the background or in a
separate process without blocking the transaction until commit.

> What do you consider a large index level?
>
> Consider a 1TB table, with just a single UUID column - that's ~25B rows,
> give or take. Real tables will have more columns, so this seems like a
> reasonable model of the largest number of rows per relation. With ~32B
> per index tuple, that's about 100M leaf pages, and with ~256 branches
> per internal page, that's still only ~5 levels. I think it's quite rare
> to see indexes with more than 6 or 7 levels.

That's a good model for a well-designed schema with an efficient
index. There are plenty of less-than-optimal schemas with indexes on
longer column lists or fairly large text fields

-- 
greg




Re: Processing btree walks as a batch to parallelize IO

2021-05-07 Thread Peter Geoghegan
On Fri, May 7, 2021 at 3:34 PM Greg Stark  wrote:
> We've talked before about buffering inserts even just for disk-based
> indexes. Much like how GIN buffers inserts and periodically flushes
> them out. We talked about doing a local buffer in each session since
> no other session even needs to see these buffered inserts until commit
> anyways. And we can more efficiently merge in multiple keys at once
> than doing them one by one.

Mark Callaghan's high level analysis of the trade-offs here is worth a
read, too.

> That's a good model for a well-designed schema with an efficient
> index. There are plenty of less-than-optimal schemas with indexes on
> longer column lists or fairly large text fields

Suffix truncation can take care of this -- all you really need is a
minimally distinguishing separator key to delineate which values
belong on which page one level down. It is almost always possible for
leaf page splits to find a way to make the new high key (also the key
to be inserted in the parent level) much smaller than your typical
key. Granted, we don't have what I've called "classic" suffix
truncation (within text column truncation) yet, so this analysis isn't
going to work with long text keys (we only truncate at the attribute
granularity currently).

Even if we're pessimistic about suffix truncation, the logarithmic
rate of growth still wins -- Tomas' analysis is sound. You cannot
realistically make a Postgres B-Tree have more than about 1% of all
pages as internal pages, unless you make the indexed keys ludicrously
large -- as in several hundred bytes each (~0.5% is typical in
practice). I think that 6 levels is very pessimistic, even with a
massive B-Tree with weirdly large keys. My mental model for internal
pages is that they are practically guaranteed to be in shared_buffers
at all times, which is about as accurate as any generalization like
that ever can be.

I once wrote a test harness that deliberately created a B-Tree that
was as tall as possible -- something with the largest possible index
tuples on the leaf level (had to disable TOAST for this). I think that
it was about 7 or 8 levels deep. The CPU overhead of the test case
made it excruciatingly slow, but it wasn't I/O bound at all (pretty
sure it all fitted in shared_buffers).

-- 
Peter Geoghegan




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Noah Misch
On Fri, May 07, 2021 at 01:18:19PM -0400, Tom Lane wrote:
> Realizing that 9989d37d prevents the assertion failure, I went
> to see if thorntail had shown EIO failures without assertions.
> Looking back 180 days, I found these:
> 
>   sysname  |branch |  snapshot   |   stage|   
> l 
>
> ---+---+-++
>  thorntail | HEAD  | 2021-03-19 21:28:15 | recoveryCheck  | 
> 2021-03-20 00:48:48.117 MSK [4089174:11] 008_fsm_truncation.pl PANIC:  could 
> not fdatasync file "00010002": Input/output error
>  thorntail | HEAD  | 2021-04-06 16:08:10 | recoveryCheck  | 
> 2021-04-06 19:30:54.103 MSK [3355008:11] 008_fsm_truncation.pl PANIC:  could 
> not fdatasync file "00010002": Input/output error
>  thorntail | REL9_6_STABLE | 2021-04-12 02:38:04 | pg_basebackupCheck | 
> pg_basebackup: could not fsync file "00010013": Input/output 
> error
> 
> So indeed the kernel-or-hardware problem is affecting other branches.

Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
prevent fsync from reaching the kernel.  Hopefully, that will make the
hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
wouldn't avoid this, because fsync=off doesn't affect syncs outside the
backend.)
/* gcc -fPIC -shared never_sync.c -o never_sync.so */

int
fsync(int fd)
{
return 0;
}
int
fdatasync(int fd)
{
return fsync(fd);
}


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 09:15, James Coleman  wrote:
>
> On Sat, Apr 24, 2021 at 6:25 AM David Rowley  wrote:
> > I'm a bit undecided if it's safe to set the opfuncid to the negator
> > function.  If anything were to set that again based on the opno then
> > it would likely set it to the wrong thing. We can't go changing the
> > opno either because EXPLAIN would display the wrong thing.
>
> I don't personally see a reason why this is a problem. But I also
> don't know that I have enough knowledge of the codebase to say that
> definitively.

The reason for my concern is that if the opfuncid is set to
InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's
opfuncid to get_opcode(opexpr->opno) . I'm effectively setting the
opfuncid to get_opcode(get_negator(opexpr->opno)), if anything were to
reset the ScalarArrayOpExpr's opfuncid to InvalidOid, then
set_sa_opfuncid() would repopulate it with the wrong value.

Maybe the solution there is to teach set_sa_opfuncid() about our
hashing NOT IN trick and have it check if (!opexpr->useOr &&
OidIsValid(opexpr->hashfuncid)) and if that's true then do
opexpr->opfuncid = get_opcode(get_negator(opexpr->opno)).  Then we
could just not bothing setting opfuncid in
convert_saop_to_hashed_saop_walker().


> > Anyway, I've attached what I ended up with after spending a few hours
> > looking at this.
>
> Overall I like this approach.
>
> One thing I think we could clean up:
>
> + booluseOr;  /* use OR or AND semantics? */
> ...
> + /* useOr == true means an IN clause, useOr == false is NOT IN */
>
> I'm wondering if the intersection of these two lines implies that
> useOr isn't quite the right name here. Perhaps something like
> "negated"?

I'm not sure I want to go changing that.  The whole IN() / NOT IN()
behaviour regarding NULLs all seems pretty weird until you mentally
replace a IN (1,2,3) with a = 1 OR a = 2 OR a = 3.  And for the a NOT
IN(1,2,3) case, a <> 1 AND a <> 2 AND a <> 3.  People can make a bit
more sense of the weirdness of NULLs with NOT IN when they mentally
convert their expression like that.  I think having that in code is
useful too. Any optimisations that are added must match those
semantics.

> The other thing I got to thinking about was = ALL. It doesn't get
> turned into a hash op because the negator of = isn't hashable. I think
> it's correct that that's the determining factor, because I can't
> imagine what it would mean to hash <>. But...I wanted to confirm I
> wasn't missing something. We don't have explicit tests for that case,
> but I'm not sure it's necessary either.

It's important to think of other cases, I just don't think there's any
need to do anything for that one.  Remember that we have the
restriction of requiring a set of Consts, so for that case to be met,
someone would have to write something like: col =
ALL('{1,1,1,1,1,1,1,1}'::int[]);  I think if anyone comes along
complaining that a query containing that is not as fast as they'd like
then we might tell them that they should just use: col = 1. A sanity
checkup might not go amiss either.

David




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread Tom Lane
David Rowley  writes:
> On Sat, 8 May 2021 at 09:15, James Coleman  wrote:
>> On Sat, Apr 24, 2021 at 6:25 AM David Rowley  wrote:
>>> I'm a bit undecided if it's safe to set the opfuncid to the negator
>>> function.  If anything were to set that again based on the opno then
>>> it would likely set it to the wrong thing. We can't go changing the
>>> opno either because EXPLAIN would display the wrong thing.

>> I don't personally see a reason why this is a problem. But I also
>> don't know that I have enough knowledge of the codebase to say that
>> definitively.

> The reason for my concern is that if the opfuncid is set to
> InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's
> opfuncid to get_opcode(opexpr->opno).

I will personally veto any design that involves setting opfuncid to
something that doesn't match the opno.  That's just horrid, and it
will break something somewhere, either immediately or down the road.

I don't immediately see why you can't add an "invert" boolean flag to
ScalarArrayOpExpr and let the executor machinery deal with this.  That'd
have the advantage of not having to depend on there being a negator.

regards, tom lane




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Michael Paquier
On Fri, May 07, 2021 at 04:30:00PM -0400, Tom Lane wrote:
> I can certainly see an argument for running some buildfarm animals
> with fsync on (for all tests).  I don't see a reason for forcing
> them all to run some tests that way; and if I were going to do that,
> I doubt that 008_fsm_truncation.pl would be the one I would pick.
> I think it's nothing but sloppiness that that one is out of step with
> all the rest.

My take on this point is that using the configuration that can be
enforced for each animal would be enough.  I manage a small animal and
this stuff can take a while to flush some data.

Worth noting that using fsync=on has not been discussed on the
original thread, and I don't see why that's necessary:
https://www.postgresql.org/message-id/flat/CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn%3D_9A%40mail.gmail.com
So I would vote for removing it in this case.
--
Michael


signature.asc
Description: PGP signature


Re: Small issues with CREATE TABLE COMPRESSION

2021-05-07 Thread Michael Paquier
On Thu, May 06, 2021 at 09:04:57PM +0900, Michael Paquier wrote:
> Yeah, I agree that this is an improvement, so let's fix this.

Just noticed that this was not applied yet, so done while I was
looking at this thread again.
--
Michael


signature.asc
Description: PGP signature


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread James Coleman
On Fri, May 7, 2021 at 9:16 PM Tom Lane  wrote:
>
> David Rowley  writes:
> > On Sat, 8 May 2021 at 09:15, James Coleman  wrote:
> >> On Sat, Apr 24, 2021 at 6:25 AM David Rowley  wrote:
> >>> I'm a bit undecided if it's safe to set the opfuncid to the negator
> >>> function.  If anything were to set that again based on the opno then
> >>> it would likely set it to the wrong thing. We can't go changing the
> >>> opno either because EXPLAIN would display the wrong thing.
>
> >> I don't personally see a reason why this is a problem. But I also
> >> don't know that I have enough knowledge of the codebase to say that
> >> definitively.
>
> > The reason for my concern is that if the opfuncid is set to
> > InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's
> > opfuncid to get_opcode(opexpr->opno).
>
> I will personally veto any design that involves setting opfuncid to
> something that doesn't match the opno.  That's just horrid, and it
> will break something somewhere, either immediately or down the road.

This is the "project design" style/policy I don't have. Thanks.

> I don't immediately see why you can't add an "invert" boolean flag to
> ScalarArrayOpExpr and let the executor machinery deal with this.  That'd
> have the advantage of not having to depend on there being a negator.

Don't we need to have a negator to be able to look up the proper has
function? At least somewhere in the process you'd have to convert from
looking up the <> op to looking up the = op and then setting the
"invert" flag.

James




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Michael Paquier
On Fri, May 07, 2021 at 04:42:46PM +1200, Thomas Munro wrote:
> Oh, and I see that 13 has 9989d37d "Remove XLogFileNameP() from the
> tree" to fix this exact problem.

I don't see that we'd be able to get a redesign of this area safe
enough for a backpatch, but perhaps we (I?) had better put some extra
effort in back-patching this commit while keeping XLogFileNameP()
around for compatibility?  How do people feel about that?
--
Michael


signature.asc
Description: PGP signature


Re: JSON doc example (matchiness)

2021-05-07 Thread Michael Paquier
On Fri, May 07, 2021 at 10:18:44PM +0200, Erik Rijkers wrote:
> 
> The JSON doc has this example (to show the need for double backslash):
> 
> $ ? (@ like_regex "^\\d+$")
> 
> 
> The example is not wrong exactly, and can be cast to jsonpath, but as-is can
> never match anything.
> 
> I think it'd be helpful to provide that example so that it more probably
> matches when the user does a quick trial.
> 
> Llet's change it to something like:
> 
> $.* ? (@ like_regex "^\\d+$")

Ah, I see.  What you are telling here is that we match the regex on
the full JSON string, which is pretty useless, and you are suggesting
to change things so as we'd match with the key names at the first
level.  Makes sense.

This paragraph of the docs say:
"For example, to match strings that contain only digits"
Could we be more precise here?  "strings" looks to much generic to
me in this context when actually referring to a set of path of keys in
a JSON blob.
--
Michael


signature.asc
Description: PGP signature


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread James Coleman
On Fri, May 7, 2021 at 8:38 PM David Rowley  wrote:
>
> It's important to think of other cases, I just don't think there's any
> need to do anything for that one.  Remember that we have the
> restriction of requiring a set of Consts, so for that case to be met,
> someone would have to write something like: col =
> ALL('{1,1,1,1,1,1,1,1}'::int[]);  I think if anyone comes along
> complaining that a query containing that is not as fast as they'd like
> then we might tell them that they should just use: col = 1. A sanity
> checkup might not go amiss either.

I wasn't concerned with trying to optimize this case (I don't think we
can anyway, at least not without adding new work, like de-duplicating
the array first). Though I do hope that someday I'll/we'll get around
to getting the stable subexpressions caching patch finished, and then
this will be able to work for more than constant arrays.

I just wanted to confirm we'd thought through the cases we can't
handle to ensure we're not accidentally covering them.

James




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 13:37, James Coleman  wrote:
>
> On Fri, May 7, 2021 at 9:16 PM Tom Lane  wrote:
> > I don't immediately see why you can't add an "invert" boolean flag to
> > ScalarArrayOpExpr and let the executor machinery deal with this.  That'd
> > have the advantage of not having to depend on there being a negator.
>
> Don't we need to have a negator to be able to look up the proper has
> function? At least somewhere in the process you'd have to convert from
> looking up the <> op to looking up the = op and then setting the
> "invert" flag.

Yeah, we *do* need to ensure there's a negator in the planner as we
need to use it during hash probes.  It's no good checking the hash
bucket we landed on does match with the <> operator's function.  We
won't find many matches that way!

I'm not opposed to adding some new field if that's what it takes.  I'd
imagine the new field will be something like negfuncid which will be
InvalidOid unless the hash function is set and useOr == false

David




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Andres Freund
Hi,

On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
> Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
> prevent fsync from reaching the kernel.  Hopefully, that will make the
> hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
> backend.)

Not sure how reliable that is - there's other paths that could return an
error, I think. If the root cause is the disk responding weirdly to
write cache flushes, you could tell the kernel that that the disk has no
write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).

Greetings,

Andres Freund




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Tom Lane
Andres Freund  writes:
> On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
>> Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached to
>> prevent fsync from reaching the kernel.  Hopefully, that will make the
>> hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
>> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
>> backend.)

> Not sure how reliable that is - there's other paths that could return an
> error, I think. If the root cause is the disk responding weirdly to
> write cache flushes, you could tell the kernel that that the disk has no
> write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).

I seriously doubt Noah has root on that machine.

More to the point, the admin told me it's a VM (or LDOM, whatever that is)
under a Solaris host, so there's no direct hardware access going on
anyway.  He didn't say in so many words, but I suspect the reason he's
suspecting kernel bugs is that there's nothing going wrong so far as the
host OS is concerned.

regards, tom lane




Re: Have I found an interval arithmetic bug?

2021-05-07 Thread Zhihong Yu
On Tue, Apr 13, 2021 at 10:55 AM Bryn Llewellyn  wrote:

> On 12-Apr-2021, at 17:25, Bruce Momjian  wrote:
>
> On Mon, Apr 12, 2021 at 05:20:43PM -0700, Bryn Llewellyn wrote:
>
> I’d argue that the fact that this:
>
> ('0.3 months'::interval) + ('0.7 months'::interval)
>
> Is reported as '30 days' and not '1 month' is yet another
> bug—precisely because of what I said in my previous email (sorry
> that I forked the thread) where I referred to the fact that, in the
> right test, adding 1 month gets a different answer than adding 30
> days.
>
>
> Flowing _up_ is what these functions do:
>
> \df *justify*
>   List of functions
>   Schema   |   Name   | Result data type | Argument data types |
> Type
>
> +--+--+-+--
> pg_catalog | justify_days | interval | interval|
> func
> pg_catalog | justify_hours| interval | interval|
> func
> pg_catalog | justify_interval | interval | interval|
> func
>
> Yet another convincing reason to get rid of this flow down
> business altogether.
>
>
> We can certainly get rid of all downflow, which in the current patch is
> only when fractional internal units are specified.
>
> If some application wants to model flow-down, then it can do so with
> trivial programming and full control over its own definition of the
> rules.
>
>
> *“Yes please!” re Bruce’s “We can certainly get rid of all downflow, which
> in the current patch is only when fractional internal units are specified.”*
>
> Notice that a user who creates interval values explicitly only by using
> “make_interval()” will see no behavior change.
>
> There’s another case of up-flow. When you subtract one timestamp value
> from another, and when they’re far enough apart, then the (internal
> representation of the) resulting interval value has both a seconds
> component and a days component. (But never, in my tests, a months
> component.) I assume that the implementation first gets the difference
> between the two timestamp values in seconds using (the moral equivalent of)
> “extract epoch”. And then, if this is greater than 24*60*60, it implements
> up-flow using the “rule-of-24”—never mind that this means that if you add
> the answer back to the timestamp value that you subtracted, then you might
> not get the timestamp value from which you subtracted. (This happens around
> a DST change and has been discussed earlier in the thread.)
>
> The purpose of these three “justify” functions is dubious. I think that
> it’s best to think of the 3-component interval vector like an [x, y, z]
> vector in 3d geometric space, where the three coordinates are not mutually
> convertible because each has unique semantics. This point has been
> rehearsed many times in this long thread.
>
> Having said this, it isn’t hard to understand the rules that the functions
> implement. And, most importantly, their use is voluntary. They are, though,
> no more than shipped and documented wrappers for a few special cases. A
> user could so easily write their own function like this:
>
> 1. Compute the values of the three components of the internal
> representation of the passed-in interval value using the “extract” feature
> and some simple arithmetic.
>
> 2. Derive the [minutes, days, seconds] values of a new representation
> using whatever rules you feel for.
>
> 3. Use these new values to create the return interval value.
>
> For example, I might follow a discipline to use interval values that have
> only one of the three components of the internal representation non-zero.
> It’s easy to use the “domain” feature for this. (I can use these in any
> context where I can use the shipped interval.) I could then write a
> function to convert a “pure seconds” value of my_interval to a “pure years”
> value. And I could implement my own rules:
>
> — Makes sense only for a large number of seconds that comes out to at
> least five years (else assert failure).
>
> — Converts seconds to years using the rule that 1 year is, on
> average, 365.25*24*60*60 seconds, and then truncates it.
>
> There’s no shipped function that does this, and this makes me suspect that
> I’d prefer to roll my own for any serious purpose.
>
> The existence of the three “justify” functions is, therefore, harmless.
>
>
Bruce / Tom:
Can we revisit this topic ?

Cheers


Re: Have I found an interval arithmetic bug?

2021-05-07 Thread Bruce Momjian
On Fri, May  7, 2021 at 07:23:42PM -0700, Zhihong Yu wrote:
> On Tue, Apr 13, 2021 at 10:55 AM Bryn Llewellyn  wrote:
> There’s no shipped function that does this, and this makes me suspect that
> I’d prefer to roll my own for any serious purpose.
> 
> The existence of the three “justify” functions is, therefore, harmless.
> 
> Bruce / Tom:
> Can we revisit this topic ?

I thought we agreed that the attached patch will be applied to PG 15.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 0e8ef958a9..0830128013 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -2770,15 +2770,13 @@ P  years-months-
 
 
 
- In the verbose input format, and in some fields of the more compact
- input formats, field values can have fractional parts; for example
- '1.5 week' or '01:02:03.45'.  Such input is
- converted to the appropriate number of months, days, and seconds
- for storage.  When this would result in a fractional number of
- months or days, the fraction is added to the lower-order fields
- using the conversion factors 1 month = 30 days and 1 day = 24 hours.
- For example, '1.5 month' becomes 1 month and 15 days.
- Only seconds will ever be shown as fractional on output.
+ Field values can have fractional parts;  for example, '1.5
+ weeks' or '01:02:03.45'.  The fractional
+ parts are used to compute appropriate values for the next lower-order
+ internal fields (months, days, seconds).  For example, '1.5
+ months' becomes 1 month 15 days.
+ The fractional conversion factors used are 1 month = 30 days and 1 day
+ = 24 hours.  Only seconds will ever be shown as fractional on output.
 
 
 
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 54ae632de2..fb9a753223 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -526,7 +526,6 @@ AdjustFractDays(double frac, struct pg_tm *tm, fsec_t *fsec, int scale)
 	extra_days = (int) frac;
 	tm->tm_mday += extra_days;
 	frac -= extra_days;
-	AdjustFractSeconds(frac, tm, fsec, SECS_PER_DAY);
 }
 
 /* Fetch a fractional-second value with suitable error checking */
@@ -3307,28 +3306,28 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 	case DTK_YEAR:
 		tm->tm_year += val;
 		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR;
+			tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 		tmask = DTK_M(YEAR);
 		break;
 
 	case DTK_DECADE:
 		tm->tm_year += val * 10;
 		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 10;
+			tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 10);
 		tmask = DTK_M(DECADE);
 		break;
 
 	case DTK_CENTURY:
 		tm->tm_year += val * 100;
 		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 100;
+			tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 100);
 		tmask = DTK_M(CENTURY);
 		break;
 
 	case DTK_MILLENNIUM:
 		tm->tm_year += val * 1000;
 		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 1000;
+			tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 1000);
 		tmask = DTK_M(MILLENNIUM);
 		break;
 
@@ -3565,7 +3564,7 @@ DecodeISO8601Interval(char *str,
 			{
 case 'Y':
 	tm->tm_year += val;
-	tm->tm_mon += (fval * MONTHS_PER_YEAR);
+	tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 	break;
 case 'M':
 	tm->tm_mon += val;
@@ -3601,7 +3600,7 @@ DecodeISO8601Interval(char *str,
 		return DTERR_BAD_FORMAT;
 
 	tm->tm_year += val;
-	tm->tm_mon += (fval * MONTHS_PER_YEAR);
+	tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 	if (unit == '\0')
 		return 0;
 	if (unit == 'T')
diff --git a/src/interfaces/ecpg/pgtypeslib/interval.c b/src/interfaces/ecpg/pgtypeslib/interval.c
index 02b3c47223..0b2fccbbe6 100644
--- a/src/interfaces/ecpg/pgtypeslib/interval.c
+++ b/src/interfaces/ecpg/pgtypeslib/interval.c
@@ -155,7 +155,7 @@ DecodeISO8601Interval(char *str,
 			{
 case 'Y':
 	tm->tm_year += val;
-	tm->tm_mon += (fval * MONTHS_PER_YEAR);
+	tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 	break;
 case 'M':
 	tm->tm_mon += val;
@@ -191,7 +191,7 @@ DecodeISO8601Interval(char *str,
 		return DTERR_BAD_FORMAT;
 
 	tm->tm_year += val;
-	tm->tm_mon += (fval * MONTHS_PER_YEAR);
+	tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 	if (unit == '\0')
 		return 0;
 	if (unit == 'T')
@@ -529,28 +529,28 @@ DecodeInterval(char **field, int *ftype, int nf,	/* int range, */
 	case DTK_YEAR:
 		tm->tm_year += val;
 		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR;
+			tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 		tmask = (fmask & DTK_M(YEAR)) 

Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 09:16, Tomas Vondra  wrote:
>
> On 5/7/21 11:04 PM, David Rowley wrote:
> > Another thought I have is that maybe it would be ok just to move
> > memory accounting debug code so it only runs once in
> > ExecEndResultCache.  I struggling to imagine that if the memory
> > tracking did go out of whack, that the problem would have accidentally
> > fixed itself by the time we got to ExecEndResultCache().  I guess even
> > if the accounting was counting far too much memory and we'd evicted
> > everything from the cache to try and get the memory usage down, we'd
> > still find the problem during ExecEndResultCache(), even if the cache
> > had become completely empty as a result.
> >
>
> I don't think postponing the debug code until much later is a great
> idea. When something goes wrong it's good to know ASAP, otherwise it's
> much more difficult to identify the issue.

I thought about this a bit and I was about to agree, but then I changed my mind.

The biggest concern I have on this topic is that we end up with zero
validation done for the memory accounting. If we can find a cheaper
place to put the Asserts that will at least bring our attention to the
fact that some problem exists, then more investigation can ensue. I
don't personally expect that every assert failure will show us the
exact location of the bug.

Additionally, I don't really think there is a huge amount of room for
bugs creeping in here as there's not all that many places that the
'mem_used' field gets updated, so there are not many places to forget
to do it.

Another way to look at this is that, where the Asserts are today,
there are zero memory accounting checks done in all cases that don't
evict any tuples. I feel by moving the tests to ExecEndResultCache()
we'll do memory validation for all plans using a Result Cache in
Assert builds.  Yes, we might just need to do a bit more work to find
out exactly where the problem is, but some investigation would need to
happen anyway. I think if anyone changes anything which breaks the
memory accounting then they'll be aware of it quite quickly and they
can just look at what they did wrong.

David


only_validate_resultcache_mem_on_plan_shutdown.patch
Description: Binary data


Re: Have I found an interval arithmetic bug?

2021-05-07 Thread Zhihong Yu
On Fri, May 7, 2021 at 7:23 PM Bruce Momjian  wrote:

> On Fri, May  7, 2021 at 07:23:42PM -0700, Zhihong Yu wrote:
> > On Tue, Apr 13, 2021 at 10:55 AM Bryn Llewellyn 
> wrote:
> > There’s no shipped function that does this, and this makes me
> suspect that
> > I’d prefer to roll my own for any serious purpose.
> >
> > The existence of the three “justify” functions is, therefore,
> harmless.
> >
> > Bruce / Tom:
> > Can we revisit this topic ?
>
> I thought we agreed that the attached patch will be applied to PG 15.
>

Good to know.

Hopefully it lands soon.


> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>
>


Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread Justin Pryzby
On Sat, May 08, 2021 at 02:26:44PM +1200, David Rowley wrote:
> On Sat, 8 May 2021 at 09:16, Tomas Vondra  
> wrote:
> > On 5/7/21 11:04 PM, David Rowley wrote:
> > > Another thought I have is that maybe it would be ok just to move
> > > memory accounting debug code so it only runs once in
> > > ExecEndResultCache.  I struggling to imagine that if the memory
> > > tracking did go out of whack, that the problem would have accidentally
> > > fixed itself by the time we got to ExecEndResultCache().  I guess even
> > > if the accounting was counting far too much memory and we'd evicted
> > > everything from the cache to try and get the memory usage down, we'd
> > > still find the problem during ExecEndResultCache(), even if the cache
> > > had become completely empty as a result.
> >
> > I don't think postponing the debug code until much later is a great
> > idea. When something goes wrong it's good to know ASAP, otherwise it's
> > much more difficult to identify the issue.
> 
> I thought about this a bit and I was about to agree, but then I changed my 
> mind.

> Yes, we might just need to do a bit more work to find
> out exactly where the problem is, but some investigation would need to
> happen anyway. I think if anyone changes anything which breaks the
> memory accounting then they'll be aware of it quite quickly and they
> can just look at what they did wrong.

You could put this into a separate function called by ExecEndResultCache().
Then anyone that breaks the memory accounting can also call the function in the
places they changed to help figure out what they broke.

-* Validate the memory accounting code is correct in assert builds. XXX 
is
-* this too expensive for USE_ASSERT_CHECKING?

-- 
Justin




Re: Have I found an interval arithmetic bug?

2021-05-07 Thread Bruce Momjian
On Fri, May  7, 2021 at 07:39:31PM -0700, Zhihong Yu wrote:
> 
> 
> On Fri, May 7, 2021 at 7:23 PM Bruce Momjian  wrote:
> 
> On Fri, May  7, 2021 at 07:23:42PM -0700, Zhihong Yu wrote:
> > On Tue, Apr 13, 2021 at 10:55 AM Bryn Llewellyn 
> wrote:
> >     There’s no shipped function that does this, and this makes me 
> suspect
> that
> >     I’d prefer to roll my own for any serious purpose.
> >
> >     The existence of the three “justify” functions is, therefore,
> harmless.
> >
> > Bruce / Tom:
> > Can we revisit this topic ?
> 
> I thought we agreed that the attached patch will be applied to PG 15.
> 
> 
> Good to know. 
> 
> Hopefully it lands soon.

It will be applied in June/July, but will not appear in a release until
Sept/Oct, 2022.  Sorry.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2021-05-07 Thread Japin Li


On Fri, 07 May 2021 at 19:50, ahsan hadi  wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:not tested
>
> I have also seen this error with logical replication using pglogical 
> extension, will this patch also address similar problem with pglogical?

Does there is a test case to reproduce this problem (using pglogical)?
I encountered this, however I'm not find a case to reproduce it.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: plan with result cache is very slow when work_mem is not enough

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 14:43, Justin Pryzby  wrote:
> You could put this into a separate function called by ExecEndResultCache().
> Then anyone that breaks the memory accounting can also call the function in 
> the
> places they changed to help figure out what they broke.

I almost did it that way and left a call to it in remove_cache_entry()
#ifdef'd out.  However, as mentioned, I'm more concerned about the
accounting being broken and left broken than I am with making it take
a little less time to find the exact place to fix the breakage.  If
the breakage was to occur when adding a new entry to the cache then it
might not narrow it down much if we just give users an easy way to
check the memory accounting during only evictions. The only way to
highlight the problem as soon as it occurs would be to validate the
memory tracking every time the mem_used field is changed. I think that
would be overkill.

I also find it hard to imagine what other reasons we'll have in the
future to adjust 'mem_used'.  At the moment there are 4 places. Two
that add bytes and two that subtract bytes. They're all hidden inside
reusable functions that are in charge of adding and removing entries
from the cache.

David




Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Noah Misch
On Fri, May 07, 2021 at 10:18:14PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
> >> Having a flaky buildfarm member is bad news.  I'll LD_PRELOAD the attached 
> >> to
> >> prevent fsync from reaching the kernel.  Hopefully, that will make the
> >> hardware-or-kernel trouble unreachable.  (Changing 008_fsm_truncation.pl
> >> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
> >> backend.)
> 
> > Not sure how reliable that is - there's other paths that could return an
> > error, I think.

Yep, one can imagine a failure at close() or something.  All the non-HEAD
buildfarm failures are at some *sync call, so I'm optimistic about getting
mileage from this.  (I didn't check the more-numerous HEAD failures.)  If it's
not enough, I may move the farm directory to tmpfs.

> > If the root cause is the disk responding weirdly to
> > write cache flushes, you could tell the kernel that that the disk has no
> > write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).
> 
> I seriously doubt Noah has root on that machine.

If I can make the case for that setting being a good thing for the VM's users
generally, I probably can file a ticket and get it done.

> More to the point, the admin told me it's a VM (or LDOM, whatever that is)
> under a Solaris host, so there's no direct hardware access going on
> anyway.  He didn't say in so many words, but I suspect the reason he's
> suspecting kernel bugs is that there's nothing going wrong so far as the
> host OS is concerned.




Inaccurate error message when set fdw batch_size to 0

2021-05-07 Thread houzj.f...@fujitsu.com
Hi,

When testing the fdw batch insert, I found a possible issue.

If I set the batch_size to 0 , it will throw an error:

-
CREATE FOREIGN TABLE test(a int, b varchar)
  SERVER testserver
  OPTIONS (table_name 'testlocal', batch_size '0');
ERROR:  fetch_size requires a non-negative integer value
-

The error message here seems not accurate, because
I can see from the code batch_size should be positive ( > 0).

So, is it better to change the error message to “fetch_size requires a positive 
integer value” ?
I also found fetch_size has the similar issue, attaching a patch to fix this.

Best regards,
houzj



0001-fix-errmsg-when-set-batchsize-to-0.patch
Description: 0001-fix-errmsg-when-set-batchsize-to-0.patch


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-05-07 Thread David Rowley
On Sat, 8 May 2021 at 14:04, David Rowley  wrote:
> I'm not opposed to adding some new field if that's what it takes.  I'd
> imagine the new field will be something like negfuncid which will be
> InvalidOid unless the hash function is set and useOr == false

Just while this is still swapped into main memory, I've attached a
patch that adds a new field to ScalarArrayOpExpr rather than
repurposing the existing field.

David


v3-0001-Speedup-NOT-IN-with-a-set-of-Consts.patch
Description: Binary data


Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

2021-05-07 Thread Thomas Munro
On Sat, May 8, 2021 at 2:30 AM Tom Lane  wrote:
> May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11

That's -EAGAIN (assuming errnos match x86) and I guess it indicates
that VDC_MAX_RETRIES is exceeded here:

https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L451
https://github.com/torvalds/linux/blob/master/drivers/block/sunvdc.c#L526

One theory is that the hypervisor/host is occasionally too swamped to
service the request queue fast enough over a ~10ms period, given that
vio_ldc_send() itself retries 1000 times with a 1us sleep, the outer
loop tries ten times, and ldc.c's write_nonraw() reports -EAGAIN when
there is no space for the message.  (Alternatively, it's trying to
send a message that's too big for the channel, the channel is
corrupted by bugs, or my fly-by of this code I'd never heard of before
now is just way off...)




Re: Small issues with CREATE TABLE COMPRESSION

2021-05-07 Thread Dilip Kumar
On Sat, May 8, 2021 at 7:04 AM Michael Paquier  wrote:
>
> On Thu, May 06, 2021 at 09:04:57PM +0900, Michael Paquier wrote:
> > Yeah, I agree that this is an improvement, so let's fix this.
>
> Just noticed that this was not applied yet, so done while I was
> looking at this thread again.

Thanks!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Inaccurate error message when set fdw batch_size to 0

2021-05-07 Thread Dilip Kumar
On Sat, May 8, 2021 at 9:09 AM houzj.f...@fujitsu.com
 wrote:
>
> The error message here seems not accurate, because
>
> I can see from the code batch_size should be positive ( > 0).
>

> So, is it better to change the error message to “fetch_size requires a 
> positive integer value” ?
>
> I also found fetch_size has the similar issue, attaching a patch to fix this.

Yes, it should be a positive integer, so your change makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Enhanced error message to include hint messages for redundant options error

2021-05-07 Thread houzj.f...@fujitsu.com
> > > Thanks! The v5 patch looks good to me. Let's see if all agree on the
> > > goto duplicate_error approach which could reduce the LOC by ~80.
> >
> > I think the "goto duplicate_error" approach looks good, it avoids
> > duplicating the same error code multiple times.
> 
> Thanks. I will mark the v5 patch "ready for committer" if no one has comments.

Hi,

I looked into the patch and noticed a minor thing.

+   return; /* keep compiler quiet */
 }

I think we do not need the comment here.
The compiler seems not require "return" at the end of function
when function's return type is VOID.

In addition, it seems better to remove these "return;" like what
commit "3974c4" did.

Best regards,
houzj