date:20200521

speed up unicode normalization quick check

2020-05-21 Thread John Naylor

Hi,

Attached is a patch to use perfect hashing to speed up Unicode
normalization quick check.

0001 changes the set of multipliers attempted when generating the hash
function. The set in HEAD works for the current set of NFC codepoints,
but not for the other types. Also, the updated multipliers now all
compile to shift-and-add on most platform/compiler combinations
available on godbolt.org (earlier experiments found in [1]). The
existing keyword lists are fine with the new set, and don't seem to be
very picky in general. As a test, it also successfully finds a
function for the OS "words" file, the "D" sets of codepoints, and for
sets of the first n built-in OIDs, where n > 5.

0002 builds on top of the existing normprops infrastructure to use a
hash function for NFC quick check. Below are typical numbers in a
non-assert build:

select count(*) from (select md5(i::text) as t from
generate_series(1,10) as i) s where t is nfc normalized;

HEAD 411ms 413ms 409ms
patch 296ms 297ms 299ms

The addition of "const" was to silence a compiler warning. Also, I
changed the formatting of the output file slightly to match pgindent.

0003 uses hashing for NFKC and removes binary search. This is split
out for readability. I gather NFKC is a less common use case, so this
could technically be left out. Since this set is larger, the
performance gains are a bit larger as well, at the cost of 19kB of
binary space:

HEAD 439ms 440ms 442ms
patch 299ms 301ms 301ms

I'll add this to the July commitfest.

[1]
https://www.postgresql.org/message-id/cacpnzcuvtilhxazxp9ucehguyhma59h6_pmp+_w-szxg0uy...@mail.gmail.com

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v1-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch
Description: Binary data

v1-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch
Description: Binary data

v1-0003-Use-perfect-hashing-for-NFKC-Unicode-normalizatio.patch
Description: Binary data

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-05-21 Thread Michael Paquier

On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote:
> On Tue, May 19, 2020 at 4:29 AM Andy Fan  wrote:
>> Thanks for the excellent extension. I want to add 5 more fields to satisfy 
>> the
>> following requirements.
>>
>> int   subplan; /* No. of subplan in this query */
>> int   subquery; /* No. of subquery */
>> int   joincnt; /* How many relations are joined */
>> bool hasagg; /* if we have agg function in this query */
>> bool hasgroup; /* has group clause */
>
> Most of those fields can be computed using the raw sql satements.  Why
> not adding functions like query_has_agg(querytext) to get the
> information from pgss stored query text instead?

Yeah I personally find concepts related only to the query string
itself not something that needs to be tied to pg_stat_statements.
While reading about those five new fields, I am also wondering how
this stuff would work with CTEs.  Particularly, should the hasagg or
hasgroup flags be set only if the most outer query satisfies a
condition?  What if an inner query satisfies a condition but not an
outer query?  Should joincnt just be the sum of all the joins done in
all queries, including subqueries?
--
Michael


signature.asc
Description: PGP signature

Re: explicit_bzero for sslpassword

2020-05-21 Thread Michael Paquier

On Wed, May 20, 2020 at 10:06:55AM +0200, Peter Eisentraut wrote:
> Looks correct to me.

Thanks for confirming, Peter.  Got this one applied.
--
Michael


signature.asc
Description: PGP signature

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-05-21 Thread Julien Rouhaud

Le jeu. 21 mai 2020 à 09:17, Michael Paquier  a écrit :

> On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote:
> > On Tue, May 19, 2020 at 4:29 AM Andy Fan 
> wrote:
> >> Thanks for the excellent extension. I want to add 5 more fields to
> satisfy the
> >> following requirements.
> >>
> >> int   subplan; /* No. of subplan in this query */
> >> int   subquery; /* No. of subquery */
> >> int   joincnt; /* How many relations are joined */
> >> bool hasagg; /* if we have agg function in this query */
> >> bool hasgroup; /* has group clause */
> >
> > Most of those fields can be computed using the raw sql satements.  Why
> > not adding functions like query_has_agg(querytext) to get the
> > information from pgss stored query text instead?
>
> Yeah I personally find concepts related only to the query string
> itself not something that needs to be tied to pg_stat_statements.
> While reading about those five new fields, I am also wondering how
> this stuff would work with CTEs.  Particularly, should the hasagg or
> hasgroup flags be set only if the most outer query satisfies a
> condition?  What if an inner query satisfies a condition but not an
> outer query?  Should joincnt just be the sum of all the joins done in
> all queries, including subqueries?
>

Indeed cte will bring additional concerns about the fields semantics.
That's another good reason to go with external functions so you can add
extra parameters for that if needed.

>

Re: [Proposal] Page Compression for OLTP

2020-05-21 Thread Fabien COELHO



Hello,

My 0.02€, some of which may just show some misunderstanding on my part:

 - you have clearly given quite a few thoughts about the what and how…
   which makes your message an interesting read.

 - Could this be proposed as some kind of extension, provided that enough
   hooks are available? ISTM that foreign tables and/or alternative
   storage engine (aka ACCESS METHOD) provide convenient APIs which could
   fit the need for these? Or are they not appropriate? You seem to
   suggest that there are not.

   If not, what could be done to improve API to allow what you are seeking
   to do? Maybe you need a somehow lower-level programmable API which does
   not exist already, or at least is not exported already, but could be
   specified and implemented with limited effort? Basically you would like
   to read/write pg pages to somewhere, and then there is the syncing
   issue to consider. Maybe such a "page storage" API could provide
   benefit for some specialized hardware, eg persistent memory stores,
   so there would be more reason to define it anyway? I think it might
   be valuable to give it some thoughts.

 - Could you maybe elaborate on how your plan differs from [4] and [5]?

 - Have you consider keeping page headers and compressing tuple data
   only?

 - I'm not sure there is a point in going below the underlying file
   system blocksize, quite often 4 KiB? Or maybe yes? Or is there
   a benefit to aim at 1/4 even if most pages overflow?

 - ISTM that your approach entails 3 "files". Could it be done with 2?
   I'd suggest that the possible overflow pointers (coa) could be part of
   the headers so that when reading the 3.1 page, then the header would
   tell where to find the overflow 3.2, without requiring an additional
   independent structure with very small data in it, most of it zeros.
   Possibly this is not possible, because it would require some available
   space in standard headers when the is page is not compressible, and
   there is not enough. Maybe creating a little room for that in
   existing headers (4 bytes could be enough?) would be a good compromise.
   Hmmm. Maybe the approach I suggest would only work for 1/2 compression,
   but not for other target ratios, but I think it could be made to work
   if the pointer can entail several blocks in the overflow table.

 - If one page is split in 3 parts, could it creates problems on syncing,
   if 1/3 or 2/3 pages get written, but maybe that is manageable with WAL
as it would note that the page was not synced and that is enough for
replay.

 - I'm unclear how you would manage the 2 representations of a page in
   memory. I'm afraid that a 8 KiB page compressed to 4 KiB would
   basically take 12 KiB, i.e. reduce the available memory for caching
   purposes. Hmmm. The current status is that a written page probably
   takes 16 KiB, once in shared buffers and once in the system caches,
   so it would be an improvement anyway.

 - Maybe the compressed and overflow table could become bloated somehow,
   which would require the vaccuuming implementation and add to the
   complexity of the implementation?

 - External tools should be available to allow page inspection, eg for
   debugging purposes.

--
Fabien.

Re: Is it useful to record whether plans are generic or custom?

2020-05-21 Thread Kyotaro Horiguchi

At Thu, 21 May 2020 12:18:16 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/05/20 21:56, Atsushi Torikoshi wrote:
> > On Wed, May 20, 2020 at 1:32 PM Kyotaro Horiguchi
> > mailto:horikyota@gmail.com>> wrote:
> > At Tue, 19 May 2020 22:56:17 +0900, Atsushi Torikoshi
> > mailto:ato...@gmail.com>> wrote in
> >  > On Sat, May 16, 2020 at 6:01 PM legrand legrand
> >  > mailto:legrand_legr...@hotmail.com>>
> >  > wrote:
> >  >
> >  > BTW, I'd also appreciate other opinions about recording the number
> >  > of generic and custom plans on pg_stat_statemtents.
> > If you/we just want to know how a prepared statement is executed,
> > couldn't we show that information in pg_prepared_statements view?
> > =# select * from pg_prepared_statements;
> > -[ RECORD 1 ]---+
> > name            | stmt1
> > statement       | prepare stmt1 as select * from t where b = $1;
> > prepare_time    | 2020-05-20 12:01:55.733469+09
> > parameter_types | {text}
> > from_sql        | t
> > exec_custom     | 5    <- existing num_custom_plans
> > exec_total          | 40   <- new member of CachedPlanSource
> > Thanks, Horiguchi-san!
> > Adding counters to pg_prepared_statements seems useful when we want
> > to know the way prepared statements executed in the current session.
> 
> I like the idea exposing more CachedPlanSource fields in
> pg_prepared_statements. I agree it's useful, e.g., for the debug
> purpose.
> This is why I implemented the similar feature in my extension.
> Please see [1] for details.

Thanks. I'm not sure plan_cache_mode should be a part of the view.
Cost numbers would look better if it is cooked a bit.  Is it worth
being in core?

=# select * from pg_prepared_statements;
-[ RECORD 1 ]---+
name| p1
statement   | prepare p1 as select a from t where a = $1;
prepare_time| 2020-05-21 15:41:50.419578+09
parameter_types | {integer}
from_sql| t
calls   | 7
custom_calls| 5
plan_generation | 6
generic_cost| 4.3105
custom_cost | 9.31

Perhaps plan_generation is not needed there.

> > And I also feel adding counters to pg_stat_statements will be
> > convenient
> > especially in production environments because it enables us to get
> > information about not only the current session but all sessions of a
> > PostgreSQL instance.
> 
> +1

Agreed. It is global and persistent.

At Tue, 19 May 2020 22:56:17 +0900, Atsushi Torikoshi  wrote 
in 
> Instead, I'm now considering using a static hash for prepared queries
> (static HTAB *prepared_queries).

That might be complex and fragile considering nested query and SPI
calls.  I'm not sure, but could we use ActivePortal?
ActivePortal->cplan is a CachedPlan, which can hold the generic/custom
information.

Instead, 


> [1]
> https://github.com/MasaoFujii/pg_cheat_funcs#record-pg_cached_plan_sourcestmt-text

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1c6a3dd41a59504244134ee44ddd4516191656da Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 May 2020 15:32:38 +0900
Subject: [PATCH] Expose counters of plancache to pg_prepared_statements

We didn't have an easy way to find how many times generic or custom
plans of a prepared statement have been executed.  This patch exposes
such numbers and costs of both plans in pg_prepared_statements.
---
 doc/src/sgml/catalogs.sgml| 45 +++
 src/backend/commands/prepare.c| 30 --
 src/backend/utils/cache/plancache.c   | 13 
 src/include/catalog/pg_proc.dat   |  6 ++--
 src/include/utils/plancache.h |  5 +--
 src/test/regress/expected/prepare.out | 43 +
 src/test/regress/expected/rules.out   |  9 --
 src/test/regress/sql/prepare.sql  |  9 ++
 8 files changed, 144 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b1b077c97f..950ed30694 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10826,6 +10826,51 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
frontend/backend protocol
   
  
+
+ 
+  
+   calls bigint
+  
+  
+   Number of times the prepared statement was executed
+  
+ 
+
+ 
+  
+   custom_calls bigint
+  
+  
+   Number of times generic plan is executed for the prepared statement
+  
+ 
+
+ 
+  
+   plan_geneation bigint
+  
+  
+   Number of times execution plan was generated for the prepared statement
+  
+ 
+
+ 
+  
+   generic_cost double precision
+  
+  
+   Estimated cost of generic plan. NULL if not yet planned.
+  
+ 
+
+ 
+  
+   custom_cost double precision
+  
+  
+   Estimated cost of custom plans.

Re: [bug] Table not have typarray when created by single user mode

2020-05-21 Thread Shawn Wang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I verified and found no problems.

Re: Is it useful to record whether plans are generic or custom?

2020-05-21 Thread Tatsuro Yamada


Hi Torikoshi-san!

On 2020/05/21 17:10, Kyotaro Horiguchi wrote:

At Thu, 21 May 2020 12:18:16 +0900, Fujii Masao  
wrote in



On 2020/05/20 21:56, Atsushi Torikoshi wrote:

On Wed, May 20, 2020 at 1:32 PM Kyotaro Horiguchi
mailto:horikyota@gmail.com>> wrote:
 At Tue, 19 May 2020 22:56:17 +0900, Atsushi Torikoshi
 mailto:ato...@gmail.com>> wrote in
  > On Sat, May 16, 2020 at 6:01 PM legrand legrand
  > mailto:legrand_legr...@hotmail.com>>
  > wrote:
  >
  > BTW, I'd also appreciate other opinions about recording the number
  > of generic and custom plans on pg_stat_statemtents.
 If you/we just want to know how a prepared statement is executed,
 couldn't we show that information in pg_prepared_statements view?
 =# select * from pg_prepared_statements;
 -[ RECORD 1 ]---+
 name| stmt1
 statement   | prepare stmt1 as select * from t where b = $1;
 prepare_time| 2020-05-20 12:01:55.733469+09
 parameter_types | {text}
 from_sql| t
 exec_custom | 5<- existing num_custom_plans
 exec_total  | 40   <- new member of CachedPlanSource
Thanks, Horiguchi-san!
Adding counters to pg_prepared_statements seems useful when we want
to know the way prepared statements executed in the current session.


I like the idea exposing more CachedPlanSource fields in
pg_prepared_statements. I agree it's useful, e.g., for the debug
purpose.
This is why I implemented the similar feature in my extension.
Please see [1] for details.


Thanks. I'm not sure plan_cache_mode should be a part of the view.
Cost numbers would look better if it is cooked a bit.  Is it worth
being in core?

=# select * from pg_prepared_statements;
-[ RECORD 1 ]---+
name| p1
statement   | prepare p1 as select a from t where a = $1;
prepare_time| 2020-05-21 15:41:50.419578+09
parameter_types | {integer}
from_sql| t
calls   | 7
custom_calls| 5
plan_generation | 6
generic_cost| 4.3105
custom_cost | 9.31

Perhaps plan_generation is not needed there.


I tried to creating PoC patch too, so I share it.
Please find attached file.

# Test case
prepare count as select count(*) from pg_class where oid >$1;
execute count(1); select * from pg_prepared_statements;

-[ RECORD 1 ]---+--
name| count
statement   | prepare count as select count(*) from pg_class where oid >$1;
prepare_time| 2020-05-21 17:41:16.134362+09
parameter_types | {oid}
from_sql| t
is_generic_plan | f <= False

You can see the following result, when you execute it 6 times.

-[ RECORD 1 ]---+--
name| count
statement   | prepare count as select count(*) from pg_class where oid >$1;
prepare_time| 2020-05-21 17:41:16.134362+09
parameter_types | {oid}
from_sql| t
is_generic_plan | t <= True


Thanks,
Tatsuro Yamada
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 80d6df8ac1..63de4fdb78 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -723,7 +723,7 @@ pg_prepared_statement(PG_FUNCTION_ARGS)
 * build tupdesc for result tuples. This must match the definition of 
the
 * pg_prepared_statements view in system_views.sql
 */
-   tupdesc = CreateTemplateTupleDesc(5);
+   tupdesc = CreateTemplateTupleDesc(6);
TupleDescInitEntry(tupdesc, (AttrNumber) 1, "name",
   TEXTOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 2, "statement",
@@ -734,6 +734,8 @@ pg_prepared_statement(PG_FUNCTION_ARGS)
   REGTYPEARRAYOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 5, "from_sql",
   BOOLOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "is_generic_plan",
+  BOOLOID, -1, 0);
 
/*
 * We put all the tuples into a tuplestore in one scan of the hashtable.
@@ -755,8 +757,8 @@ pg_prepared_statement(PG_FUNCTION_ARGS)
hash_seq_init(&hash_seq, prepared_queries);
while ((prep_stmt = hash_seq_search(&hash_seq)) != NULL)
{
-   Datum   values[5];
-   boolnulls[5];
+   Datum   values[6];
+   boolnulls[6];
 
MemSet(nulls, 0, sizeof(nulls));
 
@@ -766,6 +768,7 @@ pg_prepared_statement(PG_FUNCTION_ARGS)
values[3] = 
build_regtype_array(prep_stmt->plansource->param_types,

prep_stmt

Re: some grammar refactoring

2020-05-21 Thread Rushabh Lathia

On Tue, May 19, 2020 at 12:13 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> Here is a series of patches to do some refactoring in the grammar around
> the commands COMMENT, DROP, SECURITY LABEL, and ALTER EXTENSION ...
> ADD/DROP.  In the grammar, these commands (with some exceptions)
> basically just take a reference to an object and later look it up in C
> code.  Some of that was already generalized individually for each
> command (drop_type_any_name, drop_type_name, etc.).  This patch combines
> it into common lists for all these commands.
>
> Advantages:
>
> - Avoids having to list each object type at least four times.
>
> - Object types not supported by security labels or extensions are now
> explicitly listed and give a proper error message.  Previously, this was
> just encoded in the grammar itself and specifying a non-supported object
> type would just give a parse error.
>
> - Reduces lines of code in gram.y.
>
> - Removes some old cruft.
>
>
I liked the idea.

I had quick glance through the patches and also did quick review and
testing.
I haven't found any issue with the patch.

-- 
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Rushabh Lathia

Re: pg13: xlogreader API adjust

2020-05-21 Thread Kyotaro Horiguchi

At Fri, 15 May 2020 19:24:28 -0400, Alvaro Herrera  
wrote in 
> On 2020-May-15, Michael Paquier wrote:
> 
> > On Thu, May 14, 2020 at 02:12:25PM +0900, Kyotaro Horiguchi wrote:
> > > Good catch!  That's not only for CreateDecodingContet. That happens
> > > everywhere in the query loop in PostgresMain() until logreader is
> > > initialized.  So that also happens, for example, by starting logical
> > > replication using invalidated slot. Checking xlogreader != NULL in
> > > WalSndErrorCleanup is sufficient.  It doesn't make actual difference,
> > > but the attached explicitly initialize the pointer with NULL.
> > 
> > Alvaro, are you planning to look at that?  Should we have an open item
> > for this matter?
> 
> On it now.  I'm trying to add a test for this (needs a small change to
> PostgresNode->psql), but I'm probably doing something stupid in the Perl
> side, because it doesn't detect things as well as I'd like.  Still
> trying, but I may be asked to evict the office soon ...

FWIW, and I'm not sure which of the mail and the commit 1d3743023e was
earlier, but I confirmed that the committed test in
006_logical_decoding.pl causes a crash, and the crash is fixed by the
change of code.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: proposal: schema variables

2020-05-21 Thread Pavel Stehule

Hi

just rebase without any other changes

Regards

Pavel

Re: proposal: schema variables

2020-05-21 Thread Amit Kapila

On Thu, May 21, 2020 at 3:41 PM Pavel Stehule  wrote:
>
> Hi
>
> just rebase without any other changes
>

You seem to forget attaching the rebased patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PostgreSQL 13 Beta 1 Release Announcement Draft

2020-05-21 Thread Jonathan S. Katz

Hi John,

On 5/21/20 12:12 AM, John Naylor wrote:
> Hi Jon,
> 
> I noticed a couple minor inconsistencies:
> 
> ".datetime" -> elsewhere functions are formatted as `.datetime()`
> 
> libpq -> `libpq`
> 
> The link to the release notes on its own line is the same as the
> inline link, if that makes sense. In other places with links on their
> own line, the full URL is in the link text.
> 
> Also, for "indexes that contain many repeat values", "repeated" might
> sound better here. It's one of those things that jumped out at me at
> first reading, but when trying both in my head, it seems ok.
> 
> Regarding "streaming `pg_basebackup`s", I'm used to the general term
> "base backups" in this usage, which seems a  distinct concept from the
> name of the invoked command.

Thanks for the suggestions. I ended up incorporating all of them.

Stay tuned for the release...

Jonathan



signature.asc
Description: OpenPGP digital signature

Behaviour of failed Primary

2020-05-21 Thread Santhosh Kumar

Hi Forum,
 If I have a cluster with Synchronous replication enabled with three nodes,
for eg:

[primary] [hot stand by 1] [host stand by 2]

And for some unforeseen reasons, if primary fails, the failover will kick
in and hot stand by 1 will become new primary and  cluster setup will look
like this

[new primary (hot stand by1)] [host stand by 2]

My question here is, what will happen if the original primary which has
failed comes back. Will it become part of this high available replica
cluster automatically or it will be stale and disconnected from the
cluster?

How can we automatically make the failed primary to be part of the
cluster with hot standby role? It would be of great help, if you can direct
me to any references details. Thank you, upfront.

Regards,
Santhosh

Re: proposal: schema variables

2020-05-21 Thread Pavel Stehule

čt 21. 5. 2020 v 13:34 odesílatel Amit Kapila 
napsal:

> On Thu, May 21, 2020 at 3:41 PM Pavel Stehule 
> wrote:
> >
> > Hi
> >
> > just rebase without any other changes
> >
>
> You seem to forget attaching the rebased patch.
>

I am sorry

attached.

Pavel


> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


schema-variables-20200521.patch.gz
Description: application/gzip

Re: factorial function/phase out postfix operators?

2020-05-21 Thread Robert Haas

On Wed, May 20, 2020 at 2:24 PM Tom Lane  wrote:
> Right; I'd done the same arithmetic.  Since we currently have a total
> of 450 keywords of all flavors, that means we can make either 64%
> of them or 74.6% of them be safe to use as bare column labels.  While
> that's surely better than today, it doesn't seem like it's going to
> make for any sort of sea change in the extent of the problem.  So I was
> feeling a bit discouraged by these results.

I don't think you should feel discouraged by these results. They
assume that people are just as likely to have a problem with a
reserved keyword as an unreserved keyword, and I don't think that's
actually true. The 25.4% of keywords that aren't handled this way
include, to take a particularly egregious example, "AS" itself. And I
don't think many people are going to be sad if "select 1 as;" fails to
treat "as" as a column label.

Also, even if we only made 74.6% of these safe to use as bare column
labels, or even 64%, I think that's actually pretty significant. If I
could reduce my mortgage payment by 64%, I would be pretty happy. For
many people, that would be a sufficiently large economic impact that
it actually would be a sea change in terms of their quality of life. I
don't see a reason to suppose that's not also true here.[1]

I do like the idea of considering "can be a bare column label" as an
independent dimension from the existing keyword classification.
Presumably we would then have, in addition to the four existing
keyword productions, but then also a separate
bare_column_label_keyword: production that would include many of the
same keywords. One nice thing about that approach is that we would
then have a clear list of exactly which keywords can't be given that
treatment, and if somebody wanted to go investigate possible
improvements for any of those, they could do so. I think we'd want a
cross-check: check_keywords.pl should contain the list of keywords
that are expected to be excluded from this new production, so that any
time someone adds a new keyword, they've either got to add it to the
new production or add it to the exception list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] On the other hand, if I had 64% fewer ants in my picnic basket, I
would probably still be unhappy with the number of ants in my picnic
basket, so it all depends on context and perspective.

Re: WIP/PoC for parallel backup

2020-05-21 Thread Robert Haas

On Thu, May 21, 2020 at 2:06 AM Rushabh Lathia  wrote:
> Yes. My colleague Suraj tried this and here are the pg_stat_activity output 
> files.
>
> Captured wait events after every 3 seconds during the backup for -
> 1: parallel backup for 100GB data with 4 workers 
> (pg_stat_activity_normal_backup_100GB.txt)
> 2: Normal backup (without parallel backup patch) for 100GB data  
> (pg_stat_activity_j4_100GB.txt)
>
> Here is the observation:
>
> The total number of events (pg_stat_activity) captured during above runs:
> - 314 events for normal backups
> - 316 events for parallel backups (-j 4)
>
> BaseBackupRead wait event numbers: (newly added)
> 37 - in normal backups
> 25 - in the parallel backup (-j 4)
>
> ClientWrite wait event numbers:
> 175 - in normal backup
> 1098 - in parallel backups
>
> ClientRead wait event numbers:
> 0 - ClientRead in normal backup
> 326 - ClientRead in parallel backups for diff processes. (all in idle state)

So, basically, when we go from 1 process to 4, the additional
processes spend all of their time waiting rather than doing any useful
work, and that's why there is no performance benefit. Presumably, the
reason they spend all their time waiting for ClientRead/ClientWrite is
because the network between the two machines is saturated, so adding
more processes that are trying to use it at maximum speed just leads
to spending more time waiting for it to be available.

Do we have the same results for the local backup case, where the patch helped?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:


...

The problem is that the hashagg plan runs in ~1400 seconds, while the
groupagg only takes ~360. And per explain analyze, the difference really
is in the aggregation - if we subtract the seqscan, the sort+groupagg
takes about 310s:

   ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) 
(actual time=283378.004..335611.192 rows=6398981 loops=1)
 Group Key: lineitem_1.l_partkey
 ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) 
(actual time=283377.977..306182.393 rows=191969841 loops=1)
   Sort Key: lineitem_1.l_partkey
   Sort Method: external merge  Disk: 3569544kB
   ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 
rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1)

while the hashagg takes ~1330s:

   ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) 
(actual time=202952.170..1354546.897 rows=640 loops=1)
 Group Key: lineitem_1.l_partkey
 Planned Partitions: 128
 Peak Memory Usage: 4249 kB
 Disk Usage: 26321840 kB
 HashAgg Batches: 16512
 ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 
rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1)

And that's while only writing 26GB, compared to 35GB in the sorted plan,
and with cost being ~16M vs. ~43M (so roughly inverse).



I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
writes out *way less* data to the temp file.

The reason is revealed by explain verbose:

  ->  GroupAggregate
Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
Group Key: lineitem_1.l_partkey
->  Sort
  Output: lineitem_1.l_partkey, lineitem_1.l_quantity
  Sort Key: lineitem_1.l_partkey
  ->  Seq Scan on public.lineitem lineitem_1
Output: lineitem_1.l_partkey, lineitem_1.l_quantity

  ->  HashAggregate
Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
Group Key: lineitem_1.l_partkey
->  Seq Scan on public.lineitem lineitem_1
  Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
  lineitem_1.l_suppkey, lineitem_1.l_linenumber,
  lineitem_1.l_quantity, lineitem_1.l_extendedprice,
  lineitem_1.l_discount, lineitem_1.l_tax,
  lineitem_1.l_returnflag, lineitem_1.l_linestatus,
  lineitem_1.l_shipdate, lineitem_1.l_commitdate,
  lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
  lineitem_1.l_shipmode, lineitem_1.l_comment

It seems that in the hashagg case we're not applying projection in the
seqscan, forcing us to serialize way much data (the whole lineitem
table, essentially).

It's probably still worth tweaking the I/O pattern, I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Schedule of commit fests for PG14

2020-05-21 Thread Tom Lane

Michael Paquier  writes:
> Normally $subject would have been discussed at the developer meeting
> in Ottawa, but that's not going to happen per the current situation.

> For the last couple of years, we have been using the same timeline for
> for commit fests in a development cycle, so why not going with the
> same flow this year?  This would mean 5 CFs:
> - 2020-07-01~2020-07-31
> - 2020-09-01~2020-09-30
> - 2020-11-01~2020-11-30
> - 2021-01-01~2021-01-31
> - 2021-03-01~2021-03-31

Yeah, nobody's expressed any great unhappiness with the schedule
recently, so let's just do the same thing again this year.

regards, tom lane

Re: Schedule of commit fests for PG14

2020-05-21 Thread David Steele


On 5/21/20 2:35 AM, Michael Paquier wrote:

Hi all,

Normally $subject would have been discussed at the developer meeting
in Ottawa, but that's not going to happen per the current situation.

For the last couple of years, we have been using the same timeline for
for commit fests in a development cycle, so why not going with the
same flow this year?  This would mean 5 CFs:
- 2020-07-01~2020-07-31
- 2020-09-01~2020-09-30
- 2020-11-01~2020-11-30
- 2021-01-01~2021-01-31
- 2021-03-01~2021-03-31

Any thoughts or opinions?


+1. This schedule seems to have worked fine the last two years.

Regards,
--
-David
da...@pgmasters.net

Re: PostgreSQL 13 Beta 1 Release Announcement Draft

2020-05-21 Thread Pantelis Theodosiou

Congrats to all for the release of a new major version!

Two questions:
- Why is VACUUM together with FETCH FIRST WITH TIES, CREATE TABLE LIKE,
ALTER VIEW, ALTER TABLE, etc in Utility Commands section?
  Shouldn't there be a separate section for SQL changes? (or keep one
section but rename the Utility to include all?)

> Add FOREIGN to ALTER statements, if appropriate (Luis Carril)

> WHAT IS THIS ABOUT?
- The "WHAT IS THIS ABOUT?" should be removed, in my opinion.

Again, congrats for another release of the best database in the world.

Pantelis Theodosiou

On Thu, May 21, 2020 at 12:44 PM Jonathan S. Katz 
wrote:

> Hi John,
>
> On 5/21/20 12:12 AM, John Naylor wrote:
> > Hi Jon,
> >
> > I noticed a couple minor inconsistencies:
> >
> > ".datetime" -> elsewhere functions are formatted as `.datetime()`
> >
> > libpq -> `libpq`
> >
> > The link to the release notes on its own line is the same as the
> > inline link, if that makes sense. In other places with links on their
> > own line, the full URL is in the link text.
> >
> > Also, for "indexes that contain many repeat values", "repeated" might
> > sound better here. It's one of those things that jumped out at me at
> > first reading, but when trying both in my head, it seems ok.
> >
> > Regarding "streaming `pg_basebackup`s", I'm used to the general term
> > "base backups" in this usage, which seems a  distinct concept from the
> > name of the invoked command.
>
> Thanks for the suggestions. I ended up incorporating all of them.
>
> Stay tuned for the release...
>
> Jonathan
>
>

Re: [PATCH] fix GIN index search sometimes losing results

2020-05-21 Thread Pavel Borisov

Hi All!
1. Generally the difference of my patch in comparison to Tom's patch 0001
is that I tried to move previous logic of GIN's own TS_execute_ternary() to
the general logic of TS_execute_recurse and in case we have index without
positions to avoid diving into phrase operator replacing (only in this
case) in by an AND operator. The reason for this I suppose is speed and
I've done testing of some corner cases like phrase operator with big number
of OR comparisons inside it.

-
BEFORE ANY PATCH:
 Bitmap Heap Scan on pglist  (cost=1715.72..160233.31 rows=114545
width=1234) (actual time=652.294..2719.961 rows=4904 loops=1)
   Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' |
''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow''
| ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' |
''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' |
''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' |
''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
   Rows Removed by Index Recheck: 108191
   Heap Blocks: exact=73789
   ->  Bitmap Index Scan on pglist_fts_idx  (cost=0.00..1687.09 rows=114545
width=0) (actual time=636.883..636.883 rows=113095 loops=1)
 Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' |
''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' |
''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' |
''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl''
| ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' |
''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
 Planning Time: 3.016 ms
 Execution Time: *2721.002 ms*
---
AFTER TOM's PATCH (0001)
Bitmap Heap Scan on pglist  (cost=1715.72..160233.31 rows=114545
width=1234) (actual time=916.640..2960.571 rows=4904 loops=1)
   Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' |
''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow''
| ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' |
''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' |
''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' |
''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
   Rows Removed by Index Recheck: 108191
   Heap Blocks: exact=73789
   ->  Bitmap Index Scan on pglist_fts_idx  (cost=0.00..1687.09 rows=114545
width=0) (actual time=900.472..900.472 rows=113095 loops=1)
 Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' |
''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' |
''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' |
''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl''
| ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' |
''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
 Planning Time: 2.688 ms
 Execution Time: *2961.704 ms*

AFTER MY PATCH (gin-gist-weight-patch-v3)
Bitmap Heap Scan on pglist  (cost=1715.72..160233.31 rows=114545
width=1234) (actual time=616.982..2710.571 rows=4904 loops=1)
   Recheck Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' | ''index'' |
''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' | ''think'' | ''slow''
| ''articl'' | ''knowledg'' | ''join'' | ''need'' | ''experi'' |
''understand'' | ''free'' | ''say'' | ''comment'' | ''littl'' | ''move'' |
''function'' | ''new'' | ''never'' | ''general'' | ''get'' | ''java'' |
''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
   Rows Removed by Index Recheck: 108191
   Heap Blocks: exact=73789
   ->  Bitmap Index Scan on pglist_fts_idx  (cost=0.00..1687.09 rows=114545
width=0) (actual time=601.586..601.586 rows=113095 loops=1)
 Index Cond: (fts @@ '( ''worth'' | ''good'' | ''result'' |
''index'' | ''anoth'' | ''know'' | ''like'' | ''tool'' | ''job'' |
''think'' | ''slow'' | ''articl'' | ''knowledg'' | ''join'' | ''need'' |
''experi'' | ''understand'' | ''free'' | ''say'' | ''comment'' | ''littl''
| ''move'' | ''function'' | ''new'' | ''never'' | ''general'' | ''get'' |
''java'' | ''postgresql'' | ''notic'' | ''recent'' | ''serious'' ) <->
''start'''::tsquery)
 Planning Time: 3.115 ms
 Execution Time: *2711.533 ms*

I've done the test several times and seems that difference is real effect,
though not very big (around 7%). So maybe there is some reason to save
PHRASE_AS_AND behavior for GIN-GIST indexes despite migration from GIN's
own TS_execute_ternary() to general TS_execute_recurse.

2. As for shifting from bool to ternary callback I am not quite sure
whether it can be useful to save bool callbacks via bool-ternary wrapper.
We can include this for compatibility with old callers and can drop. Any
id

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 03:41:22PM +0200, Tomas Vondra wrote:

On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:


...

The problem is that the hashagg plan runs in ~1400 seconds, while the
groupagg only takes ~360. And per explain analyze, the difference really
is in the aggregation - if we subtract the seqscan, the sort+groupagg
takes about 310s:

  ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) 
(actual time=283378.004..335611.192 rows=6398981 loops=1)
Group Key: lineitem_1.l_partkey
->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) 
(actual time=283377.977..306182.393 rows=191969841 loops=1)
  Sort Key: lineitem_1.l_partkey
  Sort Method: external merge  Disk: 3569544kB
  ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 
rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1)

while the hashagg takes ~1330s:

  ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) 
(actual time=202952.170..1354546.897 rows=640 loops=1)
Group Key: lineitem_1.l_partkey
Planned Partitions: 128
Peak Memory Usage: 4249 kB
Disk Usage: 26321840 kB
HashAgg Batches: 16512
->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 
rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1)

And that's while only writing 26GB, compared to 35GB in the sorted plan,
and with cost being ~16M vs. ~43M (so roughly inverse).



I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
writes out *way less* data to the temp file.

The reason is revealed by explain verbose:

 ->  GroupAggregate
   Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
   Group Key: lineitem_1.l_partkey
   ->  Sort
 Output: lineitem_1.l_partkey, lineitem_1.l_quantity
 Sort Key: lineitem_1.l_partkey
 ->  Seq Scan on public.lineitem lineitem_1
   Output: lineitem_1.l_partkey, lineitem_1.l_quantity

 ->  HashAggregate
   Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
   Group Key: lineitem_1.l_partkey
   ->  Seq Scan on public.lineitem lineitem_1
 Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
 lineitem_1.l_suppkey, lineitem_1.l_linenumber,
 lineitem_1.l_quantity, lineitem_1.l_extendedprice,
 lineitem_1.l_discount, lineitem_1.l_tax,
 lineitem_1.l_returnflag, lineitem_1.l_linestatus,
 lineitem_1.l_shipdate, lineitem_1.l_commitdate,
 lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
 lineitem_1.l_shipmode, lineitem_1.l_comment

It seems that in the hashagg case we're not applying projection in the
seqscan, forcing us to serialize way much data (the whole lineitem
table, essentially).

It's probably still worth tweaking the I/O pattern, I think.



OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to
CP_SMALL_TLIST) addresses this for me. I've only tried it on the patched
version that pre-allocates 128 blocks, and the results seem pretty nice:

   sort  hash  hash+tlist
   --
  4MB   331   478188
128MB   222   434210

which I guess is what we wanted ...

I'll give it a try on the other machine (temp on SATA), but I don't see
why would it not behave similarly nicely.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 9941dfe65e..08d43c270e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2118,7 +2118,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 * Agg can project, so no need to be terribly picky about child tlist, 
but
 * we do need grouping columns to be available
 */
-   subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST);
+   subplan = create_plan_recurse(root, best_path->subpath, CP_SMALL_TLIST);
 
tlist = build_path_tlist(root, &best_path->path);

Re: Behaviour of failed Primary

2020-05-21 Thread Amit Kapila

On Thu, May 21, 2020 at 5:38 PM Santhosh Kumar  wrote:
>
> Hi Forum,
>  If I have a cluster with Synchronous replication enabled with three nodes, 
> for eg:
>
> [primary] [hot stand by 1] [host stand by 2]
>
> And for some unforeseen reasons, if primary fails, the failover will kick in 
> and hot stand by 1 will become new primary and  cluster setup will look like 
> this
>
> [new primary (hot stand by1)] [host stand by 2]
>
> My question here is, what will happen if the original primary which has 
> failed comes back. Will it become part of this high available replica cluster 
> automatically or it will be stale and disconnected from the cluster?
>

It won't become standby automatically as it would have diverged from
the new master.

> How can we automatically make the failed primary to be part of the cluster 
> with hot standby role? It would be of great help, if you can direct me to any 
> references details. Thank you, upfront.
>

I think pg_rewind can help in such situations.  See the docs of pg_rewind [1].


[1] - https://www.postgresql.org/docs/devel/app-pgrewind.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 02:12:55AM +0200, Tomas Vondra wrote:


...

I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.



OK, I've managed to get some numbers from the other machine, with 75GB
data set and temp tablespace on SATA RAID. I haven't collected I/O data
using iosnoop this time, because we already know how that changes from
the other machine. I've also only done this with 128MB work_mem, because
of how long a single run takes, and with 128 blocks pre-allocation.

The patched+tlist means both pre-allocation and with the tlist tweak
I've posted to this thread a couple minutes ago:

   master   patched   patched+tlist
   -
sort 485472 462
hash   24686   3060 559

So the pre-allocation makes it 10x faster, and the tlist tweak makes it
5x faster. Not bad, I guess.

Note: I've slightly tweaked read-ahead on the RAID device(s) on those
patched runs, but the effect was pretty negligible (compared to other
patched runs with the old read-ahead setting).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PostgreSQL 13 Beta 1 Release Announcement Draft

2020-05-21 Thread Pantelis Theodosiou

On Thu, May 21, 2020 at 3:20 PM Pantelis Theodosiou 
wrote:

> Congrats to all for the release of a new major version!
>
> Two questions:
> - Why is VACUUM together with FETCH FIRST WITH TIES, CREATE TABLE LIKE,
> ALTER VIEW, ALTER TABLE, etc in Utility Commands section?
>   Shouldn't there be a separate section for SQL changes? (or keep one
> section but rename the Utility to include all?)
>
> > Add FOREIGN to ALTER statements, if appropriate (Luis Carril)
>
> > WHAT IS THIS ABOUT?
> - The "WHAT IS THIS ABOUT?" should be removed, in my opinion.
>
> Again, congrats for another release of the best database in the world.
>
> Pantelis Theodosiou
>
> On Thu, May 21, 2020 at 12:44 PM Jonathan S. Katz 
> wrote:
>
>>
>> Thanks for the suggestions. I ended up incorporating all of them.
>>
>> Stay tuned for the release...
>>
>> Jonathan
>>
>>
Apologies, I realized a minute too late that my comments are about the
Release Notes and not the Announcement.
However, since the link to Notes makes them no visible to more eyes, they
could be checked again.

Pantelis Theodosiou

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Robert Haas

On Thu, May 21, 2020 at 10:45 AM Tomas Vondra
 wrote:
> So the pre-allocation makes it 10x faster, and the tlist tweak makes it
> 5x faster. Not bad, I guess.

That is pretty great stuff, Tomas.

FWIW, I agree that CP_SMALL_TLIST seems like the right thing here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_bsd_indent and -Wimplicit-fallthrough

2020-05-21 Thread Piotr Stefaniak


On 18/05/2020 11.22, Julien Rouhaud wrote:

On Sun, May 17, 2020 at 2:32 AM Michael Paquier  wrote:


On Sat, May 16, 2020 at 11:56:28AM -0400, Tom Lane wrote:

In the meantime, I went ahead and pushed this to our pg_bsd_indent repo.


Thanks, Tom.


+1, thanks a lot!



Committed upstream, thank you.

Re: POC: rational number type (fractions)

2020-05-21 Thread Robert Haas

On Mon, May 18, 2020 at 6:15 PM Tom Lane  wrote:
> There surely are use-cases for true rational arithmetic, but I'm
> dubious that it belongs in core Postgres.  I don't think that enough
> of our users would want it to justify expending core-project maintenance
> effort on it.  So I'd be happier to see this as an out-of-core extension.

As is often the case, I'm a little more positive about including this
than Tom, but as is also often the case, I'm somewhat cautious, too.
On the one hand, I think it would be cool to have and people would
like it. But, On the other hand, I also think we'd only want it if
we're convinced that it's a really good implementation and that
there's not a competing design which is better, or even equally good.
Those things don't seem too clear at this point, so I hope Jeff and
Joe keep chatting about it ... and maybe some other people who are
knowledgeable about this will chime in, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Jeff Davis

On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote:
> OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST
> to
> CP_SMALL_TLIST) addresses this for me.

Great!

There were a couple plan changes where it introduced a Subquery Scan.
I'm not sure that I understand why it's doing that, can you verify that
it is a reasonable thing to do?

Aside from that, feel free to commit it.

Regards,
Jeff Davis

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote:

On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote:

OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST
to
CP_SMALL_TLIST) addresses this for me.


Great!

There were a couple plan changes where it introduced a Subquery Scan.
I'm not sure that I understand why it's doing that, can you verify that
it is a reasonable thing to do?

Aside from that, feel free to commit it.



It's doing that because we're doing projection everywhere, even in cases
when it may not be necessary - but I think that's actually OK.

At first I thought we might only do it conditionally when we expect to
spill to disk, but that'd not work for cases when we only realize we
need to spill to disk during execution.

So I think the plan changes are correct and expected.

I think we should do the pre-allocation patch too. I haven't tried yet
but I believe the tlist fix alone won't do nearly as good.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 08:34:05PM +0200, Tomas Vondra wrote:

On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote:

...

I think we should do the pre-allocation patch too. I haven't tried yet
but I believe the tlist fix alone won't do nearly as good.



I've done some measurements on the smaller (SSD) machine, and the
comparison looks like this:

sort   hash   hash+prealloc+tlist   hash+tlist
   
  4MB331478   188  330
128MB222434   210  350


The last column is master with the tlist tweak alone - it's better than
hashagg on master alone, but it's not nearly as good as with both tlist
and prealloc patches.

I can't test this on the larger box with SATA temporary tablespace at
the moment (other tests are running), but I believe the difference will
be even more pronounced there.

I don't think we're under a lot of pressure - beta1 is out anyway, so we
have time to do proper testing first.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Jeff Davis

On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote:
> The last column is master with the tlist tweak alone - it's better
> than
> hashagg on master alone, but it's not nearly as good as with both
> tlist
> and prealloc patches.

Right, I certainly think we should do the prealloc change, as well.

I'm tweaking the patch to be a bit more flexible. I'm thinking we
should start the preallocation list size ~8 and then double it up to
~128 (depending on your results). That would reduce the waste in case
we have a large number of small partitions.

Regards,
Jeff Davis

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:

On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:


And if there a way to pre-allocate larger chunks? Presumably we could
assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
8kB)
instead of just single block. I haven't seen anything like that in
tape.c, though ...


It turned out to be simple (at least a POC) so I threw together a
patch. I just added a 32-element array of block numbers to each tape.
When we need a new block, we retrieve a block number from that array;
or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.



I think the PoC patch goes in the right direction. I have two ideas how
to improve it a bit:

1) Instead of assigning the pages one by one, we can easily extend the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks. And we could keep just a min/max
of free blocks, not an array with fixed number of elements.

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.

IIUC the danger of pre-allocating blocks is that we might not fill them,
resulting in temp file much larger than necessary. It might be harmless
on some (most?) current filesystems that don't actually allocate space
for blocks that are never written, but it also confuses our accounting
of temporary file sizes. So we should try to limit that, and growing the
number of pre-allocated blocks over time seems reasonable.

Both (1) and (2) seem fairly simple, not much more complex than the
current PoC patch.

I also wonder if we could collect / report useful statistics about I/O
on the temporary file, not just the size. I mean, how many pages we've
written/read, how sequential it was, etc. But some of that is probably
only visible at the OS level (e.g. we have no insignt into how the
kernel combines writes in page cache, etc.). This is clearly matter for
v14, though.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 12:04:19PM -0700, Jeff Davis wrote:

On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote:

The last column is master with the tlist tweak alone - it's better
than
hashagg on master alone, but it's not nearly as good as with both
tlist
and prealloc patches.


Right, I certainly think we should do the prealloc change, as well.

I'm tweaking the patch to be a bit more flexible. I'm thinking we
should start the preallocation list size ~8 and then double it up to
~128 (depending on your results). That would reduce the waste in case
we have a large number of small partitions.



You're reading my mind ;-)

I don't think 128 is necessarily the maximum we should use - it's just
that I haven't tested higher values. I wouldn't be surprised if higher
values made it a bit faster. But we can test and tune that, I agree with
growing the number of pre-allocted blocks over time.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Jeff Davis

On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
> 1) Instead of assigning the pages one by one, we can easily extend
> the
> API to allow getting a range of blocks, so that we don't need to call
> ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
> with the requested number of blocks.

ltsGetFreeBlock() just draws one element from a minheap. Is there some
more efficient way to get many elements from a minheap at once?

>  And we could keep just a min/max
> of free blocks, not an array with fixed number of elements.

I don't quite know what you mean. Can you elaborate?

Regards,
Jeff Davis

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-05-21 Thread Dmitry Dolgov

> On Tue, Apr 14, 2020 at 09:09:31PM +1200, David Rowley wrote:
>
> The infrastructure (knowing the unique properties of a RelOptInfo), as
> provided by the patch Andy has been working on, which is based on my
> rough prototype version, I believe should be used for the skip scans
> patch as well.

Hi,

Following our agreement about making skip scan patch to use UniqueKeys
implementation from this thread I've rebased index skip scan on first
two patches from v8 series [1] (if I understand correctly those two are
introducing the concept, and others are just using it). I would like to
clarify couple of things:

* It seems populate_baserel_uniquekeys, which actually sets uniquekeys,
  is called after create_index_paths, where index skip scan already
  needs to use them. Is it possible to call it earlier?

* Do I understand correctly, that a UniqueKey would be created in a
  simplest case only when an index is unique? This makes it different
  from what was implemented for index skip scan, since there UniqueKeys
  also represents potential to use non-unique index to facilitate search
  for unique values via skipping.

[1]: 
https://www.postgresql.org/message-id/CAKU4AWpOM3_J-B%3DwQtCeU1TGr89MhpJBBkv2he1tAeQz6i4XNw%40mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-05-21 Thread David Rowley

On Fri, 22 May 2020 at 07:49, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> * It seems populate_baserel_uniquekeys, which actually sets uniquekeys,
>   is called after create_index_paths, where index skip scan already
>   needs to use them. Is it possible to call it earlier?

Seems reasonable. I originally put it after check_index_predicates().
We need to wait until at least then so we can properly build
UniqueKeys for partial indexes.

> * Do I understand correctly, that a UniqueKey would be created in a
>   simplest case only when an index is unique? This makes it different
>   from what was implemented for index skip scan, since there UniqueKeys
>   also represents potential to use non-unique index to facilitate search
>   for unique values via skipping.

The way I picture the two working together is that the core UniqueKey
patch adds UniqueKeys to RelOptInfos to allow us to have knowledge
about what they're unique on based on the base relation's unique
indexes.

For Skipscans, that patch must expand on UniqueKeys to teach Paths
about them. I imagine we'll set some required UniqueKeys during
standard_qp_callback() and then we'll try to create some Skip Scan
paths (which are marked with UniqueKeys) if the base relation does not
already have UniqueKeys that satisfy the required UniqueKeys that were
set during standard_qp_callback().  In the future, there may be other
reasons to create Skip Scan paths for certain rels, e.g if they're on
the inner side of a SEMI/ANTI join, it might be useful to try those
when planning joins.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-05-21 Thread David Rowley

On Thu, 14 May 2020 at 14:39, Andy Fan  wrote:
>
> On Thu, May 14, 2020 at 6:20 AM David Rowley  wrote:
>> Having the "onerow" flag was not how I intended it to work.
>>
> Thanks for the detailed explanation.  So I think we do need to handle onerow
> specially, (It means more things than adding each column as a UniqueKey).
> but we don't need the onerow flag since we can tell it by ukey->exprs == NIL.
>
> During the developer of this feature,  I added some Asserts as double checking
> for onerow and exprs.  it helps me to find some special cases. like
> SELECT FROM multirows  union SELECT  FROM multirows; where targetlist is NIL.
> (I find the above case returns onerow as well just now).  so onerow flag 
> allows us
> check this special things with more attention. Even this is not the original 
> intention
> but looks it is the one purpose now.

But surely that special case should just go in
populate_unionrel_uniquekeys(). If the targetlist is empty, then add a
UniqueKey with an empty set of exprs.

> However I am feeling that removing onerow flag doesn't require much of code
> changes. Almost all the special cases which are needed before are still needed
> after that and all the functions based on that like relation_is_onerow
> /add_uniquekey_onerow is still valid, we just need change the implementation.
> so do you think we need to remove onerow flag totally?

Well, at the moment I'm not quite understanding why it's needed. If
it's not needed then we should remove it. If it turns out there is
some valid reason for it, then we should keep it.

David

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 12:40:23PM -0700, Jeff Davis wrote:

On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:

1) Instead of assigning the pages one by one, we can easily extend
the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks.


ltsGetFreeBlock() just draws one element from a minheap. Is there some
more efficient way to get many elements from a minheap at once?


 And we could keep just a min/max
of free blocks, not an array with fixed number of elements.


I don't quite know what you mean. Can you elaborate?



Ah, I forgot there's and internal minheap thing - I thought we're just
incrementing some internal counter or something like that, but with the
minheap we can't just get a range of blocks. So just disregard that,
you're right we need the array.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Jeff Davis

On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
> 2) We could make it self-tuning, by increasing the number of blocks
> we pre-allocate. So every time we exhaust the range, we double the
> number of blocks (with a reasonable maximum, like 1024 or so). Or we
> might just increment it by 32, or something.

Attached a new version that uses the doubling behavior, and cleans it
up a bit. It also returns the unused prealloc blocks back to lts-
>freeBlocks when the tape is rewound for reading.

> IIUC the danger of pre-allocating blocks is that we might not fill
> them,
> resulting in temp file much larger than necessary. It might be
> harmless
> on some (most?) current filesystems that don't actually allocate
> space
> for blocks that are never written, but it also confuses our
> accounting
> of temporary file sizes. So we should try to limit that, and growing
> the
> number of pre-allocated blocks over time seems reasonable.

There's another danger here: it doesn't matter how well the filesystem
deals with sparse writes, because ltsWriteBlock fills in the holes with
zeros anyway. That's potentially a significant amount of wasted IO
effort if we aren't careful.

Regards,
Jeff Davis

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index bc8d56807e2..84f9d6d2358 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -110,6 +110,8 @@ typedef struct TapeBlockTrailer
 #define TapeBlockSetNBytes(buf, nbytes) \
 	(TapeBlockGetTrailer(buf)->next = -(nbytes))
 
+#define TAPE_WRITE_PREALLOC_MIN 8
+#define TAPE_WRITE_PREALLOC_MAX 128
 
 /*
  * This data structure represents a single "logical tape" within the set
@@ -151,6 +153,21 @@ typedef struct LogicalTape
 	int			max_size;		/* highest useful, safe buffer_size */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * When multiple tapes are being written to concurrently (as in HashAgg),
+	 * avoid excessive fragmentation by preallocating block numbers to
+	 * individual tapes. The preallocated block numbers are held in an array
+	 * sorted in descending order; blocks are consumed from the end of the
+	 * array (lowest block numbers first).
+	 *
+	 * No filesystem operations are performed for preallocation; only the
+	 * block numbers are reserved. This may lead to sparse writes, which will
+	 * cause ltsWriteBlock() to fill in holes with zeros.
+	 */
+	long	   *prealloc;
+	int		 	nprealloc; /* number of elements in list */
+	int			prealloc_size; /* number of elements list can hold */
 } LogicalTape;
 
 /*
@@ -198,6 +215,7 @@ struct LogicalTapeSet
 static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
 static void ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
 static long ltsGetFreeBlock(LogicalTapeSet *lts);
+static long ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt);
 static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
 static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
  SharedFileSet *fileset);
@@ -397,6 +415,51 @@ ltsGetFreeBlock(LogicalTapeSet *lts)
 	return blocknum;
 }
 
+/*
+ * Return the lowest free block number from the tape's preallocation list.
+ * Refill the preallocation list if necessary.
+ */
+static long
+ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt)
+{
+	/* sorted in descending order, so return the last element */
+	if (lt->nprealloc > 0)
+		return lt->prealloc[--lt->nprealloc];
+
+	if (lt->prealloc == NULL)
+	{
+		lt->prealloc_size = TAPE_WRITE_PREALLOC_MIN;
+		lt->prealloc = (long *) palloc(sizeof(long) * lt->prealloc_size);
+	}
+	else
+	{
+		/* when the preallocation list runs out, double the size */
+		if (lt->prealloc_size * 2 <= TAPE_WRITE_PREALLOC_MAX)
+			lt->prealloc_size *= 2;
+		lt->prealloc = (long *) repalloc(lt->prealloc,
+		 sizeof(long) * lt->prealloc_size);
+	}
+
+	/* refill preallocation list */
+	lt->nprealloc = lt->prealloc_size;
+	for (int i = lt->nprealloc; i > 0; i--)
+		lt->prealloc[i - 1] = ltsGetFreeBlock(lts);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		/* verify that list is in reverse sorted order */
+		long last_block_number = -1;
+		for (int i = lt->nprealloc; i > 0; i--)
+		{
+			Assert(lt->prealloc[i - 1] > last_block_number);
+			last_block_number = lt->prealloc[i - 1];
+		}
+	}
+#endif
+
+	return lt->prealloc[--lt->nprealloc];
+}
+
 /*
  * Return a block# to the freelist.
  */
@@ -557,6 +620,9 @@ ltsInitTape(LogicalTape *lt)
 	lt->max_size = MaxAllocSize;
 	lt->pos = 0;
 	lt->nbytes = 0;
+	lt->prealloc = NULL;
+	lt->nprealloc = 0;
+	lt->prealloc_size = 0;
 }
 
 /*
@@ -709,7 +775,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		Assert(lt->firstBlockNumber == -1);
 		Assert(lt->pos == 0);
 
-		lt->curBlockNumber = ltsGetFreeBlock(lts);
+		lt->curBlockNumber = ltsGetPreallocBlock(lts, lt);
 		lt->firstBlockNumber = lt->curBlockNumber;
 
 		TapeBlockGetTrail

Re: pgindent && weirdness

2020-05-21 Thread Piotr Stefaniak


On 16/01/2020 03.59, Thomas Munro wrote:

On Wed, Jan 15, 2020 at 11:30 AM Tom Lane  wrote:

Alvaro Herrera  writes:

I just ran pgindent over some patch, and noticed that this hunk ended up
in my working tree:



- if (IsA(leftop, Var) && IsA(rightop, Const))
+ if (IsA(leftop, Var) &&IsA(rightop, Const))


Yeah, it's been doing that for decades.  I think the triggering
factor is the typedef name (Var, here) preceding the &&.

It'd be nice to fix properly, but I've tended to take the path
of least resistance by breaking such lines to avoid the ugliness:

 if (IsA(leftop, Var) &&
 IsA(rightop, Const))


I am on vacation away from the Internet this week but somehow saw this
on my phone and couldn't stop myself from peeking at pg_bsd_ident
again. Yeah, "(Var)" (where Var is a known typename) causes it to
think that any following operator must be unary.

One way to fix that in the cases Alvaro is referring to is to tell
override the setting so that && (and likewise ||) are never considered
to be unary, though I haven't tested this much and there are surely
other ways to achieve this:

diff --git a/lexi.c b/lexi.c
index d43723c..6de3227 100644
--- a/lexi.c
+++ b/lexi.c
@@ -655,6 +655,12 @@ stop_lit:
 unary_delim = state->last_u_d;
 break;
 }
+
+   /* && and || are never unary */
+   if ((token[0] == '&' && *buf_ptr == '&') ||
+   (token[0] == '|' && *buf_ptr == '|'))
+   state->last_u_d = false;
+
 while (*(e_token - 1) == *buf_ptr || *buf_ptr == '=') {
 /*
  * handle ||, &&, etc, and also things as in int *i

The problem with that is that && sometimes *should* be formatted like
a unary operator: when it's part of the nonstandard GCC computed goto
syntax.


These comments are made in the context of pushing this change or 
equivalent to FreeBSD repository.


I think this is a better approach then the one committed to 
pg_bsd_indent. It's ubiquitous that the operators are binary, except - 
as you mentioned - in a nonstandard GCC syntax. The alternative given 
has more disadvantages, with potential impact on FreeBSD code 
formatting, which it should support as well as everything else -- to a 
reasonable extent. sys/kern/ is usually a decent litmus test, but I 
don't claim it should show anything interesting in this particular case.


This change may seem hacky, but it would be far from the worst hack in 
this program's history or even in its present form. It's actually very 
much in indent's spirit, which is an attribute I neither support nor 
condemn.


In any case, this change, or equivalent, should be committed to FreeBSD 
repository together with a test case or two.

Re: Trouble with hashagg spill I/O pattern and costing

2020-05-21 Thread Tomas Vondra


On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:

On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.


Attached a new version that uses the doubling behavior, and cleans it
up a bit. It also returns the unused prealloc blocks back to lts-
freeBlocks when the tape is rewound for reading.



Ah, the returning is a nice idea, that should limit the overhead quite a
bit, I think.


IIUC the danger of pre-allocating blocks is that we might not fill
them,
resulting in temp file much larger than necessary. It might be
harmless
on some (most?) current filesystems that don't actually allocate
space
for blocks that are never written, but it also confuses our
accounting
of temporary file sizes. So we should try to limit that, and growing
the
number of pre-allocated blocks over time seems reasonable.


There's another danger here: it doesn't matter how well the filesystem
deals with sparse writes, because ltsWriteBlock fills in the holes with
zeros anyway. That's potentially a significant amount of wasted IO
effort if we aren't careful.



True. I'll give it a try on both machines and report some numbers. Might
take a couple of days.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Seq Scan vs kernel read ahead

2020-05-21 Thread David Rowley

On Thu, 21 May 2020 at 17:06, David Rowley  wrote:
> For the patch. I know you just put it together quickly, but I don't
> think you can do that ramp up the way you have. It looks like there's
> a risk of torn reads and torn writes and I'm unsure how much that
> could affect the test results here.

Oops. On closer inspection, I see that memory is per worker, not
global to the scan.

Re: Parallel Seq Scan vs kernel read ahead

2020-05-21 Thread Thomas Munro

On Fri, May 22, 2020 at 10:00 AM David Rowley  wrote:
> On Thu, 21 May 2020 at 17:06, David Rowley  wrote:
> > For the patch. I know you just put it together quickly, but I don't
> > think you can do that ramp up the way you have. It looks like there's
> > a risk of torn reads and torn writes and I'm unsure how much that
> > could affect the test results here.
>
> Oops. On closer inspection, I see that memory is per worker, not
> global to the scan.

Right, I think it's safe.  I think you were probably right that
ramp-up isn't actually useful though, it's only the end of the scan
that requires special treatment so we don't get unfair allocation as
the work runs out, due to course grain.  I suppose that even if you
have a scheme that falls back to fine grained allocation for the final
N pages, it's still possible that a highly distracted process (most
likely the leader given its double duties) can finish up sitting on a
large range of pages and eventually have to process them all at the
end after the other workers have already knocked off and gone for a
pint.

Re: pgindent && weirdness

2020-05-21 Thread Tom Lane

Piotr Stefaniak  

writes:
> On 16/01/2020 03.59, Thomas Munro wrote:
>> One way to fix that in the cases Alvaro is referring to is to tell
>> override the setting so that && (and likewise ||) are never considered
>> to be unary, though I haven't tested this much and there are surely
>> other ways to achieve this:

> I think this is a better approach then the one committed to 
> pg_bsd_indent. It's ubiquitous that the operators are binary, except - 
> as you mentioned - in a nonstandard GCC syntax. The alternative given 
> has more disadvantages, with potential impact on FreeBSD code 
> formatting, which it should support as well as everything else -- to a 
> reasonable extent. sys/kern/ is usually a decent litmus test, but I 
> don't claim it should show anything interesting in this particular case.

I think that the fix we chose is better, at least for our purposes.
You can see its effects on our source tree at

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa27dd40d5c5f56a1ee837a75c97549e992e32a4

and while certainly most of the diffs are around && or || operators,
there are a fair number that are not, such as

-   dummy_lex.input = unconstify(char *, str) +1;
+   dummy_lex.input = unconstify(char *, str) + 1;

or more interestingly

-   strncmp(text, "pg_", 3) !=0)
+   strncmp(text, "pg_", 3) != 0)

where the previous misformatting is because "text" is a known typedef
name.  So it appears to me that this change reduces the misformatting
cost of typedef names that chance to match field or variable names,
and that's actually quite a big issue for us.  We have, ATM, 3574 known
typedefs in typedefs.list, a fair number of which are not under our
control (since they come from system headers on various platforms).
So it's inevitable that there are going to be collisions.

In short, I'm inclined to stick with the hack we've got.  I'm sad that
it will result in further divergence from FreeBSD indent; but it does
useful stuff for us, and I don't want to give it up.

(That said, I don't see any penalty to carrying both changes; so we'll
probably also absorb the &&/|| change at some convenient time.)

regards, tom lane

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-05-21 Thread Andy Fan

On Fri, May 22, 2020 at 4:40 AM David Rowley  wrote:

> On Fri, 22 May 2020 at 07:49, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > * It seems populate_baserel_uniquekeys, which actually sets uniquekeys,
> >   is called after create_index_paths, where index skip scan already
> >   needs to use them. Is it possible to call it earlier?
>
> Seems reasonable. I originally put it after check_index_predicates().
> We need to wait until at least then so we can properly build
> UniqueKeys for partial indexes.
>
>
Looks a very valid reason,  I will add this in the next version.


> > * Do I understand correctly, that a UniqueKey would be created in a
> >   simplest case only when an index is unique? This makes it different
> >   from what was implemented for index skip scan, since there UniqueKeys
> >   also represents potential to use non-unique index to facilitate search
> >   for unique values via skipping.
>
> The way I picture the two working together is that the core UniqueKey
> patch adds UniqueKeys to RelOptInfos to allow us to have knowledge
> about what they're unique on based on the base relation's unique
> indexes.
>
For Skipscans, that patch must expand on UniqueKeys to teach Paths
> about them. I imagine we'll set some required UniqueKeys during
> standard_qp_callback() and then we'll try to create some Skip Scan
> paths (which are marked with UniqueKeys) if the base relation does not
> already have UniqueKeys that satisfy the required UniqueKeys that were
> set during standard_qp_callback().  In the future, there may be other
> reasons to create Skip Scan paths for certain rels, e.g if they're on
> the inner side of a SEMI/ANTI join, it might be useful to try those
> when planning joins.
>
>
Yes,  In current implementation, we also add UniqueKey during
create_xxx_paths,
xxx may be grouping/union.  after the index skipscan patch, we can do the
similar
things in create_indexskip_path.

-- 
Best Regards
Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-05-21 Thread Andy Fan

> if I understand correctly those two are introducing the concept, and

others are just using it


You are understand it correctly.

-- 
Best Regards
Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-05-21 Thread Andy Fan

On Fri, May 22, 2020 at 4:52 AM David Rowley  wrote:

> On Thu, 14 May 2020 at 14:39, Andy Fan  wrote:
> >
> > On Thu, May 14, 2020 at 6:20 AM David Rowley 
> wrote:
> >> Having the "onerow" flag was not how I intended it to work.
> >>
> > Thanks for the detailed explanation.  So I think we do need to handle
> onerow
> > specially, (It means more things than adding each column as a UniqueKey).
> > but we don't need the onerow flag since we can tell it by ukey->exprs ==
> NIL.
> >
> > During the developer of this feature,  I added some Asserts as double
> checking
> > for onerow and exprs.  it helps me to find some special cases. like
> > SELECT FROM multirows  union SELECT  FROM multirows; where targetlist is
> NIL.
> > (I find the above case returns onerow as well just now).  so onerow flag
> allows us
> > check this special things with more attention. Even this is not the
> original intention
> > but looks it is the one purpose now.
>
> But surely that special case should just go in
> populate_unionrel_uniquekeys(). If the targetlist is empty, then add a
> UniqueKey with an empty set of exprs.
>
> This is correct on this special case.

> However I am feeling that removing onerow flag doesn't require much of
> code
> > changes. Almost all the special cases which are needed before are still
> needed
> > after that and all the functions based on that like relation_is_onerow
> > /add_uniquekey_onerow is still valid, we just need change the
> implementation.
> > so do you think we need to remove onerow flag totally?
>
> Well, at the moment I'm not quite understanding why it's needed. If
> it's not needed then we should remove it. If it turns out there is
> some valid reason for it, then we should keep it.
>

Currently I uses it to detect more special case which we can't image at
first, we can
understand it as it used to  debug/Assert purpose only.   After the mainly
code is
reviewed,  that can be removed (based on the change is tiny).

-- 
Best Regards
Andy Fan

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-05-21 Thread David Rowley

On Thu, 21 May 2020 at 00:56, Simon Riggs  wrote:
> I thought the main reason to do this was the case when the nested loop 
> subplan was significantly underestimated and we realize during execution that 
> we should have built a hash table. So including this based on cost alone 
> seems to miss a trick.

Isn't that mostly because the planner tends to choose a
non-parameterized nested loop when it thinks the outer side of the
join has just 1 row?  If so, I'd say that's a separate problem as
Result Cache only deals with parameterized nested loops.  Perhaps the
problem you mention could be fixed by adding some "uncertainty degree"
to the selectivity estimate function and have it return that along
with the selectivity.  We'd likely not want to choose an
unparameterized nested loop when the uncertainly level is high.
Multiplying the selectivity of different selectivity estimates could
raise the uncertainty level a magnitude.

For plans where the planner chooses to use a non-parameterized nested
loop due to having just 1 row on the outer side of the loop, it's
taking a huge risk. The cost of putting the 1 row on the inner side of
a hash join would bearly cost anything extra during execution.
Hashing 1 row is pretty cheap and performing a lookup on that hashed
row is not much more expensive than evaluating the qual of the nested
loop. Really just requires the additional hash function calls.  Having
the uncertainty degree I mentioned above would allow us to only have
the planner do that when the uncertainty degree indicates it's not
worth the risk.

David

Re: [Proposal] Page Compression for OLTP

2020-05-21 Thread chenhj

At 2020-05-21 15:04:55, "Fabien COELHO"  wrote: > >Hello, 
> >My 0.02, some of which may just show some misunderstanding on my part: > > 
- Could this be proposed as some kind of extension, provided that enough > 
hooks are available? ISTM that foreign tables and/or alternative > storage 
engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need 
for these? Or are they not appropriate? You seem to > suggest that there are 
not. > > If not, what could be done to improve API to allow what you are 
seeking > to do? Maybe you need a somehow lower-level programmable API which 
does > not exist already, or at least is not exported already, but could be > 
specified and implemented with limited effort? Basically you would like > to 
read/write pg pages to somewhere, and then there is the syncing > issue to 
consider. Maybe such a "page storage" API could provide > benefit for some 
specialized hardware, eg persistent memory stores, > so there would be more 
reason to define it anyway? I think it might > be valuable to give it some 
thoughts. Thank you for giving so many comments. In my opinion, developing a 
foreign table or a new storage engine, in addition to compression, also needs 
to do a lot of extra things. A similar explanation was mentioned in Nikolay P's 
email. The "page storage" API may be a good choice, and I will consider it, but 
I have not yet figured out how to implement it. > - Could you maybe elaborate 
on how your plan differs from [4] and [5]? My solution is similar to CFS, and 
it is also embedded in the file access layer (fd.c, md.c) to realize the 
mapping from block number to the corresponding file and location where 
compressed data is stored. However, the most important difference is that I 
hope to avoid the need for GC through the design of the page layout. 
https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net
 >> The most difficult thing in CFS development is certainly >> 
defragmentation. In CFS it is done using background garbage collection, >> by 
one or one >> GC worker processes. The main challenges were to minimize its >> 
interaction with normal work of the system, make it fault tolerant and >> 
prevent unlimited growth of data segments. >> CFS is not introducing its own 
storage manager, it is mostly embedded in >> existed Postgres file access layer 
(fd.c, md.c). It allows to reused >> code responsible for mapping relations and 
file descriptors cache. As it >> was recently discussed in hackers, it may be 
good idea to separate the >> questions "how to map blocks to filenames and 
offsets" and "how to >> actually perform IO". In this it will be easier to 
implement compressed >> storage manager. > - Have you consider keeping page 
headers and compressing tuple data > only? In that case, we must add some 
additional information in the page header to identify whether this is a 
compressed page or an uncompressed page. When a compressed page becomes an 
uncompressed page, or vice versa, an uncompressed page becomes a compressed 
page, the original page header must be modified. This is unacceptable because 
it requires modifying the shared buffer and recalculating the checksum. 
However, it should be feasible to put this flag in the compressed address file. 
The problem with this is that even if a page only occupies the size of one 
compressed block, the address file needs to be read, that is, from 1 IO to 2 
IO. Since the address file is very small, it is basically a memory access, this 
cost may not be as large as I had imagined. > - I'm not sure there is a point 
in going below the underlying file > system blocksize, quite often 4 KiB? Or 
maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? 
My solution is mainly optimized for scenarios where the original page can be 
compressed to only require one compressed block of storage. The scene where the 
original page is stored in multiple compressed blocks is suitable for scenarios 
that are not particularly sensitive to performance, but are more concerned 
about the compression rate, such as cold data. In addition, users can also 
choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your 
approach entails 3 "files". Could it be done with 2? > I'd suggest that the 
possible overflow pointers (coa) could be part of > the headers so that when 
reading the 3.1 page, then the header would > tell where to find the overflow 
3.2, without requiring an additional > independent structure with very small 
data in it, most of it zeros. > Possibly this is not possible, because it would 
require some available > space in standard headers when the is page is not 
compressible, and > there is not enough. Maybe creating a little room for that 
in > existing headers (4 bytes could be enough?) would be a good compromise. > 
Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but 
not for other target ratios, but I thin

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-05-21 Thread Andy Fan

>  My question is whether it should be added as an optional facility of a
> parameterised sub plan, rather than an always-needed full-strength node.
> That way the choice of whether to use it can happen at execution time once
> we notice that we've been called too many times.
>
>
Actually I am not sure about what does the "parameterized sub plan" mean (I
treat is a SubPlan Node), so please correct me if I misunderstand you:)
Because
the inner plan in nest loop not a SubPlan node actually.  so if bind the
facility to SubPlan node, we may loss the chances for nest loop.  And when
we
consider the usage for nest loop,  we can consider the below example, where
this
feature will be more powerful.


select j1o.i, j2_v.sum_5
from j1 j1o
inner join lateral
(select im100, sum(im5) as sum_5
from j2
where j1o.im100 = im100
and j1o.i = 1
group by im100) j2_v
on true
where j1o.i = 1;

-- 
Best Regards
Andy Fan

Re: Parallel Seq Scan vs kernel read ahead

2020-05-21 Thread Soumyadeep Chakraborty

Hi Thomas,

Some more data points:

create table t_heap as select generate_series(1, 1) i;

Query: select count(*) from t_heap;
shared_buffers=32MB (so that I don't have to clear buffers, OS page
cache)
OS: FreeBSD 12.1 with UFS on GCP
4 vCPUs, 4GB RAM Intel Skylake
22G Google PersistentDisk
Time is measured with \timing on.

Without your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   33.88s
  1   57.62s
  2   62.01s
  6  222.94s

With your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   29.04s
  1   29.17s
  2   28.78s
  6  291.27s

I checked with explain analyze to ensure that the number of workers
planned = max_parallel_workers_per_gather

Apart from the last result (max_parallel_workers_per_gather=6), all
the other results seem favorable.
Could the last result be down to the fact that the number of workers
planned exceeded the number of vCPUs?

I also wanted to evaluate Zedstore with your patch.
I used the same setup as above.
No discernible difference though, maybe I'm missing something:

Without your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   25.86s
  1   15.70s
  2   12.60s
  6   12.41s

With your patch:

max_parallel_workers_per_gatherTime(seconds)
  0   26.96s
  1   15.73s
  2   12.46s
  6   12.10s
--
Soumyadeep

On Thu, May 21, 2020 at 3:28 PM Thomas Munro  wrote:

> On Fri, May 22, 2020 at 10:00 AM David Rowley 
> wrote:
> > On Thu, 21 May 2020 at 17:06, David Rowley  wrote:
> > > For the patch. I know you just put it together quickly, but I don't
> > > think you can do that ramp up the way you have. It looks like there's
> > > a risk of torn reads and torn writes and I'm unsure how much that
> > > could affect the test results here.
> >
> > Oops. On closer inspection, I see that memory is per worker, not
> > global to the scan.
>
> Right, I think it's safe.  I think you were probably right that
> ramp-up isn't actually useful though, it's only the end of the scan
> that requires special treatment so we don't get unfair allocation as
> the work runs out, due to course grain.  I suppose that even if you
> have a scheme that falls back to fine grained allocation for the final
> N pages, it's still possible that a highly distracted process (most
> likely the leader given its double duties) can finish up sitting on a
> large range of pages and eventually have to process them all at the
> end after the other workers have already knocked off and gone for a
> pint.
>
>
>

Re: Parallel Seq Scan vs kernel read ahead

2020-05-21 Thread Thomas Munro

On Fri, May 22, 2020 at 1:14 PM Soumyadeep Chakraborty
 wrote:
> Some more data points:

Thanks!

> max_parallel_workers_per_gatherTime(seconds)
>   0   29.04s
>   1   29.17s
>   2   28.78s
>   6  291.27s
>
> I checked with explain analyze to ensure that the number of workers
> planned = max_parallel_workers_per_gather
>
> Apart from the last result (max_parallel_workers_per_gather=6), all
> the other results seem favorable.
> Could the last result be down to the fact that the number of workers
> planned exceeded the number of vCPUs?

Interesting.  I guess it has to do with patterns emerging from various
parameters like that magic number 64 I hard coded into the test patch,
and other unknowns in your storage stack.  I see a small drop off that
I can't explain yet, but not that.

> I also wanted to evaluate Zedstore with your patch.
> I used the same setup as above.
> No discernible difference though, maybe I'm missing something:

It doesn't look like it's using table_block_parallelscan_nextpage() as
a block allocator so it's not affected by the patch.  It has its own
thing zs_parallelscan_nextrange(), which does
pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids,
ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10.

More tests with USING INDEX replident and dropped indexes

2020-05-21 Thread Michael Paquier

Hi all,

While working on some other logical decoding patch recently, I bumped
into the fact that we have special handling for the case of REPLICA
IDENTITY USING INDEX when the dependent index is dropped, where the
code handles that case as an equivalent of NOTHING.

Attached is a patch to add more coverage for that.  Any thoughts?
--
Michael
diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index d79cd316b7..fb4a7d7689 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -565,6 +565,36 @@ UPDATE table_with_unique_not_null SET data = 3 WHERE data = 2;
 UPDATE table_with_unique_not_null SET id = -id;
 UPDATE table_with_unique_not_null SET id = -id;
 DELETE FROM table_with_unique_not_null WHERE data = 3;
+-- check tables with dropped indexes used in REPLICA IDENTITY
+-- with primary key
+CREATE TABLE table_dropped_index_with_pk (a int PRIMARY KEY, b int, c int);
+CREATE UNIQUE INDEX table_dropped_index_with_pk_idx
+  ON table_dropped_index_with_pk(a);
+ALTER TABLE table_dropped_index_with_pk REPLICA IDENTITY
+  USING INDEX table_dropped_index_with_pk_idx;
+DROP INDEX table_dropped_index_with_pk_idx;
+INSERT INTO table_dropped_index_with_pk VALUES (1,1,1), (2,2,2), (3,3,3);
+UPDATE table_dropped_index_with_pk SET a = 4 WHERE a = 1;
+UPDATE table_dropped_index_with_pk SET b = 5 WHERE a = 2;
+UPDATE table_dropped_index_with_pk SET b = 6, c = 7 WHERE a = 3;
+DELETE FROM table_dropped_index_with_pk WHERE b = 1;
+DELETE FROM table_dropped_index_with_pk WHERE a = 3;
+DROP TABLE table_dropped_index_with_pk;
+-- without primary key
+CREATE TABLE table_dropped_index_no_pk (a int NOT NULL, b int, c int);
+CREATE UNIQUE INDEX table_dropped_index_no_pk_idx
+  ON table_dropped_index_no_pk(a);
+ALTER TABLE table_dropped_index_no_pk REPLICA IDENTITY
+  USING INDEX table_dropped_index_no_pk_idx;
+INSERT INTO table_dropped_index_no_pk VALUES (1,1,1), (2,2,2), (3,3,3);
+DROP INDEX table_dropped_index_no_pk_idx;
+INSERT INTO table_dropped_index_no_pk VALUES (1,1,1), (2,2,2), (3,3,3);
+UPDATE table_dropped_index_no_pk SET a = 4 WHERE a = 1;
+UPDATE table_dropped_index_no_pk SET b = 5 WHERE a = 2;
+UPDATE table_dropped_index_no_pk SET b = 6, c = 7 WHERE a = 3;
+DELETE FROM table_dropped_index_no_pk WHERE b = 1;
+DELETE FROM table_dropped_index_no_pk WHERE a = 3;
+DROP TABLE table_dropped_index_no_pk;
 -- check toast support
 BEGIN;
 CREATE SEQUENCE toasttable_rand_seq START 79 INCREMENT 1499; -- portable "random"
@@ -682,6 +712,56 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  table public.table_with_unique_not_null: DELETE: id[integer]:4
  COMMIT
  BEGIN
+ table public.table_dropped_index_with_pk: INSERT: a[integer]:1 b[integer]:1 c[integer]:1
+ table public.table_dropped_index_with_pk: INSERT: a[integer]:2 b[integer]:2 c[integer]:2
+ table public.table_dropped_index_with_pk: INSERT: a[integer]:3 b[integer]:3 c[integer]:3
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_with_pk: UPDATE: a[integer]:4 b[integer]:1 c[integer]:1
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_with_pk: UPDATE: a[integer]:2 b[integer]:5 c[integer]:2
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_with_pk: UPDATE: a[integer]:3 b[integer]:6 c[integer]:7
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_with_pk: DELETE: (no-tuple-data)
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_with_pk: DELETE: (no-tuple-data)
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: INSERT: a[integer]:1 b[integer]:1 c[integer]:1
+ table public.table_dropped_index_no_pk: INSERT: a[integer]:2 b[integer]:2 c[integer]:2
+ table public.table_dropped_index_no_pk: INSERT: a[integer]:3 b[integer]:3 c[integer]:3
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: INSERT: a[integer]:1 b[integer]:1 c[integer]:1
+ table public.table_dropped_index_no_pk: INSERT: a[integer]:2 b[integer]:2 c[integer]:2
+ table public.table_dropped_index_no_pk: INSERT: a[integer]:3 b[integer]:3 c[integer]:3
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: UPDATE: a[integer]:4 b[integer]:1 c[integer]:1
+ table public.table_dropped_index_no_pk: UPDATE: a[integer]:4 b[integer]:1 c[integer]:1
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: UPDATE: a[integer]:2 b[integer]:5 c[integer]:2
+ table public.table_dropped_index_no_pk: UPDATE: a[integer]:2 b[integer]:5 c[integer]:2
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: UPDATE: a[integer]:3 b[integer]:6 c[integer]:7
+ table public.table_dropped_index_no_pk: UPDATE: a[integer]:3 b[integer]:6 c[integer]:7
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: DELETE: (no-tuple-data)
+ table public.table_dropped_index_no_pk: DELETE: (no-tuple-data)
+ COMMIT
+ BEGIN
+ table public.table_dropped_index_no_pk: DELETE: (no-tuple-data)
+ table public.table_dropped_index_no_pk: DELETE: (no-tuple-data)
+ COMMIT
+ BEGIN
  table public.toasttable: INSERT: id[integer]:1 toasted_

Re: race condition when writing pg_control

2020-05-21 Thread Thomas Munro

On Tue, May 5, 2020 at 9:51 AM Thomas Munro  wrote:
> On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan  wrote:
> > I believe I've discovered a race condition between the startup and
> > checkpointer processes that can cause a CRC mismatch in the pg_control
> > file.  If a cluster crashes at the right time, the following error
> > appears when you attempt to restart it:
> >
> > FATAL:  incorrect checksum in control file
> >
> > This appears to be caused by some code paths in xlog_redo() that
> > update ControlFile without taking the ControlFileLock.  The attached
> > patch seems to be sufficient to prevent the CRC mismatch in the
> > control file, but perhaps this is a symptom of a bigger problem with
> > concurrent modifications of ControlFile->checkPointCopy.nextFullXid.
>
> This does indeed look pretty dodgy.  CreateRestartPoint() running in
> the checkpointer does UpdateControlFile() to compute a checksum and
> write it out, but xlog_redo() processing
> XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
> interlocking.  It looks like the ancestors of that line were there
> since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
> UpdateControLFile() directly in the startup process (immediately after
> that update), so no interlocking problem.  Then in cdd46c76548 (2009),
> RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
> in another process.

Here's a version with a commit message added.  I'll push this to all
releases in a day or two if there are no objections.
From 28ee04b537085148c75f575d2bc755a1fffc19c7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 22 May 2020 16:23:59 +1200
Subject: [PATCH] Fix race condition that could corrupt pg_control.

The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire
ControlFileLock before modifying ControlFile->checkPointCopy, or the
checkpointer could write out a control file with a bad checksum.

Back-patch to all supported release.

Author: Nathan Bossart 
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
---
 src/backend/access/transam/xlog.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..274b808476 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9976,7 +9976,9 @@ xlog_redo(XLogReaderState *record)
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
@@ -10033,7 +10035,9 @@ xlog_redo(XLogReaderState *record)
 			SetTransactionIdLimit(checkPoint.oldestXid,
   checkPoint.oldestXidDB);
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextFullXid = checkPoint.nextFullXid;
+		LWLockRelease(ControlFileLock);
 
 		/* Update shared-memory copy of checkpoint XID/epoch */
 		SpinLockAcquire(&XLogCtl->info_lck);
-- 
2.20.1

Re: POC: rational number type (fractions)

2020-05-21 Thread Noah Misch

On Thu, May 21, 2020 at 01:40:10PM -0400, Robert Haas wrote:
> On Mon, May 18, 2020 at 6:15 PM Tom Lane  wrote:
> > There surely are use-cases for true rational arithmetic, but I'm
> > dubious that it belongs in core Postgres.  I don't think that enough
> > of our users would want it to justify expending core-project maintenance
> > effort on it.  So I'd be happier to see this as an out-of-core extension.
> 
> As is often the case, I'm a little more positive about including this
> than Tom, but as is also often the case, I'm somewhat cautious, too.
> On the one hand, I think it would be cool to have and people would
> like it. But, On the other hand, I also think we'd only want it if
> we're convinced that it's a really good implementation and that
> there's not a competing design which is better, or even equally good.

I vote for keeping it out of core, mostly because writing rational numeric
code is so different from writing DBMS core code.  (Many of our existing
types, like numeric and the geometric types, have the same problem.  Let's not
invite more of that.)  The optimal reviewer pools won't have much overlap, so
patches may sit awhile and/or settle for a cursory review.

More language standard libraries provide "numeric"-style big decimals[1] than
provide big rationals[2], suggesting we're in good company.

[1] 
https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software#Languages
[2] https://en.wikipedia.org/wiki/Rational_data_type#Language_support

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-05-21 Thread Andy Fan

On Thu, May 21, 2020 at 3:17 PM Michael Paquier  wrote:

> On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote:
> > On Tue, May 19, 2020 at 4:29 AM Andy Fan 
> wrote:
> >> Thanks for the excellent extension. I want to add 5 more fields to
> satisfy the
> >> following requirements.
> >>
> >> int   subplan; /* No. of subplan in this query */
> >> int   subquery; /* No. of subquery */
> >> int   joincnt; /* How many relations are joined */
> >> bool hasagg; /* if we have agg function in this query */
> >> bool hasgroup; /* has group clause */
> >
> > Most of those fields can be computed using the raw sql satements.  Why
> > not adding functions like query_has_agg(querytext) to get the
> > information from pgss stored query text instead?
>
> Yeah I personally find concepts related only to the query string
> itself not something that needs to be tied to pg_stat_statements.
> While reading about those five new fields, I am also wondering how
> this stuff would work with CTEs.  Particularly, should the hasagg or
> hasgroup flags be set only if the most outer query satisfies a
> condition? What if an inner query satisfies a condition but not an

outer query?  Should joincnt just be the sum of all the joins done in
> all queries, including subqueries?
>


The semantics is for overall query not for most outer query. see codes
like this for example:

query_characters.hasagg |= parse->hasAggs;
query_characters.hasgroup |= parse->groupClause != NIL;


> Most of those fields can be computed using the raw sql satements.  Why
> not adding functions like query_has_agg(querytext) to get the
> information from pgss stored query text instead?

That mainly because I don't want to reparse the query again.

-- 
Best Regards
Andy Fan

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-05-21 Thread Andy Fan

On Thu, May 21, 2020 at 3:49 PM Julien Rouhaud  wrote:

> Le jeu. 21 mai 2020 à 09:17, Michael Paquier  a
> écrit :
>
>> On Thu, May 21, 2020 at 08:49:53AM +0200, Julien Rouhaud wrote:
>> > On Tue, May 19, 2020 at 4:29 AM Andy Fan 
>> wrote:
>> >> Thanks for the excellent extension. I want to add 5 more fields to
>> satisfy the
>> >> following requirements.
>> >>
>> >> int   subplan; /* No. of subplan in this query */
>> >> int   subquery; /* No. of subquery */
>> >> int   joincnt; /* How many relations are joined */
>> >> bool hasagg; /* if we have agg function in this query */
>> >> bool hasgroup; /* has group clause */
>> >
>> > Most of those fields can be computed using the raw sql satements.  Why
>> > not adding functions like query_has_agg(querytext) to get the
>> > information from pgss stored query text instead?
>>
>> Yeah I personally find concepts related only to the query string
>> itself not something that needs to be tied to pg_stat_statements.
>> ...
>>
>
> Indeed cte will bring additional concerns about the fields semantics.
> That's another good reason to go with external functions so you can add
> extra parameters for that if needed.
>
>>
There are something more we can't get from query string easily. like:
1. view involved.   2. subquery are pulled up so there is not subquery
indeed.  3. sublink are pull-up or become as an InitPlan rather than
subPlan.
4. joins are removed by remove_useless_joins.

-- 
Best Regards
Andy Fan

Re: [Proposal] Page Compression for OLTP

2020-05-21 Thread chenhj

Sorry, There may be a problem with the display format of the previous mail. So 
resend it

At 2020-05-21 15:04:55, "Fabien COELHO"  wrote:

>
>Hello,
>
>My 0.02, some of which may just show some misunderstanding on my part:
>
>  - Could this be proposed as some kind of extension, provided that enough
>hooks are available? ISTM that foreign tables and/or alternative
>storage engine (aka ACCESS METHOD) provide convenient APIs which could
>fit the need for these? Or are they not appropriate? You seem to
>suggest that there are not.
>
>If not, what could be done to improve API to allow what you are seeking
>to do? Maybe you need a somehow lower-level programmable API which does
>not exist already, or at least is not exported already, but could be
>specified and implemented with limited effort? Basically you would like
>to read/write pg pages to somewhere, and then there is the syncing
>issue to consider. Maybe such a "page storage" API could provide
>benefit for some specialized hardware, eg persistent memory stores,
>so there would be more reason to define it anyway? I think it might
>be valuable to give it some thoughts.

Thank you for giving so many comments.
In my opinion, developing a foreign table or a new storage engine, in addition 
to compression, also needs to do a lot of extra things.
A similar explanation was mentioned in Nikolay P's email.

The "page storage" API may be a good choice, and I will consider it, but I have 
not yet figured out how to implement it.

>  - Could you maybe elaborate on how your plan differs from [4] and [5]?

My solution is similar to CFS, and it is also embedded in the file access layer 
(fd.c, md.c) to realize the mapping from block number to the corresponding file 
and location where compressed data is stored.

However, the most important difference is that I hope to avoid the need for GC 
through the design of the page layout.

https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

>> The most difficult thing in CFS development is certainly
>> defragmentation. In CFS it is done using background garbage collection,
>> by one or one
>> GC worker processes. The main challenges were to minimize its
>> interaction with normal work of the system, make it fault tolerant and
>> prevent unlimited growth of data segments.

>> CFS is not introducing its own storage manager, it is mostly embedded in
>> existed Postgres file access layer (fd.c, md.c). It allows to reused
>> code responsible for mapping relations and file descriptors cache. As it
>> was recently discussed in hackers, it may be good idea to separate the
>> questions "how to map blocks to filenames and offsets" and "how to
>> actually perform IO". In this it will be easier to implement compressed
>> storage manager.

>  - Have you consider keeping page headers and compressing tuple data
>only?

In that case, we must add some additional information in the page header to 
identify whether this is a compressed page or an uncompressed page.
When a compressed page becomes an uncompressed page, or vice versa, an 
uncompressed page becomes a compressed page, the original page header must be 
modified.
This is unacceptable because it requires modifying the shared buffer and 
recalculating the checksum.

However, it should be feasible to put this flag in the compressed address file.
The problem with this is that even if a page only occupies the size of one 
compressed block, the address file needs to be read, that is, from 1 IO to 2 IO.
Since the address file is very small, it is basically a memory access, this 
cost may not be as large as I had imagined.

>  - I'm not sure there is a point in going below the underlying file
>system blocksize, quite often 4 KiB? Or maybe yes? Or is there
>a benefit to aim at 1/4 even if most pages overflow?

My solution is mainly optimized for scenarios where the original page can be 
compressed to only require one compressed block of storage.
The scene where the original page is stored in multiple compressed blocks is 
suitable for scenarios that are not particularly sensitive to performance, but 
are more concerned about the compression rate, such as cold data.

In addition, users can also choose to compile PostgreSQL with 16KB or 32KB 
BLOCKSZ.

>  - ISTM that your approach entails 3 "files". Could it be done with 2?
>I'd suggest that the possible overflow pointers (coa) could be part of
>the headers so that when reading the 3.1 page, then the header would
>tell where to find the overflow 3.2, without requiring an additional
>independent structure with very small data in it, most of it zeros.
>Possibly this is not possible, because it would require some available
>space in standard headers when the is page is not compressible, and
>there is not enough.

Re: Problem with pg_atomic_compare_exchange_u64 at 32-bit platforms

2020-05-21 Thread Noah Misch

On Wed, May 20, 2020 at 10:59:44AM +0300, Konstantin Knizhnik wrote:
> On 20.05.2020 10:36, Noah Misch wrote:
> >On Wed, May 20, 2020 at 10:23:37AM +0300, Konstantin Knizhnik wrote:
> >>On 20.05.2020 06:05, Noah Misch wrote:
> >>>On Tue, May 19, 2020 at 04:07:29PM +0300, Konstantin Knizhnik wrote:
> Definition of pg_atomic_compare_exchange_u64 requires alignment of 
> expected
> pointer on 8-byte boundary.
> 
> pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
>                                 uint64 *expected, uint64 newval)
> {
> #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
>      AssertPointerAlignment(ptr, 8);
>      AssertPointerAlignment(expected, 8);
> #endif
> 
> 
> I wonder if there are platforms  where such restriction is actually 
> needed.
> >>>In general, sparc Linux does SIGBUS on unaligned access.  Other platforms
> >>>function but suffer performance penalties.
> >>Well, if platform enforces strict alignment, then addressed value should be
> >>properly aligned in any case, shouldn't it?
> >No.  One can always cast a char* to a uint64* and get a misaligned read when
> >dereferencing the resulting pointer.
> 
> Yes, certainly we can "fool" compiler using type casts:
> 
> char buf[8];
> *(int64_t*)buf = 1;

PostgreSQL does things like that, so the assertions aren't frivolous.

> But I am speaking about normal (safe) access to variables:
> 
> long long x;
> 
> In this case "x" compiler enforces proper alignment of "x" for the target
> platform.
> We are not adding AssertPointerAlignment to any function which has pointer
> arguments, aren' we?

Most functions don't have such assertions.  That doesn't make it wrong for
this function to have them.

> I understand we do we require struct alignment pointer to atomic variables
> even at the platforms which do not require it
> (as Andreas explained, if value cross cacheline, it will cause significant
> slowdown).
> But my question was whether we need string alignment of expected value?

I expect at least some platforms need strict alignment, though I haven't tried
to prove it.

> >>void f() {
> >>  int x;
> >>  long long y;
> >>  printf("%p\n", &y);
> >>}
> >>
> >>then its address must not be aligned on 8 at 32-bit platform.
> >>This is why "expected" in test_atomic_uint64 may not be aligned on 8-byte
> >>boundary and we can get assertion failure.
> >Can you construct a patch that adds some automatic variables to a regress.c
> >function and causes an assertion inside pg_atomic_compare_exchange_u64() to
> >fail on some machine you have?  I don't think win32 behaves as you say.  If
> >you can make a test actually fail using the technique you describe, that 
> >would
> >remove all doubt.
> I do not have access to Win32.
> But I think that if you just add some 4-byte variable before "expected"
> definition, then you will get this  assertion failure (proposed patch is
> attached).
> Please notice that PG_HAVE_ATOMIC_U64_SIMULATION should not be defined and
> Postgres is build with --enable-cassert and CLAGS=-O0
> 
> Also please notice that my report is not caused just by hypothetical problem
> which I found out looking at Postgres code.
> We actually get this assertion failure in pg_atomic_compare_exchange_u64 at
> Win32 (not in regress.c).

Given https://postgr.es/m/flat/20150108204635.GK6299%40alap3.anarazel.de was
necessary, that is plausible.  Again, if you can provide a test case that you
have confirmed reproduces it, that will remove all doubt.  You refer to a "we"
that has access to a system that reproduces it.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-05-21 Thread Amit Kapila

On Tue, May 19, 2020 at 6:01 PM Amit Kapila  wrote:
>
> On Fri, May 15, 2020 at 2:48 PM Dilip Kumar  wrote:
> >

I have further reviewed v22 and below are my comments:

v22-0005-Implement-streaming-mode-in-ReorderBuffer
--
1.
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)

The above 'Note' is not correct as per the latest implementation.

v22-0006-Add-support-for-streaming-to-built-in-replicatio

2.
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-
  */
-
 #include "postgres.h"

Spurious line removal.

3.
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
+
+ Assert(TransactionIdIsValid(txn->xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, txn->xid);

The part of the comment "we're starting to stream, so must be valid"
is not correct as we are not at the start of the stream here.  The
patch has used the same incorrect sentence at few places, kindly fix
those as well.

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext.  I
don't see any need of using TopMemoryContext here.

5.
+static void
+subxact_info_add(TransactionId xid)

This function has assumed a valid value for global variables like
stream_fd and stream_xid.  I think it is better to have Assert for
those in this function before using them.  The Assert for those are
present in handle_streamed_transaction but I feel they should be in
subxact_info_add.

6.
+subxact_info_add(TransactionId xid)
/*
+ * In most cases we're checking the same subxact as we've already seen in
+ * the last call, so make ure just ignore it (this change comes later).
+ */
+ if (subxact_last == xid)
+ return;

Typo and minor correction, /ure just/sure to

7.
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * But we free the memory allocated for subxact info. There might be one
+ * exceptional transaction with many subxacts, and we don't want to keep
+ * the memory allocated forewer.
+ *
+ */

a. Typo, /forewer/forever
b. The extra line at the end of the comment is not required.

8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)

Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums.  Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?

Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change.  So, not sure, there is any sense to write
checksum for subxact file.

Tomas, do you see any reason for the same?

9.
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ */
+ if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ tempdirpath)));
+
+ snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+ tempdirpath, subid, xid);
+}

Temporary files created in PGDATA/base/pgsql_tmp follow a certain
naming convention (see docs[1]) which is not followed here.  You can
also refer SharedFileSetPath and OpenTemporaryFile.  I think we can
just try to follow that convention and then additionally append subid,
xid and .subxacts.  Also, a similar change is required for
changes_filename.  I would like to know if there is a reason why we
want to use different naming convention here?

10.
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)

The comment seems to be wrong.  I think this can be only called at
stream end, so it should be "This can only be called at the end

Re: Optimizer docs typos

2020-05-21 Thread Etsuro Fujita

On Wed, May 20, 2020 at 7:17 PM Etsuro Fujita  wrote:
> On Tue, May 19, 2020 at 7:35 PM Etsuro Fujita  wrote:
> > On Mon, May 18, 2020 at 7:45 PM Richard Guo  wrote:
> > > In this same README doc, another suspicious typo to me, which happens in
> > > section "Optimizer Functions", is in the prefix to query_planner(),
> > > we should have three dashes, rather than two, since query_planner() is
> > > called within grouping_planner().
> > >
> > > diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
> > > index 7dcab9a..bace081 100644
> > > --- a/src/backend/optimizer/README
> > > +++ b/src/backend/optimizer/README
> > > @@ -315,7 +315,7 @@ set up for recursive handling of subqueries
> > >preprocess target list for non-SELECT queries
> > >handle UNION/INTERSECT/EXCEPT, GROUP BY, HAVING, aggregates,
> > > ORDER BY, DISTINCT, LIMIT
> > > ---query_planner()
> > > +---query_planner()
> > > make list of base relations used in query
> > > split up the qual into restrictions (a=1) and joins (b=c)
> > > find qual clauses that enable merge and hash joins
> >
> > Yeah, you are right.  Another one would be in the prefix to
> > standard_join_search(); I think it might be better to have six dashes,
> > rather than five, because standard_join_search() is called within
> > make_rel_from_joinlist().
>
> Here is a patch including the change I proposed.  (Yet another thing I
> noticed is the indent spaces for join_search_one_level(): that
> function is called within standard_join_search(), so it would be
> better to have one extra space, for consistency with others (eg,
> set_base_rel_pathlists() called from make_one_rel()), but that would
> be too nitpicking.)  This is more like an improvement, so I'll apply
> the patch to HEAD only, if no objestions.

Done.

Best regards,
Etsuro Fujita