date:20190614

On Fri, 14 Jun 2019 at 04:32, Jesper Pedersen
 wrote:
> Here is a rebased version.

Hi Jesper,

I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)

First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.

I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.

The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for.  This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for().  The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join.  The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.

I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys().  Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg.  I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.

Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ )  However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.

Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.

I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:

# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 100
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
 QUERY PLAN
-
 Index Only Scan using abc_a_b_idx on abc  (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
   Scan mode: Skip scan
   Heap Fetches: 1000
 Planning Time: 5.616 ms
 Execution Time: 21.791 ms
(5 rows)


# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
QUERY PLAN
--
 Unique  (cost=0.

Re: upgrades in row-level locks can deadlock

2019-06-14 Thread Oleksii Kliukin

Alvaro Herrera  wrote:

> On 2019-Jun-13, Alvaro Herrera wrote:
> 
>> On 2019-Jun-13, Oleksii Kliukin wrote:
>> 
>>> Makes sense. For the symmetry I have included those that perform lock
>>> upgrades in one session and those that doesn’t, while the other sessions
>>> acquire locks, do updates or deletes. For those that don’t upgrade locks the
>>> test checks that the locks are acquired in the correct order.
>> 
>> Thanks for the updated patch!  I'm about to push to branches 9.6-master.
>> It applies semi-cleanly (only pgindent-maturity whitespace conflicts).
> 
> Done, thanks for the report and patch!
> 
> I tried hard to find a scenario that this patch breaks, but couldn't
> find anything.

Thank you very much for reviewing and committing it!

Cheers,
Oleksii

Re: Converting NOT IN to anti-joins during planning

2019-06-14 Thread Simon Riggs

On Wed, 6 Mar 2019 at 04:11, David Rowley 
wrote:

> Hi Jim,
>
> Thanks for replying here.
>
> On Wed, 6 Mar 2019 at 16:37, Jim Finnerty  wrote:
> >
> > Actually, we're working hard to integrate the two approaches.  I haven't
> had
> > time since I returned to review your patch, but I understand that you
> were
> > checking for strict predicates as part of the nullness checking criteria,
> > and we definitely must have that.  Zheng tells me that he has combined
> your
> > patch with ours, but before we put out a new patch, we're trying to
> figure
> > out how to preserve the existing NOT IN execution plan in the case where
> the
> > materialized subplan fits in memory.  This (good) plan is effectively an
> > in-memory hash anti-join.
> >
> > This is tricky to do because the NOT IN Subplan to anti-join
> transformation
> > currently happens early in the planning process, whereas the decision to
> > materialize is made much later, when the best path is being converted
> into a
> > Plan.
>
> I guess you're still going with the OR ... IS NULL in your patch then?
> otherwise, you'd likely find that the transformation (when NULLs are
> not possible) is always a win since it'll allow hash anti-joins. (see
> #2 in the original email on this thread)  FWIW I mentioned in [1] and
> Tom confirmed in [2] that we both think hacking the join condition to
> add an OR .. IS NULL is a bad idea. I guess you're not deterred by
> that?
>

Surely we want both?

1. Transform when we can
2. Else apply some other approach if the cost can be reduced by doing it

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise

Re: pg_upgrade: Improve invalid option handling

2019-06-14 Thread Peter Eisentraut

On 2019-06-13 14:30, Masahiko Sawada wrote:
> Why do we need to change pg_fatal() to fprintf() & exit()? It seems to
> me that we can still use pg_fatal() here since we write the message to
> stderr.

It just makes the output more consistent with other tools, e.g.,

old:

pg_upgrade: unrecognized option `--foo'

Try "pg_upgrade --help" for more information.
Failure, exiting

new:

pg_upgrade: unrecognized option `--foo'
Try "pg_upgrade --help" for more information.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Update list of combining characters

2019-06-14 Thread Peter Eisentraut

On 2019-06-13 15:52, Alvaro Herrera wrote:
> I think there's an off-by-one bug in your script.

Indeed.  Here is an updated script and patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
use strict;
use warnings;

my $range_start = undef;
my $codepoint;
my $prev_codepoint;
my $count = 0;

print "\tstatic const struct mbinterval combining[] = {";

foreach my $line ()
{
chomp $line;
my @fields = split ';', $line;
$codepoint = hex $fields[0];

next if $codepoint > 0x;

if ($fields[2] eq 'Me' || $fields[2] eq 'Mn')
{
# combining character, save for start of range
if (!defined($range_start))
{
$range_start = $codepoint;
}
}
else
{
# not a combining character, print out previous range if any
if (defined($range_start))
{
if ($count++ % 3 == 0)
{
print "\n\t\t";
}
else
{
print " ";
}
printf "{0x%04X, 0x%04X},", $range_start, $prev_codepoint;
$range_start = undef;
}
}
}
continue
{
$prev_codepoint = $codepoint;
}

print "\n\t};\n";
From da90031113908ee9869ae87a5edbf52992d16a96 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 14 Jun 2019 11:30:44 +0200
Subject: [PATCH v2] Update list of combining characters

The list of combining characters to ignore for calculating the display
width of a string (used for example by psql) was wildly outdated and
incorrect.

https://www.postgresql.org/message-id/flat/bbb19114-af1e-513b-08a9-61272794bd5c%402ndquadrant.com
---
 src/backend/utils/mb/wchar.c | 94 
 1 file changed, 62 insertions(+), 32 deletions(-)

diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index 8e5116dfc1..1b5ce1740c 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -645,40 +645,70 @@ ucs_wcwidth(pg_wchar ucs)
 {
/* sorted list of non-overlapping intervals of non-spacing characters */
static const struct mbinterval combining[] = {
-   {0x0300, 0x034E}, {0x0360, 0x0362}, {0x0483, 0x0486},
-   {0x0488, 0x0489}, {0x0591, 0x05A1}, {0x05A3, 0x05B9},
-   {0x05BB, 0x05BD}, {0x05BF, 0x05BF}, {0x05C1, 0x05C2},
-   {0x05C4, 0x05C4}, {0x064B, 0x0655}, {0x0670, 0x0670},
-   {0x06D6, 0x06E4}, {0x06E7, 0x06E8}, {0x06EA, 0x06ED},
-   {0x070F, 0x070F}, {0x0711, 0x0711}, {0x0730, 0x074A},
-   {0x07A6, 0x07B0}, {0x0901, 0x0902}, {0x093C, 0x093C},
-   {0x0941, 0x0948}, {0x094D, 0x094D}, {0x0951, 0x0954},
-   {0x0962, 0x0963}, {0x0981, 0x0981}, {0x09BC, 0x09BC},
-   {0x09C1, 0x09C4}, {0x09CD, 0x09CD}, {0x09E2, 0x09E3},
-   {0x0A02, 0x0A02}, {0x0A3C, 0x0A3C}, {0x0A41, 0x0A42},
-   {0x0A47, 0x0A48}, {0x0A4B, 0x0A4D}, {0x0A70, 0x0A71},
-   {0x0A81, 0x0A82}, {0x0ABC, 0x0ABC}, {0x0AC1, 0x0AC5},
-   {0x0AC7, 0x0AC8}, {0x0ACD, 0x0ACD}, {0x0B01, 0x0B01},
-   {0x0B3C, 0x0B3C}, {0x0B3F, 0x0B3F}, {0x0B41, 0x0B43},
-   {0x0B4D, 0x0B4D}, {0x0B56, 0x0B56}, {0x0B82, 0x0B82},
-   {0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C3E, 0x0C40},
-   {0x0C46, 0x0C48}, {0x0C4A, 0x0C4D}, {0x0C55, 0x0C56},
+   {0x0300, 0x036F}, {0x0483, 0x0489}, {0x0591, 0x05BD},
+   {0x05BF, 0x05BF}, {0x05C1, 0x05C2}, {0x05C4, 0x05C5},
+   {0x05C7, 0x05C7}, {0x0610, 0x061A}, {0x064B, 0x065F},
+   {0x0670, 0x0670}, {0x06D6, 0x06DC}, {0x06DF, 0x06E4},
+   {0x06E7, 0x06E8}, {0x06EA, 0x06ED}, {0x0711, 0x0711},
+   {0x0730, 0x074A}, {0x07A6, 0x07B0}, {0x07EB, 0x07F3},
+   {0x07FD, 0x07FD}, {0x0816, 0x0819}, {0x081B, 0x0823},
+   {0x0825, 0x0827}, {0x0829, 0x082D}, {0x0859, 0x085B},
+   {0x08D3, 0x08E1}, {0x08E3, 0x0902}, {0x093A, 0x093A},
+   {0x093C, 0x093C}, {0x0941, 0x0948}, {0x094D, 0x094D},
+   {0x0951, 0x0957}, {0x0962, 0x0963}, {0x0981, 0x0981},
+   {0x09BC, 0x09BC}, {0x09C1, 0x09C4}, {0x09CD, 0x09CD},
+   {0x09E2, 0x09E3}, {0x09FE, 0x0A02}, {0x0A3C, 0x0A3C},
+   {0x0A41, 0x0A51}, {0x0A70, 0x0A71}, {0x0A75, 0x0A75},
+   {0x0A81, 0x0A82}, {0x0ABC, 0x0ABC}, {0x0AC1, 0x0AC8},
+   {0x0ACD, 0x0ACD}, {0x0AE2, 0x0AE3}, {0x0AFA, 0x0B01},
+   {0x0B3C, 0x0B3C}, {0x0B3F, 0x0B3F}, {0x0B41, 0x0B44},
+   {0x0B4D, 0x0B56}, {0x0B62, 0x0B63}, {0x0B82, 0x0B82},
+   {0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C00, 0x0C00},
+   {0x0C04, 0x0C04}, {0x0C3E, 0x0C40}, {0x0C46, 0x0C56},
+   {0x0C62, 0x0C63}, {0x0C81, 0x0C81}, {0x0CBC, 0x0CBC},
{0x0CBF, 0x0CBF}, {0x0CC6, 0x0CC6}, {0x0CCC

Re: Converting NOT IN to anti-joins during planning

On Fri, 14 Jun 2019 at 20:41, Simon Riggs  wrote:
>
> On Wed, 6 Mar 2019 at 04:11, David Rowley  
> wrote:
>>
>> Hi Jim,
>>
>> Thanks for replying here.
>>
>> On Wed, 6 Mar 2019 at 16:37, Jim Finnerty  wrote:
>> >
>> > Actually, we're working hard to integrate the two approaches.  I haven't 
>> > had
>> > time since I returned to review your patch, but I understand that you were
>> > checking for strict predicates as part of the nullness checking criteria,
>> > and we definitely must have that.  Zheng tells me that he has combined your
>> > patch with ours, but before we put out a new patch, we're trying to figure
>> > out how to preserve the existing NOT IN execution plan in the case where 
>> > the
>> > materialized subplan fits in memory.  This (good) plan is effectively an
>> > in-memory hash anti-join.
>> >
>> > This is tricky to do because the NOT IN Subplan to anti-join transformation
>> > currently happens early in the planning process, whereas the decision to
>> > materialize is made much later, when the best path is being converted into 
>> > a
>> > Plan.
>>
>> I guess you're still going with the OR ... IS NULL in your patch then?
>> otherwise, you'd likely find that the transformation (when NULLs are
>> not possible) is always a win since it'll allow hash anti-joins. (see
>> #2 in the original email on this thread)  FWIW I mentioned in [1] and
>> Tom confirmed in [2] that we both think hacking the join condition to
>> add an OR .. IS NULL is a bad idea. I guess you're not deterred by
>> that?
>
>
> Surely we want both?
>
> 1. Transform when we can
> 2. Else apply some other approach if the cost can be reduced by doing it

Maybe.  If the scope for the conversion is reduced to only add the OR
.. IS NULL join clause when the subplan could not be hashed then it's
maybe less likely to cause performance regressions. Remember that this
forces the planner to use a nested loop join since no other join
algorithms support OR clauses. I think Jim and Zheng have now changed
their patch to do that.  If we can perform a parameterised nested loop
join then that has a good chance of being better than scanning the
subquery multiple times, however, if there's no index to do a
parameterized nested loop, then we need to do a normal nested loop
which will perform poorly, but so will the non-hashed subplan...

# create table t1 (a int);
CREATE TABLE
# create table t2 (a int);
CREATE TABLE
# set work_mem = '64kB';
SET
# insert into t1 select generate_Series(1,1);
INSERT 0 1
# insert into t2 select generate_Series(1,1);
INSERT 0 1
# explain analyze select count(*) from t1 where a not in(select a from t2);
QUERY PLAN
--
 Aggregate  (cost=1668739.50..1668739.51 rows=1 width=8) (actual
time=7079.077..7079.077 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..1668725.16 rows=5738 width=0)
(actual time=7079.072..7079.072 rows=0 loops=1)
 Filter: (NOT (SubPlan 1))
 Rows Removed by Filter: 1
 SubPlan 1
   ->  Materialize  (cost=0.00..262.12 rows=11475 width=4)
(actual time=0.004..0.397 rows=5000 loops=1)
 ->  Seq Scan on t2  (cost=0.00..159.75 rows=11475
width=4) (actual time=0.019..4.921 rows=1 loops=1)
 Planning Time: 0.348 ms
 Execution Time: 7079.309 ms
(9 rows)

# explain analyze select count(*) from t1 where not exists(select 1
from t2 where t1.a = t2.a or t2.a is null);
   QUERY PLAN

 Aggregate  (cost=1250873.25..1250873.26 rows=1 width=8) (actual
time=7263.980..7263.980 rows=1 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..1250858.97 rows=5709
width=0) (actual time=7263.976..7263.976 rows=0 loops=1)
 Join Filter: ((t1.a = t2.a) OR (t2.a IS NULL))
 Rows Removed by Join Filter: 49995000
 ->  Seq Scan on t1  (cost=0.00..159.75 rows=11475 width=4)
(actual time=0.013..2.350 rows=1 loops=1)
 ->  Materialize  (cost=0.00..262.12 rows=11475 width=4)
(actual time=0.004..0.396 rows=5000 loops=1)
   ->  Seq Scan on t2  (cost=0.00..159.75 rows=11475
width=4) (actual time=0.007..4.075 rows=1 loops=1)
 Planning Time: 0.086 ms
 Execution Time: 7264.141 ms
(9 rows)

When the index exists the transformation is certainly much better.

# create index on t2(a);
CREATE INDEX
# explain analyze select count(*) from t1 where not exists(select 1
from t2 where t1.a = t2.a or t2.a is null);
  QUERY
PLAN
---
 Aggregate  (cost=111342.50..111342.51 rows=1 width=8) (actual
time=29.580.

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-14 Thread Christoph Berg

Re: Tom Lane 2019-06-11 <24452.1560285...@sss.pgh.pa.us>
> The only way I can get it to pick "Etc/UCT" is if that's what I put
> into /etc/localtime.  (In which case I maintain that that's not a bug,
> or at least not our bug.)

Did you try a symlink or a plain file for /etc/localtime?

> So I'm still mystified by Christoph's report, and am forced to suspect
> pilot error -- specifically, /etc/localtime not containing what he said.

On Debian unstable, deleting /etc/timezone, $TZ not set, and with this symlink:
lrwxrwxrwx 1 root root 27 Mär 28 14:49 /etc/localtime -> 
/usr/share/zoneinfo/Etc/UTC

/usr/lib/postgresql/11/bin/initdb -D pgdata
$ grep timezone pgdata/postgresql.conf
log_timezone = 'UCT'
timezone = 'UCT'

/usr/lib/postgresql/12/bin/initdb -D pgdata
$ grep timezone pgdata/postgresql.conf
log_timezone = 'Etc/UTC'
timezone = 'Etc/UTC'

Same behavior on Debian Stretch (stable):
lrwxrwxrwx 1 root root 27 Mai  7 11:14 /etc/localtime -> 
/usr/share/zoneinfo/Etc/UTC

$ grep timezone pgdata/postgresql.conf
log_timezone = 'UCT'
timezone = 'UCT'

$ grep timezone pgdata/postgresql.conf
log_timezone = 'Etc/UTC'
timezone = 'Etc/UTC'

> Anyway, moving on to the question of what should we do about this,
> I don't really have anything better to offer than back-patching 23bd3cec6.

The PG12 behavior seems sane, so +1.

Christoph

Re: Add CREATE DATABASE LOCALE option

2019-06-14 Thread Heikki Linnakangas


On 05/06/2019 23:17, Peter Eisentraut wrote:

I propose this patch to add a LOCALE option to CREATE DATABASE.  This
sets both LC_COLLATE and LC_CTYPE with one option.  Similar behavior is
already supported in initdb, CREATE COLLATION, and createdb.

With collation providers other than libc, having separate lc_collate and
lc_ctype settings is not necessarily applicable, so this is also
preparation for such future functionality.


One objection is that the proposed LOCALE option would only affect 
LC_COLLATE and LC_CTYPE. What about lc_messages, lc_monetary, lc_numeric 
and lc_time? initdb's --locale option sets those, too. Should CREATE 
DATABASE LOCALE set those as well?


On the whole, +1 on adding the option. In practice, you always want to 
set LC_COLLATE and LC_CTYPE to the same value, so we should make that 
easy. But let's consider those other variables too, at least we've got 
to document it carefully.



PS. There was some discussion on doing this when the LC_COLLATE and 
LC_CTYPE options were added: 
https://www.postgresql.org/message-id/491862F7.1060501%40enterprisedb.com. 
My reading of that is that there was no strong consensus, so we just let 
it be.


- Heikki

Re: pg_upgrade: Improve invalid option handling

2019-06-14 Thread Daniel Gustafsson

> On 13 Jun 2019, at 10:19, Peter Eisentraut  
> wrote:

> Currently, calling pg_upgrade with an invalid command-line option aborts
> pg_upgrade but leaves a pg_upgrade_internal.log file lying around.  This
> patch reorder things a bit so that that file is not created until all
> the options have been parsed.

+1 on doing this. 

+   if ((log_opts.internal = fopen_priv(INTERNAL_LOG_FILE, "a")) == NULL)
+   pg_fatal("could not write to log file \"%s\"\n", 
INTERNAL_LOG_FILE);

While we’re at it, should we change this to “could not open log file” to make
the messaging more consistent across the utilities (pg_basebackup and psql both
use “could not open”)?

cheers ./daniel

Re: POC: Cleaning up orphaned files using undo logs

2019-06-14 Thread Amit Kapila

On Fri, Jun 14, 2019 at 8:26 AM Thomas Munro  wrote:
>
> * current versions of the record and worker code discussed upthread by
> Amit and others
>

Thanks for posting the complete patchset.

Last time, I mentioned the remaining work in undo-processing patchset,
the status of which is as follows:
1. Enhance uur_progress so that it updates undo action apply progress
at regular intervals.  This has been done.  The idea is that we update
the transaction's undo apply progress at regular intervals so that
after a crash we can skip already applied undo.   The undo apply
progress is updated in terms of the number of blocks processed.
I think it is better to change the name of uur_progress to something
like uur_apply_progress.  Any suggestions?

2. Enhance to support oldestXidHavingUnappliedUndo, more on that
later.  This has been done.  The idea here is that we register all the
undo apply (transaction abort) requests in the hash table (referred to
as Rollback Hash Table in the patch) and we have a hard limit (after
that we won't allow new transactions to write undo) on how many such
requests can be pending.  So scanning this table gives us the value of
oldestXidHavingUnappliedUndo (actually the value for this will be
smallest of 'xid having pending undo' and 'oldestXmin').  As this
rollback hash table is not persistent, after start, we need to take a
pass over undo logs to register all the pending abort requests in the
rollback hash table.  There are two main purposes which this value
serves (a) Any Xid below this is all-visible, so it can help in
visibility checks, (b) it can help us implementing the rule that "No
aborted XID with an age >2^31 can have unapplied undo.".  This part
helps us to decide to truncate the clog because we can't truncate the
clog for transactions having undo.

3. Split the patch.  The patch is split into five patches.  I will
give a brief description of each patch which to a good extent is
mentioned in the commit message for each patch as well:
0010-Extend-binary-heap-functionality -  This patch adds the routines
to allocate binary heap in shared memory and to remove nth element
from binary heap.  These routines will be used by a later patch that
will allow an efficient way to process the pending rollback requests.

0011-Infrastructure-to-register-and-fetch-undo-action-req - This patch
provides an infrastructure to register and fetch undo action requests.
This infrastructure provides a way to allow execution of undo actions.
One might think that we can always execute undo actions on error or
explicit rollback by the user, however, there are cases when that is
not possible.  For example, (a) if the system crash while doing the
operation, then after startup, we need a way to perform undo actions;
(b) If we get an error while
performing undo actions.

Apart from this, when there are large rollback requests, then it is
quite inefficient to perform all the undo actions and then return
control to the user.

0012-Infrastructure-to-execute-pending-undo-actions - This provides an
infrastructure to execute pending undo actions.  To apply the undo
actions, we collect the undo records in bulk and try to process them
together.  We ensure to update the transaction's progress at regular
intervals so that after a crash we can skip already applied undo.
This needs some more work to generalize the processing of undo records
so that this infrastructure can be used by other AM's as well.

0013-Allow-foreground-transactions-to-perform-undo-action - This patch
allows foreground transactions to perform undo actions on abort.  We
always perform rollback actions after cleaning up the current
(sub)transaction.  This will ensure that we perform the actions
immediately after an error (and release the locks) rather than when
the user issues Rollback command at some later point of time.  We are
releasing the locks after the undo actions are applied.  The reason to
delay lock release is that if we release locks before applying undo
actions, then the parallel session can acquire the lock before us
which can lead to deadlock.

0014-Allow-execution-and-discard-of-undo-by-background-wo-  - This
patch allows execution and discard of undo by background workers.
Undo launcher is responsible for launching the workers iff there is
some work available in one of the work queues and there are more
workers available.  The worker is launched to handle requests for a
particular database.

The discard worker is responsible for discarding the undo log of
transactions that are committed and all-visible or are rolled-back.
It also registers the request for aborted transactions in the work
queues.  It iterates through all the active logs one-by-one and tries
to discard the transactions that are old enough to matter.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Check the number of potential synchronous standbys

2019-06-14 Thread ??????

Hi all,


When the number of potential synchronous standbys is smaller than num_sync, 
such as 'FIRST 3 (1,2)', 'ANY 4 (1,2,3)' in the synchronous_standby_names, the 
processes will wait for synchronous replication forever. 
 

 
Obviously, it's not expected. I think return false and a error message may be 
better. And attached is a patch that implements the simple check. 




 

 

What do you think about this?




--
Zhang Wenjie

check_synchronous_standby_names.patch
Description: Binary data

Re: Minimal logical decoding on standbys

2019-06-14 Thread Amit Khandekar

On Wed, 22 May 2019 at 15:05, Amit Khandekar  wrote:
>
> On Tue, 9 Apr 2019 at 22:23, Amit Khandekar  wrote:
> >
> > On Sat, 6 Apr 2019 at 04:45, Andres Freund  wrote:
> > > > diff --git a/src/backend/replication/slot.c 
> > > > b/src/backend/replication/slot.c
> > > > index 006446b..5785d2f 100644
> > > > --- a/src/backend/replication/slot.c
> > > > +++ b/src/backend/replication/slot.c
> > > > @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
> > > >   }
> > > >  }
> > > >
> > > > +void
> > > > +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> > > > +{
> > > > + int i;
> > > > + boolfound_conflict = false;
> > > > +
> > > > + if (max_replication_slots <= 0)
> > > > + return;
> > > > +
> > > > +restart:
> > > > + if (found_conflict)
> > > > + {
> > > > + CHECK_FOR_INTERRUPTS();
> > > > + /*
> > > > +  * Wait awhile for them to die so that we avoid flooding 
> > > > an
> > > > +  * unresponsive backend when system is heavily loaded.
> > > > +  */
> > > > + pg_usleep(10);
> > > > + found_conflict = false;
> > > > + }
> > > > +
> > > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > > > + for (i = 0; i < max_replication_slots; i++)
> > > > + {
> > > > + ReplicationSlot *s;
> > > > + NameDataslotname;
> > > > + TransactionId slot_xmin;
> > > > + TransactionId slot_catalog_xmin;
> > > > +
> > > > + s = &ReplicationSlotCtl->replication_slots[i];
> > > > +
> > > > + /* cannot change while ReplicationSlotCtlLock is held */
> > > > + if (!s->in_use)
> > > > + continue;
> > > > +
> > > > + /* not our database, skip */
> > > > + if (s->data.database != InvalidOid && s->data.database != 
> > > > dboid)
> > > > + continue;
> > > > +
> > > > + SpinLockAcquire(&s->mutex);
> > > > + slotname = s->data.name;
> > > > + slot_xmin = s->data.xmin;
> > > > + slot_catalog_xmin = s->data.catalog_xmin;
> > > > + SpinLockRelease(&s->mutex);
> > > > +
> > > > + if (TransactionIdIsValid(slot_xmin) && 
> > > > TransactionIdPrecedesOrEquals(slot_xmin, xid))
> > > > + {
> > > > + found_conflict = true;
> > > > +
> > > > + ereport(WARNING,
> > > > + (errmsg("slot %s w/ xmin %u 
> > > > conflicts with removed xid %u",
> > > > + 
> > > > NameStr(slotname), slot_xmin, xid)));
> > > > + }
> > > > +
> > > > + if (TransactionIdIsValid(slot_catalog_xmin) && 
> > > > TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> > > > + {
> > > > + found_conflict = true;
> > > > +
> > > > + ereport(WARNING,
> > > > + (errmsg("slot %s w/ catalog xmin 
> > > > %u conflicts with removed xid %u",
> > > > + 
> > > > NameStr(slotname), slot_catalog_xmin, xid)));
> > > > + }
> > > > +
> > > > +
> > > > + if (found_conflict)
> > > > + {
> > > > + elog(WARNING, "Dropping conflicting slot %s", 
> > > > s->data.name.data);
> > > > + LWLockRelease(ReplicationSlotControlLock);  
> > > > /* avoid deadlock */
> > > > + ReplicationSlotDropPtr(s);
> > > > +
> > > > + /* We released the lock above; so re-scan the 
> > > > slots. */
> > > > + goto restart;
> > > > + }
> > > > + }
> > > >
> > > I think this should be refactored so that the two found_conflict cases
> > > set a 'reason' variable (perhaps an enum?) to the particular reason, and
> > > then only one warning should be emitted.  I also think that LOG might be
> > > more appropriate than WARNING - as confusing as that is, LOG is more
> > > severe than WARNING (see docs about log_min_messages).
> >
> > What I have in mind is :
> >
> > ereport(LOG,
> > (errcode(ERRCODE_INTERNAL_ERROR),
> > errmsg("Dropping conflicting slot %s", s->data.name.data),
> > errdetail("%s, removed xid %d.", conflict_str, xid)));
> > where conflict_str is a dynamically generated string containing
> > something like : "slot xmin : 1234, slot catalog_xmin: 5678"
> > So for the user, the errdetail will look like :
> > "slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
> > I think the user can figure out whether it was xmin or catalog_xmin or
> > both that conflicted with removed xid.
> > If we don't do this way, we may not be able to show in a single
> > message if both xmin and catalog_xmin are conflicting at the same
> > time.
> >
> > Does this message look

Re: Minimal logical decoding on standbys

2019-06-14 Thread Amit Khandekar

On Wed, 12 Jun 2019 at 00:06, Alvaro Herrera  wrote:
>
> On 2019-May-23, Andres Freund wrote:
>
> > On 2019-05-23 09:37:50 -0400, Robert Haas wrote:
> > > On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov  wrote:
> > > > > wal_level is PGC_POSTMASTER.
> > > >
> > > > But primary can be restarted without restart on standby. We require 
> > > > wal_level replica or highter (currently only logical) on standby. So 
> > > > online change from logical to replica wal_level is possible on 
> > > > standby's controlfile.
> > >
> > > That's true, but Amit's scenario involved a change in wal_level during
> > > the execution of pg_create_logical_replication_slot(), which I think
> > > can't happen.
> >
> > I don't see why not - we're talking about the wal_level in the WAL
> > stream, not the setting on the standby. And that can change during the
> > execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
> > record is replayed. I don't think it's actually a problem, as I
> > outlined in my response to Amit, though.
>
> I don't know if this is directly relevant, but in commit_ts.c we go to
> great lengths to ensure that things continue to work across restarts and
> changes of the GUC in the primary, by decoupling activation and
> deactivation of the module from start-time initialization.  Maybe that
> idea is applicable for this too?

We do kind of handle change in wal_level differently at run-time
versus at initialization. E.g. we drop the existing slots if the
wal_level becomes less than logical. But I think we don't have to do a
significant work unlike how it seems to have been done in
ActivateCommitTs when commit_ts is activated.

>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: postgres_fdw: oddity in costing presorted foreign scans with local stats

2019-06-14 Thread Etsuro Fujita

On Mon, Jun 10, 2019 at 5:37 PM Etsuro Fujita  wrote:
> On Thu, Jun 6, 2019 at 5:58 PM Etsuro Fujita  wrote:
> > I made stricter an assertion test I added on retrieved_rows.  Also, I
> > did some editorialization further and added the commit message.
> > Attached is an updated version of the patch.  If there are no
> > objections, I'll commit the patch.
>
> I noticed that the previous patch was an old version; it didn't update
> the assertion test at all.  Attached is a new version updating that
> test.  I think I had been under the weather last week due to a long
> flight.

Pushed.

Best regards,
Etsuro Fujita

Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

2019-06-14 Thread Bruce Momjian

On Thu, Jun 13, 2019 at 09:14:13PM -0400, Bruce Momjian wrote:
> On Mon, Jun  3, 2019 at 12:04:54PM -0400, Robert Haas wrote:
> > 
> > What I'm talking about here is that it also has to be reasonably
> > possible to write an encryption key command that does something
> > useful.  I don't have a really clear vision for how that's going to
> > work.  Nobody wants the server, which is probably being launched by
> > pg_ctl or systemd or both, to prompt using its own stdin/stderr, but
> > the we need to think about how the prompting is actually going to
> > work.  One idea is to do it via the FEBE protocol: connect to the
> > server using libpq, issue a new command that causes the server to
> > enter COPY mode, and then send the encryption key as COPY data.
> > However, that idea doesn't really work, because we've got to be able
> > to get the key before we run recovery or reach consistency, so the
> > server won't be listening for connections at that point.  Also, we've
> > got to have a way for this to work in single-user mode, which also
> > can't listen for connections.  It's probably all fine if your
> > encryption key command is something like 'curl
> > https://my.secret.keyserver.me/sekcret.key' because then there's an
> > external server which you can just go access - but I don't quite
> > understand how you'd do interactive prompting from here.  Sorry to get
> > hung up on what may seem like a minor point, but I think it's actually
> > fairly fundamental: we've got to have a clear vision of how end-users
> > will really be able to make use of this.
> 
> pgcryptoey has an example of prompting from /dev/tty:
> 
>   http://momjian.us/download/pgcryptokey/
> 
> Look at pgcryptokey_default.sample.

Also, look at pgcryptokey_default.c and its use with
shared_preload_libraries for a full example of prompting on server boot.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

Christoph Berg  writes:
> Re: Tom Lane 2019-06-11 <24452.1560285...@sss.pgh.pa.us>
>> The only way I can get it to pick "Etc/UCT" is if that's what I put
>> into /etc/localtime.  (In which case I maintain that that's not a bug,
>> or at least not our bug.)

> Did you try a symlink or a plain file for /etc/localtime?

Symlink --- if it's a plain file, our code can't learn anything from it.

> On Debian unstable, deleting /etc/timezone, $TZ not set, and with this 
> symlink:
> lrwxrwxrwx 1 root root 27 Mär 28 14:49 /etc/localtime -> 
> /usr/share/zoneinfo/Etc/UTC

> /usr/lib/postgresql/11/bin/initdb -D pgdata
> $ grep timezone pgdata/postgresql.conf
> log_timezone = 'UCT'
> timezone = 'UCT'

> /usr/lib/postgresql/12/bin/initdb -D pgdata
> $ grep timezone pgdata/postgresql.conf
> log_timezone = 'Etc/UTC'
> timezone = 'Etc/UTC'

That's what I'd expect.  Do you think your upthread report of HEAD
picking "Etc/UCT" was a typo?  Or maybe you actually had /etc/localtime
set that way?

>> Anyway, moving on to the question of what should we do about this,
>> I don't really have anything better to offer than back-patching 23bd3cec6.

> The PG12 behavior seems sane, so +1.

OK, I'll make that happen.

regards, tom lane

Re: pg_dump multi VALUES INSERT

2019-06-14 Thread Peter Eisentraut

Shouldn't the --rows-per-insert option also be available via pg_dumpall?
 All the other options for switching between COPY and INSERT are
settable in pg_dumpall.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_dump multi VALUES INSERT

On 2019-Jun-14, Peter Eisentraut wrote:

> Shouldn't the --rows-per-insert option also be available via pg_dumpall?
>  All the other options for switching between COPY and INSERT are
> settable in pg_dumpall.

Uh, yeah, absolutely.

Surafel, are you in a position to provide a patch for that quickly?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Allow table AM's to cache stuff in relcache

2019-06-14 Thread Heikki Linnakangas

Index AM's can cache stuff in RelationData->rd_amcache. In the zedstore 
table AM we've been hacking on, I'd like to also use rd_amcache to cache 
some information, but that's not currently possible, because rd_amcache 
can only be used by index AMs, not table AMs.


Attached patch allows rd_amcache to also be used by table AMs.

While working on this, I noticed that the memory management of relcache 
entries is quite complicated. Most stuff that's part of a relcache entry 
is allocated in CacheMemoryContext. But some fields have a dedicated 
memory context to hold them, like rd_rulescxt for rules and rd_pdcxt for 
partition information. And indexes have rd_indexcxt to hold all kinds of 
index support info.


In the patch, I documented that rd_amcache must be allocated in 
CacheMemoryContext, or in rd_indexcxt if it's an index. It works, but 
it's a bit weird. It would nice to have one memory context in every 
relcache entry, to hold all the stuff related to it, including 
rd_amcache. In other words, it would be nice if we had "rd_indexcxt" for 
tables, too, not just indexes. That would allow tracking memory usage 
more accurately, if you're debugging an out of memory situation for example.


However, the special contexts like rd_rulescxt and rd_pdcxt would still 
be needed, because of the way RelationClearRelation preserves them, when 
rebuilding the relcache entry for an open relation. So I'm not sure how 
much it would really simplify things. Also, there's some overhead for 
having extra memory contexts, and some people already complain that the 
relcache uses too much memory.


Alternatively, we could document that rd_amcache should always be 
allocated in CacheMemoryContext, even for indexes. That would make the 
rule for pg_amcache straightforward. There's no particular reason why 
rd_amcache has to be allocated in rd_indexcxt, except for how it's 
accounted for in memory context dumps.


- Heikki
>From 0b061f3c0ec8f8dedaf30b4e3cefc0cb24630ced Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 14 Jun 2019 18:09:19 +0300
Subject: [PATCH 1/1] Allow table AM's to use rd_amcache, too.

The rd_amcache allows an index AM to cache arbitrary information in
a relcache entry. This commit moves the cleanup of rd_amcache so that
it's also available for table AMs.
---
 src/backend/utils/cache/relcache.c | 11 +--
 src/include/utils/rel.h| 19 +++
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d78327..204e1eb7aad 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2358,6 +2358,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		pfree(relation->rd_options);
 	if (relation->rd_indextuple)
 		pfree(relation->rd_indextuple);
+	if (relation->rd_amcache)
+		pfree(relation->rd_amcache);
+	if (relation->rd_fdwroutine)
+		pfree(relation->rd_fdwroutine);
 	if (relation->rd_indexcxt)
 		MemoryContextDelete(relation->rd_indexcxt);
 	if (relation->rd_rulescxt)
@@ -2370,8 +2374,6 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		MemoryContextDelete(relation->rd_pdcxt);
 	if (relation->rd_partcheckcxt)
 		MemoryContextDelete(relation->rd_partcheckcxt);
-	if (relation->rd_fdwroutine)
-		pfree(relation->rd_fdwroutine);
 	pfree(relation);
 }
 
@@ -2416,6 +2418,11 @@ RelationClearRelation(Relation relation, bool rebuild)
 	 */
 	RelationCloseSmgr(relation);
 
+	/* Free AM cached data, if any */
+	if (relation->rd_amcache)
+		pfree(relation->rd_amcache);
+	relation->rd_amcache = NULL;
+
 	/*
 	 * Treat nailed-in system relations separately, they always need to be
 	 * accessible, so we can't blow them away.
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3f..9080a3c1609 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,13 +152,6 @@ typedef struct RelationData
 	 * those with lefttype and righttype equal to the opclass's opcintype. The
 	 * arrays are indexed by support function number, which is a sufficient
 	 * identifier given that restriction.
-	 *
-	 * Note: rd_amcache is available for index AMs to cache private data about
-	 * an index.  This must be just a cache since it may get reset at any time
-	 * (in particular, it will get reset by a relcache inval message for the
-	 * index).  If used, it must point to a single memory chunk palloc'd in
-	 * rd_indexcxt.  A relcache reset will include freeing that chunk and
-	 * setting rd_amcache = NULL.
 	 */
 	MemoryContext rd_indexcxt;	/* private memory cxt for this stuff */
 	/* use "struct" here to avoid needing to include amapi.h: */
@@ -173,9 +166,19 @@ typedef struct RelationData
 	Oid		   *rd_exclops;		/* OIDs of exclusion operators, if any */
 	Oid		   *rd_exclprocs;	/* OIDs of exclusion ops' procs, if any */
 	uint16	   *rd_exclstrats;	/* exclusion ops' strategy numbers, if any */
-	void	   *rd

Re: Allow table AM's to cache stuff in relcache

Heikki Linnakangas  writes:
> Index AM's can cache stuff in RelationData->rd_amcache. In the zedstore 
> table AM we've been hacking on, I'd like to also use rd_amcache to cache 
> some information, but that's not currently possible, because rd_amcache 
> can only be used by index AMs, not table AMs.
> Attached patch allows rd_amcache to also be used by table AMs.

Seems reasonable.

> In the patch, I documented that rd_amcache must be allocated in 
> CacheMemoryContext, or in rd_indexcxt if it's an index. It works, but 
> it's a bit weird.

Given the way the patch is implemented, it doesn't really matter which
context it's in, does it?  The retail pfree is inessential but also
harmless, if rd_amcache is in rd_indexcxt.  So we could take out the
"must".  I think it's slightly preferable to use rd_indexcxt if available,
to reduce the amount of loose junk in CacheMemoryContext.

> It would nice to have one memory context in every 
> relcache entry, to hold all the stuff related to it, including 
> rd_amcache. In other words, it would be nice if we had "rd_indexcxt" for 
> tables, too, not just indexes. That would allow tracking memory usage 
> more accurately, if you're debugging an out of memory situation for example.

We had some discussion related to that in the "hyrax
vs. RelationBuildPartitionDesc" thread.  I'm not quite sure where
we'll settle on that, but some redesign seems inevitable.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-14 Thread Stephen Frost

Greetings,

* Ian Barwick (ian.barw...@2ndquadrant.com) wrote:
> Consider the following cascading standby setup with PostgreSQL 12:
> 
> - there exists a running primary "A"
> - standby "B" is cloned from primary "A" using "pg_basebackup 
> --write-recovery-conf"
> - standby "C" is cloned from standby "B" using "pg_basebackup 
> --write-recovery-conf"
> 
> So far, so good, everything works as expected.
> 
> Now, for whatever reason, the user wishes standby "C" to follow another 
> upstream
> node (which one is not important here), so the user, in the comfort of their 
> own psql
> command line (no more pesky recovery.conf editing!) issues the following:
> 
> ALTER SYSTEM SET primary_conninfo = 'host=someothernode';
> 
> and restarts the instance, and... it stays connected to the original upstream 
> node.
> 
> Which is unexpected.
> 
> Examining the the restarted instance, "SHOW primary_conninfo" still displays
> the original value for "primary_conninfo". Mysteriouser and mysteriouser.
> 
> What has happened here is that with the option "--write-recovery-conf", Pg12's
> pg_basebackup (correctly IMHO) appends the recovery settings to 
> "postgresql.auto.conf".
> 
> However, on standby "C", pg_basebackup has dutifully copied over standby "B"'s
> existing "postgresql.auto.conf", which already contains standby "B"'s
> replication configuration, and appended standby "C"'s replication 
> configuration
> to that, which (before "ALTER SYSTEM" was invoked) looked something like this:
> 
>   # Do not edit this file manually!
>   # It will be overwritten by the ALTER SYSTEM command.
>   primary_conninfo = 'host=node_A'
>   primary_conninfo = 'host=node_B'
> 
> which is expected, and works because the last entry in the file is evaluated, 
> so
> on startup, standby "C" follows standby "B".
> 
> However, executing "ALTER SYSTEM SET primary_conninfo = 'host=someothernode'" 
> left
> standby "C"'s "postgresql.auto.conf" file looking like this:
> 
>   # Do not edit this file manually!
>   # It will be overwritten by the ALTER SYSTEM command.
>   primary_conninfo = 'host=someothernode'
>   primary_conninfo = 'host=node_B'
> 
> which seems somewhat broken, to say the least.

Yes, it's completely broken, which I've complained about at least twice
on this list to no avail.

Thanks for putting together an example case pointing out why it's a
serious issue.  The right thing to do here it so create an open item for
PG12 around this.

> As-is, the user will either need to repeatedly issue "ALTER SYSTEM RESET 
> primary_conninfo"
> until the duplicates are cleared (which means "ALTER SYSTEM RESET ..." 
> doesn't work as
> advertised, and is not an obvious solution anyway), or ignore the "Do not 
> edit this file manually!"
> warning and remove the offending entry/entries (which, if done safely, should 
> involve
> shutting down the instance first).
> 
> Note this issue is not specific to pg_basebackup, primary_conninfo (or any 
> other settings
> formerly in recovery.conf), it has just manifested itself as the built-in 
> toolset as of now
> provides a handy way of getting into this situation without too much effort 
> (and any
> utilities which clone standbys and append the replication configuration to
> "postgresql.auto.conf" in lieu of creating "recovery.conf" will be prone to 
> running into
> the same situation).

This is absolutely the fault of the system for putting in multiple
entries into the auto.conf, which it wasn't ever written to handle.

> I had previously always assumed that ALTER SYSTEM  would change the *last* 
> occurrence for
> the parameter in "postgresql.auto.conf", in the same way you'd need to be 
> sure to change
> the last occurrence in the normal configuration files, however this actually 
> not the case -
> as per replace_auto_config_value() ( src/backend/utils/misc/guc.c ):
> 
> /* Search the list for an existing match (we assume there's only one) */
> 
> the *first* match is replaced.
> 
> Attached patch attempts to rectify this situation by having 
> replace_auto_config_value()
> deleting any duplicate entries first, before making any changes to the last 
> entry.

While this might be a good belt-and-suspenders kind of change to
include, I don't think pg_basebackup should be causing us to have
multiple entries in the file in the first place..

> Arguably it might be sufficient (and simpler) to just scan the list for the 
> last
> entry, but removing preceding duplicates seems cleaner, as it's pointless
> (and a potential source of confusion) keeping entries around which will never 
> be used.

I don't think we should only change the last entry, that seems like a
really bad idea.  I agree that we should clean up the file if we come
across it being invalid.

> Also attached is a set of TAP tests to check ALTER SYSTEM works as expected 
> (or
> at least as seems correct to me).

In my view, at least, we should have a similar test for pg_basebackup to
make sure tha

Re: Converting NOT IN to anti-joins during planning

2019-06-14 Thread Li, Zheng

- 
   The big "IF" here is if we can calculate the size of the subplan to
know if it'll be hashed or not at the point in planning where this
conversion is done. I personally can't quite see how that'll work
reliably without actually planning the subquery, which I really doubt
is something we'd consider doing just for a cost estimate. Remember
the subquery may not just be a single relation scan, it could be a
complex query containing many joins and UNION / GROUP BY / DISTINCT /
HAVING clauses etc.
-

In our latest patch, we plan the subquery right before conversion, we only
proceed with the ANTI JOIN conversion if subplan_is_hashable(subplan) is
false. To avoid re-planning the subquery again in a later phase, I think we can
keep a pointer to the subplan in SubLink.

---
Zheng Li
AWS, Amazon Aurora PostgreSQL
 

On 6/14/19, 5:51 AM, "David Rowley"  wrote:

On Fri, 14 Jun 2019 at 20:41, Simon Riggs  wrote:
>
> On Wed, 6 Mar 2019 at 04:11, David Rowley  
wrote:
>>
>> Hi Jim,
>>
>> Thanks for replying here.
>>
>> On Wed, 6 Mar 2019 at 16:37, Jim Finnerty  wrote:
>> >
>> > Actually, we're working hard to integrate the two approaches.  I 
haven't had
>> > time since I returned to review your patch, but I understand that you 
were
>> > checking for strict predicates as part of the nullness checking 
criteria,
>> > and we definitely must have that.  Zheng tells me that he has combined 
your
>> > patch with ours, but before we put out a new patch, we're trying to 
figure
>> > out how to preserve the existing NOT IN execution plan in the case 
where the
>> > materialized subplan fits in memory.  This (good) plan is effectively 
an
>> > in-memory hash anti-join.
>> >
>> > This is tricky to do because the NOT IN Subplan to anti-join 
transformation
>> > currently happens early in the planning process, whereas the decision 
to
>> > materialize is made much later, when the best path is being converted 
into a
>> > Plan.
>>
>> I guess you're still going with the OR ... IS NULL in your patch then?
>> otherwise, you'd likely find that the transformation (when NULLs are
>> not possible) is always a win since it'll allow hash anti-joins. (see
>> #2 in the original email on this thread)  FWIW I mentioned in [1] and
>> Tom confirmed in [2] that we both think hacking the join condition to
>> add an OR .. IS NULL is a bad idea. I guess you're not deterred by
>> that?
>
>
> Surely we want both?
>
> 1. Transform when we can
> 2. Else apply some other approach if the cost can be reduced by doing it

Maybe.  If the scope for the conversion is reduced to only add the OR
.. IS NULL join clause when the subplan could not be hashed then it's
maybe less likely to cause performance regressions. Remember that this
forces the planner to use a nested loop join since no other join
algorithms support OR clauses. I think Jim and Zheng have now changed
their patch to do that.  If we can perform a parameterised nested loop
join then that has a good chance of being better than scanning the
subquery multiple times, however, if there's no index to do a
parameterized nested loop, then we need to do a normal nested loop
which will perform poorly, but so will the non-hashed subplan...

# create table t1 (a int);
CREATE TABLE
# create table t2 (a int);
CREATE TABLE
# set work_mem = '64kB';
SET
# insert into t1 select generate_Series(1,1);
INSERT 0 1
# insert into t2 select generate_Series(1,1);
INSERT 0 1
# explain analyze select count(*) from t1 where a not in(select a from t2);
QUERY PLAN

--
 Aggregate  (cost=1668739.50..1668739.51 rows=1 width=8) (actual
time=7079.077..7079.077 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..1668725.16 rows=5738 width=0)
(actual time=7079.072..7079.072 rows=0 loops=1)
 Filter: (NOT (SubPlan 1))
 Rows Removed by Filter: 1
 SubPlan 1
   ->  Materialize  (cost=0.00..262.12 rows=11475 width=4)
(actual time=0.004..0.397 rows=5000 loops=1)
 ->  Seq Scan on t2  (cost=0.00..159.75 rows=11475
width=4) (actual time=0.019..4.921 rows=1 loops=1)
 Planning Time: 0.348 ms
 Execution Time: 7079.309 ms
(9 rows)

# explain analyze select count(*) from t1 where not exists(select 1
from t2 where t1.a = t2.a or t2.a is null);
   QUERY PLAN


 Aggregate

Re: pg_dump multi VALUES INSERT

2019-06-14 Thread Fabien COELHO



Hello Alvaro,


Shouldn't the --rows-per-insert option also be available via pg_dumpall?
 All the other options for switching between COPY and INSERT are
settable in pg_dumpall.


Uh, yeah, absolutely.

Surafel, are you in a position to provide a patch for that quickly?


End of the week, more time, easy enough and I should have seen the issue 
while reviewing. Patch attached.


BTW, is the libpq hostaddr fix ok?

--
Fabien.diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index b35c702f99..ac8d039bf4 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -514,6 +514,20 @@ PostgreSQL documentation
   
  
 
+ 
+  --rows-per-insert=nrows
+  
+   
+Dump data as INSERT commands (rather than
+COPY).  Controls the maximum number of rows per
+INSERT command. The value specified must be a
+number greater than zero.  Any error during reloading will cause only
+rows that are part of the problematic INSERT to be
+lost, rather than the entire table contents.
+   
+  
+ 
+
  
-?
--help
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index ea4ac91c00..fd1244fa5d 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -146,6 +146,7 @@ main(int argc, char *argv[])
 		{"no-sync", no_argument, NULL, 4},
 		{"no-unlogged-table-data", no_argument, &no_unlogged_table_data, 1},
 		{"on-conflict-do-nothing", no_argument, &on_conflict_do_nothing, 1},
+		{"rows-per-insert", required_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -329,6 +330,11 @@ main(int argc, char *argv[])
 simple_string_list_append(&database_exclude_patterns, optarg);
 break;
 
+			case 7:
+appendPQExpBufferStr(pgdumpopts, " --rows-per-insert ");
+appendShellString(pgdumpopts, optarg);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -651,6 +657,7 @@ help(void)
 	printf(_("  --use-set-session-authorization\n"
 			 "   use SET SESSION AUTHORIZATION commands instead of\n"
 			 "   ALTER OWNER commands to set ownership\n"));
+	printf(_("  --rows-per-insert=NROWS  number of rows per INSERT; implies --inserts\n"));
 
 	printf(_("\nConnection options:\n"));
 	printf(_("  -d, --dbname=CONNSTR connect using connection string\n"));

Re: Index Skip Scan

2019-06-14 Thread Jesper Pedersen


Hi James,

On 6/13/19 11:40 PM, James Coleman wrote:

I've previously noted upthread (along with several others), that I
don't see a good reason to limit this new capability to index only
scans. In addition to the reasons upthread, this also prevents using
the new feature on physical replicas since index only scans require
visibility map (IIRC) information that isn't safe to make assumptions
about on a replica.

That being said, it strikes me that this likely indicates an existing
architecture issue. I was discussing the problem at PGCon with Andres
and Heiki with respect to an index scan variation I've been working on
myself. In short, it's not clear to me why we want index only scans
and index scans to be entirely separate nodes, rather than optional
variations within a broader index scan node. The problem becomes even
more clear as we continue to add additional variants that lie on
different axis, since we end up with an ever multiplying number of
combinations.

In that discussion no one could remember why it'd been done that way,
but I'm planning to try to find the relevant threads in the archives
to see if there's anything in particular blocking combining them.

I generally dislike gating improvements like this on seemingly
tangentially related refactors, but I will make the observation that
adding the skip scan on top of such a refactored index scan node would
make this a much more obvious and complete win.



Thanks for your feedback !


As I noted to Jesper at PGCon I'm happy to review the code in detail
also, but likely won't get to it until later this week or next week at
the earliest.

Jesper: Is there anything still on your list of things to change about
the patch? Or would now be a good time to look hard at the code?



It would be valuable to have test cases for your use-cases which works 
now, or should work.


I revived Thomas' patch because it covered our use-cases and saw it as a 
much needed feature.


Thanks again !

Best regards,
 Jesper

Re: Index Skip Scan

2019-06-14 Thread Jesper Pedersen


Hi David,

On 6/14/19 3:19 AM, David Rowley wrote:

I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)

First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.

I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.

The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for.  This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for().  The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join.  The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.

I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys().  Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg.  I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.

Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ )  However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.

Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.



Thank you very much for this feedback ! Will need to revise the patch 
based on this.



I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:

# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 100
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
  QUERY PLAN
-
  Index Only Scan using abc_a_b_idx on abc  (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
Scan mode: Skip scan
Heap Fetches: 1000
  Planning Time: 5.616 ms
  Execution Time: 21.791 ms
(5 rows)


# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
 QUERY PLAN
-

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-14 Thread Tomas Vondra

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih  wrote:

> > 2) Provide some fallback at execution time. For example, we might watch
> > the size of the group, and if we run into an unexpectedly large one we
> > might abandon the incremental sort and switch to a "full sort" mode.
>
> Are there good examples of our doing this in other types of nodes
> (whether the fallback is an entirely different algorithm/node type)? I
> like this idea in theory, but I also think it's likely it would add a
> very significant amount of complexity. The other problem is knowing
> where to draw the line: you end up creating these kinds of cliffs
> where pulling one more tuple through the incremental sort would give
> you your batch and result in not having to pull many more tuples in a
> regular sort node, but the fallback logic kicks in anyway.
>
What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

+1 to that

> Unrelated to all of the above: if I read the patch properly it
> intentionally excludes backwards scanning. I don't see any particular
> reason why that ought to be the case, and it seems like an odd
> limitation for the feature should it be merged. Should that be a
> blocker to merging?

Regarding this, I came across this,
/*
  * Incremental sort can't be used with either EXEC_FLAG_REWIND,
  * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
  * bucket in tuplesortstate.
  */
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

For some reason I was thinking we'd need it to support backwards scans
to be able to handle DESC sort on the index, but I've tested and
confirmed that already works. I suppose that's because the index scan
provides that ordering and the sort node doesn't need to reverse the
direction of what's provided to it. That's not particularly obvious to
someone newer to the codebase; I'm not sure if that's documented
anywhere.

Yeah, backward scans are not about ASC/DESC, it's about being able to walk
back through the result. And we can't do that with incremental sorts
without materialization.

On a different note, I can't stop imagining this operator on the lines
similar to parallel-append, wherein multiple workers can sort the
different groups independently at the same time.

That is an interesting idea. I suppose it'd be particularly valuable
if somehow there a node that was generating each batch in parallel
already, though I'm not sure at first though what kind of query or
node would result in that. I also wonder if (assuming that weren't the
case) it would be much of an improvement since a single thread would
have to generate each batch anyway; I'm not sure if the overhead of
farming each batch out to a worker would actually be useful or if the
real blocker is the base scan.

At the very least it's an interesting idea.

I kinda doubt that'd be very valuable. Or more precisely, we kinda already
have that capability because we can do things like this:

  -> Gather Merge
 -> Sort
-> ... scan ...

so I imagine we'd just do an Incremental Sort here. Granted, it's not
distributed by prefix groups (I assume that's what you mean by batches
here), but I don't think that's a big problem.

---

I've been writing down notes here, and I realized that my test case
from far upthread is actually a useful setup to see how much overhead
is involved in sorting each batch individually, since it sets up data
with each batch only containing 1 tuple. That particular case is one
we could easily optimize anyway in the code and skip sorting
altogether -- might be a useful enhancement.

I hope to do some more testing and then report back with the results.

James Coleman

OK.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Su

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-14 Thread Tomas Vondra

On Wed, Jun 05, 2019 at 06:14:14PM +0200, Rafia Sabih wrote:

On Mon, 3 Jun 2019 at 15:39, James Coleman  wrote:

On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
 wrote:
> Currently, the costing logic (cost_incremental_sort) assumes all groups
> have the same size, which is fine for uniform distributions. But for
> skewed distributions, that may be an issue.
>
> Consider for example a table with 1M rows, two columns, 100 groups in each
> column, and index on the first one.
>
> CREATE table t (a INT, b INT);
>
> INSERT INTO t SELECT 100*random(), 100*random()
>   FROM generate_series(1,100);
>
> Now, let's do a simple limit query to find the first row:
>
> SELECT * FROM t ORDER BU a, b LIMIT 1;
>
> In this case the current costing logic is fine - the groups are close to
> average size, and we only need to sort the first group, i.e. 1% of data.
>
> Now, let's say the first group is much larger:
>
>
> INSERT INTO t SELECT 0, 100*random()
>   FROM generate_series(1,90) s(i);
>
> INSERT INTO t SELECT 100*random(), 100*random()
>   FROM generate_series(1,10);
>
> That is, the first group is roughly 90% of data, but the number of groups
> is still the same. But we need to scan 90% of data. But the average group
> size is still the same, so the current cost model is oblivious to this.

Thinking out loud here: the current implementation doesn't guarantee
that sort groups always have the same prefix column values because
(from code comments) "Sorting many small groups with tuplesort is
inefficient). While this seems a reasonable optimization, I think it's
possible that thinking steered away from an optimization in the
opposite direction. Perhaps we should always track whether or not all
prefix tuples are the same (currently we only do that after reaching
DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
have tuplesort only care about the non-prefix columns (where currently
it has to sort on all pathkey columns even though for a large group
the prefix columns are guaranteed to be equal).

+1 for passing only the non-prefix columns to tuplesort.

Essentially I'm trying to think of ways that would get us to
comparable performance with regular sort in the case of large batch
sizes.

One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
case where you have a very small group and then a very large batch,
we'd lose the ability to optimize in the above way. That makes me
wonder if we shouldn't intentionally optimize for the possibility of
large batch sizes, since a little extra expense per group/tuple is
more likely to be a non-concern with small groups anyway when there
are large numbers of input tuples but a relatively small limit.

What about using the knowledge of MCV here, if we know the next value
is in MCV list then take the overhead of sorting this small group
alone and then leverage the optimization for the larger group, by
passing only the non-prefix columns.

Not sure. It very much depends on how expensive the comparisons are for
that particular data type. If the comparisons are cheap, then I'm not sure
it matters very much whether the group is small or large. For expensive
comparison it may not be a win either, because we need to search the MCV
lists whenever the group changes.

I guess we'll need to make some benchmarks and see if it's a win or not.

Thoughts?

> So I think we should look at the MCV list, and use that information when
> computing the startup/total cost. I think using the first/largest group to
> compute the startup cost, and the average group size for total cost would
> do the trick.

+1

I think this sounds very reasonable.

> I don't think we can do much better than this during planning. There will
> inevitably be cases where the costing model will push us to do the wrong
> thing, in either direction. The question is how serious issue that is, and
> whether we could remedy that during execution somehow.
>
> When we "incorrectly" pick full sort (when the incremental sort would be
> faster), that's obviously sad but I think it's mostly fine because it's
> not a regression.
>
> The opposite direction (picking incremental sort, while full sort being
> faster) is probably more serious, because it's a regression between
> releases.
>
> I don't think we can fully fix that by refining the cost model. We have
> two basic options:
>
> 1) Argue/show that this is not an issue in practice, because (a) such
> cases are very rare, and/or (b) the regression is limited. In short, the
> benefits of the patch outweight the losses.

My comments above go in this direction. If we can improve performance
in the worst case, I think it's plausible this concern becomes a
non-issue.

> 2) Provide some fallback at execution time. For example, we might watch
> the size of the group, and if we run into an unexpectedly large one we
> might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing th

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

> "Tom" == Tom Lane  writes:

 >>> Anyway, moving on to the question of what should we do about this,
 >>> I don't really have anything better to offer than back-patching
 >>> 23bd3cec6.

 >> The PG12 behavior seems sane, so +1.

 Tom> OK, I'll make that happen.

This isn't good enough, because it still picks "UCT" on a system with no
/etc/localtime and no TZ variable. Testing on HEAD as of 3da73d683 (on
FreeBSD, but it'll be the same anywhere else):

% ls -l /etc/*time*
ls: /etc/*time*: No such file or directory

% env -u TZ bin/initdb -D data -E UTF8 --no-locale
[...]
selecting default timezone ... UCT

We need to absolutely prefer UTC over UCT if both match.

-- 
Andrew (irc:RhodiumToad)

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-14 Thread Joe Conway

On 6/13/19 11:07 AM, Bruce Momjian wrote:
> On Thu, Jun 13, 2019 at 04:26:47PM +0900, Masahiko Sawada wrote:
>> Yeah, in principle since data key of 2 tier key architecture should
>> not go outside database I think we should not tell data keys to
>> utility commands. So the rearranging WAL format seems to be a better
>> solution but is there any reason why the main data is placed at end of
>> WAL record? I wonder if we can assemble WAL records as following order
>> and encrypt only 3 and 4.
>> 
>> 1. Header data (XLogRecord and other headers)
>> 2. Main data (xl_heap_insert, xl_heap_update etc + related data)
>> 3. Block data (Tuple data, FPI)
>> 4. Sub data (e.g tuple data for logical decoding)
> 
> Yes, that does sound like a reasonable idea.  It is similar to us not
> encrypting the clog --- there is little value.  However, if we only
> encrypt the cluster, we don't need to expose the relfilenode and we can
> just encrypt the entire WAL --- I like that simplicity.  We might find
> that the complexity of encrypting only certain tablespaces makes the
> system slower than just encrypting the entire cluster.


There are reasons other than performance to want more granular than
entire cluster encryption. Limiting the volume of encrypted data with
any one key for example. And not encrypting #1 & 2 above would help
avoid known-plaintext attacks I would think.


>> > > Also, for system catalog encryption, it could be a hard part. System
>> > > catalogs are initially created at initdb time and created by copying
>> > > from template1 when CREATE DATABASE. Therefore we would need to either
>> > > modify initdb so that it's aware of encryption keys and KMS or modify
>> > > database creation so that it copies database file while encrypting
>> > > them.
>> >
>> > I assume initdb will use the same API that you would use to start the
>> > server itself, e.g., type in a password, or contact a key server.
>> 
>> I realized that in XTS encryption mode since we craft the tweak using
>> relfilenode we will need to have the different tweaks for system
>> catalogs in new database would change. So we might need to re-encrypt
>> system catalogs when CREATE DATABASE after all. I suspect that even
>> the cluster-wide encryption has the same problem.
> 
> Yes, this is why I want to just do cluster-wide encryption at this
> stage.
> 
> In addition, while the 8k blocks would use a block cipher, the WAL would
> likely use a stream cipher, and it will be very hard to use multiple
> stream ciphers in a single WAL file.


I don't understand why we would not just use AES for both.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: Binary support for pgoutput plugin

2019-06-14 Thread Tomas Vondra

On Wed, Jun 12, 2019 at 10:35:48AM -0400, Dave Cramer wrote:

On Mon, 10 Jun 2019 at 07:49, Petr Jelinek 
wrote:

Hi,

On 10/06/2019 13:27, Dave Cramer wrote:
> So back to binary output.
>
> From what I can tell the place to specify binary options would be in the
> create publication and or in replication slots?
>
> The challenge as I see it is that the subscriber would have to be able
> to decode binary output.
>
> Any thoughts on how to handle this? At the moment I'm assuming that this
> would only work for subscribers that knew how to handle binary.
>

Given that we don't need to write anything extra to WAL on upstream to
support binary output, why not just have the request for binary data as
an option for the pgoutput and have it chosen dynamically? Then it's the
subscriber who asks for binary output via option(s) to START_REPLICATION.

If I understand this correctly this would add something to the CREATE/ALTER
SUBSCRIPTION commands in the WITH clause.
Additionally another column would be required for pg_subscription for the
binary option.
Does it make sense to add an options column which would just be a comma
separated string?
Not that I have future options in mind but seems like something that might
come up in the future.

I'd just add a boolean column to the catalog. That's what I did in the
patch adding support for streaming in-progress transactions. I don't think
we expect many additional parameters, so it makes little sense to optimize
for that case.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

Andrew Gierth  writes:
> This isn't good enough, because it still picks "UCT" on a system with no
> /etc/localtime and no TZ variable. Testing on HEAD as of 3da73d683 (on
> FreeBSD, but it'll be the same anywhere else):

[ shrug... ]  Too bad.  I doubt that that's a common situation anyway.

> We need to absolutely prefer UTC over UCT if both match.

I don't see a reason why that's a hard requirement.  There are at least
two ways for a user to override initdb's decision (/etc/localtime or TZ),
or she could just change the GUC setting after the fact, and for that
matter it's not obvious that it matters to most people how TimeZone
is spelled as long as it delivers the right external behavior.  We had
the business with "Navajo" being preferred for US Mountain time for
quite a few years, with not very many complaints.

I don't see any way that we could "fix" this except with a hardwired
special case to prefer UTC over other spellings, and I definitely do
not want to go there.  If we start putting in magic special cases to make
particular zone names be preferred over other ones, where will we stop?
(I've been lurking on the tzdb mailing list for long enough now to know
that that's a fine recipe for opening ourselves up to politically-
motivated demands that name X be preferred over name Y.)

A possibly better idea is to push back on tzdb's choice to unify
these zones.   Don't know if they'd listen, but we could try.  The
UCT symlink hasn't been out there so long that it's got much inertia.

regards, tom lane

CREATE STATISTICS documentation bug

2019-06-14 Thread Robert Haas

https://www.postgresql.org/docs/12/sql-createstatistics.html contains
this example command:

CREATE STATISTICS s2 (mcv) ON (a, b) FROM t2;

But that produces:

psql: ERROR:  only simple column references are allowed in CREATE STATISTICS

I think the parentheses around (a, b) just need to be removed.

P.S. I think the fact that we print "psql: " before the ERROR here is
useless clutter.  We didn't do that in v11 and prior and I think we
should kill it with fire.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CREATE STATISTICS documentation bug

Robert Haas  writes:
> P.S. I think the fact that we print "psql: " before the ERROR here is
> useless clutter.  We didn't do that in v11 and prior and I think we
> should kill it with fire.

Agreed, particularly seeing that the error is not originating with
psql; it's just passing it on.

regards, tom lane

Re: Binary support for pgoutput plugin

2019-06-14 Thread Dave Cramer

Dave Cramer


On Fri, 14 Jun 2019 at 14:36, Tomas Vondra 
wrote:

> On Wed, Jun 12, 2019 at 10:35:48AM -0400, Dave Cramer wrote:
> >On Mon, 10 Jun 2019 at 07:49, Petr Jelinek 
> >wrote:
> >
> >> Hi,
> >>
> >> On 10/06/2019 13:27, Dave Cramer wrote:
> >> > So back to binary output.
> >> >
> >> > From what I can tell the place to specify binary options would be in
> the
> >> > create publication and or in replication slots?
> >> >
> >> > The challenge as I see it is that the subscriber would have to be able
> >> > to decode binary output.
> >> >
> >> > Any thoughts on how to handle this? At the moment I'm assuming that
> this
> >> > would only work for subscribers that knew how to handle binary.
> >> >
> >>
> >> Given that we don't need to write anything extra to WAL on upstream to
> >> support binary output, why not just have the request for binary data as
> >> an option for the pgoutput and have it chosen dynamically? Then it's the
> >> subscriber who asks for binary output via option(s) to
> START_REPLICATION.
> >>
> >
> >If I understand this correctly this would add something to the
> CREATE/ALTER
> >SUBSCRIPTION commands in the WITH clause.
> >Additionally another column would be required for pg_subscription for the
> >binary option.
> >Does it make sense to add an options column which would just be a comma
> >separated string?
> >Not that I have future options in mind but seems like something that might
> >come up in the future.
> >
>
> I'd just add a boolean column to the catalog. That's what I did in the
> patch adding support for streaming in-progress transactions. I don't think
> we expect many additional parameters, so it makes little sense to optimize
> for that case.
>

Which is what I have done. Thanks

I've attached both patches for comments.
I still have to add documentation.

Regards,

Dave


0002-add-binary-column-to-pg_subscription.patch
Description: Binary data


0001-First-pass-at-working-code-without-subscription-opti.patch
Description: Binary data

Re: SQL/JSON path issues/questions

2019-06-14 Thread Alexander Korotkov

Hi!

On Fri, Jun 14, 2019 at 10:16 AM Kyotaro Horiguchi
 wrote:
> At Thu, 13 Jun 2019 14:59:51 +0100, Thom Brown  wrote
> in 
> > Hi,
> >
> > I've been reading through the documentation regarding jsonpath and
> > jsonb_path_query etc., and I have found it lacking explanation for
> > some functionality, and I've also had some confusion when using the
> > feature.
> >
> > ? operator
> > ==
> > The first mention of '?' is in section 9.15, where it says:
> >
> > "Suppose you would like to retrieve all heart rate values higher than
> > 130. You can achieve this using the following expression:
> > '$.track.segments[*].HR ? (@ > 130)'"
> >
> > So what is the ? operator doing here?  Sure, there's the regular ?
>
> It is described just above as:
>
> | Each filter expression must be enclosed in parentheses and
> | preceded by a question mark.

+1

> > operator, which is given as an example further down the page:
> >
> > '{"a":1, "b":2}'::jsonb ? 'b'
> >
> > But this doesn't appear to have the same purpose.
>
> The section is mentioning path expressions and the '?' is a jsonb
> operator. It's somewhat confusing but not so much comparing with
> around..

+1

> > like_regex
> > ==
> > Then there's like_regex, which shows an example that uses the keyword
> > "flag", but that is the only instance of that keyword being mentioned,
> > and the flags available to this expression aren't anywhere to be seen.
>
> It is described as POSIX regular expressions. So '9.7.3 POSIX
> Regular Expressions' is that. But linking it would
> helpful. (attached 0001)

Actually, standard requires supporting the same regex flags as
XQuery/XPath does [1].  Perhaps, we found that we miss support for 'q'
flag, while it's trivial.  Attached patch fixes that.  Documentation
should contain description of flags.  That will be posted as separate
patch.

> > is unknown
> > ==
> > "is unknown" suggests a boolean output, but the example shows an
> > output of "infinity".  While I understand what it does, this appears
> > inconsistent with all other "is..." functions (e.g. is_valid(lsn),
> > pg_is_other_temp_schema(oid), pg_opclass_is_visible(opclass_oid),
> > pg_is_in_backup() etc.).
>
> It's the right behavior. Among them, only "infinity" gives
> "unknown' for the test (@ > 0). -1 gives false, 2 and 3 true.

+1
We follow here SQL standard for jsonpath language.  There is no direct
analogy with our SQL-level functions.

>
> > $varname
> > ==
> > The jsonpath variable, $varname, has an incomplete description: "A
> > named variable. Its value must be set in the PASSING clause of an
> > SQL/JSON query function. for details."
>
> Yeah, it is apparently chopped amid. In the sgml source, the
> missing part is "", and the PASSING clause is
> not implemented yet. On the other hand a similar stuff is
> currently implemented as vas parameter in some jsonb
> functions. Linking it to there might be helpful (Attached 0002).
>
> > Binary operation error
> > ==
> > I get an error when I run this query:
> >
> > postgres=# SELECT jsonb_path_query('[2]', '2 + $[1]');
> > psql: ERROR:  right operand of jsonpath operator + is not a single numeric 
> > value
> >
> > While I know it's correct to get an error in this scenario as there is
> > no element beyond 0, the message I get is confusing.  I'd expect this
> > if it encountered another array in that position, but not for
> > exceeding the upper bound of the array.
>
> Something like attached makes it clerer? (Attached 0003)

Thank you.  Will review these two and commit.

> | ERROR:  right operand of jsonpath operator + is not a single numeric value
> | DETAIL:  It was an array with 0 elements.
>
> > Cryptic error
> > ==
> > postgres=# SELECT jsonb_path_query('[1, "2",
> > {},[{"a":2}],2.3,null,"2019-06-05T13:25:43.511Z"]','$[**].type()');
> > psql: ERROR:  syntax error, unexpected ANY_P at or near "**" of jsonpath 
> > input
> > LINE 1: ...},[{"a":2}],2.3,null,"2019-06-05T13:25:43.511Z"]','$[**].typ...
> >  ^
> > Again, I expect an error, but the message produced doesn't help me.
> > I'll remove the ANY_P if I can find it.
>
> Yeah, I had a similar error:
>
> =# select jsonb_path_query('[-1,2,7, "infinity"]', '$[*] ? (($hoge) is
> unknown)', '{"hoge": (@ > 0)}');
> ERROR:  syntax error, unexpected IS_P at or near " " of jsonpath input
>
> When the errors are issued, the caller side is commented as:
>
> jsonpath_scan.l:481
> > jsonpath_yyerror(NULL, "bogus input"); /* shouldn't happen */
>
> The error message is reasonable if it were really shouldn't
> happen, but it quite easily happen. I don't have an idea of how
> to fix it for the present..

I'm also not sure.  Need further thinking about it.

> > Can't use nested arrays with jsonpath
> > ==
> >
> > I encounter an error in this scenario:
> >
> > postgres=# select jsonb_path_query('[1, 2, 1, [1,2], 3]','$[*] ? (@ == 
> > [1,2])');
> > psql: ERROR:  syntax er

Re: CREATE STATISTICS documentation bug

On 2019-Jun-14, Tom Lane wrote:

> Robert Haas  writes:
> > P.S. I think the fact that we print "psql: " before the ERROR here is
> > useless clutter.  We didn't do that in v11 and prior and I think we
> > should kill it with fire.
> 
> Agreed, particularly seeing that the error is not originating with
> psql; it's just passing it on.

+1

Proposal: each program declares at startup whether it wants the program
name prefix or not.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

> "Tom" == Tom Lane  writes:

 >> This isn't good enough, because it still picks "UCT" on a system with no
 >> /etc/localtime and no TZ variable. Testing on HEAD as of 3da73d683 (on
 >> FreeBSD, but it'll be the same anywhere else):

 Tom> [ shrug... ]  Too bad.  I doubt that that's a common situation anyway.

Literally every server I have set up is like this...

 >> We need to absolutely prefer UTC over UCT if both match.

 Tom> I don't see a reason why that's a hard requirement.

Because the reverse is clearly insane.

-- 
Andrew (irc:RhodiumToad)

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-14 Thread Christopher Browne

On Fri, Jun 14, 2019, 3:12 PM Tom Lane  wrote:

> A possibly better idea is to push back on tzdb's choice to unify
> these zones.   Don't know if they'd listen, but we could try.  The
> UCT symlink hasn't been out there so long that it's got much inertia.

One oddity; AIX had a preference for CUT with fallbacks to CUT0 and UCT
back when we had AIX boxes (5.2 or 5.3, if my memory still works on this).

We wound up setting PGTZ explicitly to UTC to overrule any such fighting
between time zones.

There may therefore be some older history (and some sort of inertia) in AIX
land than meets the eye elsewhere.

That doesn't prevent it from being a good idea to talk to tzdb maintainers,
of course.

Re: SQL/JSON path issues/questions

2019-06-14 Thread Alexander Korotkov

On Thu, Jun 13, 2019 at 5:00 PM Thom Brown  wrote:
> I've been reading through the documentation regarding jsonpath and
> jsonb_path_query etc., and I have found it lacking explanation for
> some functionality, and I've also had some confusion when using the
> feature.

BTW, I've some general idea about jsonpath documentation structure.
Right now definition of jsonpath language is spread between sections
"JSON Types" [1] and "JSON Functions, Operators, and Expressions" [2].
Thank might be confusing.  I think it would be more readable if whole
jsonpath language definition would be given in a single place.  I
propose to move whole definition of jsonpath to section [1] leaving
section [2] just with SQL-level functions.  Any thoughts?

Links.

1. https://www.postgresql.org/docs/devel/datatype-json.html#DATATYPE-JSONPATH
2. 
https://www.postgresql.org/docs/devel/functions-json.html#FUNCTIONS-SQLJSON-PATH

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: CREATE STATISTICS documentation bug

Alvaro Herrera  writes:
> On 2019-Jun-14, Tom Lane wrote:
>> Robert Haas  writes:
>>> P.S. I think the fact that we print "psql: " before the ERROR here is
>>> useless clutter.  We didn't do that in v11 and prior and I think we
>>> should kill it with fire.

>> Agreed, particularly seeing that the error is not originating with
>> psql; it's just passing it on.

> +1

> Proposal: each program declares at startup whether it wants the program
> name prefix or not.

Well, to clarify: I think it's reasonable to include "psql: " if the
message is originating in psql.  So I don't think your idea quite
does what we want.

regards, tom lane

Draft back-branch release notes are up for review

I've committed first-draft release notes for next week's
releases at

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0995cefa74510ee0e38d1bf095b2eef2c1ea37c4

Please send any review comments by Sunday.

regards, tom lane

Re: CREATE STATISTICS documentation bug

On 2019-Jun-14, Tom Lane wrote:

> Alvaro Herrera  writes:

> > Proposal: each program declares at startup whether it wants the program
> > name prefix or not.
> 
> Well, to clarify: I think it's reasonable to include "psql: " if the
> message is originating in psql.  So I don't think your idea quite
> does what we want.

Hmm, it doesn't.

Maybe the error reporting API needs a bit of a refinement to suppress
the prefix for specific error callsites, then?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: fix psql \conninfo & \connect when using hostaddr

On 2019-May-27, Noah Misch wrote:

> Other than that, the \connect behavior change makes sense to me.  However,
> nothing updated \connect documentation.  (Even the commit log message said
> nothing about \connect.)

I added this blurb to the paragraph that documents the reusing behavior:

If hostaddr was specified in the original
connection's conninfo, that address is reused
for the new connection (disregarding any other host specification).

Thanks for reporting these problems.  I'm going to push this shortly.
We can revise over the weekend, if there's need.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

> "Tom" == Tom Lane  writes:

 >> This isn't good enough, because it still picks "UCT" on a system
 >> with no /etc/localtime and no TZ variable. Testing on HEAD as of
 >> 3da73d683 (on FreeBSD, but it'll be the same anywhere else):

 Tom> [ shrug... ]  Too bad.  I doubt that that's a common situation anyway.

I'm also reminded that this applies also if the /etc/localtime file is a
_copy_ of the UTC zonefile rather than a symlink, which is possibly even
more common.

-- 
Andrew (irc:RhodiumToad)

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-14 Thread Bruce Momjian

On Fri, Jun 14, 2019 at 02:27:17PM -0400, Joe Conway wrote:
> On 6/13/19 11:07 AM, Bruce Momjian wrote:
> > On Thu, Jun 13, 2019 at 04:26:47PM +0900, Masahiko Sawada wrote:
> >> Yeah, in principle since data key of 2 tier key architecture should
> >> not go outside database I think we should not tell data keys to
> >> utility commands. So the rearranging WAL format seems to be a better
> >> solution but is there any reason why the main data is placed at end of
> >> WAL record? I wonder if we can assemble WAL records as following order
> >> and encrypt only 3 and 4.
> >> 
> >> 1. Header data (XLogRecord and other headers)
> >> 2. Main data (xl_heap_insert, xl_heap_update etc + related data)
> >> 3. Block data (Tuple data, FPI)
> >> 4. Sub data (e.g tuple data for logical decoding)
> > 
> > Yes, that does sound like a reasonable idea.  It is similar to us not
> > encrypting the clog --- there is little value.  However, if we only
> > encrypt the cluster, we don't need to expose the relfilenode and we can
> > just encrypt the entire WAL --- I like that simplicity.  We might find
> > that the complexity of encrypting only certain tablespaces makes the
> > system slower than just encrypting the entire cluster.
> 
> 
> There are reasons other than performance to want more granular than
> entire cluster encryption. Limiting the volume of encrypted data with
> any one key for example. And not encrypting #1 & 2 above would help
> avoid known-plaintext attacks I would think.
> 
> 
> >> > > Also, for system catalog encryption, it could be a hard part. System
> >> > > catalogs are initially created at initdb time and created by copying
> >> > > from template1 when CREATE DATABASE. Therefore we would need to either
> >> > > modify initdb so that it's aware of encryption keys and KMS or modify
> >> > > database creation so that it copies database file while encrypting
> >> > > them.
> >> >
> >> > I assume initdb will use the same API that you would use to start the
> >> > server itself, e.g., type in a password, or contact a key server.
> >> 
> >> I realized that in XTS encryption mode since we craft the tweak using
> >> relfilenode we will need to have the different tweaks for system
> >> catalogs in new database would change. So we might need to re-encrypt
> >> system catalogs when CREATE DATABASE after all. I suspect that even
> >> the cluster-wide encryption has the same problem.
> > 
> > Yes, this is why I want to just do cluster-wide encryption at this
> > stage.
> > 
> > In addition, while the 8k blocks would use a block cipher, the WAL would
> > likely use a stream cipher, and it will be very hard to use multiple
> > stream ciphers in a single WAL file.
> 
> 
> I don't understand why we would not just use AES for both.

Uh, AES is an encryption cipher.  You can use it with block mode, CBC,
or stream mode, CTR, GCM;  see:

http://momjian.us/main/writings/crypto.pdf#page=7

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

> "Andrew" == Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:

 >>> This isn't good enough, because it still picks "UCT" on a system
 >>> with no /etc/localtime and no TZ variable. Testing on HEAD as of
 >>> 3da73d683 (on FreeBSD, but it'll be the same anywhere else):

 Tom> [ shrug... ]  Too bad.  I doubt that that's a common situation anyway.

 Andrew> I'm also reminded that this applies also if the /etc/localtime
 Andrew> file is a _copy_ of the UTC zonefile rather than a symlink,
 Andrew> which is possibly even more common.

And testing shows that if you select "UTC" when installing FreeBSD, you
indeed get /etc/localtime as a copy not a symlink, and I've confirmed
that initdb picks "UCT" in that case.

So here is my current proposed fix.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/bin/initdb/findtimezone.c b/src/bin/initdb/findtimezone.c
index 3477a08efd..f7c199a006 100644
--- a/src/bin/initdb/findtimezone.c
+++ b/src/bin/initdb/findtimezone.c
@@ -128,8 +128,11 @@ pg_load_tz(const char *name)
  * the C library's localtime() function.  The database zone that matches
  * furthest into the past is the one to use.  Often there will be several
  * zones with identical rankings (since the IANA database assigns multiple
- * names to many zones).  We break ties arbitrarily by preferring shorter,
- * then alphabetically earlier zone names.
+ * names to many zones).  We break ties by first checking for "preferred"
+ * names (such as "UTC"), and then arbitrarily by preferring shorter, then
+ * alphabetically earlier zone names.  (If we did not explicitly prefer
+ * "UTC", we would get the alias name "UCT" instead due to alphabetic
+ * ordering.)
  *
  * Many modern systems use the IANA database, so if we can determine the
  * system's idea of which zone it is using and its behavior matches our zone
@@ -602,6 +605,28 @@ check_system_link_file(const char *linkname, struct tztry *tt,
 #endif
 }
 
+/*
+ * Given a timezone name, determine whether it should be preferred over other
+ * names which are equally good matches. The output is arbitrary but we will
+ * use 0 for "neutral" default preference.
+ *
+ * Ideally we'd prefer the zone.tab/zone1970.tab names, since in general those
+ * are the ones offered to the user to select from. But for the moment, to
+ * minimize changes in behaviour, simply prefer UTC over alternative spellings
+ * such as UCT that otherwise cause confusion. The existing "shortest first"
+ * rule would prefer "UTC" over "Etc/UTC" so keep that the same way (while
+ * still preferring Etc/UTC over Etc/UCT).
+ */
+static int
+zone_name_pref(const char *zonename)
+{
+	if (strcmp(zonename, "UTC") == 0)
+		return 50;
+	if (strcmp(zonename, "Etc/UTC") == 0)
+		return 40;
+	return 0;
+}
+
 /*
  * Recursively scan the timezone database looking for the best match to
  * the system timezone behavior.
@@ -674,7 +699,8 @@ scan_available_timezones(char *tzdir, char *tzdirsub, struct tztry *tt,
 			else if (score == *bestscore)
 			{
 /* Consider how to break a tie */
-if (strlen(tzdirsub) < strlen(bestzonename) ||
+if (zone_name_pref(tzdirsub) > zone_name_pref(bestzonename) ||
+	strlen(tzdirsub) < strlen(bestzonename) ||
 	(strlen(tzdirsub) == strlen(bestzonename) &&
 	 strcmp(tzdirsub, bestzonename) < 0))
 	strlcpy(bestzonename, tzdirsub, TZ_STRLEN_MAX + 1);

Re: pg_dump multi VALUES INSERT

On 2019-Jun-14, Fabien COELHO wrote:

> 
> Hello Alvaro,
> 
> > > Shouldn't the --rows-per-insert option also be available via pg_dumpall?
> > >  All the other options for switching between COPY and INSERT are
> > > settable in pg_dumpall.
> > 
> > Uh, yeah, absolutely.
> > 
> > Surafel, are you in a position to provide a patch for that quickly?
> 
> End of the week, more time, easy enough and I should have seen the issue
> while reviewing. Patch attached.

Hello Fabien

Thanks for producing a fix so quickly.  David Rowley mentioned a couple
of issues OOB (stanzas were put in the wrong places, when considering
alphabetical ordering and such).  It's pushed now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: fix psql \conninfo & \connect when using hostaddr

On 2019-Jun-13, Fabien COELHO wrote:

> Thanks for the pointer! I did not notice this one. At least the API looks
> better than the one I was suggesting:-)
> 
> ISTM that this function could be used to set other parameters, fixing some
> other issues such as ignoring special parameters on reconnections.
> 
> Anyway, attached an attempt at implementing the desired behavior wrt
> hostaddr.

BTW I tested this manually and it seemed to work as intended, ie., if I
change /etc/hosts for the hostname I am using between one connection and
the next, we do keep the hostaddr if it was specified, or we resolve the
name again if it's not.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Custom table AMs need to include heapam.h because of BulkInsertState

On Fri, 14 Jun 2019 at 07:53, Andres Freund  wrote:
>
> On June 12, 2019 6:42:11 PM PDT, David Rowley  
> wrote:
> >Do you see any issue with calling table_finish_bulk_insert() when the
> >partition's CopyMultiInsertBuffer is evicted from the
> >CopyMultiInsertInfo rather than at the end of the copy? It can mean
> >that we call the function multiple times per partition. I assume the
> >function is only really intended to flush bulk inserted tuple to the
> >storage, so calling it more than once would just mean an inefficiency
> >rather than a bug.
> >
> >Let me know your thoughts.
>
> I'm out on vacation until Monday (very needed, pretty damn exhausted). So I 
> can't really give you a in depth answer right now.
>
> Off the cuff, I'd say it's worthwhile to try somewhat hard to avoid 
> superfluous finish calls. They can be quite expensive (fsync), and we ought 
> to have nearly all the state for doing it only as much as necessary. Possibly 
> we need one bool per partition to track whether any rows where inserted, but 
> thats peanuts in comparison to all the other state.

No worries. I'll just park this patch here until you're ready to give it a look.

With the attached version I'm just calling table_finish_bulk_insert()
once per partition at the end of CopyFrom().  We've got an array with
all the ResultRelInfos we touched in the proute, so it's mostly a
matter of looping over that array and calling the function on each
ResultRelInfo's ri_RelationDesc.  However, to make it more complex,
PartitionTupleRouting is private to execPartition.c so we can't do
this directly... After staring at my screen for a while, I decided to
write a function that calls a callback function on each ResultRelInfo
in the PartitionTupleRouting.

The three alternative ways I thought of were:

1) Put PartitionTupleRouting back into execPartition.h and write the
loop over each ResultRelInfo in copy.c.
2) Write a specific function in execPartition.c that calls
table_finish_bulk_insert()
3) Modify ExecCleanupTupleRouting to pass in the ti_options and a bool
to say if it should call table_finish_bulk_insert() or not.

I didn't really like either of those. For #1, I'd rather keep it
private. For #2, it just seems a bit too specific a function to go
into execPartition.c. For #3 I really don't want to slow down
ExecCleanupTupleRouting() any. I designed those to be as fast as
possible since they're called for single-row INSERTs into partitioned
tables. Quite a bit of work went into PG12 to make those fast.

Of course, someone might see one of the alternatives as better than
what the patch does, so comments welcome.

The other thing I noticed is that we call
table_finish_bulk_insert(cstate->rel, ti_options); in copy.c
regardless of if we've done any bulk inserts or not. Perhaps that
should be under an if (insertMethod != CIM_SINGLE)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

fix_copy_table_finish_build_insert_v2.patch
Description: Binary data

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-14 Thread Joe Conway

On 6/14/19 6:09 PM, Bruce Momjian wrote:
> On Fri, Jun 14, 2019 at 02:27:17PM -0400, Joe Conway wrote:
>> On 6/13/19 11:07 AM, Bruce Momjian wrote:
>> > In addition, while the 8k blocks would use a block cipher, the WAL would
>> > likely use a stream cipher, and it will be very hard to use multiple
>> > stream ciphers in a single WAL file.
>> 
>> I don't understand why we would not just use AES for both.
> 
> Uh, AES is an encryption cipher.  You can use it with block mode, CBC,
> or stream mode, CTR, GCM;  see:


AES is a block cipher, not a stream cipher. Yes you can use it in
different modes, including chained modes (and CBC is what I would pick),
but I assumed you were talking about an actual stream cipher algorithm.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Extracting only the columns needed for a query

2019-06-14 Thread Melanie Plageman

While hacking on zedstore, we needed to get a list of the columns to
be projected--basically all of the columns needed to satisfy the
query. The two use cases we have for this is
1) to pass this column list down to the AM layer for the AM to leverage it
2) for use during planning to improving costing
In other threads, such as [1], there has been discussion about the
possible benefits for all table types of having access to this set of
columns.

Focusing on how to get this used cols list (as opposed to how to pass
it down to the AM), we have tried a few approaches to constructing it
and wanted to get some ideas on how best to do it.

We are trying to determine which phase to get the columns -- after
parsing, after planning, or during execution right before calling the
AM.

Approach A: right before calling AM

Leverage expression_tree_walker() right before calling beginscan()
and collecting the columns into a needed columns context. This
approach is what is currently in the zedstore patch mentioned in
this thread [2].

The benefit of this approach is that it walks the tree right
before the used column set will be used--which makes it easy to
skip this walk for queries or AMs that don't benefit from this
used columns list.

Approach B: after parsing and/or after planning

Add a new member 'used_cols' to PlannedStmt which contains the
attributes for each relation present in the query. Construct
'used_cols' at the end of planning using the PathTargets in the
RelOptInfos in the PlannerInfo->simple_rel_array and the
RangeTblEntries in PlannerInfo->simple_rte_array.

The nice thing about this is that it does not require a full walk
of the plan tree. Approach A could be more expensive if the tree
is quite large. I'm not sure, however, if just getting the
PathTargets from the RelOptInfos is sufficient for obtaining the
whole set of columns used in the query.

Approach B, however, does not work for utility statements which do
not go through planning.

One potential solution to this that we tried was getting the
columns from the query tree after parse analyze and then in
exec_simple_query() adding the column list to the PlannedStmt.

This turned out to be as messy or more than Approach A because
each kind of utility statement has its own data structure that is
composed of elements taken from the Query tree but does not
directly include the original PlannedStmt created for the query
(the PlannedStmt doesn't contain anything except the query tree
for utility statements since they do not go through planning). So,
for each type of utility statement, we would have to separately
copy over the column list from the PlannedStmt in its own way.

It is worth noting that Approach A also requires special handling
for each type of utility statement.

We are wondering about specific benefits of Approach B--that is, is
there some use case (maybe plan re-use) for having the column set
accessible in the PlannedStmt itself?

One potential benefit of Approach B could be for scans of partition
tables. Collecting the used column list could be done once for the
query instead of once for each partition.

Both approaches, however, do not address our second use case, as we
would not have the column list during planning for non-utility
statements. To satisfy this, we would likely have to extract the
columns from the query tree after parse analyze for non-utility
statements as well.

An approach which extracted this list before planning and saved it
somewhere would help avoid having to do the same walk during planning
and then again during execution. Though, using the list constructed
after parsing may not be ideal when some columns were able to be
eliminated during planning.

Melanie & Ashwin

[1]
https://www.postgresql.org/message-id/20190409010440.bqdikgpslh42pqit%40alap3.anarazel.de
[2]
https://www.postgresql.org/message-id/CALfoeiuuLe12PuQ%3DzvM_L7B5qegBqGHYENHUGbLOsjAnG%3Dmi4A%40mail.gmail.com

Re: Speed up transaction completion faster after many relations are accessed in a transaction

On Mon, 8 Apr 2019 at 04:09, Tom Lane  wrote:
> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I've revised the patch to add a new constant named
LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the
hash table grows over that size we shrink it back down to
LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16.

I'm not opposed to setting it to 128. For this particular benchmark,
it won't make any difference as it's only going to affect something
that does not quite use 128 locks and has to work with a slightly
bloated local lock table. I think hitting 64 locks in a transaction is
a good indication that it's not a simple transaction so users are
probably unlikely to notice the small slowdown from the hash table
reinitialisation.

Since quite a bit has changed around partition planning lately, I've
taken a fresh set of benchmarks on today's master. I'm using something
very close to Amit's benchmark from upthread. I just changed the query
so we hit the same partition each time instead of a random one.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14101.226982 (excluding connections establishing)
tps = 14034.250962 (excluding connections establishing)
tps = 14107.937755 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14240.366770 (excluding connections establishing)
tps = 14272.244886 (excluding connections establishing)
tps = 14130.684315 (excluding connections establishing)

master @ a193cbec119:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 10467.027666 (excluding connections establishing)
tps = 10333.700917 (excluding connections establishing)
tps = 10633.084426 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 13938.083272 (excluding connections establishing)
tps = 14143.241802 (excluding connections establishing)
tps = 14097.406758 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

shrink_bloated_locallocktable_v4.patch
Description: Binary data

Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

On 2019-Jun-14, Tom Lane wrote:

> I wrote:
> >> Hm, I don't get that warning.  Does this patch silence it, please?
> 
> > Uh, no patch attached?  But initializing the variable where it's
> > declared would certainly silence it.
> 
> BTW, after looking around a bit I wonder if this complaint isn't
> exposing an actual logic bug.  Shouldn't skip_tuple_lock have
> a lifetime similar to first_time?

I think there are worse problems here.  I tried the attached isolation
spec.  Note that the only difference in the two permutations is that s0
finishes earlier in one than the other; yet the first one works fine and
the second one hangs until killed by the 180s timeout.  (s3 isn't
released for a reason I'm not sure I understand.)

I don't think I'm going to have time to investigate this deeply over the
weekend, so I think the safest course of action is to revert this for
next week's set.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
setup
{
drop table if exists tlu_job;
create table tlu_job (id integer primary key, name text);

insert into tlu_job values(1, 'a');
}


teardown
{
drop table tlu_job;
}

session "s0"
setup { begin; }
step "s0_forupdate" { select id from tlu_job where id = 1 for update; }
step "s0_commit" { commit; }

session "s1"
setup { begin; }
step "s1_keyshare" { select id from tlu_job where id = 1 for key share;}
step "s1_share" { select id from tlu_job where id = 1 for share; }
step "s1_update" { update tlu_job set name = 'b' where id = 1;  }
step "s1_delete" { delete from tlu_job where id = 1; }
step "s1_rollback" { rollback; }
step "s1_commit" { commit; }

session "s2"
setup { begin; }
step "s2_for_update" { select id from tlu_job where id = 1 for update; }
step "s2_update" { update tlu_job set name = 'b' where id = 1; }
step "s2_delete" { delete from tlu_job where id = 1; }
step "s2_rollback" { rollback; }
step "s2_commit" { commit; }

session "s3"
setup { begin; }
step "s3_keyshare" { select id from tlu_job where id = 1 for key share; }
step "s3_share" { select id from tlu_job where id = 1 for share; }
step "s3_for_update" { select id from tlu_job where id = 1 for update; }
step "s3_update" { update tlu_job set name = 'c' where id = 1; }
step "s3_delete" { delete from tlu_job where id = 1; }
step "s3_rollback" { rollback; }
step "s3_commit" { commit; }

permutation "s0_forupdate" "s1_share" "s2_for_update" "s0_commit" "s3_share" 
"s3_for_update" "s1_rollback" "s3_rollback" "s2_rollback"
permutation "s0_forupdate" "s1_share" "s2_for_update" "s3_share" "s0_commit" 
"s3_for_update" "s1_rollback" "s3_rollback" "s2_rollback"

Re: Extracting only the columns needed for a query