date:20200414

Re: index paths and enable_indexscan

2020-04-14 Thread Richard Guo

On Tue, Apr 14, 2020 at 2:44 PM Amit Langote 
wrote:

> Hi,
>
> Maybe I am missing something obvious, but is it intentional that
> enable_indexscan is checked by cost_index(), that is, *after* creating
> an index path?  I was expecting that if enable_indexscan is off, then
> no index paths would be generated to begin with, because I thought
> they are optional.
>

I think the cost estimate of index paths is the same as other paths on
that setting enable_xxx to off only adds a penalty factor (disable_cost)
to the path's cost. The path would be still generated and compete with
other paths in add_path().

Thanks
Richard

Re: pgsql: Improve handling of parameter differences in physical replicatio

2020-04-14 Thread Peter Eisentraut


On 2020-03-30 20:10, Andres Freund wrote:

Also, shouldn't dynahash be adjusted as well? There's e.g. the
following HASH_ENTER path:
/* report a generic message */
if (hashp->isshared)
ereport(ERROR,

(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of shared 
memory")));


Could you explain further what you mean by this?  I don't understand how 
this is related.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-14 Thread davinder singh

On Fri, Apr 10, 2020 at 5:33 PM Amit Kapila  wrote:
>
> It seems the right direction to use GetLocaleInfoEx as we have already
> used it to handle a similar problem (lc_codepage is missing in
> _locale_t in higher versions of MSVC (cf commit 0fb54de9aa)) in
> chklocale.c.  However, I see that we have added a manual parsing there
> if GetLocaleInfoEx doesn't parse it.  I think that addresses your
> concern for _create_locale handling bigger sets.  Don't we need
> something equivalent here for the cases which GetLocaleInfoEx doesn't
> support?
I am in investigating in similar lines, I think the following explanation
can help.
>From Microsoft doc.
The locale argument to the setlocale, _wsetlocale, _create_locale, and
_wcreate_locale is
locale :: "locale-name"
| *"language[_country-region[.code-page]]"*
| ".code-page"
| "C"
| ""
| NULL

For GetLocaleInfoEx locale argument is
*-*
-

Re: index paths and enable_indexscan

2020-04-14 Thread Amit Langote

On Tue, Apr 14, 2020 at 4:13 PM Richard Guo  wrote:
> On Tue, Apr 14, 2020 at 2:44 PM Amit Langote  wrote:
>> Maybe I am missing something obvious, but is it intentional that
>> enable_indexscan is checked by cost_index(), that is, *after* creating
>> an index path?  I was expecting that if enable_indexscan is off, then
>> no index paths would be generated to begin with, because I thought
>> they are optional.
>
>
> I think the cost estimate of index paths is the same as other paths on
> that setting enable_xxx to off only adds a penalty factor (disable_cost)
> to the path's cost. The path would be still generated and compete with
> other paths in add_path().

Yeah, but I am asking why build the path to begin with, as there will
always be seq scan path for base rels.  Turning enable_hashjoin off,
for example, means that no hash join paths will be built at all.

Looking into the archives, I see that the idea of "not generating
disabled paths to begin with" was discussed quite recently:
https://www.postgresql.org/message-id/29821.1572706653%40sss.pgh.pa.us

-- 

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Display of buffers for planning time show nothing for second run

2020-04-14 Thread Pavel Stehule

Hi

I am testing some features from Postgres 13, and I am not sure if I
understand well to behave of EXPLAIN(ANALYZE, BUFFERS)

When I run following statement first time in session I get

postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id =
'CZ0201';
┌──┐
│  QUERY PLAN
   │
╞══╡
│ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114
width=41) (actual time=0.072..0.168 rows=114 loops=1) │
│   Index Cond: ((okres_id)::text = 'CZ0201'::text)
   │
│   Buffers: shared hit=4
   │
│ Planning Time: 0.539 ms
   │
│   Buffers: shared hit=13
│
│ Execution Time: 0.287 ms
│
└──┘
(6 rows)

And I see share hit 13 in planning time.

For second run I get

postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id =
'CZ0201';
┌──┐
│  QUERY PLAN
   │
╞══╡
│ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114
width=41) (actual time=0.044..0.101 rows=114 loops=1) │
│   Index Cond: ((okres_id)::text = 'CZ0201'::text)
   │
│   Buffers: shared hit=4
   │
│ Planning Time: 0.159 ms
   │
│ Execution Time: 0.155 ms
│
└──┘
(5 rows)

Now, there is not any touch in planning time. Does it mean so this all
these data are cached somewhere in session memory?

Regards

Pavel

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Julien Rouhaud

Hi,

On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule  wrote:
>
> Hi
>
> I am testing some features from Postgres 13, and I am not sure if I  
> understand well to behave of EXPLAIN(ANALYZE, BUFFERS)
>
> When I run following statement first time in session I get
>
> postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 
> 'CZ0201';
> ┌──┐
> │  QUERY PLAN 
>  │
> ╞══╡
> │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
> width=41) (actual time=0.072..0.168 rows=114 loops=1) │
> │   Index Cond: ((okres_id)::text = 'CZ0201'::text)   
>  │
> │   Buffers: shared hit=4 
>  │
> │ Planning Time: 0.539 ms 
>  │
> │   Buffers: shared hit=13
>  │
> │ Execution Time: 0.287 ms
>  │
> └──┘
> (6 rows)
>
> And I see share hit 13 in planning time.
>
> For second run I get
>
> postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 
> 'CZ0201';
> ┌──┐
> │  QUERY PLAN 
>  │
> ╞══╡
> │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
> width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> │   Index Cond: ((okres_id)::text = 'CZ0201'::text)   
>  │
> │   Buffers: shared hit=4 
>  │
> │ Planning Time: 0.159 ms 
>  │
> │ Execution Time: 0.155 ms
>  │
> └──┘
> (5 rows)
>
> Now, there is not any touch in planning time. Does it mean so this all these 
> data are cached somewhere in session memory?

The planning time is definitely shorter the 2nd time.  And yes, what
you see are all the catcache accesses that are initially performed on
a fresh new backend.

Re: index paths and enable_indexscan

2020-04-14 Thread Andy Fan

On Tue, Apr 14, 2020 at 3:40 PM Amit Langote 
wrote:

> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo 
> wrote:
> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote 
> wrote:
> >> Maybe I am missing something obvious, but is it intentional that
> >> enable_indexscan is checked by cost_index(), that is, *after* creating
> >> an index path?  I was expecting that if enable_indexscan is off, then
> >> no index paths would be generated to begin with, because I thought
> >> they are optional.
> >
> >
> > I think the cost estimate of index paths is the same as other paths on
> > that setting enable_xxx to off only adds a penalty factor (disable_cost)
> > to the path's cost. The path would be still generated and compete with
> > other paths in add_path().
>
> Yeah, but I am asking why build the path to begin with, as there will
> always be seq scan path for base rels.


I guess that is because user may disable seqscan as well.  If so, we
still need formula to decide with one to use, which requires index path
has to be calculated.  but since disabling the two at the same time is rare,
we can ignore the index path build  if user allow seqscan


> Turning enable_hashjoin off,
> for example, means that no hash join paths will be built at all.
>
>
As for join,  the difference is even user allows a join method by setting,
but the planner may still not able to use it.  so the disabled path still
need
to be used.  Consider query "select * from t1, t2 where f(t1.a, t2.a) = 3",
and user setting is enable_nestloop = off, enable_hashjoin = on.
But I think it is still possible to ignore the path generating after
some extra checking.


> Looking into the archives, I see that the idea of "not generating
> disabled paths to begin with" was discussed quite recently:
> https://www.postgresql.org/message-id/29821.1572706653%40sss.pgh.pa.us
>
> --
>
> Amit Langote
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Pavel Stehule

út 14. 4. 2020 v 10:27 odesílatel Julien Rouhaud 
napsal:

> Hi,
>
> On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule 
> wrote:
> >
> > Hi
> >
> > I am testing some features from Postgres 13, and I am not sure if I
> understand well to behave of EXPLAIN(ANALYZE, BUFFERS)
> >
> > When I run following statement first time in session I get
> >
> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id
> = 'CZ0201';
> >
> ┌──┐
> > │  QUERY PLAN
>   │
> >
> ╞══╡
> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114
> width=41) (actual time=0.072..0.168 rows=114 loops=1) │
> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
>   │
> > │   Buffers: shared hit=4
>   │
> > │ Planning Time: 0.539 ms
>   │
> > │   Buffers: shared hit=13
>│
> > │ Execution Time: 0.287 ms
>│
> >
> └──┘
> > (6 rows)
> >
> > And I see share hit 13 in planning time.
> >
> > For second run I get
> >
> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id
> = 'CZ0201';
> >
> ┌──┐
> > │  QUERY PLAN
>   │
> >
> ╞══╡
> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114
> width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
>   │
> > │   Buffers: shared hit=4
>   │
> > │ Planning Time: 0.159 ms
>   │
> > │ Execution Time: 0.155 ms
>│
> >
> └──┘
> > (5 rows)
> >
> > Now, there is not any touch in planning time. Does it mean so this all
> these data are cached somewhere in session memory?
>
> The planning time is definitely shorter the 2nd time.  And yes, what
> you see are all the catcache accesses that are initially performed on
> a fresh new backend.
>

One time Tom Lane mentioned using index in planning time for getting
minimum and maximum. I expected so these values are not cached. But I
cannot to reproduce it, and then I am little bit surprised so I don't see
any hit in second, and other executions.

Regards

Pavel

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Amit Langote

On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud  wrote:
> On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule  
> wrote:
> > For second run I get
> >
> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 
> > 'CZ0201';
> > ┌──┐
> > │  QUERY PLAN   
> >│
> > ╞══╡
> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
> > width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text) 
> >│
> > │   Buffers: shared hit=4   
> >│
> > │ Planning Time: 0.159 ms   
> >│
> > │ Execution Time: 0.155 ms  
> >│
> > └──┘
> > (5 rows)
> >
> > Now, there is not any touch in planning time. Does it mean so this all 
> > these data are cached somewhere in session memory?
>
> The planning time is definitely shorter the 2nd time.  And yes, what
> you see are all the catcache accesses that are initially performed on
> a fresh new backend.

By the way, even with all catcaches served from local memory, one may
still see shared buffers being hit during planning.  For example:

explain (buffers, analyze) select * from foo where a = 1;
QUERY PLAN
---
 Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
width=4) (actual time=0.010..0.011 rows=0 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 0
   Buffers: shared hit=2
 Planning Time: 0.775 ms
   Buffers: shared hit=72
 Execution Time: 0.086 ms
(7 rows)

Time: 2.477 ms
postgres=# explain (buffers, analyze) select * from foo where a = 1;
QUERY PLAN
---
 Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
width=4) (actual time=0.012..0.012 rows=0 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 0
   Buffers: shared hit=2
 Planning Time: 0.102 ms
   Buffers: shared hit=1
 Execution Time: 0.047 ms
(7 rows)

It seems that 1 Buffer hit comes from get_relation_info() doing
_bt_getrootheight() for that index on foo.

-- 

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Julien Rouhaud

On Tue, Apr 14, 2020 at 10:36 AM Pavel Stehule  wrote:
>
> út 14. 4. 2020 v 10:27 odesílatel Julien Rouhaud  napsal:
>>
>> Hi,
>>
>> On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule  
>> wrote:
>> >
>> > Hi
>> >
>> > I am testing some features from Postgres 13, and I am not sure if I  
>> > understand well to behave of EXPLAIN(ANALYZE, BUFFERS)
>> >
>> > When I run following statement first time in session I get
>> >
>> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 
>> > 'CZ0201';
>> > ┌──┐
>> > │  QUERY PLAN  
>> > │
>> > ╞══╡
>> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
>> > width=41) (actual time=0.072..0.168 rows=114 loops=1) │
>> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
>> > │
>> > │   Buffers: shared hit=4  
>> > │
>> > │ Planning Time: 0.539 ms  
>> > │
>> > │   Buffers: shared hit=13 
>> > │
>> > │ Execution Time: 0.287 ms 
>> > │
>> > └──┘
>> > (6 rows)
>> >
>> > And I see share hit 13 in planning time.
>> >
>> > For second run I get
>> >
>> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 
>> > 'CZ0201';
>> > ┌──┐
>> > │  QUERY PLAN  
>> > │
>> > ╞══╡
>> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
>> > width=41) (actual time=0.044..0.101 rows=114 loops=1) │
>> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
>> > │
>> > │   Buffers: shared hit=4  
>> > │
>> > │ Planning Time: 0.159 ms  
>> > │
>> > │ Execution Time: 0.155 ms 
>> > │
>> > └──┘
>> > (5 rows)
>> >
>> > Now, there is not any touch in planning time. Does it mean so this all 
>> > these data are cached somewhere in session memory?
>>
>> The planning time is definitely shorter the 2nd time.  And yes, what
>> you see are all the catcache accesses that are initially performed on
>> a fresh new backend.
>
>
> One time Tom Lane mentioned using index in planning time for getting minimum 
> and maximum. I expected so these values are not cached. But I cannot to 
> reproduce it, and then I am little bit surprised so I don't see any hit in 
> second, and other executions.

Isn't that get_actual_variable_range() purpose?  If you use a plan
that hit this function you'll definitely see consistent buffer usage
during planning:

rjuju=# explain (buffers, analyze) select * from pg_class c join
pg_attribute a on a.attrelid = c.oid;
  QUERY PLAN
---
 Hash Join  (cost=21.68..110.91 rows=2863 width=504) (actual
time=0.393..5.989 rows=2863 loops=1)
   Hash Cond: (a.attrelid = c.oid)
   Buffers: shared hit=40 read=29
   ->  Seq Scan on pg_attribute a  (cost=0.00..81.63 rows=2863
width=239) (actual time=0.010..0.773 rows=2863 loops=1)
 Buffers: shared hit=28 read=25
   ->  Hash  (cost=16.86..16.86 rows=386 width=265) (actual
time=0.333..0.334 rows=386 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 85kB
 Buffers: shared hit=9 read=4
 ->  Seq Scan on pg_class c  (cost=0.00..16.86 rows=386
width=265) (actual time=0.004..0.123 rows=386 loop

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Julien Rouhaud

On Tue, Apr 14, 2020 at 10:40 AM Amit Langote  wrote:
>
> On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud  wrote:
> > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule  
> > wrote:
> > > For second run I get
> > >
> > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id = 
> > > 'CZ0201';
> > > ┌──┐
> > > │  QUERY PLAN 
> > >  │
> > > ╞══╡
> > > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
> > > width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> > > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)   
> > >  │
> > > │   Buffers: shared hit=4 
> > >  │
> > > │ Planning Time: 0.159 ms 
> > >  │
> > > │ Execution Time: 0.155 ms
> > >  │
> > > └──┘
> > > (5 rows)
> > >
> > > Now, there is not any touch in planning time. Does it mean so this all 
> > > these data are cached somewhere in session memory?
> >
> > The planning time is definitely shorter the 2nd time.  And yes, what
> > you see are all the catcache accesses that are initially performed on
> > a fresh new backend.
>
> By the way, even with all catcaches served from local memory, one may
> still see shared buffers being hit during planning.  For example:
>
> explain (buffers, analyze) select * from foo where a = 1;
> QUERY PLAN
> ---
>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
> width=4) (actual time=0.010..0.011 rows=0 loops=1)
>Index Cond: (a = 1)
>Heap Fetches: 0
>Buffers: shared hit=2
>  Planning Time: 0.775 ms
>Buffers: shared hit=72
>  Execution Time: 0.086 ms
> (7 rows)
>
> Time: 2.477 ms
> postgres=# explain (buffers, analyze) select * from foo where a = 1;
> QUERY PLAN
> ---
>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
> width=4) (actual time=0.012..0.012 rows=0 loops=1)
>Index Cond: (a = 1)
>Heap Fetches: 0
>Buffers: shared hit=2
>  Planning Time: 0.102 ms
>Buffers: shared hit=1
>  Execution Time: 0.047 ms
> (7 rows)
>
> It seems that 1 Buffer hit comes from get_relation_info() doing
> _bt_getrootheight() for that index on foo.

Indeed.  Having received some relcache invalidation should also lead
to similar effect.  Having those counters can help to quantify all of
those interactions.

Re: index paths and enable_indexscan

2020-04-14 Thread Amit Langote

On Tue, Apr 14, 2020 at 5:29 PM Andy Fan  wrote:
> On Tue, Apr 14, 2020 at 3:40 PM Amit Langote  wrote:
>> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo  wrote:
>> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote  
>> > wrote:
>> >> Maybe I am missing something obvious, but is it intentional that
>> >> enable_indexscan is checked by cost_index(), that is, *after* creating
>> >> an index path?  I was expecting that if enable_indexscan is off, then
>> >> no index paths would be generated to begin with, because I thought
>> >> they are optional.
>> >
>> > I think the cost estimate of index paths is the same as other paths on
>> > that setting enable_xxx to off only adds a penalty factor (disable_cost)
>> > to the path's cost. The path would be still generated and compete with
>> > other paths in add_path().
>>
>> Yeah, but I am asking why build the path to begin with, as there will
>> always be seq scan path for base rels.
>
> I guess that is because user may disable seqscan as well.  If so, we
> still need formula to decide with one to use, which requires index path
> has to be calculated.  but since disabling the two at the same time is rare,
> we can ignore the index path build  if user allow seqscan

I am saying that instead of building index path with disabled cost,
just don't build it at all. A base rel will always have a sequetial
path, even though with disabled cost if enable_seqscan = off.

--

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Lexer issues

2020-04-14 Thread Julien Rouhaud

Hello,

On Mon, Apr 13, 2020 at 4:04 PM Patrick REED  wrote:
>
> I am experimenting with postgres and am wondering if there is any tutorial on 
> how to properly add a new command to postgres.
>
> I want to add a new constraint on "CREATE ROLE" that requires an integer, it 
> has an identifier that is not a known (reserved or unreserved keyword) in 
> postgres, say we call it TestPatrick. In other words, I want to do this 
> "CREATE ROLE X TestPatrick=10". I am having an issue with having postgres 
> recognize my new syntax.
>
> I have seen this video: https://www.youtube.com/watch?v=uSEXTcEiXGQ and was 
> able to add have my postgres compile with my added word (modified gram.y, 
> kwlist.h, gram.cpp etc based on the video). However, when I use my syntax on 
> a client session, it still doesn't recognize my syntax... Are there any 
> specific lexer changes I need to make? I followed the example of CONNECTION 
> LIMIT and tried to mimic it for Create ROLE.

I'd think that if you can get a successful compilation with a modified
gram.y (and any kwlist change needed) the new syntax should be
accepted (at least up to the parser, whether the utility command is
properly handled is another thing), since there's a single version of
the CreateRoleStmt.  Is there any chance that you're somehow
connecting to something else than the freshly make-install-ed binary,
or that the error is coming from later stage than parsing?

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-14 Thread David Rowley

On Fri, 3 Apr 2020 at 15:18, Andy Fan  wrote:
> All above 4 item Done.

Just to explain my view on this going forward for PG14.  I do plan to
do a more thorough review of this soon.  I wasn't so keen on pursuing
this for PG13 as the skip scans patch [1] needs to use the same
infrastructure this patch has added and it does not, yet.

The infrastructure (knowing the unique properties of a RelOptInfo), as
provided by the patch Andy has been working on, which is based on my
rough prototype version, I believe should be used for the skip scans
patch as well.  I understand that as skip scans currently stands,
Jasper has done quite a bit of work to add the UniqueKeys, however,
this was unfortunately based on some early description of UniqueKeys
where I had thought that we could just store EquivalenceClasses. I no
longer think that's the case, and I believe the implementation that we
require is something more along the lines of Andy's latest version of
the patch.  However, I've not quite stared at it long enough to be
highly confident in that.

I'd like to strike up a bit of a plan to move both Andy's work and the
Skip scans work forward for PG14.

Here are my thoughts so far:

1. Revise v4 of remove DISTINCT patch to split the patch into two pieces.

0001 should add the UniqueKey code but not any additional planner
smarts to use them (i.e remove GROUP BY / DISTINCT) elimination parts.
Join removals and Unique joins should use UniqueKeys in this patch.
0002 should add back the GROUP BY  / DISTINCT smarts and add whatever
tests should be added for that and include updating existing expected
results and modifying any tests which no longer properly test what
they're meant to be testing.

I've done this with the attached patch.

2. David / Jesper to look at 0001 and build or align the existing skip
scans 0001 patch to make use of Andy's 0001 patch. This will require
tagging UniqueKeys onto Paths, not just RelOptInfos, plus a bunch of
other work.


Clearly UniqueKeys must suit both needs and since we have two
different implementations each providing some subset of the features,
then clearly we're not yet ready to move both skip scans and this
patch forward together. We need to align that and move both patches
forward together. Hopefully, the attached 0001 patch helps move that
along.



While I'm here, a quick review of Andy's v4 patch. I didn't address
any of this in the attached v5. These are only based on what I saw
when shuffling some code around. It's not an in-depth review.

1. Out of date comment in join.sql

-- join removal is not possible when the GROUP BY contains a column that is
-- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
-- but this happens too late for join removal in the outer plan level.)
explain (costs off)
select d.* from d left join (select d, c_id from b group by b.d, b.c_id) s
  on d.a = s.d;

You've changed the GROUP BY clause so it does not include b.id, so the
Note in the comment is now misleading.

2. I think 0002 is overly restrictive in its demands that
parse->hasAggs must be false. We should be able to just use a Group
Aggregate with unsorted input when the input_rel is unique on the
GROUP BY clause.  This will save on hashing and sorting.  Basically
similar to what we do for when a query contains aggregates without any
GROUP BY.

3. I don't quite understand why you changed this to a right join:

 -- Test case where t1 can be optimized but not t2
 explain (costs off) select t1.*,t2.x,t2.z
-from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
+from t1 right join t2 on t1.a = t2.x and t1.b = t2.y

Perhaps this change is left over from some previous version of the patch?

[1] https://commitfest.postgresql.org/27/1741/


v5-0001-Introduce-UniqueKeys-to-determine-RelOptInfo-uniq.patch
Description: Binary data


v5-0002-Skip-DISTINCT-GROUP-BY-if-input-is-already-unique.patch
Description: Binary data

Re: index paths and enable_indexscan

2020-04-14 Thread Andy Fan

On Tue, Apr 14, 2020 at 4:58 PM Amit Langote 
wrote:

> On Tue, Apr 14, 2020 at 5:29 PM Andy Fan  wrote:
> > On Tue, Apr 14, 2020 at 3:40 PM Amit Langote 
> wrote:
> >> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo 
> wrote:
> >> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote 
> wrote:
> >> >> Maybe I am missing something obvious, but is it intentional that
> >> >> enable_indexscan is checked by cost_index(), that is, *after*
> creating
> >> >> an index path?  I was expecting that if enable_indexscan is off, then
> >> >> no index paths would be generated to begin with, because I thought
> >> >> they are optional.
> >> >
> >> > I think the cost estimate of index paths is the same as other paths on
> >> > that setting enable_xxx to off only adds a penalty factor
> (disable_cost)
> >> > to the path's cost. The path would be still generated and compete with
> >> > other paths in add_path().
> >>
> >> Yeah, but I am asking why build the path to begin with, as there will
> >> always be seq scan path for base rels.
> >
> > I guess that is because user may disable seqscan as well.  If so, we
> > still need formula to decide with one to use, which requires index path
> > has to be calculated.  but since disabling the two at the same time is
> rare,
> > we can ignore the index path build  if user allow seqscan
>
> I am saying that instead of building index path with disabled cost,
> just don't build it at all. A base rel will always have a sequetial
> path, even though with disabled cost if enable_seqscan = off.
>

Let's say user set  enable_seqscan=off and set enable_indexscan=off;
will you expect user to get seqscan at last?  If so, why is seqscan
(rather than index scan) since both are disabled by user equally?


> Amit Langote
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Index Skip Scan

2020-04-14 Thread David Rowley

On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Other than that to summarize current open points for future readers
> (this thread somehow became quite big):
>
> * Making UniqueKeys usage more generic to allow using skip scan for more
>   use cases (hopefully it was covered by the v33, but I still need a
>   confirmation from David, like blinking twice or something).

I've not yet looked at the latest patch, but I did put some thoughts
into an email on the other thread that's been discussing UniqueKeys
[1].

I'm keen to hear thoughts on the plan I mentioned over there.  Likely
it would be best to discuss the specifics of what additional features
we need to add to UniqueKeys for skip scans over here, but discuss any
chances which affect both patches over there.  We certainly can't have
two separate implementations of UniqueKeys, so I believe the skip
scans UniqueKeys patch should most likely be based on the one in [1]
or some descendant of it.

[1] 
https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com

Re: index paths and enable_indexscan

2020-04-14 Thread Andy Fan

On Tue, Apr 14, 2020 at 5:12 PM Andy Fan  wrote:

>
>
> On Tue, Apr 14, 2020 at 4:58 PM Amit Langote 
> wrote:
>
>> On Tue, Apr 14, 2020 at 5:29 PM Andy Fan 
>> wrote:
>> > On Tue, Apr 14, 2020 at 3:40 PM Amit Langote 
>> wrote:
>> >> On Tue, Apr 14, 2020 at 4:13 PM Richard Guo 
>> wrote:
>> >> > On Tue, Apr 14, 2020 at 2:44 PM Amit Langote <
>> amitlangot...@gmail.com> wrote:
>> >> >> Maybe I am missing something obvious, but is it intentional that
>> >> >> enable_indexscan is checked by cost_index(), that is, *after*
>> creating
>> >> >> an index path?  I was expecting that if enable_indexscan is off,
>> then
>> >> >> no index paths would be generated to begin with, because I thought
>> >> >> they are optional.
>> >> >
>> >> > I think the cost estimate of index paths is the same as other paths
>> on
>> >> > that setting enable_xxx to off only adds a penalty factor
>> (disable_cost)
>> >> > to the path's cost. The path would be still generated and compete
>> with
>> >> > other paths in add_path().
>> >>
>> >> Yeah, but I am asking why build the path to begin with, as there will
>> >> always be seq scan path for base rels.
>> >
>> > I guess that is because user may disable seqscan as well.  If so, we
>> > still need formula to decide with one to use, which requires index path
>> > has to be calculated.  but since disabling the two at the same time is
>> rare,
>> > we can ignore the index path build  if user allow seqscan
>>
>> I am saying that instead of building index path with disabled cost,
>> just don't build it at all. A base rel will always have a sequetial
>> path, even though with disabled cost if enable_seqscan = off.
>>
>
> Let's say user set  enable_seqscan=off and set enable_indexscan=off;
> will you expect user to get seqscan at last?  If so, why is seqscan
> (rather than index scan) since both are disabled by user equally?
>
>
The following test should demonstrate what I think.

demo=# create table t(a int);
CREATE TABLE
demo=# insert into t select generate_series(1, 1000);
INSERT 0 1000
demo=# create index t_a on t(a);
CREATE INDEX
demo=# analyze t;
ANALYZE
demo=# set enable_seqscan to off;
SET
demo=# set enable_indexscan to off;
SET
demo=# set enable_bitmapscan to off;
SET
demo=# set enable_indexonlyscan to off;
SET
demo=# explain select * from t where a = 1;
   QUERY PLAN
-
 Index Scan using t_a on t  (cost=100.43..108.45 rows=1
width=4)
   Index Cond: (a = 1)
(2 rows)

If we just disable index path,  we will get seqscan at last.

Regards
Andy Fan


>> Amit Langote
>> EnterpriseDB: http://www.enterprisedb.com
>>
>

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Pavel Stehule

út 14. 4. 2020 v 10:40 odesílatel Amit Langote 
napsal:

> On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud  wrote:
> > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule 
> wrote:
> > > For second run I get
> > >
> > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE
> okres_id = 'CZ0201';
> > >
> ┌──┐
> > > │  QUERY PLAN
> │
> > >
> ╞══╡
> > > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49
> rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> > > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
> │
> > > │   Buffers: shared hit=4
> │
> > > │ Planning Time: 0.159 ms
> │
> > > │ Execution Time: 0.155 ms
>  │
> > >
> └──┘
> > > (5 rows)
> > >
> > > Now, there is not any touch in planning time. Does it mean so this all
> these data are cached somewhere in session memory?
> >
> > The planning time is definitely shorter the 2nd time.  And yes, what
> > you see are all the catcache accesses that are initially performed on
> > a fresh new backend.
>
> By the way, even with all catcaches served from local memory, one may
> still see shared buffers being hit during planning.  For example:
>
> explain (buffers, analyze) select * from foo where a = 1;
> QUERY PLAN
>
> ---
>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
> width=4) (actual time=0.010..0.011 rows=0 loops=1)
>Index Cond: (a = 1)
>Heap Fetches: 0
>Buffers: shared hit=2
>  Planning Time: 0.775 ms
>Buffers: shared hit=72
>  Execution Time: 0.086 ms
> (7 rows)
>
> Time: 2.477 ms
> postgres=# explain (buffers, analyze) select * from foo where a = 1;
> QUERY PLAN
>
> ---
>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
> width=4) (actual time=0.012..0.012 rows=0 loops=1)
>Index Cond: (a = 1)
>Heap Fetches: 0
>Buffers: shared hit=2
>  Planning Time: 0.102 ms
>Buffers: shared hit=1
>  Execution Time: 0.047 ms
> (7 rows)
>
> It seems that 1 Buffer hit comes from get_relation_info() doing
> _bt_getrootheight() for that index on foo.
>

unfortunatelly, I cannot to repeat it.

create table foo(a int);
create index on foo(a);
insert into foo values(1);
analyze foo;

for this case any second EXPLAIN is without buffer on my comp


> --
>
> Amit Langote
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Pavel Stehule

út 14. 4. 2020 v 10:49 odesílatel Julien Rouhaud 
napsal:

> On Tue, Apr 14, 2020 at 10:36 AM Pavel Stehule 
> wrote:
> >
> > út 14. 4. 2020 v 10:27 odesílatel Julien Rouhaud 
> napsal:
> >>
> >> Hi,
> >>
> >> On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule 
> wrote:
> >> >
> >> > Hi
> >> >
> >> > I am testing some features from Postgres 13, and I am not sure if I
> understand well to behave of EXPLAIN(ANALYZE, BUFFERS)
> >> >
> >> > When I run following statement first time in session I get
> >> >
> >> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE
> okres_id = 'CZ0201';
> >> >
> ┌──┐
> >> > │  QUERY
> PLAN  │
> >> >
> ╞══╡
> >> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49
> rows=114 width=41) (actual time=0.072..0.168 rows=114 loops=1) │
> >> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
>   │
> >> > │   Buffers: shared hit=4
>   │
> >> > │ Planning Time: 0.539 ms
>   │
> >> > │   Buffers: shared hit=13
>  │
> >> > │ Execution Time: 0.287 ms
>  │
> >> >
> └──┘
> >> > (6 rows)
> >> >
> >> > And I see share hit 13 in planning time.
> >> >
> >> > For second run I get
> >> >
> >> > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE
> okres_id = 'CZ0201';
> >> >
> ┌──┐
> >> > │  QUERY
> PLAN  │
> >> >
> ╞══╡
> >> > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49
> rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> >> > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
>   │
> >> > │   Buffers: shared hit=4
>   │
> >> > │ Planning Time: 0.159 ms
>   │
> >> > │ Execution Time: 0.155 ms
>  │
> >> >
> └──┘
> >> > (5 rows)
> >> >
> >> > Now, there is not any touch in planning time. Does it mean so this
> all these data are cached somewhere in session memory?
> >>
> >> The planning time is definitely shorter the 2nd time.  And yes, what
> >> you see are all the catcache accesses that are initially performed on
> >> a fresh new backend.
> >
> >
> > One time Tom Lane mentioned using index in planning time for getting
> minimum and maximum. I expected so these values are not cached. But I
> cannot to reproduce it, and then I am little bit surprised so I don't see
> any hit in second, and other executions.
>
> Isn't that get_actual_variable_range() purpose?  If you use a plan
> that hit this function you'll definitely see consistent buffer usage
> during planning:
>
> rjuju=# explain (buffers, analyze) select * from pg_class c join
> pg_attribute a on a.attrelid = c.oid;
>   QUERY PLAN
>
> ---
>  Hash Join  (cost=21.68..110.91 rows=2863 width=504) (actual
> time=0.393..5.989 rows=2863 loops=1)
>Hash Cond: (a.attrelid = c.oid)
>Buffers: shared hit=40 read=29
>->  Seq Scan on pg_attribute a  (cost=0.00..81.63 rows=2863
> width=239) (actual time=0.010..0.773 rows=2863 loops=1)
>  Buffers: shared hit=28 read=25
>->  Hash  (cost=16.86..16.86 rows=386 width=265) (actual
> time=0.333..0.334 rows=386 loops=1)
>  Buckets: 1024  Batches: 1  Memory Usage: 85kB
>  Buffers: shared hit=9 read=4
>  ->  Seq Scan on pg_class c  (cost=0.00..16.86 rows=386
> width=265) (actual time=0.004..0.123 rows=386 loops=1)
>Buffers: shared hit=9 read=4
>  Planning Time: 2.709 ms
>Buffers: shared hit=225 read=33
>  Execution Time: 6.529 ms
> (13 rows)
>
> rjuju=# explain (buffers, ana

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-14 Thread Amit Kapila

On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh  wrote:
>
> On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar  wrote:
> > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh  
> > wrote:
> > >
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > This looks wrong. We should change the name of this Macro or we can
> > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> >
> > I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> > make much sense to add different terminology no?
> > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> >
> In that case, we can rename this, for example, SizeOfXLogTransactionId.
>
> Some review comments from 0002-Issue-individual-*.path,
>
> +void
> +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr lsn, int nmsgs,
> + SharedInvalidationMessage *msgs)
> +{
> + MemoryContext oldcontext;
> + ReorderBufferChange *change;
> +
> + /* XXX Should we even write invalidations without valid XID? */
> + if (xid == InvalidTransactionId)
> + return;
> +
> + Assert(xid != InvalidTransactionId);
>
> It seems we don't call the function if xid is not valid. In fact,
>

You have a valid point.  Also, it is not clear if we are first
checking (xid == InvalidTransactionId) and returning from the
function, how can even Assert hit.

> @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf)
>   }
>   case XLOG_XACT_ASSIGNMENT:
>   break;
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + if (!TransactionIdIsValid(xid))
> + break;
> +
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Why should we insert an WAL record for such cases?
>

Right, if there is any such case, we should avoid it.

One more point about this patch, the commit message needs to be updated:

> The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
> e.g. at the command level, or something like that?

I think the above part of commit message is not right as the patch
already does such a caching now at the command level.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Julien Rouhaud

On Tue, Apr 14, 2020 at 11:25 AM Pavel Stehule  wrote:
>
> út 14. 4. 2020 v 10:40 odesílatel Amit Langote  
> napsal:
>>
>> On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud  wrote:
>> > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule  
>> > wrote:
>> > > For second run I get
>> > >
>> > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE okres_id 
>> > > = 'CZ0201';
>> > > ┌──┐
>> > > │  QUERY PLAN
>> > >   │
>> > > ╞══╡
>> > > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49 rows=114 
>> > > width=41) (actual time=0.044..0.101 rows=114 loops=1) │
>> > > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)  
>> > >   │
>> > > │   Buffers: shared hit=4
>> > >   │
>> > > │ Planning Time: 0.159 ms
>> > >   │
>> > > │ Execution Time: 0.155 ms   
>> > >   │
>> > > └──┘
>> > > (5 rows)
>> > >
>> > > Now, there is not any touch in planning time. Does it mean so this all 
>> > > these data are cached somewhere in session memory?
>> >
>> > The planning time is definitely shorter the 2nd time.  And yes, what
>> > you see are all the catcache accesses that are initially performed on
>> > a fresh new backend.
>>
>> By the way, even with all catcaches served from local memory, one may
>> still see shared buffers being hit during planning.  For example:
>>
>> explain (buffers, analyze) select * from foo where a = 1;
>> QUERY PLAN
>> ---
>>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
>> width=4) (actual time=0.010..0.011 rows=0 loops=1)
>>Index Cond: (a = 1)
>>Heap Fetches: 0
>>Buffers: shared hit=2
>>  Planning Time: 0.775 ms
>>Buffers: shared hit=72
>>  Execution Time: 0.086 ms
>> (7 rows)
>>
>> Time: 2.477 ms
>> postgres=# explain (buffers, analyze) select * from foo where a = 1;
>> QUERY PLAN
>> ---
>>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
>> width=4) (actual time=0.012..0.012 rows=0 loops=1)
>>Index Cond: (a = 1)
>>Heap Fetches: 0
>>Buffers: shared hit=2
>>  Planning Time: 0.102 ms
>>Buffers: shared hit=1
>>  Execution Time: 0.047 ms
>> (7 rows)
>>
>> It seems that 1 Buffer hit comes from get_relation_info() doing
>> _bt_getrootheight() for that index on foo.
>
>
> unfortunatelly, I cannot to repeat it.
>
> create table foo(a int);
> create index on foo(a);
> insert into foo values(1);
> analyze foo;
>
> for this case any second EXPLAIN is without buffer on my comp

_bt_getrootheight() won't cache any value if the index is totally
empty.  Removing the INSERT in your example should lead to Amit's
behavior.

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Kyotaro Horiguchi

At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada 
 wrote in 
> On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
> >
> > Kyotaro Horiguchi  writes:
> > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote in
> > >> What I think we should do about this is, essentially, to get rid of
> > >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> > >> whether *it* believes that it is a sync standby, based on its last
> > >> evaluation of the relevant GUCs.  This would be a bool that it'd
> > >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
> >
> > > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > > walsender thinks it is at, among synchronous_standby_names.  Then to
> > > decide "I am a sync standby" we need to know how many walsenders with
> > > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > > judgment now and suffers from the inconsistency of priority values.
> >
> > Yeah.  After looking a bit closer, I think that the current definition
> > of sync_standby_priority (that is, as the result of local evaluation
> > of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> > with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> > is make one sweep across the WalSnd array, collecting PID,
> > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > Then examine that data and decide which WAL value we need, without assuming
> > that the sync_standby_priority values are necessarily totally consistent.
> > But in any case we must examine each entry just once while holding its
> > mutex, not go back to it later expecting it to still be the same.

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function.  Thus
how many times we scan on the array doesn't make any difference.

I think we need to do one of the followings.

 A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.

 B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.

 C) protect walsender priority array from beinig inconsistent.

The (B) is the band aids. To achieve A we need to central controller
of priority config handling.  C is:

> Can we have a similar approach of sync_standby_defined for
> sync_standby_priority? That is, checkpionter is responsible for
> changing sync_standby_priority of all walsenders when SIGHUP. That
> way, all walsenders can see a consistent view of
> sync_standby_priority. And when a walsender starts, it sets
> sync_standby_priority by itself. The logic to decide who's a sync
> standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> having higher priority along with their WAL positions.

Yeah, it works if we do , but the problem of that way is that to
determin priority of walsenders, we need to know what walsenders are
running. That is, when new walsender comes the process needs to aware
of the arrival (or leaving) right away and reassign the priority of
every wal senders again.

If we accept to share variable-length information among processes,
sharing sync_standby_names or parsed SyncRepConfigData among processes
would work.

> >
> > Another thing that I'm finding interesting is that I now see this is
> > not at all new code.  It doesn't look like SyncRepGetSyncStandbysPriority
> > has changed much since 2016.  So how come we didn't detect this problem
> > long ago?  I searched the buildfarm logs for assertion failures in
> > syncrep.c, looking back one year, and here's what I found:
...
> > The line numbers vary in the back branches, but all of these crashes are
> > at that same Assert.  So (a) yes, this does happen in the back branches,
> > but (b) some fairly recent change has made it a whole lot more probable.
> > Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
> > so whatever the change was was indirect.  Curious.  Is it just timing?
> 
> Interesting. It's happening on certain animals, not all. Especially
> tests with HEAD on sidewinder and curculio, which are NetBSD 7 and
> OpenBSD 5.9 respectively, started to fail at a high rate since a
> couple of days ago.

Coundn't this align the timing of config reloading? (I didn't checked
anything yet.)

| commit 421685812290406daea58b78dfab0346eb683bbb
| Author: Noah Misch 
| Date:   Sat Apr 11 10:30:00 2020 -0700
| 
| When WalSndCaughtUp, sleep only in WalSndWaitForWal().
|Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
|< sentPtr.  That is important in logical replication.  When the latest
|physical LSN yields no logical replication messages (a common case),
|that keepalive elicits a reply, and processing the reply updates
|pg_stat_replication.replay_lsn.  WalSndLoop() lacks that; wh

Re: Display of buffers for planning time show nothing for second run

2020-04-14 Thread Pavel Stehule

út 14. 4. 2020 v 11:35 odesílatel Julien Rouhaud 
napsal:

> On Tue, Apr 14, 2020 at 11:25 AM Pavel Stehule 
> wrote:
> >
> > út 14. 4. 2020 v 10:40 odesílatel Amit Langote 
> napsal:
> >>
> >> On Tue, Apr 14, 2020 at 5:27 PM Julien Rouhaud 
> wrote:
> >> > On Tue, Apr 14, 2020 at 10:18 AM Pavel Stehule <
> pavel.steh...@gmail.com> wrote:
> >> > > For second run I get
> >> > >
> >> > > postgres=# EXPLAIN (BUFFERS, ANALYZE) SELECT * FROM obce WHERE
> okres_id = 'CZ0201';
> >> > >
> ┌──┐
> >> > > │  QUERY
> PLAN  │
> >> > >
> ╞══╡
> >> > > │ Index Scan using obce_okres_id_idx on obce  (cost=0.28..14.49
> rows=114 width=41) (actual time=0.044..0.101 rows=114 loops=1) │
> >> > > │   Index Cond: ((okres_id)::text = 'CZ0201'::text)
> │
> >> > > │   Buffers: shared hit=4
> │
> >> > > │ Planning Time: 0.159 ms
> │
> >> > > │ Execution Time: 0.155 ms
>│
> >> > >
> └──┘
> >> > > (5 rows)
> >> > >
> >> > > Now, there is not any touch in planning time. Does it mean so this
> all these data are cached somewhere in session memory?
> >> >
> >> > The planning time is definitely shorter the 2nd time.  And yes, what
> >> > you see are all the catcache accesses that are initially performed on
> >> > a fresh new backend.
> >>
> >> By the way, even with all catcaches served from local memory, one may
> >> still see shared buffers being hit during planning.  For example:
> >>
> >> explain (buffers, analyze) select * from foo where a = 1;
> >> QUERY PLAN
> >>
> ---
> >>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
> >> width=4) (actual time=0.010..0.011 rows=0 loops=1)
> >>Index Cond: (a = 1)
> >>Heap Fetches: 0
> >>Buffers: shared hit=2
> >>  Planning Time: 0.775 ms
> >>Buffers: shared hit=72
> >>  Execution Time: 0.086 ms
> >> (7 rows)
> >>
> >> Time: 2.477 ms
> >> postgres=# explain (buffers, analyze) select * from foo where a = 1;
> >> QUERY PLAN
> >>
> ---
> >>  Index Only Scan using foo_pkey on foo  (cost=0.15..8.17 rows=1
> >> width=4) (actual time=0.012..0.012 rows=0 loops=1)
> >>Index Cond: (a = 1)
> >>Heap Fetches: 0
> >>Buffers: shared hit=2
> >>  Planning Time: 0.102 ms
> >>Buffers: shared hit=1
> >>  Execution Time: 0.047 ms
> >> (7 rows)
> >>
> >> It seems that 1 Buffer hit comes from get_relation_info() doing
> >> _bt_getrootheight() for that index on foo.
> >
> >
> > unfortunatelly, I cannot to repeat it.
> >
> > create table foo(a int);
> > create index on foo(a);
> > insert into foo values(1);
> > analyze foo;
> >
> > for this case any second EXPLAIN is without buffer on my comp
>
> _bt_getrootheight() won't cache any value if the index is totally
> empty.  Removing the INSERT in your example should lead to Amit's
> behavior.
>

aha. good to know it.

Thank you

Pavel

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-14 Thread Anna Akenteva


On 2020-04-11 00:44, Andres Freund wrote:

I think there's also some advantages of having it in a single statement
for poolers. If a pooler analyzes BEGIN AFTER 'xx/xx' it could
e.g. redirect the transaction to a node that's caught up far enough,
instead of blocking. But that can't work even close to as easily if 
it's

something that has to be executed before transaction begin.



I think that's a good point.

Also, I'm not sure how we'd expect a wait-for-LSN procedure to work 
inside a single-snapshot transaction. Would it throw an error inside a 
RR transaction block? Would it give a warning?


--
Anna Akenteva
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-14 Thread Dilip Kumar

On Tue, Apr 14, 2020 at 2:57 PM Amit Kapila  wrote:
>
> On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh  
> wrote:
> >
> > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar  wrote:
> > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh  
> > > wrote:
> > > >
> > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > > This looks wrong. We should change the name of this Macro or we can
> > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> > >
> > > I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> > > make much sense to add different terminology no?
> > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > >
> > In that case, we can rename this, for example, SizeOfXLogTransactionId.
> >
> > Some review comments from 0002-Issue-individual-*.path,
> >
> > +void
> > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> > + XLogRecPtr lsn, int nmsgs,
> > + SharedInvalidationMessage *msgs)
> > +{
> > + MemoryContext oldcontext;
> > + ReorderBufferChange *change;
> > +
> > + /* XXX Should we even write invalidations without valid XID? */
> > + if (xid == InvalidTransactionId)
> > + return;
> > +
> > + Assert(xid != InvalidTransactionId);
> >
> > It seems we don't call the function if xid is not valid. In fact,
> >
>
> You have a valid point.  Also, it is not clear if we are first
> checking (xid == InvalidTransactionId) and returning from the
> function, how can even Assert hit.

I have changed to code, now we only have an assert.

>
> > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > XLogRecordBuffer *buf)
> >   }
> >   case XLOG_XACT_ASSIGNMENT:
> >   break;
> > + case XLOG_XACT_INVALIDATIONS:
> > + {
> > + TransactionId xid;
> > + xl_xact_invalidations *invals;
> > +
> > + xid = XLogRecGetXid(r);
> > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > +
> > + if (!TransactionIdIsValid(xid))
> > + break;
> > +
> > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > + invals->nmsgs, invals->msgs);
> >
> > Why should we insert an WAL record for such cases?
> >
>
> Right, if there is any such case, we should avoid it.

I think we don't have any such case because we are logging at the
command end.  So I have created an assert instead of the check.

> One more point about this patch, the commit message needs to be updated:
>
> > The new invalidations are written to WAL immediately, without any
> such caching. Perhaps it would be possible to add similar caching,
> > e.g. at the command level, or something like that?
>
> I think the above part of commit message is not right as the patch
> already does such a caching now at the command level.

Right, I have removed that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-14 Thread Amit Kapila

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar  wrote:
>
>
> >
> > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > > XLogRecordBuffer *buf)
> > >   }
> > >   case XLOG_XACT_ASSIGNMENT:
> > >   break;
> > > + case XLOG_XACT_INVALIDATIONS:
> > > + {
> > > + TransactionId xid;
> > > + xl_xact_invalidations *invals;
> > > +
> > > + xid = XLogRecGetXid(r);
> > > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > > +
> > > + if (!TransactionIdIsValid(xid))
> > > + break;
> > > +
> > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > > + invals->nmsgs, invals->msgs);
> > >
> > > Why should we insert an WAL record for such cases?
> > >
> >
> > Right, if there is any such case, we should avoid it.
>
> I think we don't have any such case because we are logging at the
> command end.  So I have created an assert instead of the check.
>

Have you tried to ensure this in some way?  One idea could be to add
an Assert (to check if transaction id is assigned) in the new code
where you are writing WAL for this action and then run make
check-world and or make installcheck-world.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-14 Thread Dilip Kumar

On Tue, Apr 14, 2020 at 3:57 PM Amit Kapila  wrote:
>
> On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar  wrote:
> >
> >
> > >
> > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > > > XLogRecordBuffer *buf)
> > > >   }
> > > >   case XLOG_XACT_ASSIGNMENT:
> > > >   break;
> > > > + case XLOG_XACT_INVALIDATIONS:
> > > > + {
> > > > + TransactionId xid;
> > > > + xl_xact_invalidations *invals;
> > > > +
> > > > + xid = XLogRecGetXid(r);
> > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > > > +
> > > > + if (!TransactionIdIsValid(xid))
> > > > + break;
> > > > +
> > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > > > + invals->nmsgs, invals->msgs);
> > > >
> > > > Why should we insert an WAL record for such cases?
> > > >
> > >
> > > Right, if there is any such case, we should avoid it.
> >
> > I think we don't have any such case because we are logging at the
> > command end.  So I have created an assert instead of the check.
> >
>
> Have you tried to ensure this in some way?  One idea could be to add
> an Assert (to check if transaction id is assigned) in the new code
> where you are writing WAL for this action and then run make
> check-world and or make installcheck-world.

Yeah, I had already tested that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Andrew Dunstan



On 4/13/20 7:55 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On 4/13/20 7:02 PM, Jonathan S. Katz wrote:
>>> Perhaps a counterproposal: We eliminate the content in the leftmost
>>> "function column, but leave that there to allow the function name /
>>> signature to span the full 3 columns. Then the rest of the info goes
>>> below. This will also compress the table height down a bit.
>> An attempt at a "POC" of what I'm describing (attached image).
> Hmm ... what is determining the width of the left-hand column?
> It doesn't seem to have any content, since the function entries
> are being spanned across the whole table.
>
> I think the main practical problem though is that it wouldn't
> work nicely for operators, since the key "name" you'd be looking
> for would not be at the left of the signature line.  I suppose we
> don't necessarily have to have the same layout for operators as
> for functions, but it feels like it'd be jarringly inconsistent.
>
>   



Maybe highlight the item by bolding or colour?


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP/PoC for parallel backup

2020-04-14 Thread Kashif Zeeshan

Hi Asif

Getting the following error on Parallel backup when --no-manifest option is
used.

[edb@localhost bin]$
[edb@localhost bin]$
[edb@localhost bin]$  ./pg_basebackup -v -j 5  -D
 /home/edb/Desktop/backup/ --no-manifest
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/228 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_10223"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: backup worker (3) created
pg_basebackup: backup worker (4) created
pg_basebackup: write-ahead log end point: 0/2000100
pg_basebackup: error: could not get data for 'BUILD_MANIFEST': ERROR:
 could not open file
"base/pgsql_tmp/pgsql_tmp_b4ef5ac0fd150b2a28caf626bbb1bef2.1": No such file
or directory
pg_basebackup: removing contents of data directory
"/home/edb/Desktop/backup/"
[edb@localhost bin]$

Thanks

On Tue, Apr 14, 2020 at 5:33 PM Asif Rehman  wrote:

>
>
> On Wed, Apr 8, 2020 at 6:53 PM Kashif Zeeshan <
> kashif.zees...@enterprisedb.com> wrote:
>
>>
>>
>> On Tue, Apr 7, 2020 at 9:44 PM Asif Rehman 
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks, Kashif and Rajkumar. I have fixed the reported issues.
>>>
>>> I have added the shared state as previously described. The new grammar
>>> changes
>>> are as follows:
>>>
>>> START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d]
>>> - This will generate a unique backupid using pg_strong_random(16)
>>> and hex-encoded
>>>   it. which is then returned as the result set.
>>> - It will also create a shared state and add it to the hashtable.
>>> The hash table size is set
>>>   to BACKUP_HASH_SIZE=10, but since hashtable can expand
>>> dynamically, I think it's
>>>   sufficient initial size. max_wal_senders is not used, because it
>>> can be set to quite a
>>>   large values.
>>>
>>> JOIN_BACKUP 'backup_id'
>>> - finds 'backup_id' in hashtable and attaches it to server process.
>>>
>>>
>>> SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS]
>>> - renamed SEND_FILES to SEND_FILE
>>> - removed START_WAL_LOCATION from this because 'startptr' is now
>>> accessible through
>>>   shared state.
>>>
>>> There is no change in other commands:
>>> STOP_BACKUP [NOWAIT]
>>> LIST_TABLESPACES [PROGRESS]
>>> LIST_FILES [TABLESPACE]
>>> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
>>>
>>> The current patches (v11) have been rebased to the latest master. The
>>> backup manifest is enabled
>>> by default, so I have disabled it for parallel backup mode and have
>>> generated a warning so that
>>> user is aware of it and not expect it in the backup.
>>>
>>> Hi Asif
>>
>> I have verified the bug fixes, one bug is fixed and working now as
>> expected
>>
>> For the verification of the other bug fixes faced following issues,
>> please have a look.
>>
>>
>> 1) Following bug fixes mentioned below are generating segmentation fault.
>>
>> Please note for reference I have added a description only as steps were
>> given in previous emails of each bug I tried to verify the fix. Backtrace
>> is also added with each case which points to one bug for both the cases.
>>
>> a) The backup failed with errors "error: could not connect to server:
>> could not look up local user ID 1000: Too many open files" when the
>> max_wal_senders was set to 2000.
>>
>>
>> [edb@localhost bin]$ ./pg_basebackup -v -j 1990 -D
>>  /home/edb/Desktop/backup/
>> pg_basebackup: warning: backup manifest is disabled in parallel backup
>> mode
>> pg_basebackup: initiating base backup, waiting for checkpoint to complete
>> pg_basebackup: checkpoint completed
>> pg_basebackup: write-ahead log start point: 0/228 on timeline 1
>> pg_basebackup: starting background WAL receiver
>> pg_basebackup: created temporary replication slot "pg_basebackup_9925"
>> pg_basebackup: backup worker (0) created
>> pg_basebackup: backup worker (1) created
>> pg_basebackup: backup worker (2) created
>> pg_basebackup: backup worker (3) created
>> ….
>> ….
>> pg_basebackup: backup worker (1014) created
>> pg_basebackup: backup worker (1015) created
>> pg_basebackup: backup worker (1016) created
>> pg_basebackup: backup worker (1017) created
>> pg_basebackup: error: could not connect to server: could not look up
>> local user ID 1000: Too many open files
>> Segmentation fault
>> [edb@localhost bin]$
>>
>>
>> [edb@localhost bin]$
>> [edb@localhost bin]$ gdb pg_basebackup
>> /tmp/cores/core.pg_basebackup.13219.localhost.localdomain.1586349551
>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
>> Copyright (C) 2013 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <
>> http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show co

While restoring -getting error if dump contain sql statements generated from generated.sql file

2020-04-14 Thread tushar


Hi ,

We have a sql file  called 'generated.sql' under src/test/regress/sql 
folder . if we run this file on psql , take the dump and try to restore 
it on another db

we are getting error like -

psql:/tmp/x:434: ERROR:  column "b" of relation "gtest1_1" is a 
generated column

psql:/tmp/x:441: ERROR:  cannot use column reference in DEFAULT expression

These sql statements , i copied from the dump file

postgres=# CREATE TABLE public.gtest30 (
postgres(# a integer,
postgres(# b integer
postgres(# );
CREATE TABLE
postgres=#
postgres=# CREATE TABLE public.gtest30_1 (
postgres(# )
postgres-# INHERITS (public.gtest30);
CREATE TABLE
postgres=# ALTER TABLE ONLY public.gtest30_1 ALTER COLUMN b SET DEFAULT 
(a * 2);

ERROR:  cannot use column reference in DEFAULT expression
postgres=#

Steps to reproduce -

connect to psql - ( ./psql postgres)
create database ( create database x;)
connect to database x (\c x )
execute generated.sql file (\i ../../src/test/regress/sql/generated.sql)
take the dump of x db (./pg_dump -Fp x > /tmp/t.dump)
create another database  (create database y;)
Connect to y db (\c y)
execute plain dump sql file (\i /tmp/t.dump)

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Tom Lane

Kyotaro Horiguchi  writes:
> SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> sync_standby_priority of any walsender can be changed while the
> function is scanning welsenders. The issue is we already have
> inconsistent walsender information before we enter the function.  Thus
> how many times we scan on the array doesn't make any difference.

*Yes it does*.  The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers.  While that doesn't seem to be the
exact issue we're seeing in the buildfarm, it's still another obvious
bug in this code.  I will not accept a "fix" that doesn't fix that.

> I think we need to do one of the followings.

>  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> walsender priority is inconsistent.
>  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> inconsistency.
>  C) protect walsender priority array from beinig inconsistent.

(B) seems like the only practical solution from here.  We could
probably arrange for synchronous update of the priorities when
they change in response to a GUC change, but it doesn't seem to
me to be practical to do that in response to walsender exit.
You'd end up finding that an unexpected walsender exit results
in panic'ing the system, which is no better than where we are now.

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes.  It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved.  In ANY mode I don't see that
inconsistent priorities matter at all.

> If we accept to share variable-length information among processes,
> sharing sync_standby_names or parsed SyncRepConfigData among processes
> would work.

Not sure that we really need more than what's being shared now,
ie each process's last-known index in the sync_standby_names list.

regards, tom lane

Re: where should I stick that backup?

2020-04-14 Thread Robert Haas

On Sun, Apr 12, 2020 at 8:27 PM Andres Freund  wrote:
> > That's quite appealing. One downside - IMHO significant - is that you
> > have to have a separate process to do *anything*. If you want to add a
> > filter that just logs everything it's asked to do, for example, you've
> > gotta have a whole process for that, which likely adds a lot of
> > overhead even if you can somehow avoid passing all the data through an
> > extra set of pipes. The interface I proposed would allow you to inject
> > very lightweight filters at very low cost. This design really doesn't.
>
> Well, in what you described it'd still be all done inside pg_basebackup,
> or did I misunderstand? Once you fetched it from the server, I can't
> imagine the overhead of filtering it a bit differently would matter.
>
> But even if, the "target" could just reply with "skip" or such, instead
> of providing an fd.
>
> What kind of filtering are you thinking of where this is a problem?
> Besides just logging the filenames?  I just can't imagine how that's a
> relevant overhead compared to having to do things like
> 'shell ssh rhaas@depository pgfile create-exclusive - %f.lz4'

Anything you want to do in the same process. I mean, right now we have
basically one target (filesystem) and one filter (compression).
Neither of those things spawn a process. It seems logical to imagine
that there might be other things that are similar in the future. It
seems to me that there are definitely things where you will want to
spawn a process; that's why I like having shell commands as one
option. But I don't think we should require that you can't have a
filter or a target unless you also spawn a process for it.

> I really think we want the option to eventually do this server-side. And
> I don't quite see it as viable to go for an API that allows to specify
> shell fragments that are going to be executed server side.

The server-side thing is a good point, but I think it adds quite a bit
of complexity, too. I'm worried that this is ballooning to an
unworkable amount of complexity - and not just code complexity, but
bikeshedding complexity, too. Like, I started with a command-line
option that could probably have been implemented in a few hundred
lines of code. Now, we're up to something where you have to build
custom processes that speak a novel protocol and work on both the
client and the server side. That's at least several thousand lines of
code, maybe over ten thousand if the sample binaries that use the new
protocol are more than just simple demonstrations of how to code to
the interface. More importantly, it means agreeing on the nature of
this custom protocol, which seems like something where I could put in
a ton of effort to create something and then have somebody complain
because it's not JSON, or because it is JSON, or because the
capability negotiation system isn't right, or whatever. I'm not
exactly saying that we shouldn't do it; I think it has some appeal.
But I'd sure like to find some way of getting started that doesn't
involve having to do everything in one patch, and then getting told to
change it all again - possibly with different people wanting
contradictory things.

> > Note that you could build this on top of what I proposed, but not the
> > other way around.
>
> Why should it not be possible the other way round?

Because a C function call API lets you decide to spawn a process, but
if the framework inherently spawns a process, you can't decide not to
do so in a particular case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Andreas Karlsson


On 4/13/20 7:13 PM, Tom Lane wrote:

As discussed in the thread at [1], I've been working on redesigning
the tables we use to present SQL functions and operators.  The
first installment of that is now up; see tables 9.30 and 9.31 at

https://www.postgresql.org/docs/devel/functions-datetime.html

and table 9.33 at

https://www.postgresql.org/docs/devel/functions-enum.html

Before I spend more time on this, I want to make sure that people
are happy with this line of attack.  Comparing these tables to
the way they look in v12, they clearly take more vertical space;
but at least to my eye they're less cluttered and more readable.
They definitely scale a lot better for cases where a long function
description is needed, or where we'd like to have more than one
example.  Does anyone prefer the old way, or have a better idea?

I know that the table headings are a bit weirdly laid out; hopefully
that can be resolved [2].


I prefer the old way since I find it very hard to see which fields 
belong to which function in the new way. I think what confuses my eyes 
is how some rows are split in half while others are not, especially for 
those functions where there is only one example output. I do not have 
any issue reading those with many example outputs.


For the old tables I can at least just make the browser window 
ridiculously wide ro read them.


Andreas

Re: index paths and enable_indexscan

2020-04-14 Thread Amit Langote

On Tue, Apr 14, 2020 at 6:12 PM Andy Fan  wrote:
> On Tue, Apr 14, 2020 at 4:58 PM Amit Langote  wrote:
>> I am saying that instead of building index path with disabled cost,
>> just don't build it at all. A base rel will always have a sequetial
>> path, even though with disabled cost if enable_seqscan = off.
>
> Let's say user set  enable_seqscan=off and set enable_indexscan=off;
> will you expect user to get seqscan at last?  If so, why is seqscan
> (rather than index scan) since both are disabled by user equally?

I was really thinking of this in terms of planner effort, which for
creating an index path is more than creating sequential path, although
sure the payoff can be great. That is, I want the planner to avoid
creating index paths *to save cycles*, but see no way of making that
happen.  I was thinking disabling enable_indexscan would do the trick.

--

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: index paths and enable_indexscan

2020-04-14 Thread Tom Lane

Amit Langote  writes:
> I am saying that instead of building index path with disabled cost,
> just don't build it at all. A base rel will always have a sequetial
> path, even though with disabled cost if enable_seqscan = off.

Awhile back I'd looked into getting rid of disable_cost altogether
by dint of not generating disabled paths.  It's harder than it
sounds.  We could perhaps change this particular case, but it's
not clear that there's any real benefit of making this one change
in isolation.

Note that you can't just put a big OFF switch at the start of indxpath.c,
because enable_indexscan and enable_bitmapscan are distinct switches,
but the code to generate those path types is inextricably intertwined.
Skipping individual paths further down on the basis of the appropriate
switch would be fairly subtle and perhaps bug-prone.  The existing
implementation of those switches has the advantages of being trivially
simple and clearly correct (for some value of "correct").

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Tom Lane

Andreas Karlsson  writes:
> For the old tables I can at least just make the browser window 
> ridiculously wide ro read them.

A large part of the point here is to make the tables usable
when you don't have that option, as for example in PDF output.

Even with a wide window, though, some of our function tables are
monstrously ugly.

regards, tom lane

Re: index paths and enable_indexscan

2020-04-14 Thread Tom Lane

Amit Langote  writes:
> I was really thinking of this in terms of planner effort, which for
> creating an index path is more than creating sequential path, although
> sure the payoff can be great. That is, I want the planner to avoid
> creating index paths *to save cycles*, but see no way of making that
> happen.  I was thinking disabling enable_indexscan would do the trick.

I think that's completely misguided, because in point of fact nobody
is going to care about the planner's performance with enable_indexscan
turned off.  It's not an interesting production case.

All of these enable_xxx switches exist just for debug purposes, and so
the right way to think about them is "what's the simplest, least
bug-prone, lowest-maintenance way to get the effect?".

Likewise, I don't actually much care what results you get if you turn
off *all* of them.  It's not a useful case to spend our effort on.

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Andreas Karlsson


On 4/14/20 4:29 PM, Tom Lane wrote:

Andreas Karlsson  writes:

For the old tables I can at least just make the browser window
ridiculously wide ro read them.


A large part of the point here is to make the tables usable
when you don't have that option, as for example in PDF output.

Even with a wide window, though, some of our function tables are
monstrously ugly.


Sure, but I wager the number of people using the HTML version of our 
documentation on laptops and desktop computers are the biggest group of 
users.


That said, I agree with that quite many of our tables right now are 
ugly, but I prefer ugly to hard to read. For me the mix of having every 
third row split into two fields makes the tables very hard to read. I 
have a hard time seeing which rows belong to which function.


Andreas

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Tom Lane

Andreas Karlsson  writes:
> That said, I agree with that quite many of our tables right now are 
> ugly, but I prefer ugly to hard to read. For me the mix of having every 
> third row split into two fields makes the tables very hard to read. I 
> have a hard time seeing which rows belong to which function.

Did you look at the variants without that discussed downthread?

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Robert Haas

On Mon, Apr 13, 2020 at 4:29 PM Tom Lane  wrote:
> I wouldn't be averse to dropping the text descriptions for operators
> in places where they seem obvious ... but who decides what is obvious?

Well, we do. We're smart, right? I don't think it's a good idea to add
clutter to table A just because table B needs more details. What
matters is whether table A needs more details.

The v12 version of the "Table 9.30. Date/Time Operators" is not that
wide, and is really quite clear. The new version takes 3 lines per
operator where the old one took one. That's because you've added (1) a
description of the fact that + does addition and - does subtraction,
repeated for each operator, and (2) explicit information about the
input and result types. I don't think either add much, in this case.
The former doesn't really need to be explained, and the latter was
clear enough from the way the examples were presented - everything had
explicit types.

For more complicated cases, one thing we could do is ditch the table
and use a  with a separate  for each
operator. So you could have something like:

date + date &arrow; timestamp

Lengthy elocution, including an example.

But I would only advocate for this style in cases where there is
substantial explaining to be done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

2020-04-14 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sun, Apr 12, 2020 at 8:27 PM Andres Freund  wrote:
> > I really think we want the option to eventually do this server-side. And
> > I don't quite see it as viable to go for an API that allows to specify
> > shell fragments that are going to be executed server side.
> 
> The server-side thing is a good point, but I think it adds quite a bit
> of complexity, too. I'm worried that this is ballooning to an
> unworkable amount of complexity - and not just code complexity, but
> bikeshedding complexity, too. Like, I started with a command-line
> option that could probably have been implemented in a few hundred
> lines of code. Now, we're up to something where you have to build
> custom processes that speak a novel protocol and work on both the
> client and the server side. That's at least several thousand lines of
> code, maybe over ten thousand if the sample binaries that use the new
> protocol are more than just simple demonstrations of how to code to
> the interface. More importantly, it means agreeing on the nature of
> this custom protocol, which seems like something where I could put in
> a ton of effort to create something and then have somebody complain
> because it's not JSON, or because it is JSON, or because the
> capability negotiation system isn't right, or whatever. I'm not
> exactly saying that we shouldn't do it; I think it has some appeal.
> But I'd sure like to find some way of getting started that doesn't
> involve having to do everything in one patch, and then getting told to
> change it all again - possibly with different people wanting
> contradictory things.

Doing things incrementally and not all in one patch absolutely makes a
lot of sense and is a good idea.

Wouldn't it make sense to, given that we have some idea of what we want
it to eventually look like, to make progress in that direction though?

That is- I tend to agree with Andres that having this supported
server-side eventually is what we should be thinking about as an
end-goal (what is the point of pg_basebackup in all of this, after all,
if the goal is to get a backup of PG from the PG server to s3..?  why
go through some other program or through the replication protocol?) and
having the server exec'ing out to run shell script fragments to make
that happen looks like it would be really awkward and full of potential
risks and issues and agreement that it wouldn't be a good fit.

If, instead, we worked on a C-based interface which includes filters and
storage drivers, and was implemented through libpgcommon, we could start
with that being all done through pg_basebackup and work to hammer out
the complications and issues that we run into there and, once it seems
reasonably stable and works well, we could potentially pull that into
the backend to be run directly without having to have pg_basebackup
involved in the process.

There's been good progress in the direction of having more done by the
backend already, and that's thanks to you and it's good work-
specifically that the backend now has the ability to generate a
manifest, with checksums included as the backup is being run, which is
definitely an important piece.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Tom Lane

Robert Haas  writes:
> The v12 version of the "Table 9.30. Date/Time Operators" is not that
> wide, and is really quite clear.

Well, no it isn't. The main nit I would pick in that claim is that
it's far from obvious that the three examples of float8 * interval
are all talking about the same operator; in fact, a reader would
be very likely to draw the false conclusion that there is an
integer * interval operator.

This is an aspect of the general problem that we don't have a nice
way to deal with multiple examples in the tables.  Somebody kluged
their way around it here in this particular way, but I'd really like
a clearer way, because we need more examples.

I would also point out that this table is quite out of step with
the rest of the docs in its practice of showing the results as
though they were typed literals.  Most places that show results
just show what you'd expect to see in a psql output column, making
it necessary to show the result data type somewhere else.

> The new version takes 3 lines per
> operator where the old one took one. That's because you've added (1) a
> description of the fact that + does addition and - does subtraction,
> repeated for each operator, and (2) explicit information about the
> input and result types. I don't think either add much, in this case.

As I already said, I agree about the text descriptions being of marginal
value in this case.  I disagree about the explicit datatypes, because the
float8 * interval cases already show a hole in that argument, and surely
we don't want to require every example to use explicitly-typed literals
and nothing but.  Besides, what will you do for operators that take
anyarray or the like?

> For more complicated cases, one thing we could do is ditch the table
> and use a  with a separate  for each
> operator. So you could have something like:
> ...
> But I would only advocate for this style in cases where there is
> substantial explaining to be done.

I'd like to have more consistency, not less.  I do not think it helps
readers to have each page in Chapter 9 have its own idiosyncratic way of
presenting operators/functions.  The operator tables are actually almost
that bad, right now --- compare section 9.1 (hasn't even bothered with
a formal ) with tables 9.1, 9.4, 9.9, 9.12, 9.14, 9.30, 9.34,
9.37, 9.41, 9.44.  The variation in level of detail and precision is
striking, and not in a good way.

regards, tom lane

Re: Properly mark NULL returns in numeric aggregates

2020-04-14 Thread Jesse Zhang

Hi David,

On Mon, Apr 13, 2020 at 10:46 PM David Rowley  wrote:
>
> On Tue, 14 Apr 2020 at 06:14, Tom Lane  wrote:
> > Yeah, they're relying exactly on the assumption that nodeAgg is not
> > going to try to copy a value declared "internal", and therefore they
> > can be loosey-goosey about whether the value pointer is null or not.
> > However, if you want to claim that that's wrong, you have to explain
> > why it's okay for some other code to be accessing a value that's
> > declared "internal".  I'd say that the meaning of that is precisely
> > "keepa u hands off".
> >
> > In the case at hand, the current situation is that we only expect the
> > values returned by these combine functions to be read by the associated
> > final functions, which are on board with the null-pointer representation
> > of an empty result.  Your argument is essentially that it should be
> > possible to feed the values to the aggregate's associated serialization
> > function as well.  But the core code never does that, so I'm not convinced
> > that we should add it to the requirements; we'd be unable to test it.
>
> Casting my mind back to when I originally wrote that code, I attempted
> to do so in such a way so that it could one day be used for a 3-stage
> aggregation. e.g Parallel Partial Aggregate -> Gather -> Combine
> Serial Aggregate on one node, then on some master node a Deserial
> Combine Finalize Aggregate.  You're very right that we can't craft
> such a plan with today's master  (We didn't even add a supporting enum
> for it in AggSplit).   However, it does appear that there are
> extensions or forks out there which attempt to use the code in this
> way, so it would be good to not leave those people out in the cold
> regarding this.

Greenplum plans split-aggregation quite similarly from Postgres: while
it doesn't pass partial results through a intra-cluster "Gather" --
using a reshuffle-by-hash type operation instead -- Greenplum _does_
split an aggregate into final and partial halves, running them on
different nodes. In short, the relation ship among the combine, serial,
and deserial functions are similar to how they are in Postgres today
(serial->deserial->combine), in the context of splitting aggregates.
The current problem arises because Greenplum spills the hash table in
hash aggregation (a diff we're working actively to upstream), a process
in which we have to touch (read: serialize and copy) the internal trans
values.  However, we are definitely eyeing what you described as
something to move towards.

As a fork, we'd like to carry as thin a diff as possible. So the current
situation is pretty much forcing us to diverge in the functions
mentioned up-thread.

In hindsight, "sloppy" might not have been a wise choice of words,
apologies for the possible offense, David!

>
> For testing, can't we just have an Assert() in
> advance_transition_function that verifies isnull matches the
> nullability of the return value for INTERNAL returning transfns? i.e,
> the attached
>
> I don't have a test case to hand that could cause this to fail, but it
> sounds like Jesse might.

One easy way to cause this is "sum(x) FILTER (WHERE false)" which will
for sure make the partial results NULL. Is that what you're looking for?
I'll be happy to send in the SQL.

Cheers,
Jesse

Re: where should I stick that backup?

2020-04-14 Thread Robert Haas

On Tue, Apr 14, 2020 at 11:08 AM Stephen Frost  wrote:
> Wouldn't it make sense to, given that we have some idea of what we want
> it to eventually look like, to make progress in that direction though?

Well, yes. :-)

> That is- I tend to agree with Andres that having this supported
> server-side eventually is what we should be thinking about as an
> end-goal (what is the point of pg_basebackup in all of this, after all,
> if the goal is to get a backup of PG from the PG server to s3..?  why
> go through some other program or through the replication protocol?) and
> having the server exec'ing out to run shell script fragments to make
> that happen looks like it would be really awkward and full of potential
> risks and issues and agreement that it wouldn't be a good fit.

I'm fairly deeply uncomfortable with what Andres is proposing. I see
that it's very powerful, and can do a lot of things, and that if
you're building something that does sophisticated things with storage,
you probably want an API like that. It does a great job making
complicated things possible. However, I feel that it does a lousy job
making simple things simple. Suppose you want to compress using your
favorite compression program. Well, you can't. Your favorite
compression program doesn't speak the bespoke PostgreSQL protocol
required for backup plugins. Neither does your favorite encryption
program. Either would be perfectly happy to accept a tarfile on stdin
and dump out a compressed or encrypted version, as the case may be, on
stdout, but sorry, no such luck. You need a special program that
speaks the magic PostgreSQL protocol but otherwise does pretty much
the exact same thing as the standard one.

It's possibly not the exact same thing. A special might, for example,
use multiple threads for parallel compression rather than multiple
processes, perhaps gaining a bit of efficiency. But it's doubtful
whether all users care about such marginal improvements. All they're
going to see is that they can use gzip and maybe lz4 because we
provide the necessary special magic tools to integrate with those, but
for some reason we don't have a special magic tool that they can use
with their own favorite compressor, and so they can't use it. I think
people are going to find that fairly unhelpful.

Now, it's a problem we can work around. We could have a "shell
gateway" program which acts as a plugin, speaks the backup plugin
protocol, and internally does fork-and-exec stuff to spin up copies of
any binary you want to act as a filter. I don't see any real problem
with that. I do think it's very significantly more complicated than
just what Andres called an FFI. It's gonna be way easier to just write
something that spawns shell processes directly than it is to write
something that spawns a process and talks to it using this protocol
and passes around file descriptors using the various different
mechanisms that different platforms use for that, and then that
process turns around and spawns some other processes and passes along
the file descriptors to them. Now you've added a whole bunch of
platform-specific code and a whole bunch of code to generate and parse
protocol messages to achieve exactly the same thing that you could've
done far more simply with a C API. Even accepting as a given the need
to make the C API work separately on both the client and server side,
you've probably at least doubled, and I suspect more like quadrupled,
the amount of infrastructure that has to be built.

So...

> If, instead, we worked on a C-based interface which includes filters and
> storage drivers, and was implemented through libpgcommon, we could start
> with that being all done through pg_basebackup and work to hammer out
> the complications and issues that we run into there and, once it seems
> reasonably stable and works well, we could potentially pull that into
> the backend to be run directly without having to have pg_basebackup
> involved in the process.

...let's do this. Actually, I don't really mind if we target something
that can work on both the client and server side initially, but based
on C, not a new wire protocol with file descriptor passing. That new
wire protocol, and the file descriptor passing infrastructure that
goes with it, are things that I *really* think should be pushed off to
version 2, because I think they're going to generate a lot of
additional work and complexity, and I don't want to deal with all of
it at once.

Also, I don't really see what's wrong with the server forking
processes that exec("/usr/bin/lz4") or whatever. We do similar things
in other places and, while it won't work for cases where you want to
compress a shazillion files, that's not really a problem here anyway.
At least at the moment, the server-side format is *always* tar, so the
problem of needing a separate subprocess for every file in the data
directory does not arise.

> There's been good progress in the direction of having more done by the
> backend already, and

Re: Properly mark NULL returns in numeric aggregates

2020-04-14 Thread Tom Lane

David Rowley  writes:
> For testing, can't we just have an Assert() in
> advance_transition_function that verifies isnull matches the
> nullability of the return value for INTERNAL returning transfns? i.e,
> the attached

FTR, I do not like this Assert one bit.  nodeAgg.c has NO business
inquiring into the contents of internal-type Datums.  It has even
less business enforcing a particular Datum value for a SQL null ---
we have always, throughout the system, considered that if isnull
is true then the contents of the Datum are unspecified.  I think
this is much more likely to cause problems than solve any.

regards, tom lane

PG compilation error with Visual Studio 2015/2017/2019

2020-04-14 Thread Ranier Vilela

 >>I m still working on testing this patch. If anyone has Idea please
suggest.
I still see problems with this patch.

1. Variable loct have redundant initialization, it would be enough to
declare so: _locale_t loct;
2. Style white space in variable rc declaration.
3. Style variable cp_index can be reduced.
if (tmp != NULL) {
   size_t cp_index;

cp_index = (size_t)(tmp - winlocname);
strncpy(loc_name, winlocname, cp_index);
loc_name[cp_index] = '\0';
4. Memory leak if _WIN32_WINNT >= 0x0600 is true, _free_locale(loct); is
not called.
5. Why call _create_locale if _WIN32_WINNT >= 0x0600 is true and loct is
not used?

regards,
Ranier Vilela

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-14 Thread Erik Rijkers


On 2020-04-14 12:10, Dilip Kumar wrote:


v14-0001-Immediately-WAL-log-assignments.patch +
v14-0002-Issue-individual-invalidations-with.patch +
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch   +
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
v14-0007-Track-statistics-for-streaming.patch  +
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch  +
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch


applied on top of 8128b0c (a few hours ago)

Hi,

I haven't followed this thread and maybe this instabilty is 
known/expected; just thought I'd let you know.


When doing running a pgbench run over logical replication (cascading 
down two replicas), I get this segmentation fault.


2020-04-14 17:27:28.135 CEST [8118] DETAIL:  Streaming transactions 
committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
2020-04-14 17:27:28.135 CEST [8118] LOG:  logical decoding found 
consistent point at 0/5FA2A00
2020-04-14 17:27:28.135 CEST [8118] DETAIL:  There are no running 
transactions.
2020-04-14 17:27:28.138 CEST [8006] LOG:  server process (PID 8118) was 
terminated by signal 11: Segmentation fault
2020-04-14 17:27:28.138 CEST [8006] DETAIL:  Failed process was running: 
COMMIT
2020-04-14 17:27:28.138 CEST [8006] LOG:  terminating any other active 
server processes
2020-04-14 17:27:28.138 CEST [8163] WARNING:  terminating connection 
because of crash of another server process
2020-04-14 17:27:28.138 CEST [8163] DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and 
exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2020-04-14 17:27:28.138 CEST [8163] HINT:  In a moment you should be 
able to reconnect to the database and repeat your command.



This error happens somewhat buried away in my test-stuff; I can dig it 
out and make it into a repeatable test if you need it. (debian 
stretch/gcc 9.3.0)



Erik Rijkers

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Robert Haas

On Tue, Apr 14, 2020 at 11:26 AM Tom Lane  wrote:
> Well, no it isn't. The main nit I would pick in that claim is that
> it's far from obvious that the three examples of float8 * interval
> are all talking about the same operator; in fact, a reader would
> be very likely to draw the false conclusion that there is an
> integer * interval operator.

I agree that's not great. I think that could possibly be fixed by
showing all three examples in the same cell, and maybe by revising the
choice of examples.

> I'd like to have more consistency, not less.  I do not think it helps
> readers to have each page in Chapter 9 have its own idiosyncratic way of
> presenting operators/functions.  The operator tables are actually almost
> that bad, right now --- compare section 9.1 (hasn't even bothered with
> a formal ) with tables 9.1, 9.4, 9.9, 9.12, 9.14, 9.30, 9.34,
> 9.37, 9.41, 9.44.  The variation in level of detail and precision is
> striking, and not in a good way.

Well, I don't know. Having two or even three formats is not the same
as having infinitely many formats, and may be justified if the needs
are sufficiently different from each other.

At any rate, if the price of more clarity and more examples is that
the tables become three times as long and harder to read, I am
somewhat inclined to think that the cure is worse than the disease. I
can readily see how something like table 9.10 (Other String Functions)
might be a mess on a narrow screen or in PDF format, but it's an
extremely useful table on a normal-size screen in HTML format, and
part of what makes it useful is that it's compact. Almost anything we
do is going to remove some of that compactness to save horizontal
space. Maybe that's OK, but it's sure not great. It's nice to be able
to see more on one screen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cleaning perl code

2020-04-14 Thread Andrew Dunstan


On 4/13/20 12:47 PM, Andrew Dunstan wrote:
>
> OK, I've committed all that stuff. I think that takes care of the
> non-controversial part of what I proposed :-)
>
>

OK, it seems there is a majority of people commenting in this thread in
favor of not doing more except to reverse the policy of requiring
subroutine returns. I'll do that shortly. In the spirit of David
Steele's contribution, here is a snippet that when added to the
perlcriticrc would allow us to pass at the "brutal" setting (severity
1). But I'm not proposing to add this, it's just here so anyone
interested can see what's involved.

One of the things that's a bit sad is that perlcritic doesn't generally
let you apply policies to a given set of files or files matching some
pattern. It would be nice, for instance, to be able to apply some
additional standards to strategic library files like PostgresNode.pm,
TestLib.pm and Catalog.pm. There are good reasons as suggested upthread
to apply higher standards to library files than to, say, a TAP test
script. The only easy way I can see to do that would be to have two
different perlcriticrc files and adjust pgperlcritic to make two runs.
If people think that's worth it I'll put a little work into it. If not,
I'll just leave things here.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

# severity  4
[-BuiltinFunctions::RequireBlockGrep] # 9 occurrences
[-BuiltinFunctions::RequireBlockMap] # 1 occurrences
[-InputOutput::ProhibitReadlineInForLoop] # 2 occurrences
[-InputOutput::RequireBriefOpen] # 83 occurrences
[-Modules::ProhibitAutomaticExportation] # 7 occurrences
[-Modules::ProhibitMultiplePackages] # 9 occurrences
[-Objects::ProhibitIndirectSyntax] # 12 occurrences
[-Subroutines::RequireArgUnpacking] # 39 occurrences
[-TestingAndDebugging::ProhibitNoWarnings] # 6 occurrences
[-TestingAndDebugging::ProhibitProlongedStrictureOverride] # 2 occurrences
[-ValuesAndExpressions::ProhibitCommaSeparatedStatements] # 4 occurrences
[-ValuesAndExpressions::ProhibitConstantPragma] # 2 occurrences
[-ValuesAndExpressions::ProhibitMixedBooleanOperators] # 2 occurrences
[-Variables::RequireLocalizedPunctuationVars] # 72 occurrences
# severity  3
[-BuiltinFunctions::ProhibitComplexMappings] # 2 occurrences
[-BuiltinFunctions::ProhibitLvalueSubstr] # 1 occurrences
[-BuiltinFunctions::ProhibitVoidMap] # 1 occurrences
[-BuiltinFunctions::RequireSimpleSortBlock] # 1 occurrences
[-ClassHierarchies::ProhibitExplicitISA] # 10 occurrences
[-CodeLayout::ProhibitHardTabs] # 172 occurrences
[-ControlStructures::ProhibitCascadingIfElse] # 15 occurrences
[-ControlStructures::ProhibitDeepNests] # 1 occurrences
[-ControlStructures::ProhibitNegativeExpressionsInUnlessAndUntilConditions] # 5 
occurrences
[-ErrorHandling::RequireCarping] # 284 occurrences
[-ErrorHandling::RequireCheckingReturnValueOfEval] # 11 occurrences
[-InputOutput::ProhibitBacktickOperators] # 13 occurrences
[-InputOutput::ProhibitJoinedReadline] # 1 occurrences
[-InputOutput::RequireCheckedOpen] # 24 occurrences
[-Miscellanea::ProhibitUnrestrictedNoCritic] # 12 occurrences
[-Modules::ProhibitConditionalUseStatements] # 1 occurrences
[-Modules::ProhibitExcessMainComplexity] # 15 occurrences
[-NamingConventions::ProhibitAmbiguousNames] # 3 occurrences
[-RegularExpressions::ProhibitCaptureWithoutTest] # 30 occurrences
[-RegularExpressions::ProhibitComplexRegexes] # 267 occurrences
[-RegularExpressions::ProhibitUnusedCapture] # 11 occurrences
[-RegularExpressions::RequireExtendedFormatting] # 1048 occurrences
[-Subroutines::ProhibitExcessComplexity] # 13 occurrences
[-Subroutines::ProhibitManyArgs] # 9 occurrences
[-Subroutines::ProhibitUnusedPrivateSubroutines] # 3 occurrences
[-TestingAndDebugging::RequireTestLabels] # 4 occurrences
[-ValuesAndExpressions::ProhibitImplicitNewlines] # 312 occurrences
[-ValuesAndExpressions::ProhibitMismatchedOperators] # 11 occurrences
[-ValuesAndExpressions::ProhibitQuotesAsQuotelikeOperatorDelimiters] # 2 
occurrences
[-ValuesAndExpressions::ProhibitVersionStrings] # 1 occurrences
[-ValuesAndExpressions::RequireQuotedHeredocTerminator] # 98 occurrences
[-Variables::ProhibitPackageVars] # 79 occurrences
[-Variables::ProhibitReusedNames] # 14 occurrences
[-Variables::ProhibitUnusedVariables] # 8 occurrences
[-Variables::RequireInitializationForLocalVars] # 3 occurrences
# severity  2
[-BuiltinFunctions::ProhibitBooleanGrep] # 14 occurrences
[-BuiltinFunctions::ProhibitStringySplit] # 8 occurrences
[-BuiltinFunctions::ProhibitUselessTopic] # 4 occurrences
[-CodeLayout::ProhibitQuotedWordLists] # 13 occurrences
[-ControlStructures::ProhibitCStyleForLoops] # 26 occurrences
[-ControlStructures::ProhibitPostfixControls] # 325 occurrences
[-ControlStructures::ProhibitUnlessBlocks] # 12 occurrences
[-Documentation::RequirePodSections] # 51 occurrences
[-InputOutput::RequireCheckedClose] # 140 occurrences
[-Miscellanea::ProhibitTies] #

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread Tom Lane

Robert Haas  writes:
> At any rate, if the price of more clarity and more examples is that
> the tables become three times as long and harder to read, I am
> somewhat inclined to think that the cure is worse than the disease. I
> can readily see how something like table 9.10 (Other String Functions)
> might be a mess on a narrow screen or in PDF format, but it's an
> extremely useful table on a normal-size screen in HTML format, and
> part of what makes it useful is that it's compact. Almost anything we
> do is going to remove some of that compactness to save horizontal
> space. Maybe that's OK, but it's sure not great. It's nice to be able
> to see more on one screen.

I dunno, it doesn't look to me like 9.10 is some paragon of efficient
use of screen space, even with a wide window.  (And my goodness it
looks bad if I try a window about half my usual web-browsing width.)
Maybe I should go convert that one to see what it looks like in one of
the other layouts being discussed.

regards, tom lane

Re: [BUG] non archived WAL removed during production crash recovery

2020-04-14 Thread Jehan-Guillaume de Rorthais

On Fri, 10 Apr 2020 11:00:31 +0900 (JST)
Kyotaro Horiguchi  wrote:
[...]
> > > but the mistake here is it thinks that inRecovery represents whether it is
> > > running as a standby or not, but actually it is true on primary during
> > > crash recovery.  
> > 
> > Indeed.
> >   
> > > On the other hand, with the patch, standby with archive_mode=on
> > > wrongly archives WAL files during crash recovery.  
> > 
> > "without the patch" you mean? You are talking about 78ea8b5daab, right?  
> 
> No. I menat the v4 patch in [1].
> 
> [1] https://www.postgresql.org/message-id/20200407171736.61906608%40firost
> 
> Prior to appllying the patch (that is the commit 78ea..),
> XLogArchiveCheckDone() correctly returns true (= allow to remove it)
> for the same condition.
> 
> The proposed patch does the folloing thing.
> 
> if (!XLogArchivingActive() ||
> recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
>   return true;
> 
> It doesn't return for the condition "recoveryState=CRASH_RECOVERING
> and archive_mode = on". Then the WAL files is mitakenly marked as
> ".ready" if not marked yet.

Indeed.

But .ready files are then deleted during the first restartpoint. I'm not sure
how to fix this behavior without making the code too complex.

This is discussed in my last answer to Michael few minutes ago as well.

> By the way, the code seems not following the convention a bit
> here. Let the inserting code be in the same style to the existing code
> around.
> 
> + if ( ! XLogArchivingActive() )
> 
> I think we don't put the  spaces within the parentheses above.

Indeed, This is fixed in patch v5 sent few minutes ago.

> | ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY
> 
> The first two and the last one are in different style. *I* prefer them
> (if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

I like Michael's proposal. See v5 of the patch.

Thank you for your review!

Regards,

ERROR: could not determine which collation to use for string comparison

2020-04-14 Thread Andreas Joseph Krogh


Guys; This errors out with: 

ERROR: could not determine which collation to use for string comparison 
 HINT: Use the COLLATE clause to set the collation explicitly.


The database is init'ed with: 
initdb -D $PGDATA -E utf8 --locale=nb_NO.UTF-8

13-dev HEAD as of 8128b0c152a67917535f50738ac26da4f984ddd9 

Works fine in <= 12 

=== 

create table person( id serial primary key, firstname varchar, lastname varchar
);insert into person(firstname, lastname) values ('Andreas', 'Krogh'); CREATE 
OR REPLACE FUNCTIONconcat_lower(varchar, varchar) RETURNS varchar AS $$ SELECT 
nullif(lower(coalesce($1, '')) || lower(coalesce($2, '')), '') $$ LANGUAGE SQL  
IMMUTABLE; select * from person pers ORDER BY concat_lower(pers.firstname, 
pers.lastname)ASC;  === 


--
 Andreas Joseph Krogh

Re: documenting the backup manifest file format

2020-04-14 Thread Robert Haas

On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera  wrote:
> Yeah, I guess I'm just saying that it feels brittle to have a file
> format that's supposed to be good for data exchange and then make it
> itself depend on representation details such as the order that fields
> appear in, the letter case, or the format of newlines.  Maybe this isn't
> really of concern, but it seemed strange.

I didn't want to use JSON for this at all, but I got outvoted. When I
raised this issue, it was suggested that I deal with it in this way,
so I did. I can't really defend it too far beyond that, although I do
think that one nice thing about this is that you can verify the
checksum using shell commands if you want. Just figure out the number
of lines in the file, minus one, and do head -n$LINES backup_manifest
| shasum -a256 and boom. If there were some whitespace-skipping thing
figuring out how to reproduce the checksum calculation would be hard.

> I think strict ISO 8601 might be preferable (with the T in the middle
> and ending in Z instead of " GMT").

Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.

> > > Why is the top-level checksum only allowed to be SHA-256, if the
> > > files can use up to SHA-512?
>
> Thanks for the discussion.  I think you mostly want to make sure that
> the manifest is sensible (not corrupt) rather than defend against
> somebody maliciously giving you an attacking manifest (??).  I incline
> to agree that any SHA-2 hash is going to serve that purpose and have no
> further comment to make.

The code has other sanity checks against the manifest failing to parse
properly, so you can't (I hope) crash it or anything even if you
falsify the checksum. But suppose that there is a gremlin running
around your system flipping occasional bits. If said gremlin flips a
bit in a "0" that appears in a file's checksum string, it could become
a "1", a "3", or a "7", all of which are still valid characters for a
hex string. When you then tried to verify the backup, verification for
that file would fail, but you'd think it was a problem with the file,
rather than a problem with the manifest. The manifest checksum
prevents that: you'll get a complaint about the manifest checksum
being wrong rather than a complaint about the file not matching the
manifest checksum. A sufficiently smart gremlin could figure out the
expected checksum for the revised manifest and flip bits to make the
actual value match the expected one, but I think we're worried about
"chaotic neutral" gremlins, not "lawful evil" ones.

That having been said, there was some discussion on the original
thread about keeping your backup on regular storage and your manifest
checksum in a concrete bunker at the bottom of the ocean; in that
scenario, it should be possible to detect tampering in either the
manifest itself or in non-WAL data files, as long as the adversary
can't break SHA-256. But I'm not sure how much we should really worry
about that. For me, the design center for this feature is a user who
untars base.tar and forgets about 43965.tar. If that person runs
pg_verifybackup, it's gonna tell them that things are broken, and
that's good enough for me. It may not be good enough for everybody,
but it's good enough for me.

I think I'm going to go ahed and push this now, maybe with a small
wording tweak as discussed upthread with Andrew. The rest of this
discussion is really about whether the patch needs any design changes
rather than about whether the documentation describes what the patch
does, so it makes sense to me to commit this first and then if
somebody wants to argue for a change they certainly can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: documenting the backup manifest file format

2020-04-14 Thread David Steele


On 4/14/20 12:56 PM, Robert Haas wrote:

On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera  wrote:

Yeah, I guess I'm just saying that it feels brittle to have a file
format that's supposed to be good for data exchange and then make it
itself depend on representation details such as the order that fields
appear in, the letter case, or the format of newlines.  Maybe this isn't
really of concern, but it seemed strange.


I didn't want to use JSON for this at all, but I got outvoted. When I
raised this issue, it was suggested that I deal with it in this way,
so I did. I can't really defend it too far beyond that, although I do
think that one nice thing about this is that you can verify the
checksum using shell commands if you want. Just figure out the number
of lines in the file, minus one, and do head -n$LINES backup_manifest
| shasum -a256 and boom. If there were some whitespace-skipping thing
figuring out how to reproduce the checksum calculation would be hard.


I think strict ISO 8601 might be preferable (with the T in the middle
and ending in Z instead of " GMT").


Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.


"I'm also partial to using epoch time in the manifest because it is 
generally easier for programs to work with.  But, human-readable doesn't 
suck, either."


Also you don't need to worry about time-zone conversion errors -- even 
if the source time is UTC this can easily happen if you are not careful. 
It also saves a parsing step.


The downside is it is not human-readable but this is intended to be a 
machine-readable format so I don't think it's a big deal (encoded 
filenames will be just as opaque). If a user really needs to know what 
time some file is (rare, I think) they can paste it with a web tool to 
find out.


Regards,
--
-David
da...@pgmasters.net

Re: sqlsmith crash incremental sort

2020-04-14 Thread James Coleman

On Sun, Apr 12, 2020 at 8:09 PM Tomas Vondra
 wrote:
>
> On Sun, Apr 12, 2020 at 12:44:45AM +0200, Tomas Vondra wrote:
> >Hi,
> >
> >I've looked into this a bit, and at first I thought that maybe the
> >issue is in how cost_incremental_sort picks the EC members. It simply
> >does this:
> >
> >EquivalenceMember *member = (EquivalenceMember *)
> >linitial(key->pk_eclass->ec_members);
> >
> >so I was speculating that maybe there are multiple EC members and the
> >one we need is not the first one. That would have been easy to fix.
> >
> >But that doesn't seem to be the case - in this example the EC ony has a
> >single EC member anyway.
> >
> >(gdb) p key->pk_eclass->ec_members
> >$14 = (List *) 0x12eb958
> >(gdb) p *key->pk_eclass->ec_members
> >$15 = {type = T_List, length = 1, max_length = 5, elements = 0x12eb970, 
> > initial_elements = 0x12eb970}
> >
> >and the member is a Var with varno=0 (with a RelabelType on top, but
> >that's irrelevant).
> >
> >(gdb) p *(Var*)((RelabelType*)member->em_expr)->arg
> >$12 = {xpr = {type = T_Var}, varno = 0, varattno = 1, vartype = 12445, 
> > vartypmod = -1, varcollid = 950, varlevelsup = 0, varnosyn = 0, varattnosyn 
> > = 1, location = -1}
> >
> >which then triggers the assert in find_base_rel. When looking for other
> >places calling estimate_num_groups I found this in prepunion.c:
> >
> > * XXX you don't really want to know about this: we do the estimation
> > * using the subquery's original targetlist expressions, not the
> > * subroot->processed_tlist which might seem more appropriate.  The
> > * reason is that if the subquery is itself a setop, it may return a
> > * processed_tlist containing "varno 0" Vars generated by
> > * generate_append_tlist, and those would confuse estimate_num_groups
> > * mightily.  We ought to get rid of the "varno 0" hack, but that
> > * requires a redesign of the parsetree representation of setops, so
> > * that there can be an RTE corresponding to each setop's output.
> >
> >which seems pretty similar to the issue at hand, because the subpath is
> >T_UpperUniquePath (not sure if that passes as setop, but the symptoms
> >match nicely).
> >
> >Not sure what to do about it in cost_incremental_sort, though :-(
> >
>
> I've been messing with this the whole day, without much progress :-(
>
> I'm 99.% sure it's the same issue described by the quoted comment,
> because the plan looks like this:
>
>   Nested Loop Left Join
> ->  Sample Scan on pg_namespace
>   Sampling: system ('7.2'::real)
> ->  Incremental Sort
>   Sort Key: ...
>   Presorted Key: ...
>   ->  Unique
> ->  Sort
>   Sort Key: ...
>   ->  Append
> ->  Nested Loop
> ...
> ->  Nested Loop
> ...
>
> so yeah, the plan does have set operations, and generate_append_tlist
> does generate Vars with varno == 0, causing this issue.

This is a bit of an oddly shaped plan anyway, right? In an ideal world
the sort for the unique would have knowledge about what would be
useful for the parent node, and we wouldn't need the incremental sort
at all.

I'm not sure that that kind of thing is really a new problem, though,
and it might not even be entirely possible to fix directly by trying
to push down knowledge about useful sort keys to whatever created that
sort path; it might only be fixable by having the incremental sort (or
even regular sort) path creation know to "subsume" a sort underneath
it.

Anyway, I think that's a bit off topic, but it stood out to me.

> But I'm not entirely sure what to do about it in cost_incremental_sort.
> The comment (introduced by 89deca582a in 2017) suggests a proper fix
> would require redesigning the parsetree representation of setops, and
> it's a bit too late for that.
>
> So I wonder what a possible solution might look like. I was hoping we
> might grab the original target list and use that, similarly to
> recurse_set_operations, but I'm not sure how/where to get it.

This is also not an area I'm familiar with. Reading through the
prepunion.c code alongside cost_incremental_sort, it seems that we
don't have access to the same level of information as the prepunion
code (i.e., we're only looking at the result of the union, not the
components of it), and trying descend down into it seems even more
gross, so, see below...

> Another option is to use something as simple as checking for Vars with
> varno==0 in cost_incremental_sort() and ignoring them somehow. We could
> simply use some arbitrary estimate - by assuming the rows are unique or
> something like that. Yes, I agree it's pretty ugly and I'd much rather
> find a way to generate something sensible, but I'm not even sure we can
> generate good estimate when doing UNION of data from different relations
> and so on. Th

Re: documenting the backup manifest file format

2020-04-14 Thread Alvaro Herrera

On 2020-Apr-14, David Steele wrote:

> On 4/14/20 12:56 PM, Robert Haas wrote:
>
> > Hmm, did David suggest that before? I don't recall for sure. I think
> > he had some suggestion, but I'm not sure if it was the same one.
> 
> "I'm also partial to using epoch time in the manifest because it is
> generally easier for programs to work with.  But, human-readable doesn't
> suck, either."

Ugh.  If you go down that road, why write human-readable contents at
all?  You may as well just use a binary format.  But that's a very
slippery slope and you won't like to be in the bottom -- I don't see
what that gains you.  It's not like it's a lot of work to parse a
timestamp in a non-internationalized well-defined human-readable format.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: documenting the backup manifest file format

2020-04-14 Thread David Steele


On 4/14/20 1:27 PM, Alvaro Herrera wrote:

On 2020-Apr-14, David Steele wrote:


On 4/14/20 12:56 PM, Robert Haas wrote:


Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.


"I'm also partial to using epoch time in the manifest because it is
generally easier for programs to work with.  But, human-readable doesn't
suck, either."


Ugh.  If you go down that road, why write human-readable contents at
all?  You may as well just use a binary format.  But that's a very
slippery slope and you won't like to be in the bottom -- I don't see
what that gains you.  It's not like it's a lot of work to parse a
timestamp in a non-internationalized well-defined human-readable format.


Well, times are a special case because they are so easy to mess up. Try 
converting ISO-8601 to epoch time using the standard C functions on a 
system where TZ != UTC. Fun times.


Regards,
--
-David
da...@pgmasters.net

Re: index paths and enable_indexscan

2020-04-14 Thread Robert Haas

On Tue, Apr 14, 2020 at 10:21 AM Tom Lane  wrote:
> Awhile back I'd looked into getting rid of disable_cost altogether
> by dint of not generating disabled paths.  It's harder than it
> sounds.  We could perhaps change this particular case, but it's
> not clear that there's any real benefit of making this one change
> in isolation.

I like the idea and have had the same thought before. I wondered
whether we could arrange to generate paths for a rel and then if we
end up with no paths, do it again ignoring the disable flags. It
didn't seem entirely easy to rearrange things to work that way,
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-14 Thread Hamid Akhtar

Thank you so much Bruce and Sawada.

Bruce:
I'll update the documentation with more details for max_age_prepared_xacts

Sawada:
You have raised 4 very valid points. Here are my thoughts.

(1)
I think your concern is that if we can reduce the need of 2 GUCs to one.

The purpose of having 2 different GUCs was serving two different purposes.
- max_age_prepared_xacts; this defines the maximum age of a prepared
transaction after which it may be considered an orphan.
- prepared_xacts_vacuum_warn_timeout; since we are throwing warnings in the
log, we need a way of controlling the behaviour to prevent from flooding
the log file with our messages. This timeout defines that. May be there is
a better way of handling this, but may be not.

(2)
Your point is that when there are more than one prepared transactions (and
even if the first one is removed), timeout
prepared_xacts_vacuum_warn_timeout isn't always accurate.

Yes, I agree. However, for us to hit the exact timeout for each prepared
transaction, we need setup timers and callback functions. That's too much
of an overhead IMHO. The alternative design that I took (the current
design) is based on the assumption that we don't need a precise timeout for
these transactions or for vacuum to report these issues to log. So, a
decent enough way of setting a timeout should be good enough considering
that it doesn't add any real overhead to the vacuum process. This does mean
that an orphaned prepared transaction may be notified after
prepared_xacts_vacuum_warn_timeout * 2. This, IMHO should be acceptable
behavior.

(3)
Message is too detailed.

Yes, I agree. I'll review this an update the patch.

(4)
GUCs should be renamed.

Yes, I agree. The names you have suggested make more sense. I'll send an
updated version of the patch with the new names and other suggested changes.

On Wed, Mar 11, 2020 at 10:43 AM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Mon, 2 Mar 2020 at 21:42, Hamid Akhtar  wrote:
> >
> > Here is the v2 of the same patch after rebasing it and running it
> through pgindent. There are no other code changes.
> >
>
> Thank you for working on this. I think what this patch tries to
> achieve would be helpful to inform orphaned prepared transactions that
> can be cause of inefficient vacuum to users.
>
> As far as I read the patch, the setting this feature using newly added
> parameters seems to be complicated to me. IIUC, even if a prepared
> transactions is enough older than max_age_prepared_xacts, we don't
> warn if it doesn't elapsed prepared_xacts_vacuum_warn_timeout since
> when the "first" prepared transaction is created. And the first
> prepared transaction means that the first entry for
> TwoPhaseStateData->prepXacts. Therefore, if there is always more than
> one prepared transaction, we don't warn for at least
> prepared_xacts_vacuum_warn_timeout seconds even if the first added
> prepared transaction is already removed. So I'm not sure how we can
> think the setting value of prepared_xacts_vacuum_warn_timeout.
>
> Regarding the warning message, I wonder if the current message is too
> detailed. If we want to inform that there is orphaned prepared
> transactions to users, it seems to me that it's enough to report the
> existence (and possibly the number of orphaned prepared transactions),
> rather than individual details.
>
> Given that the above things, we can simply think this feature; we can
> have only max_age_prepared_xacts, and autovacuum checks the minimum of
> prepared_at of prepared transactions, and compare it to
> max_age_prepared_xacts. We can warn if (CurrentTimestamp -
> min(prepared_at)) > max_age_prepared_xacts. In addition, if we also
> want to control this behavior by the age of xid, we can have another
> GUC parameter for comparing the age(min(xid of prepared transactions))
> to that value.
>
> Finally, regarding the name of parameters, when we mention the age of
> transaction it means the age of xid of the transaction, not the time.
> Please refer to other GUC parameter having "age" in its name such as
> autovacuum_freeze_max_age, vacuum_freeze_min_age. The patch adds
> max_age_prepared_xacts but I think it should be renamed. For example,
> prepared_xact_warn_min_duration is for time and
> prepared_xact_warn_max_age for age.
>
> Regards,
>
> --
> Masahiko Sawadahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: documenting the backup manifest file format

2020-04-14 Thread Andrew Dunstan



On 4/14/20 1:33 PM, David Steele wrote:
> On 4/14/20 1:27 PM, Alvaro Herrera wrote:
>> On 2020-Apr-14, David Steele wrote:
>>
>>> On 4/14/20 12:56 PM, Robert Haas wrote:
>>>
 Hmm, did David suggest that before? I don't recall for sure. I think
 he had some suggestion, but I'm not sure if it was the same one.
>>>
>>> "I'm also partial to using epoch time in the manifest because it is
>>> generally easier for programs to work with.  But, human-readable
>>> doesn't
>>> suck, either."
>>
>> Ugh.  If you go down that road, why write human-readable contents at
>> all?  You may as well just use a binary format.  But that's a very
>> slippery slope and you won't like to be in the bottom -- I don't see
>> what that gains you.  It's not like it's a lot of work to parse a
>> timestamp in a non-internationalized well-defined human-readable format.
>
> Well, times are a special case because they are so easy to mess up.
> Try converting ISO-8601 to epoch time using the standard C functions
> on a system where TZ != UTC. Fun times.
>
>


Even if it's a zulu time? That would be pretty damn sad.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Corruption during WAL replay

2020-04-14 Thread Teja Mupparti

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on 
the critical-section around truncate, but I just want to emphasize the need for 
reversing the order of the dropping the buffers and the truncation.

 Repro details (when full page write = off)

 1) Page on disk has empty LP 1, Insert into page LP 1
 2) checkpoint START (Recovery REDO eventually starts here)
 3) Delete all rows on the page (page is empty now)
 4) Autovacuum kicks in and truncates the pages
 DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk 
still empty
 5) Checkpoint completes
 6) Crash
 7) smgrtruncate - Not reached (this is where we do the physical 
truncate)

 Now the crash-recovery starts

 Delete-log-replay (above step-3) reads page with empty LP 1 and the 
delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation 
will happen in the future but the recovery fails (repeatedly) even before 
reaching that point.

Best regards,
Teja

From: Kyotaro Horiguchi 
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko.saw...@2ndquadrant.com 
Cc: and...@anarazel.de ; tejesw...@hotmail.com 
; pgsql-hack...@postgresql.org 
; hexexp...@comcast.net 
Subject: Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada 
 wrote in
> On Mon, 13 Apr 2020 at 17:40, Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti  wrote:
> > > /*
> > >  * We WAL-log the truncation before actually truncating, which means
> > >  * trouble if the truncation fails. If we then crash, the WAL replay
> > >  * likely isn't going to succeed in the truncation either, and cause a
> > >  * PANIC. It's tempting to put a critical section here, but that cure
> > >  * would be worse than the disease. It would turn a usually harmless
> > >  * failure to truncate, that might spell trouble at WAL replay, into a
> > >  * certain PANIC.
> > >  */
> >
> > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> > log something and then not perform the action. This leads to to primary
> > / replica getting out of sync, crash recovery potentially not completing
> > (because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4.  So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again.  So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone.  But we don't have a room for another flag
in buffer header.  For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

> > > As a second idea, I wonder if we can defer truncation until commit
> > > time like smgrDoPendingDeletes mechanism. The sequence would be:
> >
> > This is mostly an issue during [auto]vacuum partially truncating the end
> > of the file. We intentionally release the AEL regularly to allow other
> > accesses to continue.
> >
> > For transactional truncations we don't go down this path (as we create a
> > new relfilenode).
> >
> >
> > > At RelationTruncate(),
> > > 1. WAL logging.
> > > 2. Remember buffers to be dropped.
> >
> > You definitely cannot do that, as explained above.
>
> Ah yes, you're right.
>
> So it seems to me currently what we can do for this issue would be to
> enclose the truncation operation in a critical section. IIUC it's not
> enough just to reverse the order of dropping buffers and physical file
> truncation because it cannot solve the problem of inconsistency on the
> standby. And as Horiguchi-san mentioned, there is no need to reverse
> that order if we envelop the truncation operation by a critical
> section because we can recover page changes during crash recovery. The
> strategy of writing out all dirty buffers before dropping buffers,
> proposed as (a) in [1], also seems not enough.

Agreed.  Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

> [1] 
> https://www.postgresql.org/message-id/201912070012

Re: pg_basebackup, manifests and backends older than ~12

2020-04-14 Thread Robert Haas

On Mon, Apr 13, 2020 at 8:23 PM Michael Paquier  wrote:
> On Mon, Apr 13, 2020 at 07:55:07PM -0400, Robert Haas wrote:
> > Oh, hmm. Maybe I'm getting confused with a previous version of the
> > patch that behaved differently.
>
> No problem.  If you prefer keeping this part of the code, that's fine
> by me.  If you think that the patch is suited as-is, including
> silencing the error forcing to use --no-manifest on server versions
> older than v13, I am fine to help out and apply it myself, but I am
> also fine if you wish to take care of it by yourself.

Feel free to go ahead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-14 Thread Robert Haas

On Wed, Feb 19, 2020 at 10:05 AM Hamid Akhtar  wrote:
> Attached is version 1 of POC patch for notifying of orphaned
> prepared transactions via warnings emitted to a client
> application and/or log file. It applies to PostgreSQL branch
> "master" on top of "e2e02191" commit.

I think this is a bad idea and that we should reject the patch. It's
true that forgotten prepared transactions are a problem, but it's also
true that you can monitor for that yourself using the
pg_prepared_xacts view. If you do, you will have a lot more
flexibility than this patch gives you, or than any similar patch ever
can give you.

Generally, people don't pay attention to warnings in logs, so they're
just clutter. Moreover, there are tons of different things for which
you should probably monitor (wraparound perils, slow checkpoints,
bloated tables, etc.) and so the real solution is to run some
monitoring software. So even if you do pay attention to your logs, and
even if the GUCs this provides you sufficient flexibility for your
needs in this one area, you still need to run some monitoring
software. At which point, you don't also need this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: documenting the backup manifest file format

2020-04-14 Thread David Steele


On 4/14/20 3:03 PM, Andrew Dunstan wrote:


On 4/14/20 1:33 PM, David Steele wrote:

On 4/14/20 1:27 PM, Alvaro Herrera wrote:

On 2020-Apr-14, David Steele wrote:


On 4/14/20 12:56 PM, Robert Haas wrote:


Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.


"I'm also partial to using epoch time in the manifest because it is
generally easier for programs to work with.  But, human-readable
doesn't
suck, either."


Ugh.  If you go down that road, why write human-readable contents at
all?  You may as well just use a binary format.  But that's a very
slippery slope and you won't like to be in the bottom -- I don't see
what that gains you.  It's not like it's a lot of work to parse a
timestamp in a non-internationalized well-defined human-readable format.


Well, times are a special case because they are so easy to mess up.
Try converting ISO-8601 to epoch time using the standard C functions
on a system where TZ != UTC. Fun times.


Even if it's a zulu time? That would be pretty damn sad.
ZULU/GMT/UTC are all fine. But if the server timezone is EDT for example 
(not that I recommend this) you are likely to get the wrong result. 
Results vary based on your platform. For instance, we found MacOS was 
more likely to work the way you would expect and Linux was hopeless.


There are all kinds of fun tricks to get around this (sort of). One is 
to temporarily set TZ=UTC which sucks if an error happens before it gets 
set back. There are some hacks to try to determine your offset which 
have inherent race conditions around DST changes.


After some experimentation we just used the Posix definition for epoch 
time and used that to do our conversions:


https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16

Regards,
--
-David
da...@pgmasters.net

Re: Parallel copy

2020-04-14 Thread Kuntal Ghosh

Hello,

I was going through some literatures on parsing CSV files in a fully
parallelized way and found (from [1]) an interesting approach
implemented in the open-source project ParaText[2]. The algorithm
follows a two-phase approach: the first pass identifies the adjusted
chunks in parallel by exploiting the simplicity of CSV formats and the
second phase processes complete records within each adjusted chunk by
one of the available workers. Here is the sketch:

1. Each worker scans a distinct fixed sized chunk of the CSV file and
collects the following three stats from the chunk:
a) number of quotes
b) position of the first new line after even number of quotes
c) position of the first new line after odd number of quotes
2. Once stats from all the chunks are collected, the leader identifies
the adjusted chunk boundaries by iterating over the stats linearly:
- For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
- If the number is even, then the k-th chunk does not start in the
middle of a quoted field, and the first newline after an even number
of quotes (the second collected information) is the first record
delimiter in this chunk.
- Otherwise, if the number is odd, the first newline after an odd
number of quotes (the third collected information) is the first record
delimiter.
- The end position of the adjusted chunk is obtained based on the
starting position of the next adjusted chunk.
3. Once the boundaries of the chunks are determined (forming adjusted
chunks), individual worker may take up one adjusted chunk and process
the tuples independently.

Although this approach parses the CSV in parallel, it requires two
scan on the CSV file. So, given a system with spinning hard-disk and
small RAM, as per my understanding, the algorithm will perform very
poorly. But, if we use this algorithm to parse a CSV file on a
multi-core system with a large RAM, the performance might be improved
significantly [1].

Hence, I was trying to think whether we can leverage this idea for
implementing parallel COPY in PG. We can design an algorithm similar
to parallel hash-join where the workers pass through different phases.
1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
the small stats about each chunk in the shared memory. If the shared
memory is full, go to phase 2.
2. Phase 2 - Allow a single worker to process the stats and decide the
actual chunk boundaries so that no tuple spans across two different
chunks. Go to phase 3.
3. Phase 3 - Each worker picks one adjusted chunk, parse and process
tuples from the same. Once done with one chunk, it picks the next one
and so on.
4. If there are still some unread contents, go back to phase 1.

We can probably use separate workers for phase 1 and phase 3 so that
they can work concurrently.

Advantages:
1. Each worker spends some significant time in each phase. Gets
benefit of the instruction cache - at least in phase 1.
2. It also has the same advantage of parallel hash join - fast workers
get to work more.
3. We can extend this solution for reading data from STDIN. Of course,
the phase 1 and phase 2 must be performed by the leader process who
can read from the socket.

Disadvantages:
1. Surely doesn't work if we don't have enough shared memory.
2. Probably, this approach is just impractical for PG due to certain
limitations.

Thoughts?

[1] 
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf
[2] ParaText. https://github.com/wiseio/paratext.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: documenting the backup manifest file format

2020-04-14 Thread Andrew Dunstan



On 4/14/20 3:19 PM, David Steele wrote:
> On 4/14/20 3:03 PM, Andrew Dunstan wrote:
>>
>> On 4/14/20 1:33 PM, David Steele wrote:
>>> On 4/14/20 1:27 PM, Alvaro Herrera wrote:
 On 2020-Apr-14, David Steele wrote:

> On 4/14/20 12:56 PM, Robert Haas wrote:
>
>> Hmm, did David suggest that before? I don't recall for sure. I think
>> he had some suggestion, but I'm not sure if it was the same one.
>
> "I'm also partial to using epoch time in the manifest because it is
> generally easier for programs to work with.  But, human-readable
> doesn't
> suck, either."

 Ugh.  If you go down that road, why write human-readable contents at
 all?  You may as well just use a binary format.  But that's a very
 slippery slope and you won't like to be in the bottom -- I don't see
 what that gains you.  It's not like it's a lot of work to parse a
 timestamp in a non-internationalized well-defined human-readable
 format.
>>>
>>> Well, times are a special case because they are so easy to mess up.
>>> Try converting ISO-8601 to epoch time using the standard C functions
>>> on a system where TZ != UTC. Fun times.
>>
>> Even if it's a zulu time? That would be pretty damn sad.
> ZULU/GMT/UTC are all fine. But if the server timezone is EDT for
> example (not that I recommend this) you are likely to get the wrong
> result. Results vary based on your platform. For instance, we found
> MacOS was more likely to work the way you would expect and Linux was
> hopeless.
>
> There are all kinds of fun tricks to get around this (sort of). One is
> to temporarily set TZ=UTC which sucks if an error happens before it
> gets set back. There are some hacks to try to determine your offset
> which have inherent race conditions around DST changes.
>
> After some experimentation we just used the Posix definition for epoch
> time and used that to do our conversions:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16
>
>
>

OK, but I think if we're putting a timestamp string in ISO-8601 format
in the manifest it should be in UTC / Zulu time, precisely to avoid
these issues. If that's too much trouble then yes an epoch time will
probably do.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: documenting the backup manifest file format

2020-04-14 Thread David Steele


On 4/14/20 3:55 PM, Andrew Dunstan wrote:


OK, but I think if we're putting a timestamp string in ISO-8601 format
in the manifest it should be in UTC / Zulu time, precisely to avoid
these issues. If that's too much trouble then yes an epoch time will
probably do.


Happily ISO-8601 is always UTC. The problem I'm referring to is the 
timezone setting on the host system when doing conversions in C.


To be fair most languages handle this well and C is C so I'm not sure we 
need to make a big deal of it. In JSON/XML it's pretty common to use 
ISO-8601 so that seems like a rational choice.


Regards,
--
-David
da...@pgmasters.net

Re: documenting the backup manifest file format

2020-04-14 Thread Alvaro Herrera

On 2020-Apr-14, Andrew Dunstan wrote:

> OK, but I think if we're putting a timestamp string in ISO-8601 format
> in the manifest it should be in UTC / Zulu time, precisely to avoid
> these issues. If that's too much trouble then yes an epoch time will
> probably do.

The timestamp is always specified and always UTC (except the code calls
it GMT).

+   /*
+* Convert last modification time to a string and append it to the
+* manifest. Since it's not clear what time zone to use and since time
+* zone definitions can change, possibly causing confusion, use GMT
+* always.
+*/
+   appendStringInfoString(&buf, "\"Last-Modified\": \"");
+   enlargeStringInfo(&buf, 128);
+   buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
+  pg_gmtime(&mtime));
+   appendStringInfoString(&buf, "\"");

I was merely saying that it's trivial to make this iso-8601 compliant as

buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ",

ie. omit the "GMT" string and replace it with a literal Z, and remove
the space and replace it with a T.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: documenting the backup manifest file format

2020-04-14 Thread Alvaro Herrera

On 2020-Apr-14, David Steele wrote:

> Happily ISO-8601 is always UTC.

Uh, it is not --
https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Tom Lane

I wrote:
> It doesn't seem to me to be that hard to implement the desired
> semantics for synchronous_standby_names with inconsistent info.
> In FIRST mode you basically just need to take the N smallest
> priorities you see in the array, but without assuming there are no
> duplicates or holes.  It might be a good idea to include ties at the
> end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> standbys, include the first three of them in the calculation until
> the inconsistency is resolved.  In ANY mode I don't see that
> inconsistent priorities matter at all.

Concretely, I think we ought to do the attached, or something pretty
close to it.

I'm not really happy about breaking ties based on walsnd_index,
but I see that there are several TAP test cases that fail if we
do something else.  I'm inclined to think those tests are bogus ...
but I won't argue to change them right now.

regards, tom lane

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31..c66c371 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -108,14 +108,17 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static void SyncRepGetSyncStandbysPriority(SyncRepStandbyData *standbys, int n);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +409,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&MyWalSnd->mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(&MyWalSnd->mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -523,8 +527,6 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
- * The caller must hold SyncRepLock.
- *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
  * store the positions into *writePtr, *flushPtr and *applyPtr.
@@ -536,27 +538,43 @@ static bool
 SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	 XLogRecPtr *applyPtr, bool *am_sync)
 {
-	List	   *sync_standbys;
-
-	Assert(LWLockHeldByMe(SyncRepLock));
+	SyncRepStandbyData *sync_standbys;
+	int			num_standbys;
+	int			i;
 
+	/* Initialize default results */
 	*writePtr = InvalidXLogRecPtr;
 	*flushPtr = InvalidXLogRecPtr;
 	*applyPtr = InvalidXLogRecPtr;
 	*am_sync = false;
 
+	/* Quick out if not even configured to be synchronous */
+	if (SyncRepConfig == NULL)
+		return false;
+
 	/* Get standbys that are considered as synchronous at this moment */
-	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+	num_standbys = SyncRepGetSyncStandbys(&sync_standbys);
+
+	/* Am I among the candidate sync standbys? */
+	for (i = 0; i < num_standbys; i++)
+	{
+		if (sync_standbys[i].is_me)
+		{
+			*am_sync = sync_standbys[i].is_sync_standby;
+			break;
+		}
+	}
 
 	/*
-	 * Quick exit if we are not managing a sync standby or there are not
-	 * enough synchronous standbys.
+	 * Nothing more to do if we are not managing a sync standby or there are
+	 * not enough synchronous standbys.  (Note: if there are least num_sync
+	 * candidates, then at least num_sync of them will be marked as
+	 * is_sync_standby; we don't need to count them here.)
 	 */
 	if (!(*am_sync) ||
-		SyncRepConfig == NULL ||
-		list_length(sync_standbys) < SyncRepConfig->num_sync)
+		num_standbys < SyncRepConfig->num_sync)
 	{
-		list_free(sync_standbys);
+		pfree(sync_standbys);
 		return false;
 	}
 
@@ -576,15 +594,16 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
 	{
 		SyncRepGetOldestSyncRecPtr(writePtr, flushPtr, applyPtr,
-   sync_standbys);
+   sync_standbys, num_standbys);
 	}
 	else
 	{
 		SyncRepGetNthLatestSyncRecPtr(writePtr, flushPtr, applyPtr,
-	  sync_standbys, SyncRepConfig->num_sync);
+	  sync_standbys, num_standbys,
+	  S

Re: documenting the backup manifest file format

2020-04-14 Thread Andrew Dunstan



On 4/14/20 4:09 PM, Alvaro Herrera wrote:
> On 2020-Apr-14, Andrew Dunstan wrote:
>
>> OK, but I think if we're putting a timestamp string in ISO-8601 format
>> in the manifest it should be in UTC / Zulu time, precisely to avoid
>> these issues. If that's too much trouble then yes an epoch time will
>> probably do.
> The timestamp is always specified and always UTC (except the code calls
> it GMT).
>
> +   /*
> +* Convert last modification time to a string and append it to the
> +* manifest. Since it's not clear what time zone to use and since time
> +* zone definitions can change, possibly causing confusion, use GMT
> +* always.
> +*/
> +   appendStringInfoString(&buf, "\"Last-Modified\": \"");
> +   enlargeStringInfo(&buf, 128);
> +   buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
> +  pg_gmtime(&mtime));
> +   appendStringInfoString(&buf, "\"");
>
> I was merely saying that it's trivial to make this iso-8601 compliant as
>
> buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ",
>
> ie. omit the "GMT" string and replace it with a literal Z, and remove
> the space and replace it with a T.
>

+1


cheers


andre



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: documenting the backup manifest file format

2020-04-14 Thread David Steele


On 4/14/20 4:11 PM, Alvaro Herrera wrote:

On 2020-Apr-14, David Steele wrote:


Happily ISO-8601 is always UTC.


Uh, it is not --
https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators


Whoops, you are correct. I've just never seen non-UTC in the wild yet.

--
-David
da...@pgmasters.net

Re: cleaning perl code

2020-04-14 Thread Alvaro Herrera

On 2020-Apr-14, Andrew Dunstan wrote:

> One of the things that's a bit sad is that perlcritic doesn't generally
> let you apply policies to a given set of files or files matching some
> pattern. It would be nice, for instance, to be able to apply some
> additional standards to strategic library files like PostgresNode.pm,
> TestLib.pm and Catalog.pm. There are good reasons as suggested upthread
> to apply higher standards to library files than to, say, a TAP test
> script. The only easy way I can see to do that would be to have two
> different perlcriticrc files and adjust pgperlcritic to make two runs.
> If people think that's worth it I'll put a little work into it. If not,
> I'll just leave things here.

I think being more strict about it in strategic files (I'd say that's
Catalog.pm plus src/test/perl/*.pm) might be a good idea.  Maybe give it
a try and see what comes up.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP/PoC for parallel backup

2020-04-14 Thread Robert Haas

On Tue, Apr 14, 2020 at 10:37 AM Asif Rehman  wrote:
> I forgot to make a check for no-manifest. Fixed. Attached is the updated 
> patch.

+typedef struct
+{
...
+} BackupFile;
+
+typedef struct
+{
...
+} BackupState;

These structures need comments.

+list_wal_files_opt_list:
+   SCONST SCONST
{
- $$ = makeDefElem("manifest_checksums",
-
(Node *)makeString($2), -1);
+   $$ = list_make2(
+   makeDefElem("start_wal_location",
+   (Node *)makeString($2), -1),
+   makeDefElem("end_wal_location",
+   (Node *)makeString($2), -1));
+
}

This seems like an unnecessarily complicated parse representation. The
DefElems seem to be completely unnecessary here.

@@ -998,7 +1110,37 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}

-   perform_base_backup(&opt);
+   switch (cmd->cmdtag)

So the design here is that SendBaseBackup() is now going to do a bunch
of things that are NOT sending a base backup? With no updates to the
comments of that function and no change to the process title it sets?

-   return (manifest->buffile != NULL);
+   return (manifest && manifest->buffile != NULL);

Heck no. It appears that you didn't even bother reading the function
header comment.

+ * Send a single resultset containing XLogRecPtr record (in text format)
+ * TimelineID and backup label.
  */
 static void
-SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli)
+SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli,
+StringInfo label, char *backupid)

This just casually breaks wire protocol compatibility, which seems
completely unacceptable.

+   if (strlen(opt->tablespace) > 0)
+   sendTablespace(opt->tablespace, NULL, true, NULL, &files);
+   else
+   sendDir(".", 1, true, NIL, true, NULL, NULL, &files);
+
+   SendFilesHeader(files);

So I guess the idea here is that we buffer the entire list of files in
memory, regardless of size, and then we send it out afterwards. That
doesn't seem like a good idea. The list of files might be very large.
We probably need some code refactoring here rather than just piling
more and more different responsibilities onto sendTablespace() and
sendDir().

+   if (state->parallel_mode)
+   SpinLockAcquire(&state->lock);
+
+   state->throttling_counter += increment;
+
+   if (state->parallel_mode)
+   SpinLockRelease(&state->lock);

I don't like this much. It seems to me that we would do better to use
atomics here all the time, instead of conditional spinlocks.

+static void
+send_file(basebackup_options *opt, char *file, bool missing_ok)
...
+   if (file == NULL)
+   return;

That seems totally inappropriate.

+   sendFile(file, file + basepathlen, &statbuf,
true, InvalidOid, NULL, NULL);

Maybe I'm misunderstanding, but this looks like it's going to write a
tar header, even though we're not writing a tarfile.

+   else
+   ereport(WARNING,
+   (errmsg("skipping special file
or directory \"%s\"", file)));

So, if the user asks for a directory or symlink, what's going to
happen is that they're going to receive an empty file, and get a
warning. That sounds like terrible behavior.

+   /*
+* Check for checksum failures. If there are failures across multiple
+* processes it may not report total checksum count, but it will error
+* out,terminating the backup.
+*/

In other words, the patch breaks the feature. Not that the feature in
question works particularly well as things stand, but this makes it
worse.

I think this patch (0003) is in really bad shape. I'm having second
thoughts about the design, but it's kind of hard to even have a
discussion about the design when the patch is riddled with minor
problems like inadequate comments, failure to update existing
comments, and breaking a bunch of things. I understand that sometimes
things get missed, but this is version 14 of a patch that's been
kicking around since last August.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ERROR: could not determine which collation to use for string comparison

2020-04-14 Thread Tom Lane

Andreas Joseph Krogh  writes:
> Guys; This errors out with: 
> ERROR: could not determine which collation to use for string comparison 
>  HINT: Use the COLLATE clause to set the collation explicitly.

Fixed, thanks for the report.

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-14 Thread David G. Johnston

On Mon, Apr 13, 2020 at 10:13 AM Tom Lane  wrote:

> As discussed in the thread at [1], I've been working on redesigning
> the tables we use to present SQL functions and operators.  The
> first installment of that is now up; see tables 9.30 and 9.31 at
>
> https://www.postgresql.org/docs/devel/functions-datetime.html
>
> and table 9.33 at
>
> https://www.postgresql.org/docs/devel/functions-enum.html
>
>
As I write this the enum headers are centered horizontally while the
datetime ones are left aligned.  The centering doesn't do it for me.  To
much gap and the data itself is not centered so there is a large
disconnected between the header and the value.

The run-on aspect of the left-aligned setup is of some concern but maybe
just adding some left padding to the second column - and right padding to
the first - can provide the desired negative space without adding so much
as to break usability.

(gonna use embedded images here...)

[image: image.png]

[image: image.png]
David J.

Re: wrong relkind error messages

2020-04-14 Thread Alvaro Herrera

On 2020-Apr-14, Michael Paquier wrote:

> On Mon, Apr 13, 2020 at 11:13:15AM -0400, Tom Lane wrote:

> > ERROR:  cannot define statistics for relation "ti"
> > DETAIL:  This operation is not supported for indexes.
> > 
> > which still leaves implicit that "ti" is an index, but probably that's
> > something the user can figure out.
> > 
> > Maybe someone else can do better?
> 
> "This operation is not supported for put_relkind_here \"%s\"."?  I
> think that it is better to provide a relation name in the error
> message (even optionally a namespace).  That's less to guess for the
> user.

But the relation name is already in the ERROR line -- why do you care so
much about also having it in the DETAIL?  Besides, I think part of the
point Tom was making is that if you say "not supported for the index
foo" is that the user is left wondering whether the operation is not
supported for that particular index only or for any index.

Tom's other proposal

> > DETAIL:  "ti" is an index, and this operation is not supported for that 
> > kind of relation.

addresses that problem, but seems excessively verbose.

Also, elsewhere Peter said[1] that we should not try to list the things
that would be allowed, so it's pointless to try to list the relkinds for
which the operation is permissible.

So I +1 this idea:

 ERROR:  cannot define statistics for relation "ti"
 DETAIL:  This operation is not supported for indexes.

[1] 
https://www.postgresql.org/message-id/1293803569.19789.6.camel%40fsopti579.F-Secure.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_basebackup, manifests and backends older than ~12

2020-04-14 Thread Michael Paquier

On Tue, Apr 14, 2020 at 03:13:39PM -0400, Robert Haas wrote:
> Feel free to go ahead.

Thanks, let's do it then.  If you have any objections about any parts
of the patch, of course please feel free.
--
Michael


signature.asc
Description: PGP signature

Re: pg_basebackup, manifests and backends older than ~12

2020-04-14 Thread Michael Paquier

On Mon, Apr 13, 2020 at 07:04:20PM -0400, Alvaro Herrera wrote:
> I agree, I think forcing users to specify --no-manifest when run on old
> servers will cause users to write bad scripts; I vote for silently
> disabling checksums.

Okay, thanks.  Are there any other opinions?
--
Michael


signature.asc
Description: PGP signature

Re: wrong relkind error messages

2020-04-14 Thread Alvaro Herrera

On 2020-Apr-13, Robert Haas wrote:

> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> + errmsg("action cannot be performed on relation \"%s\"",
> + RelationGetRelationName(rel)),
> 
> Super-vague.

Maybe, but note that the patch proposed to replace this current error
message:
  ERROR:  foo is not an index or foreign table
with 
  ERROR:  action cannot be performed on "foo"
  DETAIL:  "foo" is a materialized view.

or, if we're to adopt Tom's proposed wording,

  ERROR:  cannot perform action on relation "ti"
  DETAIL:  This operation is not supported for materialized views.

so it's not like this is making things any worse; the error was already
super-vague.  

This could be improved if we had stringification of ALTER TABLE
subcommand types:

  ERROR:  ALTER TABLE ... ADD COLUMN cannot be performed on "foo"
  DETAIL:  "foo" is a gummy bear.
or
  ERROR:  ALTER TABLE ... ADD COLUMN cannot be performed on foo
  DETAIL:  This action cannot be performed on gummy bears.

but that seems material for a different patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_basebackup, manifests and backends older than ~12

2020-04-14 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Apr 13, 2020 at 07:04:20PM -0400, Alvaro Herrera wrote:
>> I agree, I think forcing users to specify --no-manifest when run on old
>> servers will cause users to write bad scripts; I vote for silently
>> disabling checksums.

> Okay, thanks.  Are there any other opinions?

FWIW, I concur with silently disabling the feature if the source
server can't support it.

regards, tom lane

Re: wrong relkind error messages

2020-04-14 Thread Tom Lane

Alvaro Herrera  writes:
> On 2020-Apr-13, Robert Haas wrote:
>> + ereport(ERROR,
>> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
>> + errmsg("action cannot be performed on relation \"%s\"",
>> + RelationGetRelationName(rel)),
>> 
>> Super-vague.

> Maybe, but note that the patch proposed to replace this current error
> message:
>   ERROR:  foo is not an index or foreign table
> ...
> so it's not like this is making things any worse; the error was already
> super-vague.  

Yeah.  I share Robert's feeling that "action" is not really desirable
here, but I have to concur that this is an improvement on the existing
text, which also fails to mention what command is being rejected.

> This could be improved if we had stringification of ALTER TABLE
> subcommand types:
>   ERROR:  ALTER TABLE ... ADD COLUMN cannot be performed on "foo"

In the meantime could we at least say "ALTER TABLE action cannot
be performed"?  The worst aspect of the existing text is that if
an error comes out of a script with a lot of different commands,
it doesn't give you any hint at all about which command failed.

regards, tom lane

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-14 Thread Andy Fan

Hi David:

Thanks for your time.


> 1. Out of date comment in join.sql
>
> -- join removal is not possible when the GROUP BY contains a column that is
> -- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
> -- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
> -- but this happens too late for join removal in the outer plan level.)
> explain (costs off)
> select d.* from d left join (select d, c_id from b group by b.d, b.c_id) s
>   on d.a = s.d;
>
> You've changed the GROUP BY clause so it does not include b.id, so the
> Note in the comment is now misleading.
>

Thanks, I will fix this one in the following patch.


>
> 2. I think 0002 is overly restrictive in its demands that
> parse->hasAggs must be false. We should be able to just use a Group
> Aggregate with unsorted input when the input_rel is unique on the
> GROUP BY clause.  This will save on hashing and sorting.  Basically
> similar to what we do for when a query contains aggregates without any
> GROUP BY.
>
>
Yes,  This will be a perfect result,  the difficult is the current
aggregation function
execution is highly coupled with Agg node(ExecInitAgg) which is removed in
the
unique case.  I ever make the sum (w/o finalfn) and avg(with finalfn)
works in a hack way, but still many stuffs is not handled.  Let me prepare
the code
for this purpose in 1~2  days to see if I'm going with the right direction.

Ashutosh also has an idea[1] that if the relation underlying an Agg node is
known to be unique for given groupByClause, we could safely use
AGG_SORTED strategy. Though the input is not ordered, it's sorted thus for
every row Agg
node will combine/finalize the aggregate result.

I will target the perfect result first and see how many effort do we need,
if not,
I will try Ashutosh's suggestion.



> 3. I don't quite understand why you changed this to a right join:
>
>  -- Test case where t1 can be optimized but not t2
>  explain (costs off) select t1.*,t2.x,t2.z
> -from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
> +from t1 right join t2 on t1.a = t2.x and t1.b = t2.y
>
> Perhaps this change is left over from some previous version of the patch?
>

This is on purpose.   the original test case is used to test we can short
the group key for t1 but not t2 for aggregation, but if I keep the inner
join, the
aggnode will be removed totally, so I have to change it to right join in
order
to keep the aggnode.  The full test case is:

-- Test case where t1 can be optimized but not t2

explain (costs off) select t1.*,t2.x,t2.z

from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y

group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;


where (a, b) is the primary key of t1.



[1]
https://www.postgresql.org/message-id/CAExHW5sY%2BL6iZ%3DrwnL7n3jET7aNLCNQimvfcS7C%2B5wmdjmdPiw%40mail.gmail.com

Re: Perl modules for testing/viewing/corrupting/repairing your heap files

2020-04-14 Thread Mark Dilger

Not having received any feedback on this, I've dusted the modules off for 
submission as-is.



v1-0001-Adding-HeapFile-related-perl-modules.patch
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perl modules for testing/viewing/corrupting/repairing your heap files

2020-04-14 Thread Peter Geoghegan

On Wed, Apr 8, 2020 at 3:51 PM Mark Dilger  wrote:
> Recently, as part of testing something else, I had need of a tool to create
> surgically precise corruption within heap pages.  I wanted to make the
> corruption from within TAP tests, so I wrote the tool as a set of perl 
> modules.

There is also pg_hexedit:

https://github.com/petergeoghegan/pg_hexedit

-- 
Peter Geoghegan

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Kyotaro Horiguchi

At Tue, 14 Apr 2020 09:52:42 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> > sync_standby_priority of any walsender can be changed while the
> > function is scanning welsenders. The issue is we already have
> > inconsistent walsender information before we enter the function.  Thus
> > how many times we scan on the array doesn't make any difference.
> 
> *Yes it does*.  The existing code can deliver entirely broken results
> if some walsender exits between where we examine the priorities and
> where we fetch the WAL pointers.  While that doesn't seem to be the

Ah. I didn't take that as inconsistency. Actually walsender exit
inactivate the corresponding slot by setting pid = 0.  In a bad case
(as you mentioned upthread) the entry can be occupied by another wal
sender. However, sync_standby_priority cannot be updated until the
whole work is finished.

> exact issue we're seeing in the buildfarm, it's still another obvious
> bug in this code.  I will not accept a "fix" that doesn't fix that.

I think that the "inconsistency" that can be observed in a process is
disagreement between SyncRepConfig->nmembers and
->sync_standby_priority.  If any one of walsenders
regards its priority as lower (larger in value) than nmembers in the
"current" process, the assertion fires.  If that is the issue, the
issue is not dynamic inconsistency.

# It's the assumption of my band-aid.

> > I think we need to do one of the followings.
> 
> >  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> > walsender priority is inconsistent.
> >  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> > inconsistency.
> >  C) protect walsender priority array from beinig inconsistent.
> 
> (B) seems like the only practical solution from here.  We could
> probably arrange for synchronous update of the priorities when
> they change in response to a GUC change, but it doesn't seem to
> me to be practical to do that in response to walsender exit.
> You'd end up finding that an unexpected walsender exit results
> in panic'ing the system, which is no better than where we are now.

I agree to you as a whole.  I thought of several ways to keep the
consistency of the array but all of them looked too much.

> It doesn't seem to me to be that hard to implement the desired
> semantics for synchronous_standby_names with inconsistent info.
> In FIRST mode you basically just need to take the N smallest
> priorities you see in the array, but without assuming there are no
> duplicates or holes.  It might be a good idea to include ties at the
> end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> standbys, include the first three of them in the calculation until
> the inconsistency is resolved.  In ANY mode I don't see that
> inconsistent priorities matter at all.

Mmm, the priority lists like 1,2,2,4 are not thought as inconsistency
at all in the context of walsender priority.  That happenes stablly if
any two or more walreceivers reported the same application_name. I
believe the existing code is already taking that case into
consideration.

> > If we accept to share variable-length information among processes,
> > sharing sync_standby_names or parsed SyncRepConfigData among processes
> > would work.
> 
> Not sure that we really need more than what's being shared now,
> ie each process's last-known index in the sync_standby_names list.

If we take the (B), we don't need anymore.  (A) and (C) would need
more.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

remove_useless_groupby_columns does not need to record constraint dependencies

2020-04-14 Thread David Rowley

Hi,

Over in [1], Tom and I had a discussion in response to some confusion
about why remove_useless_groupby_columns() goes to the trouble of
recording a dependency on the PRIMARY KEY constraint when removing
surplus columns from the GROUP BY clause.

The outcome was that we don't need to do this since
remove_useless_groupby_columns() is used only as a plan-time
optimisation, we don't need to record any dependency.  Unlike
check_functional_grouping(), where we must record the dependency as we
may end up with a VIEW with columns, e.g, in the select list which are
functionally dependant on a pkey constraint. In that case, we must
ensure the view is also removed, or that the constraint removal is
blocked. There's no such requirement for planner smarts, such as the
one in remove_useless_groupby_columns() as in that case we'll trigger
a relcache invalidation during ALTER TABLE DROP CONSTRAINT, which
cached plans will notice when they obtain their locks just before
execution begins.

To prevent future confusion, I'd like to remove dependency recording
code from remove_useless_groupby_columns() and update the misleading
comment. Likely this should also be backpatched to 9.6.

Does anyone think differently?

A patch to do this is attached.

[1] 
https://www.postgresql.org/message-id/caaphdvr4ow_oud_rxp0d1hrgz+a4mm8+8ur7qom2vqkfx08...@mail.gmail.com


fix_remove_useless_groupby_columns.patch
Description: Binary data

Re: where should I stick that backup?

2020-04-14 Thread Andres Freund

Hi,

On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> I'm fairly deeply uncomfortable with what Andres is proposing. I see
> that it's very powerful, and can do a lot of things, and that if
> you're building something that does sophisticated things with storage,
> you probably want an API like that. It does a great job making
> complicated things possible. However, I feel that it does a lousy job
> making simple things simple.

I think it's pretty much exactly the opposite. Your approach seems to
move all the complexity to the user, having to build entire combination
of commands themselves.  Instead of having one or two default commands
that do backups in common situations, everyone has to assemble them from
pieces.

Moved from later in your email, since it seems to make more sense to
have it here:
> All they're going to see is that they can use gzip and maybe lz4
> because we provide the necessary special magic tools to integrate with
> those, but for some reason we don't have a special magic tool that
> they can use with their own favorite compressor, and so they can't use
> it. I think people are going to find that fairly unhelpful.

I have no problem with providing people with the opportunity to use
their personal favorite compressor, but forcing them to have to do that,
and to ensure it's installed etc, strikes me as a spectacurly bad
default situation. Most people don't have the time to research which
compression algorithms work the best for which precise situation.

How do you imagine a default scripted invocation of the new backup stuff
to look like?  Having to specify multiple commandline "fragments" for
compression, storing files, ...  can't be what we want the common case
should look like. It'll just again lead to everyone copy & pasting
examples that all are wrong in different ways. They'll not at all work
across platforms (or often not across OS versions).

In general, I think it's good to give expert users the ability to
customize things like backups and archiving. But defaulting to every
non-expert user having to all that expert work (or coyping it from bad
examples) is one of the most user hostile things in postgres.

> Also, I don't really see what's wrong with the server forking
> processes that exec("/usr/bin/lz4") or whatever. We do similar things
> in other places and, while it won't work for cases where you want to
> compress a shazillion files, that's not really a problem here anyway.
> At least at the moment, the server-side format is *always* tar, so the
> problem of needing a separate subprocess for every file in the data
> directory does not arise.

I really really don't understand this. Are you suggesting that for
server side compression etc we're going to add the ability to specify
shell commands as argument to the base backup command?  That seems so
obviously a non-starter?  A good default for backup configurations
should be that the PG user that the backup is done under is only allowed
to do that, and not that it directly has arbitrary remote command
execution.

> Suppose you want to compress using your favorite compression
> program. Well, you can't. Your favorite compression program doesn't
> speak the bespoke PostgreSQL protocol required for backup
> plugins.  Neither does your favorite encryption program. Either would
> be perfectly happy to accept a tarfile on stdin and dump out a
> compressed or encrypted version, as the case may be, on stdout, but
> sorry, no such luck. You need a special program that speaks the magic
> PostgreSQL protocol but otherwise does pretty much the exact same
> thing as the standard one.

But the tool speaking the protocol can just allow piping through
whatever tool?  Given that there likely is benefits to either doing
things on the client side or on the server side, it seems inevitable
that there's multiple places that would make sense to have the
capability for?

> It's possibly not the exact same thing. A special might, for example,
> use multiple threads for parallel compression rather than multiple
> processes, perhaps gaining a bit of efficiency. But it's doubtful
> whether all users care about such marginal improvements.

Marginal improvements? Compression scales decently well with the number
of cores.  pg_basebackup's compression is useless because it's so slow
(and because its clientside, but that's IME the lesser issue).  I feel I
must be misunderstanding what you mean here.

gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
not available compression cycles), but a lot less commonly than 180.

Greetings,

Andres Freund

Re: backup manifests

2020-04-14 Thread Fujii Masao





On 2020/04/14 0:15, Robert Haas wrote:

On Sun, Apr 12, 2020 at 10:09 PM Fujii Masao
 wrote:

I found other minor issues.


I think these are all correct fixes. Thanks for the post-commit
review, and sorry for this mistakes.


Thanks for the review, Michael and Robert. Pushed the patches!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: snapshot too old issues, first around wraparound and then more.

2020-04-14 Thread Thomas Munro

On Mon, Apr 13, 2020 at 2:58 PM Thomas Munro  wrote:
> On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan  wrote:
> > I think that it's worth considering whether or not there are a
> > significant number of "snapshot too old" users that rarely or never
> > rely on old snapshots used by new queries. Kevin said that this
> > happens "in some cases", but how many cases? Might it be that many
> > "snapshot too old" users could get by with a version of the feature
> > that makes the most conservative possible assumptions, totally giving
> > up on the idea of differentiating which blocks are truly safe to
> > access with an "old" snapshot? (In other words, one that assumes that
> > they're *all* unsafe for an "old" snapshot.)
> >
> > I'm thinking of a version of "snapshot too old" that amounts to a
> > statement timeout that gets applied for xmin horizon type purposes in
> > the conventional way, while only showing an error to the client if and
> > when they access literally any buffer (though not when the relation is
> > a system catalog). Is it possible that something along those lines is
> > appreciably better than nothing to users? If it is, and if we can find
> > a way to manage the transition, then maybe we could tolerate
> > supporting this greatly simplified implementation of "snapshot too
> > old".
>
> Interesting idea.  I'm keen to try prototyping it to see how well it
> works out it practice.  Let me know soon if you already have designs
> on that and I'll get out of your way, otherwise I'll give it a try and
> share what I come up with.

Here's a quick and dirty test patch of that idea (or my understanding
of it), just for experiments.  It introduces snapshot->expire_time and
a new timer SNAPSHOT_TIMEOUT to cause the next CHECK_FOR_INTERRUPTS()
to set snapshot->too_old on any active or registered snapshots whose
time has come, and then try to advance MyPgXact->xmin, without
considering the ones marked too old.  That gets rid of the concept of
"early pruning".  You can use just regular pruning, because the
snapshot is no longer holding the regular xmin back.  Then
TestForOldSnapshot() becomes simply if (snapshot->too_old)
ereport(...).

There are certainly some rough edges, missed details and bugs in here,
not least the fact (pointed out to me by Andres in an off-list chat)
that we sometimes use short-lived snapshots without registering them;
we'd have to fix that.  It also does nothing to ensure that
TestForOldSnapshot() is actually called at all the right places, which
is still required for correct results.

If those problems can be fixed, you'd have a situation where
snapshot-too-old is a coarse grained, blunt instrument that
effectively aborts your transaction even if the whole cluster is
read-only.  I am not sure if that's really truly useful to anyone (ie
if these ODBC cursor users would be satisfied; I'm not sure I
understand that use case).

Hmm.  I suppose it must be possible to put the LSN check back: if
(snapshot->too_old && PageGetLSN(page) > snapshot->lsn) ereport(...).
Then the granularity would be the same as today -- block level -- but
the complexity is transferred from the pruning side (has to deal with
xid time map) to the snapshot-owning side (has to deal with timers,
CFI() and make sure all snapshots are registered).  Maybe not a great
deal, and maybe not easier than fixing the existing bugs.

One problem is all the new setitimer() syscalls.  I feel like that
could be improved, as could statement_timeout, by letting existing
timers run rather than repeatedly rescheduling eagerly, so that eg a 1
minute timeout never gets rescheduled more than once per minute.  I
haven't looked into that, but I guess it's no worse than the existing
implement's overheads anyway.

PS in the patch the GUC is interpreted as milliseconds, which is more
fun for testing but it should really be minutes like before.
From 8a5455c7e376bd6ceddf956f789cfdede0396f3f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 15 Apr 2020 12:35:51 +1200
Subject: [PATCH] Implement snapshot_too_old using a timer.

Remove the previous implementation of "snapshot too old", and replace it
with a very conservative implementation based only on time passing.

Use a timer to mark active and registered snapshots as too old and
potentially advance the backend's xmin.  Snapshots that reach this state
can no longer be used to read buffers, because data they need to see may
be vacuumed away.

XXX Incomplete experimental code!
---
 src/backend/access/common/toast_internals.c |  12 +-
 src/backend/access/heap/pruneheap.c |   4 +-
 src/backend/catalog/index.c |  15 +-
 src/backend/commands/vacuum.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c |  17 -
 src/backend/storage/ipc/ipci.c  |   2 -
 src/backend/storage/ipc/procarray.c |  23 +-
 src/backend/tcop/postgres.c |   6 +
 src/backend/utils/init/globals.c|   1 +
 src/backend/utils/init/postinit.c

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Kyotaro Horiguchi

At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in 
> I wrote:
> > It doesn't seem to me to be that hard to implement the desired
> > semantics for synchronous_standby_names with inconsistent info.
> > In FIRST mode you basically just need to take the N smallest
> > priorities you see in the array, but without assuming there are no
> > duplicates or holes.  It might be a good idea to include ties at the
> > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> > standbys, include the first three of them in the calculation until
> > the inconsistency is resolved.  In ANY mode I don't see that
> > inconsistent priorities matter at all.
> 
> Concretely, I think we ought to do the attached, or something pretty
> close to it.

Looking SyncRepGetSyncStandbys, I agree that it's good not assuming
lowest_priority, which I thought as the culprit of the assertion
failure.  The current code intends to use less memory.  I don't think
there is a case where only 3 out of 1000 standbys are required to be
sync-standby so collecting all wal senders then sorting them seems
reasonable strategy.  The new code looks clearer.

+   stby->is_sync_standby = true;   /* might change below */

I'm uneasy with that.  In quorum mode all running standbys are marked
as "sync" and that's bogus.

The only users of the flag seems to be:

SyncRepGetSyncRecPtr:
+   *am_sync = sync_standbys[i].is_sync_standby;

and

SyncRepGetOldestSyncRecPtr:
+   /* Ignore candidates that aren't considered synchronous */
+   if (!sync_standbys[i].is_sync_standby)
+   continue;

On the other hand sync_standbys is already sorted in priority order so I think 
we can get rid of the member by setting *am_sync as the follows.

SyncRepGetSyncRecPtr:
  if (sync_standbys[i].is_me)
  {
  *am_sync = (i < SyncRepConfig->num_sync);
  break;
  }

And the second user can be as the follows.

SyncRepGetOldestSyncRecPtr:
   /* Ignore candidates that aren't considered synchronous */
   if (i >= SyncRepConfig->num_sync)
   break;

> I'm not really happy about breaking ties based on walsnd_index,
> but I see that there are several TAP test cases that fail if we
> do something else.  I'm inclined to think those tests are bogus ...
> but I won't argue to change them right now.

Agreed about the tie-breaker.

I'm looking this more closer.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-14 Thread David Rowley

On Wed, 15 Apr 2020 at 12:19, Andy Fan  wrote:
>
>> 2. I think 0002 is overly restrictive in its demands that
>> parse->hasAggs must be false. We should be able to just use a Group
>> Aggregate with unsorted input when the input_rel is unique on the
>> GROUP BY clause.  This will save on hashing and sorting.  Basically
>> similar to what we do for when a query contains aggregates without any
>> GROUP BY.
>>
>
> Yes,  This will be a perfect result,  the difficult is the current 
> aggregation function
> execution is highly coupled with Agg node(ExecInitAgg) which is removed in the
> unique case.

This case here would be slightly different. It would be handled by
still creating a Group Aggregate path, but just not consider Hash
Aggregate and not Sort the input to the Group Aggregate path. Perhaps
that's best done by creating a new flag bit and using it in
create_grouping_paths() in the location where we set the flags
variable.  If you determine that the input_rel is unique for the GROUP
BY clause, and that there are aggregate functions, then set a flag,
e.g GROUPING_INPUT_UNIQUE. Likely there will be a few other flags that
you can skip setting in that function, for example, there's no need to
check if the input can sort, so no need for GROUPING_CAN_USE_SORT,
since you won't need to sort, likewise for GROUPING_CAN_USE_HASH. I'd
say there also is no need for checking if we can set
GROUPING_CAN_PARTIAL_AGG (What would be the point in doing partial
aggregation when there's 1 row per group?)   Then down in
add_paths_to_grouping_rel(), just add a special case before doing any
other code, such as:

if ((extra->flags & GROUPING_INPUT_UNIQUE) != 0 && parse->groupClause != NIL)
{
Path*path = input_rel->cheapest_total_path;

add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
path,
grouped_rel->reltarget,
AGG_SORTED,
AGGSPLIT_SIMPLE,
parse->groupClause,
havingQual,
agg_costs,
dNumGroups));
return;
}

You may also want to consider the cheapest startup path there too so
that the LIMIT processing can do something smarter later in planning
(assuming cheapest_total_path != cheapest_startup_path (which you'd
need to check for)).

Perhaps it would be better to only set the GROUPING_INPUT_UNIQUE if
there is a groupClause, then just Assert(parse->groupClause != NIL)
inside that if.

David

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Noah Misch

On Tue, Apr 14, 2020 at 04:32:40PM -0400, Tom Lane wrote:
> I wrote:
> > It doesn't seem to me to be that hard to implement the desired
> > semantics for synchronous_standby_names with inconsistent info.
> > In FIRST mode you basically just need to take the N smallest
> > priorities you see in the array, but without assuming there are no
> > duplicates or holes.  It might be a good idea to include ties at the
> > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> > standbys, include the first three of them in the calculation until
> > the inconsistency is resolved.  In ANY mode I don't see that
> > inconsistent priorities matter at all.
> 
> Concretely, I think we ought to do the attached, or something pretty
> close to it.
> 
> I'm not really happy about breaking ties based on walsnd_index,
> but I see that there are several TAP test cases that fail if we
> do something else.  I'm inclined to think those tests are bogus ...
> but I won't argue to change them right now.

This passes the test battery I wrote in preparation for the 2020-02 thread.

Re: Vacuum o/p with (full 1, parallel 0) option throwing an error

2020-04-14 Thread Amit Kapila

On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila  wrote:
>
> On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier  wrote:
> >
>
> > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
> > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables
> >  WARNING:  disabling parallel option of vacuum on "tmp" --- cannot  vacuum 
> > temporary tables in parallel
> > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even 
> > though that's implied by FULL)
> >
> > To fully close the gap in the tests, I would also add a test for
> > (PARALLEL 1, FULL false) where FULL directly specified, even if that
> > sounds like a nit.  That's fine to test even on a temporary table.
> >
>
> Okay, I will do this once we agree on the error message stuff.
>

I have changed one of the existing tests to test the option suggested
by you.  Additionally, I have changed the docs as per suggestion from
Sawada-san.  I haven't changed the error message.  Let me know if you
have any more comments?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v5-0001-Fix-the-usage-of-parallel-and-full-options-of-vac.patch
Description: Binary data

Re: Vacuum o/p with (full 1, parallel 0) option throwing an error

2020-04-14 Thread Justin Pryzby

On Wed, Apr 15, 2020 at 08:54:17AM +0530, Amit Kapila wrote:
> On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila  wrote:
> >
> > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier  wrote:
> > >
> >
> > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
> > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables
> > >  WARNING:  disabling parallel option of vacuum on "tmp" --- cannot  
> > > vacuum temporary tables in parallel
> > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled 
> > > (even though that's implied by FULL)
> > >
> > > To fully close the gap in the tests, I would also add a test for
> > > (PARALLEL 1, FULL false) where FULL directly specified, even if that
> > > sounds like a nit.  That's fine to test even on a temporary table.
> > >
> >
> > Okay, I will do this once we agree on the error message stuff.
> >
> 
> I have changed one of the existing tests to test the option suggested
> by you.  Additionally, I have changed the docs as per suggestion from
> Sawada-san.  I haven't changed the error message.  Let me know if you
> have any more comments?

You did:
|...then the number of workers is determined based on the number of
|indexes that support parallel vacuum operation on the 
[-relation,-]{+relation+} and is further
|limited by .

I'd suggest to say instead:
|...then the number of workers is determined based on the number of
|indexes ON THE RELATION that support parallel vacuum operation, and is further
|limited by .

-- 
Justin

Re: Vacuum o/p with (full 1, parallel 0) option throwing an error

2020-04-14 Thread Amit Kapila

On Wed, Apr 15, 2020 at 9:03 AM Justin Pryzby  wrote:
>
> On Wed, Apr 15, 2020 at 08:54:17AM +0530, Amit Kapila wrote:
> > On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila  wrote:
> > >
> > > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier  
> > > wrote:
> > > >
> > >
> > > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
> > > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables
> > > >  WARNING:  disabling parallel option of vacuum on "tmp" --- cannot  
> > > > vacuum temporary tables in parallel
> > > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled 
> > > > (even though that's implied by FULL)
> > > >
> > > > To fully close the gap in the tests, I would also add a test for
> > > > (PARALLEL 1, FULL false) where FULL directly specified, even if that
> > > > sounds like a nit.  That's fine to test even on a temporary table.
> > > >
> > >
> > > Okay, I will do this once we agree on the error message stuff.
> > >
> >
> > I have changed one of the existing tests to test the option suggested
> > by you.  Additionally, I have changed the docs as per suggestion from
> > Sawada-san.  I haven't changed the error message.  Let me know if you
> > have any more comments?
>
> You did:
> |...then the number of workers is determined based on the number of
> |indexes that support parallel vacuum operation on the 
> [-relation,-]{+relation+} and is further
> |limited by .
>
> I'd suggest to say instead:
> |...then the number of workers is determined based on the number of
> |indexes ON THE RELATION that support parallel vacuum operation, and is 
> further
> |limited by .
>

I have not changed this now but I find your suggestion better than
existing wording.  I'll change this before committing the patch unless
there are more comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Properly mark NULL returns in numeric aggregates

2020-04-14 Thread David Rowley

On Wed, 15 Apr 2020 at 03:41, Tom Lane  wrote:
>
> David Rowley  writes:
> > For testing, can't we just have an Assert() in
> > advance_transition_function that verifies isnull matches the
> > nullability of the return value for INTERNAL returning transfns? i.e,
> > the attached
>
> FTR, I do not like this Assert one bit.  nodeAgg.c has NO business
> inquiring into the contents of internal-type Datums.  It has even
> less business enforcing a particular Datum value for a SQL null ---
> we have always, throughout the system, considered that if isnull
> is true then the contents of the Datum are unspecified.  I think
> this is much more likely to cause problems than solve any.

OK. the latter case could be ignored by adding an OR condition to the
Assert to allow isnull == false cases to pass without any
consideration to the Datum value, but it sounds like you don't want to
insist that isnull == true returns NULL a pointer.

FWIW, I agree with Jesse that having numeric_combine() return a NULL
pointer without properly setting the isnull flag is pretty bad and it
should be fixed regardless. Not fixing it, even in the absence of
having a good way to test it just seems like we're leaving something
around that we're going to trip up on in the future. Serialization
functions crashing after receiving input from a combine function seems
pretty busted to me, regardless if there is a pathway for the
functions to be called in that order in core or not. I'm not a fan of
leaving it in just because testing for it might not be easy. One
problem with coming up with a way of testing from an SQL level will be
that we'll need to pick some aggregate functions that currently have
this issue and ensure they don't regress.  There's not much we can do
to ensure any new aggregates we might create the future don't go and
break this rule. That's why I thought that the Assert might be more
useful.

I don't think it would be impossible to test this using an extension
and using the create_upper_paths_hook.  I see that test_rls_hooks
which runs during make check-world does hook into the RLS hooks do
test some behaviour. I don't think it would be too tricky to have a
hook implement a 3-stage aggregate plan with the middle stage doing a
deserial/combine/serial before passing to the Finalize Aggregate node.
That would allow us to ensure serial functions can accept the results
from combine functions, to which nothing in core currently can do.

David

Re: documenting the backup manifest file format

2020-04-14 Thread Fujii Masao




On 2020/04/14 2:40, Robert Haas wrote:

On Fri, Mar 27, 2020 at 4:32 PM Andres Freund  wrote:

I don't like having a file format that's intended to be used by external
tools too that's undocumented except for code that assembles it in a
piecemeal fashion.  Do you mean in a follow-on patch this release, or
later? I don't have a problem with the former.


Here is a patch for that.


While reading the document that you pushed, I thought that it's better
to define index term for backup manifest, so that we can easily reach
this document from the index page. Thought? Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/backup-manifest.sgml 
b/doc/src/sgml/backup-manifest.sgml
index 376aff0d6d..b9634f2706 100644
--- a/doc/src/sgml/backup-manifest.sgml
+++ b/doc/src/sgml/backup-manifest.sgml
@@ -3,6 +3,10 @@
 
  Backup Manifest Format
 
+  
+   Backup Manifest
+  
+
   
The backup manifest generated by  is
primarily intended to permit the backup to be verified using

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Masahiko Sawada

On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi  wrote:
>
> At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada 
>  wrote in
> > On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
> > >
> > > Kyotaro Horiguchi  writes:
> > > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote 
> > > > in
> > > >> What I think we should do about this is, essentially, to get rid of
> > > >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> > > >> whether *it* believes that it is a sync standby, based on its last
> > > >> evaluation of the relevant GUCs.  This would be a bool that it'd
> > > >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
> > >
> > > > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > > > walsender thinks it is at, among synchronous_standby_names.  Then to
> > > > decide "I am a sync standby" we need to know how many walsenders with
> > > > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > > > judgment now and suffers from the inconsistency of priority values.
> > >
> > > Yeah.  After looking a bit closer, I think that the current definition
> > > of sync_standby_priority (that is, as the result of local evaluation
> > > of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> > > with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> > > is make one sweep across the WalSnd array, collecting PID,
> > > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > > Then examine that data and decide which WAL value we need, without 
> > > assuming
> > > that the sync_standby_priority values are necessarily totally consistent.
> > > But in any case we must examine each entry just once while holding its
> > > mutex, not go back to it later expecting it to still be the same.
>
> SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> sync_standby_priority of any walsender can be changed while the
> function is scanning welsenders. The issue is we already have
> inconsistent walsender information before we enter the function.  Thus
> how many times we scan on the array doesn't make any difference.
>
> I think we need to do one of the followings.
>
>  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> walsender priority is inconsistent.
>
>  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> inconsistency.
>
>  C) protect walsender priority array from beinig inconsistent.
>
> The (B) is the band aids. To achieve A we need to central controller
> of priority config handling.  C is:
>
> > Can we have a similar approach of sync_standby_defined for
> > sync_standby_priority? That is, checkpionter is responsible for
> > changing sync_standby_priority of all walsenders when SIGHUP. That
> > way, all walsenders can see a consistent view of
> > sync_standby_priority. And when a walsender starts, it sets
> > sync_standby_priority by itself. The logic to decide who's a sync
> > standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> > having higher priority along with their WAL positions.
>
> Yeah, it works if we do , but the problem of that way is that to
> determin priority of walsenders, we need to know what walsenders are
> running. That is, when new walsender comes the process needs to aware
> of the arrival (or leaving) right away and reassign the priority of
> every wal senders again.

I think we don't need to reassign the priority when new walsender
comes or leaves. IIUC The priority is calculated based on only
synchronous_standby_names. Coming or leaving a walsender doesn't
affect other's priorities.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-14 Thread Dilip Kumar

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers  wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch +
> > v14-0002-Issue-individual-invalidations-with.patch +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch   +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch  +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch  +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)
>
> Hi,
>
> I haven't followed this thread and maybe this instabilty is
> known/expected; just thought I'd let you know.
>
> When doing running a pgbench run over logical replication (cascading
> down two replicas), I get this segmentation fault.

Thanks for the testing.  Is it possible to share the call stack?

>
> 2020-04-14 17:27:28.135 CEST [8118] DETAIL:  Streaming transactions
> committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
> 2020-04-14 17:27:28.135 CEST [8118] LOG:  logical decoding found
> consistent point at 0/5FA2A00
> 2020-04-14 17:27:28.135 CEST [8118] DETAIL:  There are no running
> transactions.
> 2020-04-14 17:27:28.138 CEST [8006] LOG:  server process (PID 8118) was
> terminated by signal 11: Segmentation fault
> 2020-04-14 17:27:28.138 CEST [8006] DETAIL:  Failed process was running:
> COMMIT
> 2020-04-14 17:27:28.138 CEST [8006] LOG:  terminating any other active
> server processes
> 2020-04-14 17:27:28.138 CEST [8163] WARNING:  terminating connection
> because of crash of another server process
> 2020-04-14 17:27:28.138 CEST [8163] DETAIL:  The postmaster has
> commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2020-04-14 17:27:28.138 CEST [8163] HINT:  In a moment you should be
> able to reconnect to the database and repeat your command.
>
>
> This error happens somewhat buried away in my test-stuff; I can dig it
> out and make it into a repeatable test if you need it. (debian
> stretch/gcc 9.3.0)

Yeah, that will be great.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 101 matches

Mail list logo