date:20190524

On Fri, 24 May 2019 at 18:17, Vik Fearing  wrote:
>
> With a sample query such as
>
> SELECT x, avg(x)
> FROM (VALUES (1), (2), (3)) AS v (x);
>
> We give the error message "column "v.x" must appear in the GROUP BY
> clause or be used in an aggregate function".
>
> This is correct but incomplete.  Attached is a trivial patch to also
> suggest that the user might have been trying to use a window function.

I think you might have misthought this one. If there's an aggregate
function in the SELECT or HAVING clause, then anything else in the
SELECT clause is going to have to be either in the GROUP BY clause, be
functionally dependent on the GROUP BY clause, or be in an aggregate
function. Putting it into a window function won't help the situation.

postgres=# select sum(x) over(),avg(x) FROM (VALUES (1), (2), (3)) AS v (x);
psql: ERROR:  column "v.x" must appear in the GROUP BY clause or be
used in an aggregate function
LINE 1: select sum(x) over(),avg(x) FROM (VALUES (1), (2), (3)) AS v...
   ^

If there's any change to make to the error message then it would be to
add the functional dependency part, but since we're pretty bad at
detecting that, I don't think we should.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Small fix: avoid passing null pointers to memcpy()

2019-05-24 Thread didier

Hi
A smaller version removing memset in print_aligned_text function.
The line is redundant , header_done isn't used yet and it's either
pg_malloc0 or null.

Without this patch make check fails 3 tests if pg is compiled with
-fsanitize=address,undefined
diff --git a/src/fe_utils/print.c b/src/fe_utils/print.c
index cb9a9a0613..b13167859c 100644
--- a/src/fe_utils/print.c
+++ b/src/fe_utils/print.c
@@ -913,7 +913,6 @@ print_aligned_text(const printTableContent *cont, FILE *fout, bool is_pager)
 
 			more_col_wrapping = col_count;
 			curr_nl_line = 0;
-			memset(header_done, false, col_count * sizeof(bool));
 			while (more_col_wrapping)
 			{
 if (opt_border == 2)

Why does not subquery pruning conditions inherit to parent query?

2019-05-24 Thread Kato, Sho

Hello

I execute following query to the partitioned table, but the plan is different 
from my assumption, so please tell me the reason.

postgres=# explain select * from jta, (select a, max(b)  from jtb where a = 1 
group by a ) c1 where jta.a = c1.a;
   QUERY PLAN   

 Hash Join  (cost=38.66..589.52 rows=1402 width=12)
   Hash Cond: (jta0.a = jtb0.a)
   ->  Append  (cost=0.00..482.50 rows=25500 width=4)
 ->  Seq Scan on jta0  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta2  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta3  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta4  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta5  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta6  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta7  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta8  (cost=0.00..35.50 rows=2550 width=4)
 ->  Seq Scan on jta9  (cost=0.00..35.50 rows=2550 width=4)
   ->  Hash  (cost=38.53..38.53 rows=11 width=8)
 ->  GroupAggregate  (cost=0.00..38.42 rows=11 width=8)
   Group Key: jtb0.a
   ->  Seq Scan on jtb0  (cost=0.00..38.25 rows=11 width=8)
 Filter: (a = 1)
(18 rows)

I assume that subquery aggregate only pruned table and parent query joins 
pruned table and subquery results.
However, parent query scan all partitions and join.
In my investigation, because is_simple_query() returns false if subquery 
contains GROUP BY, parent query does not prune.
Is it possible to improve this?
If subquery has a WHERE clause only, parent query does not scan all partitions.

postgres=# explain select * from jta, (select a from jtb where a = 1) c1 where 
jta.a = c1.a;
QUERY PLAN
--
 Nested Loop  (cost=0.00..81.94 rows=143 width=8)
   ->  Seq Scan on jta0  (cost=0.00..41.88 rows=13 width=4)
 Filter: (a = 1)
   ->  Materialize  (cost=0.00..38.30 rows=11 width=4)
 ->  Seq Scan on jtb0  (cost=0.00..38.25 rows=11 width=4)
   Filter: (a = 1)
(6 rows)

regards,

Sho Kato

Re: Why does not subquery pruning conditions inherit to parent query?

On Fri, 24 May 2019 at 19:44, Kato, Sho  wrote:
> I execute following query to the partitioned table, but the plan is different 
> from my assumption, so please tell me the reason.
>
> postgres=# explain select * from jta, (select a, max(b)  from jtb where a = 1 
> group by a ) c1 where jta.a = c1.a;
>QUERY PLAN
> 
>  Hash Join  (cost=38.66..589.52 rows=1402 width=12)
>Hash Cond: (jta0.a = jtb0.a)
>->  Append  (cost=0.00..482.50 rows=25500 width=4)
>  ->  Seq Scan on jta0  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta1  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta2  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta3  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta4  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta5  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta6  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta7  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta8  (cost=0.00..35.50 rows=2550 width=4)
>  ->  Seq Scan on jta9  (cost=0.00..35.50 rows=2550 width=4)
>->  Hash  (cost=38.53..38.53 rows=11 width=8)
>  ->  GroupAggregate  (cost=0.00..38.42 rows=11 width=8)
>Group Key: jtb0.a
>->  Seq Scan on jtb0  (cost=0.00..38.25 rows=11 width=8)
>  Filter: (a = 1)
> (18 rows)
>
> I assume that subquery aggregate only pruned table and parent query joins 
> pruned table and subquery results.
> However, parent query scan all partitions and join.
> In my investigation, because is_simple_query() returns false if subquery 
> contains GROUP BY, parent query does not prune.
> Is it possible to improve this?

The planner can only push quals down into a subquery, it cannot pull
quals from a subquery into the outer query.

If you write the query like:

explain select * from jta, (select a, max(b) from jtb group by a ) c1
where jta.a = c1.a and c1.a = 1;

you should get the plan that you want.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Read-only access to temp tables for 2PC transactions

2019-05-24 Thread Simon Riggs

On Thu, 23 May 2019 at 16:55, Andres Freund  wrote:

> Hi,
>
> On 2019-05-23 12:36:09 +0100, Simon Riggs wrote:
> > The ONLY case where this matters is if someone does a PREPARE and then
> > starts doing other work on the session. Which makes no sense in the
> normal
> > workflow of a session. I'm sure there are tests that do that, but those
> > tests are unrepresentative of sensible usage.
>
> That's extremely common.
>

Not at all.


> There's no way we can forbid using session after 2PC unconditionally,
> it'd break most users of 2PC.
>

Since we disagree, can you provide more information about this usage
pattern?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Read-only access to temp tables for 2PC transactions

2019-05-24 Thread Simon Riggs

On Fri, 24 May 2019 at 01:39, Michael Paquier  wrote:

> On Thu, May 23, 2019 at 08:54:59AM -0700, Andres Freund wrote:
> > On 2019-05-23 12:36:09 +0100, Simon Riggs wrote:
> >> The ONLY case where this matters is if someone does a PREPARE and then
> >> starts doing other work on the session. Which makes no sense in the
> normal
> >> workflow of a session. I'm sure there are tests that do that, but those
> >> tests are unrepresentative of sensible usage.
> >
> > That's extremely common.
> >
> > There's no way we can forbid using session after 2PC unconditionally,
> > it'd break most users of 2PC.
>
> This does not break Postgres-XC or XL as their inner parts with a
> COMMIT involving multiple write nodes issue a set of PREPARE
> TRANSACTION followed by an immediate COMMIT PREPARED which are
> transparent for the user, so the point of Simon looks sensible from
> this angle.

Maybe, but I am not discussing other products since they can be changed
without discussion here.

> Howewer, I much agree with Andres that it is very common
> to have PREPARE and COMMIT PREPARED issued with different sessions.  I
> am not much into the details of XA-compliant drivers, but I think that
> having us lose this property would be the source of many complaints.
>

Yes, it is *very* common to have PREPARE and COMMIT PREPARED issued from
different sessions. That is the main usage in a session pool and not the
point I made.

There are two usage patterns, with a correlation between the way 2PC and
temp tables work:

Transaction-mode session-pool: (Most common usage mode)
* No usage of session-level temp tables (because that wouldn't work)
* 2PC with PREPARE and COMMIT PREPARED on different sessions
* No reason at all to hold locks on temp table after PREPARE

Session-mode (Less frequent usage mode)
* Usage of session-level temp tables
* 2PC on same session only, i.e. no usage of session between PREPARE and
COMMIT PREPARED (Simon's observation)
* No reason at all to hold locks on temp table after PREPARE (Simon's
conclusion)

I'd like to hear from anyone that thinks my observation is incorrect and to
explain their usage pattern so we can understand why they think they would
execute further SQL between PREPARE and COMMIT PREPARED when using a single
session, while at the same time using temp tables.

If there really is a usage pattern there we should take note of, then I
suggest we introduce a parameter that allows temp table locks to be dropped
at PREPARE, so that we can use 2PC and Temp Tables with ease, for those
that want it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Indexing - comparison of tree structures

2019-05-24 Thread Sascha Kuhl

Hi,

I compared two data structures realistically by time, after estimating big
O. T-tree outperforms b-tree, which is commonly used, for a medium size
table. Lehmann and Carey showed the same, earlier.

Can you improve indexing by this?

Understandably

Sascha Kuhl


Just another bit faster.pdf
Description: Adobe PDF document

Re: Should we warn against using too many partitions?

On Fri, 24 May 2019 at 17:58, Amit Langote
 wrote:
> +  Whether using table inheritance or native partitioning, hierarchies
>
> Maybe, it would better to use the word "declarative" instead of "native",
> if only to be consistent; neighboring paragraphs use "declarative".

Thanks for having a look.

I've attached the pg10 and pg11 patches with that updated... and also
the master one (unchanged) with the hopes that the CF bot picks that
one.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


docs_partitioning_warning_master_v2.patch
Description: Binary data


docs_partitioning_warning_pg11_v3.patch
Description: Binary data


docs_partitioning_warning_pg10_v3.patch
Description: Binary data

Re: Converting NOT IN to anti-joins during planning

On Wed, 6 Mar 2019 at 12:54, David Rowley  wrote:
> The latest patch is attached.

Rebased version after pgindent run.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


not_in_anti_join_v1.4.patch
Description: Binary data

Re: initdb recommendations

On 5/23/19 10:30 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> "Jonathan S. Katz"  writes:
>> > For now I have left in the password based method to be scram-sha-256 as
>> > I am optimistic about the support across client drivers[1] (and FWIW I
>> > have an implementation for crystal-pg ~60% done).
>> 
>> > However, this probably means we would need to set the default password
>> > encryption guc to "scram-sha-256" which we're not ready to do yet, so it
>> > may be moot to leave it in.
>> 
>> > So, thinking out loud about that, we should probably use "md5" and once
>> > we decide to make the encryption method "scram-sha-256" by default, then
>> > we update the recommendation?
>> 
>> Meh.  If we're going to break things, let's break them.  Set it to
>> scram by default and let people who need to cope with old clients
>> change the default.  I'm tired of explaining that MD5 isn't actually
>> insecure in our usage ...
> 
> +many.

many++

Are we doing this for pg12? In any case, I would think we better loudly
point out this change somewhere.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: initdb recommendations

2019-05-24 Thread Dave Cramer

On Fri, 24 May 2019 at 07:48, Joe Conway  wrote:

> On 5/23/19 10:30 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> "Jonathan S. Katz"  writes:
> >> > For now I have left in the password based method to be scram-sha-256
> as
> >> > I am optimistic about the support across client drivers[1] (and FWIW I
> >> > have an implementation for crystal-pg ~60% done).
> >>
> >> > However, this probably means we would need to set the default password
> >> > encryption guc to "scram-sha-256" which we're not ready to do yet, so
> it
> >> > may be moot to leave it in.
> >>
> >> > So, thinking out loud about that, we should probably use "md5" and
> once
> >> > we decide to make the encryption method "scram-sha-256" by default,
> then
> >> > we update the recommendation?
> >>
> >> Meh.  If we're going to break things, let's break them.  Set it to
> >> scram by default and let people who need to cope with old clients
> >> change the default.  I'm tired of explaining that MD5 isn't actually
> >> insecure in our usage ...
> >
> > +many.
>
> many++
>
> Are we doing this for pg12? In any case, I would think we better loudly
> point out this change somewhere.
>
>
+many as well given the presumption that we are going to break existing
behaviour

Dave

Re: initdb recommendations

2019-05-24 Thread Peter Eisentraut

On 2019-05-24 13:48, Joe Conway wrote:
> Are we doing this for pg12?

no

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: GROUP BY optimization

2019-05-24 Thread Dmitry Dolgov

> On Fri, May 3, 2019 at 11:55 PM Tomas Vondra  
> wrote:
>
> I don't recall the exact details of the discussion, since most of it
> happened almost a year ago, but the main concern about the original
> costing approach is that it very much assumes uniform distribution.
>
> For example if you have N tuples with M groups (which are not computed
> using estimate_num_groups IIRC, and we can hardly do much better), then
> the patch assumed each groups is ~N/M rows and used that for computing
> the number of comparisons for a particular sequence of ORDER BY clauses.
>
> But that may result in pretty significant errors in causes with a couple
> of very large groups.

Yes, you're right, the current version of the patch assumes uniform
distribution of values between these M groups. After some thinking I've got an
idea that maybe it's better to not try to find out what would be acceptable for
both uniform and non uniform distributions, but just do not reorder at all if
there are any significant deviations from what seems to be a "best case",
namely when values distributed among groups relatively uniformly and there are
no doubts about how complicated is to compare them.

Saying that, it's possible on top of the current implementation to check:

* dependency coefficient between columns (if I understand correctly, non
  uniform distributions of values between the groups a.k.a few large groups
  should be visible from an extended statistics as a significant dependency)

* largest/smallest group in MCV doesn't differ too much from an expected
  "average" group

* average width and comparison procost are similar

If everything fits (which I hope would be most of the time) - apply reorder,
otherwise use whatever order was specified (which leaves the room for manual
reordering for relatively rare situations). Does it makes sense?

> But what I think we could do is using largest possible group instead of
> the average one. So for example when estimating number of comparisons
> for columns (c1,..,cN), you could look at MCV lists for these columns
> and compute
>
> m(i) = Max(largest group in MCV list for i-th column)
>
> and then use Min(m(1), ..., m(k)) when estimating the number of
> comparisons.

I see the idea, but I'm a bit confused about how to get a largest group for a
MCV list? I mean there is a list of most common values per column with
frequencies, and similar stuff for multi columns statistics, but how to get a
size of those groups?

Re: initdb recommendations

Greetings,

* Joe Conway (m...@joeconway.com) wrote:
> On 5/23/19 10:30 PM, Stephen Frost wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> "Jonathan S. Katz"  writes:
> >> > For now I have left in the password based method to be scram-sha-256 as
> >> > I am optimistic about the support across client drivers[1] (and FWIW I
> >> > have an implementation for crystal-pg ~60% done).
> >> 
> >> > However, this probably means we would need to set the default password
> >> > encryption guc to "scram-sha-256" which we're not ready to do yet, so it
> >> > may be moot to leave it in.
> >> 
> >> > So, thinking out loud about that, we should probably use "md5" and once
> >> > we decide to make the encryption method "scram-sha-256" by default, then
> >> > we update the recommendation?
> >> 
> >> Meh.  If we're going to break things, let's break them.  Set it to
> >> scram by default and let people who need to cope with old clients
> >> change the default.  I'm tired of explaining that MD5 isn't actually
> >> insecure in our usage ...
> > 
> > +many.
> 
> many++
> 
> Are we doing this for pg12? In any case, I would think we better loudly
> point out this change somewhere.

Sure, we should point it out, but I don't know that it needs to be
screamed from the rooftops considering the packagers have already been
largely ignoring our defaults here anyway...

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: initdb recommendations

On 5/24/19 8:13 AM, Stephen Frost wrote:
> Greetings,
> 
> * Joe Conway (m...@joeconway.com) wrote:
>> On 5/23/19 10:30 PM, Stephen Frost wrote:
>> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> >> "Jonathan S. Katz"  writes:
>> >> > For now I have left in the password based method to be scram-sha-256 as
>> >> > I am optimistic about the support across client drivers[1] (and FWIW I
>> >> > have an implementation for crystal-pg ~60% done).
>> >> 
>> >> > However, this probably means we would need to set the default password
>> >> > encryption guc to "scram-sha-256" which we're not ready to do yet, so it
>> >> > may be moot to leave it in.
>> >> 
>> >> > So, thinking out loud about that, we should probably use "md5" and once
>> >> > we decide to make the encryption method "scram-sha-256" by default, then
>> >> > we update the recommendation?
>> >> 
>> >> Meh.  If we're going to break things, let's break them.  Set it to
>> >> scram by default and let people who need to cope with old clients
>> >> change the default.  I'm tired of explaining that MD5 isn't actually
>> >> insecure in our usage ...
>> > 
>> > +many.
>> 
>> many++
>> 
>> Are we doing this for pg12? In any case, I would think we better loudly
>> point out this change somewhere.
> 
> Sure, we should point it out, but I don't know that it needs to be
> screamed from the rooftops considering the packagers have already been
> largely ignoring our defaults here anyway...

Yeah, I thought about that, but anyone not using those packages will be
in for a big surprise. Don't get me wrong, I wholeheartedly endorse the
change, but I predict many related questions on the lists, and anything
we can do to mitigate that should be done.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Excessive memory usage in multi-statement queries w/ partitioning

On 5/24/19 1:47 AM, Amit Langote wrote:
> On 2019/05/23 4:15, Andreas Seltenreich wrote:
>> …but when doing it on the parent relation, even 100 statements are
>> enough to exceed the limit:
>> 
>> ,
>> | $ psql -c "$(yes update t set c=c where c=6 \; | head -n 100)"
>> | FEHLER:  Speicher aufgebraucht
>> | DETAIL:  Failed on request of size 200 in memory context "MessageContext".
>> `
>> 
>> The memory context dump shows plausible values except for the MessageContext:
>> 
>> TopMemoryContext: 124336 total in 8 blocks; 18456 free (11 chunks); 105880 
>> used
>>   [...]
>>   MessageContext: 264241152 total in 42 blocks; 264 free (0 chunks); 
>> 264240888 used
>>   [...]
> 
> As David Rowley said, planning that query hundreds of times under a single
> MessageContext is not something that will end well on 11.3, because even a
> single instance takes up tons of memory that's only released when
> MessageContext is reset.
> 
>> Maybe some tactically placed pfrees or avoiding putting redundant stuff
>> into MessageContext can relax the situation?
> 
> I too have had similar thoughts on the matter.  If the planner had built
> all its subsidiary data structures in its own private context (or tree of
> contexts) which is reset once a plan for a given query is built and passed
> on, then there wouldn't be an issue of all of that subsidiary memory
> leaking into MessageContext.  However, the problem may really be that
> we're subjecting the planner to use cases that it wasn't perhaps designed
> to perform equally well under -- running it many times while handling the
> same message.  It is worsened by the fact that the query in question is
> something that ought to have been documented as not well supported by the
> planner; David has posted a documentation patch for that [1].  PG 12 has
> alleviated the situation to a large degree, so you won't see the OOM
> occurring for this query, but not for all queries unfortunately.


I admittedly haven't followed this thread too closely, but if having 100
partitions causes out of memory on pg11, that sounds like a massive
regression to me.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: initdb recommendations

Greetings,

* Joe Conway (m...@joeconway.com) wrote:
> On 5/24/19 8:13 AM, Stephen Frost wrote:
> > * Joe Conway (m...@joeconway.com) wrote:
> >> On 5/23/19 10:30 PM, Stephen Frost wrote:
> >> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> >> "Jonathan S. Katz"  writes:
> >> >> > For now I have left in the password based method to be scram-sha-256 
> >> >> > as
> >> >> > I am optimistic about the support across client drivers[1] (and FWIW I
> >> >> > have an implementation for crystal-pg ~60% done).
> >> >> 
> >> >> > However, this probably means we would need to set the default password
> >> >> > encryption guc to "scram-sha-256" which we're not ready to do yet, so 
> >> >> > it
> >> >> > may be moot to leave it in.
> >> >> 
> >> >> > So, thinking out loud about that, we should probably use "md5" and 
> >> >> > once
> >> >> > we decide to make the encryption method "scram-sha-256" by default, 
> >> >> > then
> >> >> > we update the recommendation?
> >> >> 
> >> >> Meh.  If we're going to break things, let's break them.  Set it to
> >> >> scram by default and let people who need to cope with old clients
> >> >> change the default.  I'm tired of explaining that MD5 isn't actually
> >> >> insecure in our usage ...
> >> > 
> >> > +many.
> >> 
> >> many++
> >> 
> >> Are we doing this for pg12? In any case, I would think we better loudly
> >> point out this change somewhere.
> > 
> > Sure, we should point it out, but I don't know that it needs to be
> > screamed from the rooftops considering the packagers have already been
> > largely ignoring our defaults here anyway...
> 
> Yeah, I thought about that, but anyone not using those packages will be
> in for a big surprise. Don't get me wrong, I wholeheartedly endorse the
> change, but I predict many related questions on the lists, and anything
> we can do to mitigate that should be done.

You think there's someone who builds from the source and just trusts
what we have put in for the defaults in pg_hba.conf..?

I've got a really hard time with that idea...

I'm all for making people aware of it, but I don't think it justifies
being the top item of the release notes or some such.  Frankly, anything
that starts with "If you build from source, then..." is already going to
be pretty low impact and therefore low on the list of things we need to
cover in the release notes, et al.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: initdb recommendations

2019-05-24 Thread Magnus Hagander

On Fri, May 24, 2019 at 2:19 PM Stephen Frost  wrote:

> Greetings,
>
> * Joe Conway (m...@joeconway.com) wrote:
> > On 5/24/19 8:13 AM, Stephen Frost wrote:
> > > * Joe Conway (m...@joeconway.com) wrote:
> > >> On 5/23/19 10:30 PM, Stephen Frost wrote:
> > >> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > >> >> "Jonathan S. Katz"  writes:
> > >> >> > For now I have left in the password based method to be
> scram-sha-256 as
> > >> >> > I am optimistic about the support across client drivers[1] (and
> FWIW I
> > >> >> > have an implementation for crystal-pg ~60% done).
> > >> >>
> > >> >> > However, this probably means we would need to set the default
> password
> > >> >> > encryption guc to "scram-sha-256" which we're not ready to do
> yet, so it
> > >> >> > may be moot to leave it in.
> > >> >>
> > >> >> > So, thinking out loud about that, we should probably use "md5"
> and once
> > >> >> > we decide to make the encryption method "scram-sha-256" by
> default, then
> > >> >> > we update the recommendation?
> > >> >>
> > >> >> Meh.  If we're going to break things, let's break them.  Set it to
> > >> >> scram by default and let people who need to cope with old clients
> > >> >> change the default.  I'm tired of explaining that MD5 isn't
> actually
> > >> >> insecure in our usage ...
> > >> >
> > >> > +many.
> > >>
> > >> many++
> > >>
> > >> Are we doing this for pg12? In any case, I would think we better
> loudly
> > >> point out this change somewhere.
> > >
> > > Sure, we should point it out, but I don't know that it needs to be
> > > screamed from the rooftops considering the packagers have already been
> > > largely ignoring our defaults here anyway...
> >
> > Yeah, I thought about that, but anyone not using those packages will be
> > in for a big surprise. Don't get me wrong, I wholeheartedly endorse the
> > change, but I predict many related questions on the lists, and anything
> > we can do to mitigate that should be done.
>
> You think there's someone who builds from the source and just trusts
> what we have put in for the defaults in pg_hba.conf..?
>
> I've got a really hard time with that idea...
>
> I'm all for making people aware of it, but I don't think it justifies
> being the top item of the release notes or some such.  Frankly, anything
> that starts with "If you build from source, then..." is already going to
> be pretty low impact and therefore low on the list of things we need to
> cover in the release notes, et al.
>

I think changing away from "trust" is going to be a much smaller change
than people seem to worry about.

It will hit people *in the developer community*.

The thing that will potentially hit *end users* is when the RPMs, DEBs or
Windows Installers switch to SCRAM (because of clients with older drivers).
But they have *already* stopped using trust many many years ago.

Making the default change away from trust in the source distro will affect
few people.

Making the default change of password_encryption -> scram will affect a
*lot* of people. That one needs to be more carefully coordinated.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: initdb recommendations

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> The thing that will potentially hit *end users* is when the RPMs, DEBs or
> Windows Installers switch to SCRAM (because of clients with older drivers).

Agreed.  I'm not sure that our change to SCRAM as default would actually
make them change...  It might, but I'm not sure and it's really a bit of
a different discussion in any case because we need to provide info about
how to go about making the migration.

> Making the default change away from trust in the source distro will affect
> few people.

Agreed.

> Making the default change of password_encryption -> scram will affect a
> *lot* of people. That one needs to be more carefully coordinated.

We need to provide better documentation about how to get from md5 to
SCRAM, in my view.  I'm not sure where that should live, exactly.
I really wish we had put more effort into making the migration easy to
do over a period of time, and we might actually have to do that before
the packagers would be willing to make that change.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: initdb recommendations

On 5/24/19 8:33 AM, Stephen Frost wrote:
> Greetings,
> 
> * Magnus Hagander (mag...@hagander.net) wrote:
>> The thing that will potentially hit *end users* is when the RPMs, DEBs or
>> Windows Installers switch to SCRAM (because of clients with older drivers).
> 
> Agreed.  I'm not sure that our change to SCRAM as default would actually
> make them change...  It might, but I'm not sure and it's really a bit of
> a different discussion in any case because we need to provide info about
> how to go about making the migration.

Yeah, that's the key piece. Even with (almost) all the drivers now
supporting SCRAM, the re-hashing from md5 => scram-sha-256 does not come
automatically.

>> Making the default change away from trust in the source distro will affect
>> few people.
> 
> Agreed.

+1

>> Making the default change of password_encryption -> scram will affect a
>> *lot* of people. That one needs to be more carefully coordinated.

Per some of the upthread comments though, if we go down this path we
should at least make the packagers abundantly aware if we do change the
default. I think some of the work they do could help ease the upgrade pain.

> We need to provide better documentation about how to get from md5 to
> SCRAM, in my view.  I'm not sure where that should live, exactly.
> I really wish we had put more effort into making the migration easy to
> do over a period of time, and we might actually have to do that before
> the packagers would be willing to make that change.

+100...I think we should do this regardless, and I was already thinking
of writing something up around it. I would even suggest that we have
said password upgrade documentation backpatched to 10.

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: initdb recommendations

Greetings,

* Jonathan S. Katz (jk...@postgresql.org) wrote:
> On 5/24/19 8:33 AM, Stephen Frost wrote:
> > We need to provide better documentation about how to get from md5 to
> > SCRAM, in my view.  I'm not sure where that should live, exactly.
> > I really wish we had put more effort into making the migration easy to
> > do over a period of time, and we might actually have to do that before
> > the packagers would be willing to make that change.
> 
> +100...I think we should do this regardless, and I was already thinking
> of writing something up around it. I would even suggest that we have
> said password upgrade documentation backpatched to 10.

Not sure that backpatching is necessary, but I'm not actively against
it.

What I was really getting at though was the ability to have multiple
authenticator tokens active concurrently (eg: md5 AND SCRAM), with an
ability to use either one (idk, md5_or_scram auth method?), and then
automatically set both on password change until everything is using
SCRAM and then remove all MD5 stuff.

Or something along those lines.  In other words, I'm talking about new
development work to ease the migration (while also providing some oft
asked about features, like the ability to do rolling passwords...).

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: initdb recommendations

On 5/24/19 8:56 AM, Jonathan S. Katz wrote:
> On 5/24/19 8:33 AM, Stephen Frost wrote:
>> * Magnus Hagander (mag...@hagander.net) wrote:
>>> Making the default change away from trust in the source distro will affect
>>> few people.
>> 
>> Agreed.
> 
> +1

Fewer people, but likely disproportionately high representation on pgsql
lists. Anyway, nuff said -- I guess the future will tell one way or the
other.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: initdb recommendations

On 5/24/19 9:01 AM, Stephen Frost wrote:
> Greetings,
> 
> * Jonathan S. Katz (jk...@postgresql.org) wrote:
>> On 5/24/19 8:33 AM, Stephen Frost wrote:
>>> We need to provide better documentation about how to get from md5 to
>>> SCRAM, in my view.  I'm not sure where that should live, exactly.
>>> I really wish we had put more effort into making the migration easy to
>>> do over a period of time, and we might actually have to do that before
>>> the packagers would be willing to make that change.
>>
>> +100...I think we should do this regardless, and I was already thinking
>> of writing something up around it. I would even suggest that we have
>> said password upgrade documentation backpatched to 10.
> 
> Not sure that backpatching is necessary, but I'm not actively against
> it.

Well, for someone who wants to cut over and has to manually guide the
process, a guide will help in absence of new development.

> 
> What I was really getting at though was the ability to have multiple
> authenticator tokens active concurrently (eg: md5 AND SCRAM), with an
> ability to use either one (idk, md5_or_scram auth method?), and then
> automatically set both on password change until everything is using
> SCRAM and then remove all MD5 stuff.
> 
> Or something along those lines.  In other words, I'm talking about new
> development work to ease the migration (while also providing some oft
> asked about features, like the ability to do rolling passwords...).

Cool, I have been thinking about a similar feature as well to help ease
the transition (and fwiw was going to suggest it in my previous email).

I think an interim step at least is to document how we can at least help
ease the transition.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: Excessive memory usage in multi-statement queries w/ partitioning

On Sat, 25 May 2019 at 00:18, Joe Conway  wrote:
> I admittedly haven't followed this thread too closely, but if having 100
> partitions causes out of memory on pg11, that sounds like a massive
> regression to me.

For it to have regressed it would have had to once have been better,
but where was that mentioned?  The only thing I saw was
non-partitioned tables compared to partitioned tables, but you can't
really say it's a regression if you're comparing apples to oranges.

I think the only regression here is in the documents from bebc46931a1
having removed the warning about too many partitions in a partitioned
table at the end of ddl.sgml. As Amit mentions, we'd like to put
something back about that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Aggregate error message

David Rowley  writes:
> On Fri, 24 May 2019 at 18:17, Vik Fearing  wrote:
>> With a sample query such as
>> SELECT x, avg(x)
>> FROM (VALUES (1), (2), (3)) AS v (x);
>> We give the error message "column "v.x" must appear in the GROUP BY
>> clause or be used in an aggregate function".
>> This is correct but incomplete.  Attached is a trivial patch to also
>> suggest that the user might have been trying to use a window function.

> I think you might have misthought this one. If there's an aggregate
> function in the SELECT or HAVING clause, then anything else in the
> SELECT clause is going to have to be either in the GROUP BY clause, be
> functionally dependent on the GROUP BY clause, or be in an aggregate
> function. Putting it into a window function won't help the situation.

Yeah.  Also, even if the problem really is that avg(x) should have had
an OVER clause, the fact that the error cursor will not be pointing
at avg(x) means that Vik's wording is still not that helpful.

This is a bit outside our usual error-writing practice, but I wonder
if we could phrase it like "since this query uses aggregation, column
"v.x" must appear in the GROUP BY clause or be used in an aggregate
function".  With that, perhaps the user would realize "oh, I didn't
mean to aggregate" when faced with Vik's example.  But this phrasing
doesn't cover the GROUP-BY-without-aggregate case, and I'm not sure
how to do that without making the message even longer and more unwieldy.

> If there's any change to make to the error message then it would be to
> add the functional dependency part, but since we're pretty bad at
> detecting that, I don't think we should.

Yeah, that's another thing we're failing to cover in the message ...
but it seems unrelated to Vik's example.

regards, tom lane

Re: initdb recommendations

2019-05-24 Thread Heikki Linnakangas

On 24/05/2019 16:01, Stephen Frost wrote:

What I was really getting at though was the ability to have multiple
authenticator tokens active concurrently (eg: md5 AND SCRAM), with an
ability to use either one (idk, md5_or_scram auth method?), and then
automatically set both on password change until everything is using
SCRAM and then remove all MD5 stuff.

Umm, that's what "md5" already does. Per documentation 
(https://www.postgresql.org/docs/current/auth-password.html):

> To ease transition from the md5 method to the newer SCRAM method, if
> md5 is specified as a method in pg_hba.conf but the user's password on
> the server is encrypted for SCRAM (see below), then SCRAM-based
> authentication will automatically be chosen instead.

The migration path is:

1. Use "md5" in pg_hba.conf, and put password_encryption='scram-sha-256' 
in postgresql.conf.

2. Wait until all users have reset their passwords, so that all users 
have a SCRAM-SHA-256 verifier.

3. Replace "md5" with "scram-sha-256" in pg_hba.conf.

Step 3 is kind of optional; once all users have a SCRAM verifier instead 
of an MD5 hash, they will all use SCRAM even without changing 
pg_hba.conf. It just prevents MD5 authentication in case a user forces a 
new MD5 hash into the system e.g. by changing password_encryption, or by 
setting an MD5 password explicitly with ALTER USER.

- Heikki

Re: Minimal logical decoding on standbys

2019-05-24 Thread Amit Khandekar

On Thu, 23 May 2019 at 23:18, Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-23 23:08:55 +0530, Amit Khandekar wrote:
> > On Thu, 23 May 2019 at 21:29, Andres Freund  wrote:
> > > On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> > > > On Tue, 21 May 2019 at 21:49, Andres Freund  wrote:
> > > > Yeah, I agree we should add such checks to minimize the possibility of
> > > > reading logical records from a master that has insufficient wal_level.
> > > > So to summarize :
> > > > a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> > > > b. Call this function call in CreateInitDecodingContext() as well.
> > > > c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> > > > error if there is an existing logical slot.
> > > >
> > > > This made me think more of the race conditions. For instance, in
> > > > pg_create_logical_replication_slot(), just after
> > > > CheckLogicalDecodingRequirements and before actually creating the
> > > > slot, suppose concurrently Controlfile->wal_level is changed from
> > > > logical to replica.  So suppose a new slot does get created. Later the
> > > > slot is read, so in pg_logical_slot_get_changes_guts(),
> > > > CheckLogicalDecodingRequirements() is called where it checks
> > > > ControlFile->wal_level value. But just before it does that,
> > > > ControlFile->wal_level concurrently changes back to logical, because
> > > > of replay of another param-change record. So this logical reader will
> > > > think that the wal_level is sufficient, and will proceed to read the
> > > > records, but those records are *before* the wal_level change, so these
> > > > records don't have logical data.
> > >
> > > I don't think that's an actual problem, because there's no decoding
> > > before the slot exists and CreateInitDecodingContext() has determined
> > > the start LSN. And by that point the slot exists, slo
> > > XLOG_PARAMETER_CHANGE replay can error out.
> >
> > So between the start lsn and the lsn for
> > parameter-change(logical=>replica) record, there can be some records ,
> > and these don't have logical data. So the slot created will read from
> > the start lsn, and proceed to read these records, before reading the
> > parameter-change record.
>
> I don't think that's possible. By the time CreateInitDecodingContext()
> is called, the slot *already* exists (but in a state that'll cause it to
> be throw away on error). But the restart point has not yet been
> determined. Thus, if there is a XLOG_PARAMETER_CHANGE with a wal_level
> change it can error out. And to handle the race of wal_level changing
> between CheckLogicalDecodingRequirements() and the slot creation, we
> recheck in CreateInitDecodingContext().

ok, got it now. I was concerned that there might be some such cases
unhandled because we are not using locks to handle such concurrency
conditions. But as you have explained, the checks we are adding will
avoid this race condition.

>
> Think we might nee dto change ReplicationSlotReserveWal() to use the
> replay, rather than the redo pointer for logical slots though.

Not thought of this; will get back.

Working on the patch now 

> > Are you saying we want to error out when the postgres replays the
> > param change record and there is existing logical slot ? I thought you
> > were suggesting earlier that it's the decoder.c code which should
> > error out when reading the param-change record.
>
> Yes, that's what I'm saying. See this portion of my previous email on
> the topic:
Yeah, thanks for pointing that.
>
> On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> > So I suspect we need conflict handling in xlog_redo's
> > XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> > slots, we ought to be safe.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: initdb recommendations

Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 24/05/2019 16:01, Stephen Frost wrote:
> >What I was really getting at though was the ability to have multiple
> >authenticator tokens active concurrently (eg: md5 AND SCRAM), with an
> >ability to use either one (idk, md5_or_scram auth method?), and then
> >automatically set both on password change until everything is using
> >SCRAM and then remove all MD5 stuff.
> 
> Umm, that's what "md5" already does. Per documentation
> (https://www.postgresql.org/docs/current/auth-password.html):

I remembered that we did something here but hadn't gone and looked at
it recently, so sorry for misremembering.  Perhaps all the more reason
for detailed migration documentation.

> > To ease transition from the md5 method to the newer SCRAM method, if
> > md5 is specified as a method in pg_hba.conf but the user's password on
> > the server is encrypted for SCRAM (see below), then SCRAM-based
> > authentication will automatically be chosen instead.
> 
> The migration path is:
> 
> 1. Use "md5" in pg_hba.conf, and put password_encryption='scram-sha-256' in
> postgresql.conf.
> 
> 2. Wait until all users have reset their passwords, so that all users have a
> SCRAM-SHA-256 verifier.

Wait though- once a password is changed then they *have* to use SCRAM
for auth from that point on, right?  That's great if you can be sure
that everything you're connecting from supports it, but that isn't going
to necessairly be the case.  I think this is what I recall being unhappy
about and what I was trying to remember about what we did.

We also haven't got a way to tell very easily when a given md5 (or
scram, for that matter...) authenticator was last used, making it hard
to see if it's still actually being used or not.  Nor is there a very
nice way to see when all users have reset their passwords to scram
without inspecting the password hash itself...

> 3. Replace "md5" with "scram-sha-256" in pg_hba.conf.
> 
> Step 3 is kind of optional; once all users have a SCRAM verifier instead of
> an MD5 hash, they will all use SCRAM even without changing pg_hba.conf. It
> just prevents MD5 authentication in case a user forces a new MD5 hash into
> the system e.g. by changing password_encryption, or by setting an MD5
> password explicitly with ALTER USER.

Yes, which you'd certainly want to do, so I don't consider it to be
optional.  Further, we should really have a way for an admin to say
"never allow storing an md5 password again" which I don't think we do.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: initdb recommendations

On 5/24/19 9:49 AM, Heikki Linnakangas wrote:
> On 24/05/2019 16:01, Stephen Frost wrote:
>> What I was really getting at though was the ability to have multiple
>> authenticator tokens active concurrently (eg: md5 AND SCRAM), with an
>> ability to use either one (idk, md5_or_scram auth method?), and then
>> automatically set both on password change until everything is using
>> SCRAM and then remove all MD5 stuff.
> 
> Umm, that's what "md5" already does. Per documentation
> (https://www.postgresql.org/docs/current/auth-password.html):

Tested manually and verified in code, it does do that check:

/*
 * If 'md5' authentication is allowed, decide whether to perform 'md5' or
 * 'scram-sha-256' authentication based on the type of password the user
 * has.  If it's an MD5 hash, we must do MD5 authentication, and if it's a
 * SCRAM verifier, we must do SCRAM authentication.
 *
 * If MD5 authentication is not allowed, always use SCRAM.  If the user
 * had an MD5 password, CheckSCRAMAuth() will fail.
 */
if (port->hba->auth_method == uaMD5 && pwtype == PASSWORD_TYPE_MD5)
auth_result = CheckMD5Auth(port, shadow_pass, logdetail);
else
auth_result = CheckSCRAMAuth(port, shadow_pass, logdetail);

>> To ease transition from the md5 method to the newer SCRAM method, if
>> md5 is specified as a method in pg_hba.conf but the user's password on
>> the server is encrypted for SCRAM (see below), then SCRAM-based
>> authentication will automatically be chosen instead.
> 
> The migration path is:
> 
> 1. Use "md5" in pg_hba.conf, and put password_encryption='scram-sha-256'
> in postgresql.conf.
> 
> 2. Wait until all users have reset their passwords, so that all users
> have a SCRAM-SHA-256 verifier.

And "a superuser can verify this has occurred by inspecting the
pg_authid table (appropriate SQL)"

> 
> 3. Replace "md5" with "scram-sha-256" in pg_hba.conf.
> 
> Step 3 is kind of optional; once all users have a SCRAM verifier instead
> of an MD5 hash, they will all use SCRAM even without changing
> pg_hba.conf.

Verified this is true.

> It just prevents MD5 authentication in case a user forces a
> new MD5 hash into the system e.g. by changing password_encryption, or by
> setting an MD5 password explicitly with ALTER USER.

Cool. Thanks for the explanation.

I do think we should document said upgrade path, my best guess being
around here[1].

Jonathan

[1] https://www.postgresql.org/docs/current/auth-password.html

signature.asc
Description: OpenPGP digital signature

Re: Excessive memory usage in multi-statement queries w/ partitioning

On 5/24/19 9:33 AM, David Rowley wrote:
> On Sat, 25 May 2019 at 00:18, Joe Conway  wrote:
>> I admittedly haven't followed this thread too closely, but if having 100
>> partitions causes out of memory on pg11, that sounds like a massive
>> regression to me.
> 
> For it to have regressed it would have had to once have been better,
> but where was that mentioned?  The only thing I saw was
> non-partitioned tables compared to partitioned tables, but you can't
> really say it's a regression if you're comparing apples to oranges.

I have very successfully used multiple hundreds and even low thousands
of partitions without running out of memory under the older inheritance
based "partitioning", and declarative partitioning is supposed to be
(and we have advertised it to be) better, not worse, isn't it?

At least from my point of view if 100 partitions is unusable due to
memory leaks it is a regression. Perhaps not *technically* a regression
assuming it behaves this way in pg10 also, but I bet lots of users would
see it that way.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: initdb recommendations

2019-05-24 Thread Heikki Linnakangas

On 24/05/2019 17:02, Jonathan S. Katz wrote:

On 5/24/19 9:49 AM, Heikki Linnakangas wrote:

It just prevents MD5 authentication in case a user forces a
new MD5 hash into the system e.g. by changing password_encryption, or by
setting an MD5 password explicitly with ALTER USER.

Cool. Thanks for the explanation.

I do think we should document said upgrade path, my best guess being
around here[1].

[1] https://www.postgresql.org/docs/current/auth-password.html

You mean, like this? From the bottom of that page :-)

> To upgrade an existing installation from md5 to scram-sha-256, after
> having ensured that all client libraries in use are new enough to
> support SCRAM, set password_encryption = 'scram-sha-256' in
> postgresql.conf, make all users set new passwords, and change the
> authentication method specifications in pg_hba.conf to scram-sha-256.

It would be nice to expand that a little bit, though:

* How do you verify if all client libraries support SCRAM? Would be good 
to mention the minimum libpq version here, at least. Can we give more 
explicit instructions? It would be nice if there was a way to write an 
entry to the log, whenever an older client connects. Not sure how you'd 
do that..

* How does one "make all users to set new passwords"? Related to that, 
how do you check if all users have reset their password to SCRAM? Give 
the exact SQL needed to check that.

- Heikki

Re: Excessive memory usage in multi-statement queries w/ partitioning

Joe Conway  writes:
> On 5/24/19 9:33 AM, David Rowley wrote:
>> For it to have regressed it would have had to once have been better,
>> but where was that mentioned?  The only thing I saw was
>> non-partitioned tables compared to partitioned tables, but you can't
>> really say it's a regression if you're comparing apples to oranges.

> I have very successfully used multiple hundreds and even low thousands
> of partitions without running out of memory under the older inheritance
> based "partitioning", and declarative partitioning is supposed to be
> (and we have advertised it to be) better, not worse, isn't it?

Have you done the exact thing described in the test case?  I think
that's going to be quite unpleasantly memory-intensive in any version.

The real issue here is that we have designed around the assumption
that MessageContext will be used to parse and plan one single statement
before being reset.  The described usage breaks that assumption.
No matter how memory-efficient any one statement is or isn't,
if you throw enough of them at the backend without giving it a chance
to reset MessageContext, it won't end well.

So my thought, if we want to do something about this, is not "find
some things we can pfree at the end of planning" but "find a way
to use a separate context for each statement in the query string".
Maybe multi-query strings could be handled by setting up a child
context of MessageContext (after we've done the raw parsing there
and determined that indeed there are multiple queries), running
parse analysis and planning in that context, and resetting that
context after each query.

regards, tom lane

Re: BUG #15819: wrong expression in document of pgbench

2019-05-24 Thread Fabien COELHO




In example of random_zipfian,  the explanation is "which itself(2) is
produced (3/2)*2.5 = 2.76 times more frequently than 3".
"(3/2)*2.5 = 2.76" is wrong. The correct expression is "(3/2)**2.5 = 2.76".


Indeed. Attached patch to fix this typo.

--
Fabiendiff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ef22a484e7..e3b73a4cf5 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1561,7 +1561,7 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
   For example, random_zipfian(1, ..., 2.5) produces
   the value 1 about (2/1)**2.5 =
   5.66 times more frequently than 2, which
-  itself is produced (3/2)*2.5 = 2.76 times more
+  itself is produced (3/2)**2.5 = 2.76 times more
   frequently than 3, and so on.

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Peter Eisentraut

On 2018-04-10 23:32, Alvaro Herrera wrote:
> To figure out, I used the attached patch (not intended for application)
> to add a backtrace to each log message, plus a couple of accusatory
> elog() calls in relation_open and ExecSetupPartitionPruneState.

What do people think about adding something like this errbacktrace()
from Álvaro's patch to core PostgreSQL?  If we could devise a way to
selectively enable it, it might be an easier way for users to provide
backtraces from errors.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: initdb recommendations

On 5/24/19 10:26 AM, Heikki Linnakangas wrote:
> On 24/05/2019 17:02, Jonathan S. Katz wrote:
>> On 5/24/19 9:49 AM, Heikki Linnakangas wrote:
>>> It just prevents MD5 authentication in case a user forces a
>>> new MD5 hash into the system e.g. by changing password_encryption, or by
>>> setting an MD5 password explicitly with ALTER USER.
>>
>> Cool. Thanks for the explanation.
>>
>> I do think we should document said upgrade path, my best guess being
>> around here[1].
>>
>> [1] https://www.postgresql.org/docs/current/auth-password.html
> 
> You mean, like this? From the bottom of that page :-)

...yes ;) I think what I'm saying is that it should be its own section.

>> To upgrade an existing installation from md5 to scram-sha-256, after
>> having ensured that all client libraries in use are new enough to
>> support SCRAM, set password_encryption = 'scram-sha-256' in
>> postgresql.conf, make all users set new passwords, and change the
>> authentication method specifications in pg_hba.conf to scram-sha-256.
> 
> It would be nice to expand that a little bit, though:
> 
> * How do you verify if all client libraries support SCRAM? Would be good
> to mention the minimum libpq version here, at least. Can we give more
> explicit instructions? It would be nice if there was a way to write an
> entry to the log, whenever an older client connects. Not sure how you'd
> do that..

Yeah, this one is hard, because a lot of that depends on how the client
deals with not supporting SCRAM. Typically the server sends over
AuthenticationSASL and the client raises an error. All the server will
see is the connection closed, but it could be for any reason.

For example, I tested this with an unpatched asyncpg and noted similar
behavior. I'm not sure there's anything we can do given we don't know
that the client does not support SCRAM ahead of time.

I think the best we can do is mention minimums and, if we're ok with it,
link to the drivers wiki page so people can see which min. versions of
their preferred connection library support it.

> * How does one "make all users to set new passwords"? Related to that,
> how do you check if all users have reset their password to SCRAM? Give
> the exact SQL needed to check that.

Yeah this is a big one. I already hinted at the latter point, but also
explaining how to change passwords is helpful too (and I feel can also
cause quite a debate as well. Within psql it's a straightforward choice.
Outside of it, to do it safely you have to do a bit of extra work).

Jonathan

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Small fix: avoid passing null pointers to memcpy()

didier  writes:
> A smaller version removing memset in print_aligned_text function.
> The line is redundant , header_done isn't used yet and it's either
> pg_malloc0 or null.

Hm, I see the theoretical problem ...

> Without this patch make check fails 3 tests if pg is compiled with
> -fsanitize=address,undefined

... but if that's the only evidence of an actual problem, I can't
get excited about it.  ASAN complains about many things in Postgres,
and most of them are pretty hypothetical.

regards, tom lane

Contribute - money

2019-05-24 Thread Sascha Kuhl

Hi,

Is it possible to obtain money for a contribution I give hear. Or is
everything expected to be free?

Regards

Sascha Kuhl

Re: [HACKERS] Runtime Partition Pruning

Peter Eisentraut  writes:
> On 2018-04-10 23:32, Alvaro Herrera wrote:
>> To figure out, I used the attached patch (not intended for application)
>> to add a backtrace to each log message, plus a couple of accusatory
>> elog() calls in relation_open and ExecSetupPartitionPruneState.

> What do people think about adding something like this errbacktrace()
> from Álvaro's patch to core PostgreSQL?  If we could devise a way to
> selectively enable it, it might be an easier way for users to provide
> backtraces from errors.

I think we did discuss it right after that, or somewhere nearby, and
concluded that the output is so imprecise that it's not really going
to be worth whatever portability issues we'd have to deal with.

I'd be all for a better version, but glibc's backtrace() just sucks,
at least given our coding style with lots of static functions.

regards, tom lane

Re: Teach pg_upgrade test to honor NO_TEMP_INSTALL

Hi,

On 2019-05-22 13:08:41 -0700, Andres Freund wrote:
> On 2019-05-22 16:04:34 -0400, Andrew Dunstan wrote:
> > If I disable install, the buildfarm fails the upgrade check even when
> > not using NO_TEMP_INSTALL.
> > 
> > 
> > excerpts from the log:
> > sh: /home/pgl/npgl/pg_head/bfroot/HEAD/inst/bin/psql: No such file or
> > directory
> 
> That's the issue I was talking to Tom about above. Need to
> unconditionally have
> 

Andrew, after the latest set of changes, the reversed order should now
work reliably?

Greetings,

Andres Freund

Re: initdb recommendations

2019-05-24 Thread Noah Misch

On Thu, May 23, 2019 at 06:56:49PM +0200, Magnus Hagander wrote:
> On Thu, May 23, 2019, 18:54 Peter Eisentraut 
>  wrote:
> > To recap, the idea here was to change the default authentication methods
> > that initdb sets up, in place of "trust".
> >
> > I think the ideal scenario would be to use "peer" for local and some
> > appropriate password method (being discussed elsewhere) for host.
> >
> > Looking through the buildfarm, I gather that the only platforms that
> > don't support peer are Windows, AIX, and HP-UX.  I think we can probably
> > figure out some fallback or alternative default for the latter two
> > platforms without anyone noticing.  But what should the defaults be on
> > Windows?  It doesn't have local sockets, so the lack of peer wouldn't
> > matter.  But is it OK to default to a password method, or would that
> > upset people particularly?
> 
> I'm sure password would be fine there. It's what "everybody else" does
> (well sqlserver also cord integrated security, but people are used to it).

Our sspi auth is a more-general version of peer auth, and it works over TCP.
It would be a simple matter of programming to support "peer" on Windows,
consisting of sspi auth with an implicit pg_ident map.  Nonetheless, I agree
password would be fine.

Re: Minimal logical decoding on standbys

2019-05-24 Thread Amit Khandekar

On Fri, 24 May 2019 at 19:26, Amit Khandekar  wrote:
> Working on the patch now 

Attached is an incremental WIP patch
handle_wal_level_changes_WIP.patch to be applied over the earlier main
patch logical-decoding-on-standby_v4_rebased.patch.

> >
> > On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> > > So I suspect we need conflict handling in xlog_redo's
> > > XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> > > slots, we ought to be safe.

Yet to do this. Andres, how do you want to handle this scenario ? Just
drop all the existing logical slots like what we decided for conflict
recovery for conflicting xids ?

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

handle_wal_level_changes_WIP.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

I wrote:
> Peter Eisentraut  writes:
>> What do people think about adding something like this errbacktrace()
>> from Álvaro's patch to core PostgreSQL?

> I think we did discuss it right after that, or somewhere nearby, and
> concluded that the output is so imprecise that it's not really going
> to be worth whatever portability issues we'd have to deal with.

Hmm, after some digging in the archives, the closest thing I can find
is this thread:

https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com

where we discussed using libunwind instead, but people didn't like
the extra dependency.

However, I stand by the assertion that glibc's backtrace() is too
imprecise to be useful; I've experimented with it and despaired of
being able to tell where control had actually been.

regards, tom lane

Re: Teach pg_upgrade test to honor NO_TEMP_INSTALL

Andres Freund  writes:
> Andrew, after the latest set of changes, the reversed order should now
> work reliably?

Also, Thomas should be able to revert his cfbot hack ...

regards, tom lane

Re: [HACKERS] Runtime Partition Pruning

Hi,

On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
> I wrote:
> > Peter Eisentraut  writes:
> >> What do people think about adding something like this errbacktrace()
> >> from Álvaro's patch to core PostgreSQL?
> 
> > I think we did discuss it right after that, or somewhere nearby, and
> > concluded that the output is so imprecise that it's not really going
> > to be worth whatever portability issues we'd have to deal with.
> 
> Hmm, after some digging in the archives, the closest thing I can find
> is this thread:
> 
> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
> 
> where we discussed using libunwind instead, but people didn't like
> the extra dependency.

Hm, I didn't actually see that much concern about that. I still think we
should just go for libunwind. At least on debian it's likely to already
be installed:

andres@alap4:~$ apt rdepends libunwind8
libunwind8
Reverse Depends:
  Depends: libunwind-dev (= 1.2.1-9)
  Depends: linux-perf-4.16
  Depends: linux-perf-4.15
  Depends: linux-perf-4.14
  Depends: rspamd
  Depends: linux-perf-5.0
  Depends: libjulia1
  Depends: julia
  Depends: geary
  Depends: libunwind8-dbgsym (= 1.2.1-9)
  Depends: xwayland
  Depends: xvfb
  Depends: xserver-xorg-core
  Depends: xserver-xephyr
  Depends: xnest
  Depends: xdmx
  Depends: trafficserver
  Depends: tigervnc-standalone-server
  Depends: tarantool
  Depends: strace
  Depends: spring
  Depends: rspamd
  Depends: linux-perf-4.19
  Depends: libunwind-setjmp0
  Depends: libeina1a
  Depends: libjulia1
  Depends: julia
  Depends: intel-gpu-tools
  Depends: libheaptrack
  Depends: libgoogle-perftools4
  Depends: libgoogle-glog0v5
  Depends: gdnsd
  Depends: libevas1-engines-x
  Depends: libevas1

In particular strace, xserver-xorg-core, perf are reasonably likely to
already installed.

It's also not a large library. I'd bet if we made it an optional
build-time dependency it'd get included by just about every distro.

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

Andres Freund  writes:
> On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
>> Hmm, after some digging in the archives, the closest thing I can find
>> is this thread:
>> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
>> where we discussed using libunwind instead, but people didn't like
>> the extra dependency.

> Hm, I didn't actually see that much concern about that. I still think we
> should just go for libunwind.

Is it actually better?  The basic problem with backtrace() is that it
only knows about global functions, and so reports call sites in static
functions as if they were in whatever global function physically precedes
the static one.  I think doing materially better requires depending on
debug symbols, which (at least in the Red Hat world) aren't going to
be there in a typical production situation.

regards, tom lane

Re: [HACKERS] Small fix: avoid passing null pointers to memcpy()

2019-05-24 Thread didier

Hi,

On Fri, May 24, 2019 at 5:10 PM Tom Lane  wrote:
>
> didier  writes:
> > A smaller version removing memset in print_aligned_text function.
> > The line is redundant , header_done isn't used yet and it's either
> > pg_malloc0 or null.
>
> Hm, I see the theoretical problem ...
>
> > Without this patch make check fails 3 tests if pg is compiled with
> > -fsanitize=address,undefined
>
> ... but if that's the only evidence of an actual problem, I can't
> get excited about it.  ASAN complains about many things in Postgres,
Not that much, make check has only this one.

> and most of them are pretty hypothetical.
Well there's no point for this line even without ASAN, why call memset
on header_done but not on width_header,
width_average, and so on?

ASAN complaining at runtime because memset is called with a null ptr
and a zero count is just the nudge for removing it.

Regards
Didier

Re: [HACKERS] Runtime Partition Pruning

Hi,

On 2019-05-24 12:08:57 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
> >> Hmm, after some digging in the archives, the closest thing I can find
> >> is this thread:
> >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
> >> where we discussed using libunwind instead, but people didn't like
> >> the extra dependency.
> 
> > Hm, I didn't actually see that much concern about that. I still think we
> > should just go for libunwind.
> 
> Is it actually better?

I've not looked in a while, but I think at some point it was.


> The basic problem with backtrace() is that it
> only knows about global functions, and so reports call sites in static
> functions as if they were in whatever global function physically precedes
> the static one.

Does that depend on whether the program was compiled with
-fno-omit-frame-pointer? At least some distros now compile with frame
pointers enabled, to get reasonably fast perf profiles (at a basically
immeasurable slowdown, on modern-ish CPUs).


> I think doing materially better requires depending on
> debug symbols, which (at least in the Red Hat world) aren't going to
> be there in a typical production situation.

It's still a lot easier to install debug symbols than to attach a
debugger and get a backtrace that way. Especially when the problem is
hard to reproduce.

Greetings,

Andres Freund

Re: [HACKERS] Small fix: avoid passing null pointers to memcpy()

I wrote:
> didier  writes:
>> Without this patch make check fails 3 tests if pg is compiled with
>> -fsanitize=address,undefined

> ... but if that's the only evidence of an actual problem, I can't
> get excited about it.  ASAN complains about many things in Postgres,
> and most of them are pretty hypothetical.

BTW, I did try the described test case, so I suppose I might as well
summarize the results while I have them.  The postmaster log from
a regression run with today's HEAD shows ASAN errors at

clog.c:299:3:
indexcmds.c:1054:4:
relcache.c:5905:6:
snapmgr.c:597:2:
snapmgr.c:601:2:
xact.c:5204:3:

The above all seem to be the same ilk as the problem in print.c,
ie passing a NULL pointer with zero count to memcpy, memset, or
the like.  At least some of them are fairly old.

pg_crc32c_sse42.c:37:18:
pg_crc32c_sse42.c:44:9:

This is an intentional use of unaligned access:

 * NB: We do unaligned accesses here. The Intel architecture allows that,
 * and performance testing didn't show any performance gain from aligning
 * the begin address.

This comment is unclear about whether it would actually hurt to have a
preparatory loop to align the begin address.

... and a whole bunch of places in arrayaccess.h and arrayfuncs.c.

These seem to be down to use of AnyArrayType:

typedef union AnyArrayType
{
ArrayType   flt;
ExpandedArrayHeader xpn;
} AnyArrayType;

ASAN seems to believe that use of this union entitles the compiler to
assume 8-byte alignment even when touching fields of a short-header
array.  Maybe it's right, but this code has been like this for awhile
and we've not heard trouble reports.  I'm afraid that avoiding the
use of the union would be a notational mess, so I'm loath to change
it unless we're forced to.

This is just from the core tests, there could well be more issues in
contrib or PLs.

regards, tom lane

Re: Read-only access to temp tables for 2PC transactions

2019-05-24 Thread Konstantin Knizhnik

On 24.05.2019 11:52, Simon Riggs wrote:
On Fri, 24 May 2019 at 01:39, Michael Paquier > wrote:

On Thu, May 23, 2019 at 08:54:59AM -0700, Andres Freund wrote:
> On 2019-05-23 12:36:09 +0100, Simon Riggs wrote:
>> The ONLY case where this matters is if someone does a PREPARE
and then
>> starts doing other work on the session. Which makes no sense in
the normal
>> workflow of a session. I'm sure there are tests that do that,
but those
>> tests are unrepresentative of sensible usage.
>
> That's extremely common.
>
> There's no way we can forbid using session after 2PC
unconditionally,
> it'd break most users of 2PC.

This does not break Postgres-XC or XL as their inner parts with a
COMMIT involving multiple write nodes issue a set of PREPARE
TRANSACTION followed by an immediate COMMIT PREPARED which are
transparent for the user, so the point of Simon looks sensible from
this angle. 

Maybe, but I am not discussing other products since they can be 
changed without discussion here.

Howewer, I much agree with Andres that it is very common
to have PREPARE and COMMIT PREPARED issued with different sessions.  I
am not much into the details of XA-compliant drivers, but I think that
having us lose this property would be the source of many complaints.

Yes, it is *very* common to have PREPARE and COMMIT PREPARED issued 
from different sessions. That is the main usage in a session pool and 
not the point I made.

There are two usage patterns, with a correlation between the way 2PC 
and temp tables work:

Transaction-mode session-pool: (Most common usage mode)
* No usage of session-level temp tables (because that wouldn't work)
* 2PC with PREPARE and COMMIT PREPARED on different sessions
* No reason at all to hold locks on temp table after PREPARE

Session-mode (Less frequent usage mode)
* Usage of session-level temp tables
* 2PC on same session only, i.e. no usage of session between PREPARE 
and COMMIT PREPARED (Simon's observation)
* No reason at all to hold locks on temp table after PREPARE (Simon's 
conclusion)

I'd like to hear from anyone that thinks my observation is incorrect 
and to explain their usage pattern so we can understand why they think 
they would execute further SQL between PREPARE and COMMIT PREPARED 
when using a single session, while at the same time using temp tables.

If there really is a usage pattern there we should take note of, then 
I suggest we introduce a parameter that allows temp table locks to be 
dropped at PREPARE, so that we can use 2PC and Temp Tables with ease, 
for those that want it.

--
Simon Riggs http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From my point of view releasing all temporary table locks after 
preparing of 2PC transaction is not technically possible:
assume that this transaction has  updated some tuples of temporary table 
- them are not visible to other transactions until 2PC is committed,

but since lock is removed, other transactions can update the same tuple.

Prohibiting transaction to do anything else  except COMMIT/ROLLBACK 
PREPARED after preparing transaction seems to be too voluntaristic decision.
I do not think that "That's extremely common", but I almost sure that 
there are such cases.

The safe scenario is when temporary table is created and dropped inside 
transaction (table created with ON COMMIT DROP). But there is still one 
issue with this scenario: first creation of temporary table cause 
creation of
pg_temp namespace and it can not be undone. Another possible scenario is 
temporary table created outside transaction with ON COMMIT DELETE. In 
this case truncation of table on prepare will also release all locks.

Pure read-only access to temporary tables seems to be not so useful,  
because before reading something from temporary table, we have to write 
something to it. And if reading of temporary table is wrapped in 2PC,
then most likely writing to temporary table also has to be wrapped in 
2PC, which is not possible with the proposed solution.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Inconsistency between table am callback and table function names

2019-05-24 Thread Ashwin Agrawal

On Thu, May 23, 2019 at 4:32 PM Andres Freund  wrote:

> Hi,
>
> On 2019-05-14 12:11:46 -0700, Ashwin Agrawal wrote:
> > Thank you. Please find the patch to rename the agreed functions. It would
> > be good to make all consistent instead of applying exception to three
> > functions but seems no consensus on it. Given table_ api are new, we
> could
> > modify them leaving systable_ ones as is, but as objections left that as
> is.
>
> I've pushed a slightly modified version (rebase, some additional
> newlines due to the longer function names) now. Thanks for the patch!
>

Thanks a lot Andres. With pg_intend run before the patch on master, I can
imagine possibly generated additional work for you on this.

Re: Excessive memory usage in multi-statement queries w/ partitioning

On 5/24/19 10:28 AM, Tom Lane wrote:
> Joe Conway  writes:
>> On 5/24/19 9:33 AM, David Rowley wrote:
>>> For it to have regressed it would have had to once have been better,
>>> but where was that mentioned?  The only thing I saw was
>>> non-partitioned tables compared to partitioned tables, but you can't
>>> really say it's a regression if you're comparing apples to oranges.
> 
>> I have very successfully used multiple hundreds and even low thousands
>> of partitions without running out of memory under the older inheritance
>> based "partitioning", and declarative partitioning is supposed to be
>> (and we have advertised it to be) better, not worse, isn't it?
> 
> Have you done the exact thing described in the test case?  I think
> that's going to be quite unpleasantly memory-intensive in any version.


Ok, fair point. Will test and report back.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: Read-only access to temp tables for 2PC transactions

Hi,

On 2019-05-24 19:37:15 +0300, Konstantin Knizhnik wrote:
> From my point of view releasing all temporary table locks after preparing of
> 2PC transaction is not technically possible:
> assume that this transaction has  updated some tuples of temporary table - 
> them
> are not visible to other transactions until 2PC is committed,
> but since lock is removed, other transactions can update the same tuple.

I don't think tuple level actions are the problem? Those doesn't require
table level locks to be held.

Generally, I fail to see how locks themselves are the problem. The
problem are the catalog entries for the temp table, the relation forks,
and the fact that a session basically couldn't drop (and if created in
that transaction, use) etc the temp table after the PREPARE.

Greetings,

Andres Freund

Re: [HACKERS] proposal: schema variables

2019-05-24 Thread Pavel Stehule

Hi

čt 9. 5. 2019 v 6:34 odesílatel Pavel Stehule 
napsal:

> Hi
>
> rebased patch
>

rebase after pgindent

Regards

Pavel

>
> Regards
>
> Pavel
>
>
>


schema-variables-20190524.patch.gz
Description: application/gzip

Re: BUG #15781: subselect on foreign table (postgres_fdw) can crash (segfault)

2019-05-24 Thread Sean Johnston

Not sure if this is the right avenue to follow up on this. The patch works
fine. However, we're working on a modified version of the postgres_fdw in
which we're trying to push as much as possible to the remote nodes,
including ordering and limits. The patch causes the upper paths for the
ordering and limit to be rejected as they have no relids. I've had a quick
look at maybe how to pull in relids in the fdw private data but its not
obvious. Obviously this isn't mainstream postgres but just wondering if
anyone has looked into issues with regards to pushing order/limit to remote
nodes for fdw.

On Sat, Apr 27, 2019 at 3:47 PM Tom Lane  wrote:

> Etsuro Fujita  writes:
> > On Sat, Apr 27, 2019 at 2:10 AM Tom Lane  wrote:
> >> If we don't want to change what the core code does with fdw_exprs,
> >> I think the only way to fix it is to hack postgres_fdw so that it
> >> won't generate plans involving the problematic case.
>
> > Seems reasonable.
>
> >> See attached.
>
> > I read the patch.  It looks good to me.  I didn't test it, though.
>
> Thanks for looking!  Have a good vacation ...
>
> regards, tom lane
>

Re: BUG #15781: subselect on foreign table (postgres_fdw) can crash (segfault)

Sean Johnston  writes:
> Not sure if this is the right avenue to follow up on this. The patch works
> fine. However, we're working on a modified version of the postgres_fdw in
> which we're trying to push as much as possible to the remote nodes,
> including ordering and limits. The patch causes the upper paths for the
> ordering and limit to be rejected as they have no relids.

Uh, what?  If you're speaking of 8cad5adb9, the only case I'm aware of
where it might be a performance issue is if you have "GROUP BY
local-expr", which seems like a pretty weird thing to need to push
to the remote side, since the local expression would be effectively
a constant on the far end.

You could imagine working around it by discarding such GROUP BY
columns in what's to be sent to the far end, and if you end up with
an empty GROUP BY clause, sending "HAVING TRUE" instead to keep the
semantics the same.  But I'm uninterested in stacking yet more
klugery atop 8cad5adb9 so far as the community code is concerned.
The right way forward, as noted in the commit message, is to revert
that patch in favor of adding some API that will let the FDW control
how setrefs.c processes a ForeignScan's expressions.  We just can't
do that in released branches :-(.

It's possible that we should treat this issue as an open item for v12
instead of letting it slide to v13 or later.  But I think people would
only be amenable to that if you can point to a non-silly example where
failure to push the GROUP BY creates a performance issue.

regards, tom lane

Re: proposal: new polymorphic types - commontype and commontypearray

2019-05-24 Thread Pavel Stehule

Hi

út 5. 3. 2019 v 18:37 odesílatel Pavel Stehule 
napsal:

>
>
> út 5. 3. 2019 v 15:35 odesílatel Tom Lane  napsal:
>
>> David Steele  writes:
>> > This thread has been very quiet for a month.  I agree with Andres [1]
>> > that we should push this to PG13.
>>
>> I think the main thing it's blocked on is disagreement on what the
>> type name should be, which is kind of a silly thing to get blocked on,
>> but nonetheless it's important ...
>>
>
> I sent some others possible names, but probably this mail was forgotten
>
> What about "ctype" like shortcut for common type? carraytype,
> cnonarraytype?
>

rebase for PostgreSQL 13

Regards

Pavel


> Regards
>
> Pavel
>
>>
>> regards, tom lane
>>
>
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index fc300f605f..b010b2066e 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4771,6 +4771,14 @@ SELECT * FROM pg_attribute
 anyrange

 
+   
+commontype
+   
+
+   
+commontypearray
+   
+

 void

@@ -4880,6 +4888,34 @@ SELECT * FROM pg_attribute
 ).

 
+   
+commontype
+Indicates that a function accepts any data type. Values
+are converted to real common type.
+(see ).
+   
+
+   
+commontypearray
+Indicates that a function accepts any array data type. The
+elements of array are converted to common type of these values.
+(see ).
+   
+
+   
+commontypenonarray
+Indicates that a function accepts any non-array data type
+(see ).
+   
+
+   
+commontyperange
+Indicates that a function accepts any range data type
+(see  and
+). The subtype can be used for
+deduction of common type.
+   
+

 cstring
 Indicates that a function accepts or returns a null-terminated C string.
diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index b5e59d542a..4d2129d8b7 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -231,7 +231,7 @@
 
  Five pseudo-types of special interest are anyelement,
  anyarray, anynonarray, anyenum,
- and anyrange,
+ anyrange, commontype and commontypearray.
  which are collectively called polymorphic types.
  Any function declared using these types is said to be
  a polymorphic function.  A polymorphic function can
@@ -267,6 +267,15 @@
  be an enum type.
 
 
+
+ Second family of polymorphic types are types commontype and
+ commontypearray. These types are similar to types
+ anyelement and anyarray. The arguments declared
+ as anyelement requires same real type of passed values. For
+ commontype's arguments is selected common type, and later
+ all these arguments are casted to this common type.
+
+
 
  Thus, when more than one argument position is declared with a polymorphic
  type, the net effect is that only certain combinations of actual argument
diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 7cab039ded..b3e7eb391b 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -136,7 +136,7 @@ AggregateCreate(const char *aggName,
 	hasInternalArg = false;
 	for (i = 0; i < numArgs; i++)
 	{
-		if (IsPolymorphicType(aggArgTypes[i]))
+		if (IsPolymorphicTypeAny(aggArgTypes[i]))
 			hasPolyArg = true;
 		else if (aggArgTypes[i] == INTERNALOID)
 			hasInternalArg = true;
@@ -146,7 +146,7 @@ AggregateCreate(const char *aggName,
 	 * If transtype is polymorphic, must have polymorphic argument also; else
 	 * we will have no way to deduce the actual transtype.
 	 */
-	if (IsPolymorphicType(aggTransType) && !hasPolyArg)
+	if (IsPolymorphicTypeAny(aggTransType) && !hasPolyArg)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
  errmsg("cannot determine transition data type"),
@@ -156,7 +156,7 @@ AggregateCreate(const char *aggName,
 	 * Likewise for moving-aggregate transtype, if any
 	 */
 	if (OidIsValid(aggmTransType) &&
-		IsPolymorphicType(aggmTransType) && !hasPolyArg)
+		IsPolymorphicTypeAny(aggmTransType) && !hasPolyArg)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
  errmsg("cannot determine transition data type"),
@@ -492,7 +492,7 @@ AggregateCreate(const char *aggName,
 	 * that itself violates the rule against polymorphic result with no
 	 * polymorphic input.)
 	 */
-	if (IsPolymorphicType(finaltype) && !hasPolyArg)
+	if (IsPolymorphicTypeAny(finaltype) && !hasPolyArg)
 		ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
  errmsg("cannot determine result data type"),
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 002584b941..fd4bfca9b5 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -103,6 +103,10 @@ ProcedureCreate(const char *procedureName,
 	b

Nitpick about assumptions in indexam and tableam

2019-05-24 Thread Mark Dilger

Hackers,

The return value of RegisterSnapshot is being ignored in a
few places in indexam.c and tableam.c, suggesting an
intimate knowledge of the inner workings of the snapshot
manager from these two files.  I don't think that is particularly
wise, and I don't see a performance justification for the way
it is presently coded.  There are no live bugs caused by this
that I can see, but I would still like it cleaned up.

Inside index_beginscan_parallel:

  snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
  RegisterSnapshot(snapshot);
  scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
  pscan, true);

It happens to be true in the current implementation of the
snapshot manager that restored snapshots will have their
'copied' field set to true, and that the RegisterSnapshot
function will in that case return the same snapshot that
it was handed, so the snapshot handed to index_beginscan_internal
turns out to be the right one.  But if RegisterSnapshot were
changed to return a different copy of the snapshot, this code
would break.

There is a similar level of knowledge in table_beginscan_parallel,
which for brevity I won't quote here.

The code in table_scan_update_snapshot appears even more
brittle to me.  The only function in the core code base that
calls table_scan_update_snapshot is ExecBitmapHeapInitializeWorker,
and it does so right after restoring the snapshot that it hands
to table_scan_update_snapshot, so the fact that
table_scan_update_snapshot then ignores the return value
of RegisterSnapshot on that snapshot happens to be ok.  If
some other code were changed to call this function, it is not
clear that it would work out so well.

I propose that attached patch.

mark


snapshot.patch.1
Description: Binary data

Generated columns and indexes

2019-05-24 Thread Florian Weimer

The documentation for generated columns should probably say whether
you can create indexes on them.

Re: BUG #15781: subselect on foreign table (postgres_fdw) can crash (segfault)

2019-05-24 Thread Etsuro Fujita

On Sat, May 25, 2019 at 2:19 AM Sean Johnston
 wrote:
> Obviously this isn't mainstream postgres but just wondering if anyone has 
> looked into issues with regards to pushing order/limit to remote nodes for 
> fdw.

In PostgreSQL 12 Beta 1 released this week [1], we can push down ORDER
BY/LIMIT to the remote PostgreSQL server.  Give it a try!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/about/news/1943/

Re: Suppressing noise in successful check-world runs

2019-05-24 Thread Peter Geoghegan

On Wed, May 22, 2019 at 3:57 PM Tom Lane  wrote:
> I experimented with the attached quick-hack patch to make pg_regress
> suppress notices from its various initial DROP/CREATE IF [NOT] EXISTS
> commands.  I'm not entirely convinced whether suppressing them is
> a good idea though.  Perhaps some hack with effects confined to
> pg_upgrade's test would be better.  I don't have a good idea what
> that would look like, however.
>
> Or we could just say this isn't annoying enough to fix.

I think it's worth fixing.

-- 
Peter Geoghegan

Re: Suppressing noise in successful check-world runs

2019-05-24 Thread Peter Geoghegan

On Fri, May 24, 2019 at 12:31 PM Peter Geoghegan  wrote:
>
> On Wed, May 22, 2019 at 3:57 PM Tom Lane  wrote:
> > I experimented with the attached quick-hack patch to make pg_regress
> > suppress notices from its various initial DROP/CREATE IF [NOT] EXISTS
> > commands.  I'm not entirely convinced whether suppressing them is
> > a good idea though.  Perhaps some hack with effects confined to
> > pg_upgrade's test would be better.  I don't have a good idea what
> > that would look like, however.
> >
> > Or we could just say this isn't annoying enough to fix.
>
> I think it's worth fixing.

My development machine has 8 logical cores, and like you I only see
the NOTICE from pg_upgrade's tests with "-j10":

pg@bat:/code/postgresql/patch/build$ time make check-world -j10 >/dev/null
NOTICE:  database "regression" does not exist, skipping
make check-world -j10 > /dev/null  86.40s user 34.10s system 140% cpu
1:25.94 total

However, I see something else with "-j16", even after a precautionary
clean + rebuild:

pg@bat:/code/postgresql/patch/build$ time make check-world -j16 >/dev/null
NOTICE:  database "regression" does not exist, skipping
pg_regress: could not open file
"/code/postgresql/patch/build/src/test/regress/regression.diffs" for
reading: No such file or directory
make check-world -j16 > /dev/null  96.35s user 37.45s system 152% cpu
1:27.49 total

I suppose this might be because of a pg_regress/make file
"regression.diffs" race. This is also a problem for my current
workflow for running "make check-world" in parallel [1], though only
when there is definitely a regression.diffs file with actual
regressions -- there is no regression that I'm missing here, and as
far as I know this output about "regression.diffs" is just more noise.
I had intended to figure out a way of keeping "regression.diffs" with
my existing workflow, since losing the details of a test failure is a
real annoyance. Especially when there is a test that doesn't fail
reliably.

[1] https://wiki.postgresql.org/wiki/Committing_checklist#Basic_checks
-- 
Peter Geoghegan

Question about some changes in 11.3

2019-05-24 Thread Mat Arye

Hi,

11.3 included some change to partition table planning. Namely
commit 925f46f ("Fix handling of targetlist SRFs when scan/join relation is
known empty.") seems to redo all paths for partitioned tables
in apply_scanjoin_target_to_paths. It clears the paths in:

```
 if (rel_is_partitioned)
rel->pathlist = NIL
```

Then the code rebuild the paths. However, the rebuilt append path never
gets the
set_rel_pathlist_hook called. Thus, the work that hook did previously gets
thrown away and the rebuilt append path can never be influenced by this
hook. Is this intended behavior? Am I missing something?

Thanks,
Mat
TimescaleDB

Re: Zedstore - compressed in-core columnar storage

2019-05-24 Thread Ashwin Agrawal

On Thu, May 23, 2019 at 7:30 PM Ajin Cherian  wrote:

> Hi Ashwin,
>
> - how to pass the "column projection list" to table AM? (as stated in
>   initial email, currently we have modified table am API to pass the
>   projection to AM)
>
> We were working on a similar columnar storage using pluggable APIs; one
> idea that we thought of was to modify the scan slot based on the targetlist
> to have only the relevant columns in the scan descriptor. This way the
> table AMs are passed a slot with only relevant columns in the descriptor.
> Today we do something similar to the result slot using
> ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
> somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
> a table am handler API to modify the scan tuple slot based on the
> targetlist, a probable name for the new table am handler would be:
> exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
>

Interesting.

Though this reads hacky and not clean approach to me. Reasons:

- The memory allocation and initialization for slot descriptor was
  done in ExecInitScanTupleSlot().  exec_init_scan_slot_tl() would
  redo lot of work. ExecInitScanTupleSlot() ideally just points to
  tupleDesc from Relation object. But for exec_init_scan_slot_tl()
  will free the existing tupleDesc and reallocate fresh. Plus, can't
  point to Relation tuple desc but essentially need to craft one out.

- As discussed in thread [1], several places want to use different
  slots for the same scan, so that means will have to modify the
  descriptor every time on such occasions even if it remains the same
  throughout the scan. Some extra code can be added to keep around old
  tupledescriptor and then reuse for next slot, but that seems again
  added code complexity.

- AM needs to know the attnum in terms of relation's attribute number
  to scan. How would tupledesc convey that? Like TupleDescData's attrs
  currently carries info for attnum at attrs[attnum - 1]. If TupleDesc
  needs to convey random attributes to scan, seems this relationship
  has to be broken. attrs[offset] will provide info for some attribute
  in relation, means offset != (attrs->attnum + 1). Which I am not
  sure how many places in code rely on that logic to get information.

- The tupledesc provides lot of information not just attribute numbers
  to scan. Like it provides information in TupleConstr about default
  value for column. If AM layer has to modify existing slot's
  tupledesc, it would have to copy over such information as well. This
  information today is fetched using attnum as offset value in
  constr->missing array. If this information will be retained how will
  the constr array constructed? Will the array contain only values for
  columns to scan or will contain constr array as is from Relation's
  tuple descriptor as it does today. Seems will be overhead to
  construct the constr array fresh and if not constructing fresh seems
  will have mismatch between natt and array elements.

Seems with the proposed exec_init_scan_slot_tl() API, will have to
call it after beginscan and before calling getnextslot, to provide
column projection list to AM. Special dedicated API we have for
Zedstore to pass down column projection list, needs same calling
convention which is the reason I don't like it and trying to find
alternative. But at least the api we added for Zedstore seems much
simple, generic and flexible, in comparison, as lets AM decide what it
wishes to do with it. AM can fiddle with slot's TupleDescriptor if
wishes or can handle the column projection some other way.

 So this way the scan am handlers like getnextslot is passed a slot only
> having the relevant columns in the scan descriptor. One issue though is
> that the beginscan is not passed the slot, so if some memory allocation
> needs to be done based on the column list, it can't be done in beginscan.
> Let me know what you think.
>

Yes, ideally would like to see if possible having this information
available on beginscan. But if can't be then seems fine to delay such
allocations on first calls to getnextslot and friends, that's how we
do today for Zedstore.

1]
https://www.postgresql.org/message-id/20190508214627.hw7wuqwawunhynj6%40alap3.anarazel.de

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Tomas Vondra

On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote:

Hi,

On 2019-05-24 12:08:57 -0400, Tom Lane wrote:

Andres Freund  writes:
> On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
>> Hmm, after some digging in the archives, the closest thing I can find
>> is this thread:
>> 
https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
>> where we discussed using libunwind instead, but people didn't like
>> the extra dependency.

> Hm, I didn't actually see that much concern about that. I still think we
> should just go for libunwind.

Is it actually better?

I've not looked in a while, but I think at some point it was.

The basic problem with backtrace() is that it
only knows about global functions, and so reports call sites in static
functions as if they were in whatever global function physically precedes
the static one.

Does that depend on whether the program was compiled with
-fno-omit-frame-pointer? At least some distros now compile with frame
pointers enabled, to get reasonably fast perf profiles (at a basically
immeasurable slowdown, on modern-ish CPUs).

I doubt that, because if that was the case we'd not be able to get
accurate profiles from perf, no? And AFAICS that's not the case,
irrespectedly of whether -fno-omit-frame-pointer is used.

I think doing materially better requires depending on
debug symbols, which (at least in the Red Hat world) aren't going to
be there in a typical production situation.

It's still a lot easier to install debug symbols than to attach a
debugger and get a backtrace that way. Especially when the problem is
hard to reproduce.

Right. Debugger requires interaction with a running process, while
having it integrated would make that unnecessary.

But I think Peter also suggested this might require the ability to
selectively enable the backtraces, and I think he's right. I doubt we
want to log them for every log message, right?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: GROUP BY optimization

2019-05-24 Thread Tomas Vondra


On Fri, May 24, 2019 at 02:10:48PM +0200, Dmitry Dolgov wrote:

On Fri, May 3, 2019 at 11:55 PM Tomas Vondra  
wrote:

I don't recall the exact details of the discussion, since most of it
happened almost a year ago, but the main concern about the original
costing approach is that it very much assumes uniform distribution.

For example if you have N tuples with M groups (which are not computed
using estimate_num_groups IIRC, and we can hardly do much better), then
the patch assumed each groups is ~N/M rows and used that for computing
the number of comparisons for a particular sequence of ORDER BY clauses.

But that may result in pretty significant errors in causes with a couple
of very large groups.


Yes, you're right, the current version of the patch assumes uniform
distribution of values between these M groups. After some thinking I've got an
idea that maybe it's better to not try to find out what would be acceptable for
both uniform and non uniform distributions, but just do not reorder at all if
there are any significant deviations from what seems to be a "best case",
namely when values distributed among groups relatively uniformly and there are
no doubts about how complicated is to compare them.

Saying that, it's possible on top of the current implementation to check:

* dependency coefficient between columns (if I understand correctly, non
 uniform distributions of values between the groups a.k.a few large groups
 should be visible from an extended statistics as a significant dependency)

* largest/smallest group in MCV doesn't differ too much from an expected
 "average" group

* average width and comparison procost are similar

If everything fits (which I hope would be most of the time) - apply reorder,
otherwise use whatever order was specified (which leaves the room for manual
reordering for relatively rare situations). Does it makes sense?



I think those are interesting sources of information. I don't know if it
will be sufficient, but I think it's worth exploring.

One of the issues with this approach however will be selecting the
threshold values. That is, how do you decide whether that a given
functional dependency is "too strong" or the largest/smallest MCV item
differs too much from the average one?

If you pick a particular threshold value, there'll be a sudden behavior
change at some point, and that's very annoying. For example, you might
pick 0.5 as a threshold for "strong" functional dependencies. There's
little reason why 0.4999 should be weak and 0.5001 should be "strong".

This may lead to sudden behavior changes after ANALYZE, for example.

So I think we need to find a way to make this part of the costing model,
which is how we deal with such cases elsewhere.


But what I think we could do is using largest possible group instead of
the average one. So for example when estimating number of comparisons
for columns (c1,..,cN), you could look at MCV lists for these columns
and compute

m(i) = Max(largest group in MCV list for i-th column)

and then use Min(m(1), ..., m(k)) when estimating the number of
comparisons.


I see the idea, but I'm a bit confused about how to get a largest group for a
MCV list? I mean there is a list of most common values per column with
frequencies, and similar stuff for multi columns statistics, but how to get a
size of those groups?


You can just multiply the frequency by the number of rows in the table,
that gives you the size of the group. Or an estimate of it, because
there might be a filter condition involved.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Suppressing noise in successful check-world runs

Peter Geoghegan  writes:
> My development machine has 8 logical cores, and like you I only see
> the NOTICE from pg_upgrade's tests with "-j10":

> pg@bat:/code/postgresql/patch/build$ time make check-world -j10 >/dev/null
> NOTICE:  database "regression" does not exist, skipping
> make check-world -j10 > /dev/null  86.40s user 34.10s system 140% cpu
> 1:25.94 total

> However, I see something else with "-j16", even after a precautionary
> clean + rebuild:

> pg@bat:/code/postgresql/patch/build$ time make check-world -j16 >/dev/null
> NOTICE:  database "regression" does not exist, skipping
> pg_regress: could not open file
> "/code/postgresql/patch/build/src/test/regress/regression.diffs" for
> reading: No such file or directory
> make check-world -j16 > /dev/null  96.35s user 37.45s system 152% cpu
> 1:27.49 total

Yes, I see that too with sufficiently high -j.  I believe this is
what Noah was trying to fix in bd1592e85, but that patch evidently
needs a bit more work :-(

regards, tom lane

Re: Teach pg_upgrade test to honor NO_TEMP_INSTALL

2019-05-24 Thread Andrew Dunstan



On 5/24/19 11:22 AM, Andres Freund wrote:
> Hi,
>
> On 2019-05-22 13:08:41 -0700, Andres Freund wrote:
>> On 2019-05-22 16:04:34 -0400, Andrew Dunstan wrote:
>>> If I disable install, the buildfarm fails the upgrade check even when
>>> not using NO_TEMP_INSTALL.
>>>
>>>
>>> excerpts from the log:
>>> sh: /home/pgl/npgl/pg_head/bfroot/HEAD/inst/bin/psql: No such file or
>>> directory
>> That's the issue I was talking to Tom about above. Need to
>> unconditionally have
>> 
> Andrew, after the latest set of changes, the reversed order should now
> work reliably?
>


With the latest changes I don't get the above failure:


andrew@emma:pg_head (master)$ ~/bf/client-code/run_build.pl 
--skip-steps=install
master:HEAD  [19:21:45] creating vpath build dir
/home/pgl/npgl/pg_head/bfroot/HEAD/pgsql.build ...
master:HEAD  [19:21:45] running configure ...
master:HEAD  [19:21:52] running make ...
master:HEAD  [19:23:42] running make check ...
master:HEAD  [19:24:37] running make contrib ...
master:HEAD  [19:24:45] running make testmodules ...
master:HEAD  [19:24:45] checking pg_upgrade
master:HEAD  [19:26:41] checking test-decoding
master:HEAD  [19:26:59] running make ecpg check ...
master:HEAD  [19:27:25] OK


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Suppressing noise in successful check-world runs

2019-05-24 Thread Peter Geoghegan

On Fri, May 24, 2019 at 4:18 PM Tom Lane  wrote:
> Yes, I see that too with sufficiently high -j.  I believe this is
> what Noah was trying to fix in bd1592e85, but that patch evidently
> needs a bit more work :-(

It would be nice if this was fixed, but I don't see a problem when I
use the optimum number of jobs, so I don't consider it to be urgent.

I'm happy with the new approach, since it avoids the problem of
regression.diffs files that get deleted before I have a chance to take
a look. I should thank Noah for his work on this.

-- 
Peter Geoghegan

Confusing error message for REINDEX TABLE CONCURRENTLY

2019-05-24 Thread Ashwin Agrawal

CREATE TABLE circles (c circle, EXCLUDE USING gist (c WITH &&));

REINDEX TABLE CONCURRENTLY circles;
WARNING:  cannot reindex exclusion constraint index "public.circles_c_excl"
concurrently, skipping
NOTICE:  table "circles" has no indexes
REINDEX

The message "table has no indexes" is confusing, as warning above it states
table has index, just was skipped by reindex.

So, currently for any reason (exclusion or invalid index) reindex table
concurrently skips reindex, it reports the table has no index. Looking at
the behavior of non-concurrent reindex, it emits the NOTICE only if table
really has no indexes (since it has no skip cases).

We need to see what really wish to communicate here, table has no indexes
or just that reindex was *not* performed or keep it simple and completely
avoid emitting anything. If we skip any indexes we anyways emit WARNING, so
that should be sufficient and nothing more needs to be conveyed.

In-case we wish to communicate no reindex was performed, what do we wish to
notify for empty tables?

Seems might be just emit the NOTICE "table xxx has no index", if really no
index for concurrent and non-concurrent case, make it consistent, less
confusing and leave it there. Attaching the patch to just do that. Thoughts?
From dc8119abe180cc14b0720c4de79495de4c62bf12 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 24 May 2019 16:34:48 -0700
Subject: [PATCH v1] Avoid confusing error message for REINDEX CONCURRENTLY.

REINDEX CONCURRENTLY skips reindexing exclusion or invalid
indexes. But in such cases it incorrectly reported table has no
indexes. Make the behavior consistent with not concurrently REINDEX
command to emit the notice only if the table has no indexes.
---
 src/backend/commands/indexcmds.c   | 13 -
 src/test/regress/expected/create_index.out |  1 -
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 40ea629ffe7..402440ebad0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2630,7 +2630,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 	foreach(l, relids)
 	{
 		Oid			relid = lfirst_oid(l);
-		bool		result;
 
 		StartTransactionCommand();
 		/* functions in indexes may want a snapshot set */
@@ -2638,11 +2637,12 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 		if (concurrent)
 		{
-			result = ReindexRelationConcurrently(relid, options);
+			ReindexRelationConcurrently(relid, options);
 			/* ReindexRelationConcurrently() does the verbose output */
 		}
 		else
 		{
+			bool		result;
 			result = reindex_relation(relid,
 	  REINDEX_REL_PROCESS_TOAST |
 	  REINDEX_REL_CHECK_CONSTRAINTS,
@@ -2656,7 +2656,6 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 
 			PopActiveSnapshot();
 		}
-
 		CommitTransactionCommand();
 	}
 	StartTransactionCommand();
@@ -2693,6 +2692,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	char	   *relationName = NULL;
 	char	   *relationNamespace = NULL;
 	PGRUsage	ru0;
+	bool rel_has_index = false;
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -2759,6 +2759,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	Relation	indexRelation = index_open(cellOid,
 		   ShareUpdateExclusiveLock);
 
+	rel_has_index = true;
 	if (!indexRelation->rd_index->indisvalid)
 		ereport(WARNING,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -2805,6 +2806,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		Relation	indexRelation = index_open(cellOid,
 			   ShareUpdateExclusiveLock);
 
+		rel_has_index = true;
 		if (!indexRelation->rd_index->indisvalid)
 			ereport(WARNING,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -2843,6 +2845,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 			 errmsg("cannot reindex a system catalog concurrently")));
 
+rel_has_index = true;
 /* Save the list of relation OIDs in private context */
 oldcontext = MemoryContextSwitchTo(private_context);
 
@@ -2873,11 +2876,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			break;
 	}
 
-	/* Definitely no indexes, so leave */
+	/* Definitely no indexes to rebuild, so leave */
 	if (indexIds == NIL)
 	{
 		PopActiveSnapshot();
-		return false;
+		return rel_has_index;
 	}
 
 	Assert(heapRelationIds != NIL);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c8bc6be0613..9a9401c280e 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2150,7 +2150,6 @@ DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
 -- The invalid index is not processed when running REINDEX TABLE.
 REINDEX TABLE CONCURRENTLY c

Re: Indexing - comparison of tree structures

2019-05-24 Thread Jonah H. Harris

T-tree (and variants) are index types commonly associated with in-memory
database management systems and rarely, if-ever, used with on-disk
databases. There has been a lot of research in regard to more modern cache
conscious/oblivious b-trees that perform equally or better than t-tree.
What’s the use-case?

On Fri, May 24, 2019 at 5:38 AM Sascha Kuhl  wrote:

> Hi,
>
> I compared two data structures realistically by time, after estimating big
> O. T-tree outperforms b-tree, which is commonly used, for a medium size
> table. Lehmann and Carey showed the same, earlier.
>
> Can you improve indexing by this?
>
> Understandably
>
> Sascha Kuhl
>
-- 
Jonah H. Harris

Re: POC: converting Lists into arrays

Here's a new version of the Lists-as-arrays patch.  It's rebased up to
HEAD, and I also realized that I could fix the problem with multiple
evaluation of the List arguments of foreach etc. by using structure
assignment.  So that gets rid of a large chunk of the semantic gotchas
that were in the previous patch.  You still have to be careful about
code that deletes list entries within a foreach() over the list ---
but nearly all such code is using list_delete_cell, which means
you'll have to touch it anyway because of the API change for that
function.

Previously, the typical logic for deletion-within-a-loop involved
either advancing or not advancing a "prev" pointer that was used
with list_delete_cell.  The way I've recoded that here changes those
loops to use an integer list index that gets incremented or not.

Now, it turns out that the new formulation of foreach() is really
strictly equivalent to

for (int pos = 0; pos < list_length(list); pos++)
{
whatever-type item = list_nth(list, pos);
...
}

which means that it could cope fine with deletion of the current
list element if we were to provide some supported way of not
incrementing the list index counter.  That is, instead of
code that looks more or less like this:

for (int pos = 0; pos < list_length(list); pos++)
{
whatever-type item = list_nth(list, pos);
...
if (delete_cur)
{
list = list_delete_nth_cell(list, pos);
pos--;   /* keep loop in sync with deletion */
}
}

we could write, say:

foreach(lc, list)
{
whatever-type item = lfirst(lc);
...
if (delete_cur)
{
list = list_delete_cell(list, lc);
foreach_backup(lc); /* keep loop in sync with deletion 
*/
}
}

which is the same thing under the hood.  I'm not quite sure if that way
is better or not.  It's more magical than explicitly manipulating a list
index, but it's also shorter and therefore less subject to typos.

regards, tom lane



reimplement-List-as-array-4.patch.gz
Description: reimplement-List-as-array-4.patch.gz

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-05-24 Thread Noah Misch

On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> Following this direction, the attached PoC works *at least for*
> the wal_optimization TAP tests, but doing pending flush not in
> smgr but in relcache.

This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache.  We already have a module, storage.c,
that maintains state about files created in the current transaction.  Why did
you use relcache instead of storage.c?

On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> This is a tidier version of the patch.

> - Move the substantial work to table/index AMs.
> 
>   Each AM can decide whether to support WAL skip or not.
>   Currently heap and nbtree support it.

Why would an AM find it important to disable WAL skip?

Re: Confusing error message for REINDEX TABLE CONCURRENTLY

On Sat, 25 May 2019 at 12:06, Ashwin Agrawal  wrote:
> Seems might be just emit the NOTICE "table xxx has no index", if really no 
> index for concurrent and non-concurrent case, make it consistent, less 
> confusing and leave it there. Attaching the patch to just do that. Thoughts?

Would it not be better just to change the error message for the
concurrent case so that it reads: "table \"%s\" has no indexes that
can be concurrently reindexed"

Otherwise, what you have now is still confusing for partitioned tables:

postgres=# create table listp (a int primary key) partition by list(a);
CREATE TABLE
postgres=# REINDEX TABLE CONCURRENTLY listp;
psql: WARNING:  REINDEX of partitioned tables is not yet implemented,
skipping "listp"
psql: NOTICE:  table "listp" has no indexes
REINDEX

Also, I think people probably will care more about the fact that
nothing was done for that table rather than if the table happens to
have no indexes. For the non-concurrently case, that just happened to
be the same thing.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Fix link for v12

2019-05-24 Thread Amit Kapila

On Thu, May 23, 2019 at 10:56 PM Euler Taveira  wrote:
>
> Hi,
>
> I noticed that v12 release notes is referencing the wrong GUC. It
> should be recovery_target_timeline instead of recovery_target_time.
> Patch attached.
>

Your patch looks correct to me.  I will commit it in some time unless
someone does it before or sees any problem with me committing this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: converting Lists into arrays

On Sat, 25 May 2019 at 12:53, Tom Lane  wrote:
> Now, it turns out that the new formulation of foreach() is really
> strictly equivalent to
>
> for (int pos = 0; pos < list_length(list); pos++)
> {
> whatever-type item = list_nth(list, pos);
> ...
> }
>
> which means that it could cope fine with deletion of the current
> list element if we were to provide some supported way of not
> incrementing the list index counter.  That is, instead of
> code that looks more or less like this:
>
> for (int pos = 0; pos < list_length(list); pos++)
> {
> whatever-type item = list_nth(list, pos);
> ...
> if (delete_cur)
> {
> list = list_delete_nth_cell(list, pos);
> pos--;   /* keep loop in sync with deletion */
> }
> }
>
> we could write, say:
>
> foreach(lc, list)
> {
> whatever-type item = lfirst(lc);
> ...
> if (delete_cur)
> {
> list = list_delete_cell(list, lc);
> foreach_backup(lc); /* keep loop in sync with 
> deletion */
> }
> }
>
> which is the same thing under the hood.  I'm not quite sure if that way
> is better or not.  It's more magical than explicitly manipulating a list
> index, but it's also shorter and therefore less subject to typos.

If we're doing an API break for this, wouldn't it be better to come up
with something that didn't have to shuffle list elements around every
time one is deleted?

For example, we could have a foreach_delete() that instead of taking a
pointer to a ListCell, it took a ListDeleteIterator which contained a
ListCell pointer and a Bitmapset, then just have a macro that marks a
list item as deleted (list_delete_current(di)) and have a final
cleanup at the end of the loop.

The cleanup operation can still use memmove, but just only move up
until the next bms_next_member on the deleted set, something like
(handwritten and untested):

void
list_finalize_delete(List *list, ListDeleteIterator *di)
{
int srcpos, curr, tarpos;

/* Zero the source and target list position markers */
srcpos = tarpos = 0;
curr = -1;
while ((curr = bms_next_member(di->deleted, curr) >= 0)
{
int n = curr - srcpos;
if (n > 0)
{
memmove(&list->elements[tarpos], &list->elements[srcpos],
n * sizeof(ListCell));
tarpos += n;
}
srcpos = curr + 1;
}
list->length = tarpos;
}

Or maybe we should worry about having the list in an inconsistent
state during the loop?  e.g if the list is getting passed into a
function call to do something.


--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Performance issue in foreign-key-aware join estimation

On Fri, 22 Mar 2019 at 10:10, Tom Lane  wrote:
>
> David Rowley  writes:
> > [ eclass_indexes_v4.patch ]
>
> I still don't like this approach too much.  I think we can fairly easily
> construct the eclass_indexes at a higher level instead of trying to
> manage them in add_eq_member, and after some hacking I have the attached.

I've rebased this on top of the pgindent changes (attached) and looked over it.

The only problem I see is that you're not performing a bms_copy() of
the parent's eclass_indexes in add_child_rel_equivalences(). You've
written a comment that claims it's fine to just point to the parent's
in:

/*
* The child is now mentioned in all the same eclasses as its parent ---
* except for corner cases such as a volatile EC.  But it's okay if
* eclass_indexes lists too many rels, so just borrow the parent's index
* set rather than making a new one.
*/
child_rel->eclass_indexes = parent_rel->eclass_indexes;

but that's not true since in get_eclass_for_sort_expr() we perform:

rel->eclass_indexes = bms_add_member(rel->eclass_indexes,
ec_index);

which will work fine about in about 63 out of 64 cases, but once
bms_add_member has to reallocate the set then it'll leave the child
rel's eclass_indexes pointing to freed memory.  I'm not saying I have
any reproducible test case that can crash the backend from this, but
it does seem like a bug waiting to happen.

Apart from that, as far as I can tell, the patch seems fine.

I didn't add the bms_copy(). I'll wait for your comments before doing so.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

eclass_indexes_v6.patch
Description: Binary data

Re: Zedstore - compressed in-core columnar storage

2019-05-24 Thread Mark Kirkwood

On 23/05/19 12:07 PM, Ashwin Agrawal wrote:

We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.

Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).

Current State:

- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.

- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.

- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper
[4] on

how modern filesystems manage space acted as a good source for ideas.

- Tuple locks implemented

- Serializable isolation handled

- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.

Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO

Open questions / discussion items:

- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())

- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)

- TID treated as (block, offset) in current indexing code

- Physical tlist optimization? (currently, we disabled it for
zedstore)

Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.

References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3]
https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development

4] https://www.kernel.org/doc/ols/2010/ols2010-pages

Re: Runtime pruning problem