Re: Index Skip Scan

2019-06-22 Thread Floris Van Nee
The following sql statement seems to have incorrect results - some logic in the 
backwards scan is currently not entirely right.

-Floris


drop table if exists a;
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 1) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a desc, b desc;
explain select distinct on (a) a,b from a order by a desc, b desc;

DROP TABLE
CREATE TABLE
INSERT 0 5
CREATE INDEX
ANALYZE
 a |   b   
---+---
 5 | 1
 5 | 1
 4 | 1
 3 | 1
 2 | 1
 1 | 1
(6 rows)

   QUERY PLAN   
 
-
 Index Only Scan Backward using a_a_b_idx on a  (cost=0.29..1.45 rows=5 width=8)
   Scan mode: Skip scan
(2 rows)





Re: Choosing values for multivariate MCV lists

2019-06-22 Thread Tomas Vondra

On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:

On Thu, 20 Jun 2019 at 23:35, Tomas Vondra  wrote:


On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:

>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>though, because that would seem likely to put too much weight on the
>least commonly occurring values.

But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

   ... (cost=... rows=1 ...) (actual=... rows=101 ...)

and

   ... (cost=... rows=100 ...) (actual=... rows=200 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much more dangerous, because it's off by six
orders of magnitude.



Hmm, that's a good example. That definitely suggests that we should be
trying to minimise the relative error, but also perhaps that what we
should be looking at is actually just the ratio freq / base_freq,
rather than their difference.



Attached are patches that implement this (well, the first one optimizes
how the base frequency is computed, which helps for high statistic target
values). The 0002 part picks the items based on

  Max(freq/base_freq, base_freq/freq)

It did help with the addresses data set quite a bit, but I'm sure it needs
more testing. I've also tried using

  freq * abs(freq - base_freq)

but the estimates were not as good.

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 430ee6e6739d5f8fb2e25657f0196a5c413c2021 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 22 Jun 2019 13:09:42 +0200
Subject: [PATCH 1/2] fix mcv build perf issue

---
 src/backend/statistics/mcv.c | 114 ---
 1 file changed, 106 insertions(+), 8 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..04a4f17b01 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -78,6 +78,9 @@ static MultiSortSupport build_mss(VacAttrStats **stats, int 
numattrs);
 static SortItem *build_distinct_groups(int numrows, SortItem *items,
   
MultiSortSupport mss, int *ndistinct);
 
+static SortItem **build_column_frequencies(SortItem *groups, int ngroups,
+   
   MultiSortSupport mss, int *ncounts);
+
 static int count_distinct_groups(int numrows, SortItem *items,
  
MultiSortSupport mss);
 
@@ -172,6 +175,8 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
SortItem   *groups;
MCVList*mcvlist = NULL;
MultiSortSupport mss;
+   SortItem  **freqs;
+   int*nfreqs;
 
attnums = build_attnums_array(attrs, &numattrs);
 
@@ -188,6 +193,10 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
/* transform the sorted rows into groups (sorted by frequency) */
groups = build_distinct_groups(nitems, items, mss, &ngroups);
 
+   /* compute frequencies for values in each column */
+   nfreqs = (int *) palloc0(sizeof(int) * numattrs);
+   freqs = build_column_frequencies(groups, ngroups, mss, nfreqs);
+
/*
 * Maximum number of MCV items to store, based on the attribute with the
 * largest stats target (and the number of groups we have available).
@@ -242,6 +251,16 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
if (nitems > 0)
{
int j;
+   SortItemkey;
+   MultiSortSupporttmp;
+
+   /* used to search values */
+   tmp = (MultiSortSupport) palloc(offsetof(MultiSortSupportData, 
ssup)
+   
+ sizeof(SortSupportData));
+
+   /* space for search key */
+   key.values = palloc(sizeof(Datum));
+   key.isnull = palloc(sizeof(bool));
 
/*
 * Allocate the MCV list structure, set the global parameters.
@@ -281,22 +300,28 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset 
*attrs,
item->base_frequency = 1.0;
for (j = 0; j < numattrs; j++)
{
-   int   

Re: Index Skip Scan

2019-06-22 Thread Dmitry Dolgov
> On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee 
wrote:
> The following sql statement seems to have incorrect results - some logic
in
> the backwards scan is currently not entirely right.

Thanks for testing! You're right, looks like in the current implementation
in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.


Re: \describe*

2019-06-22 Thread Corey Huinker
>
> > So what is the uptake on implementing this at the server side, ie.
> > DESCRIBE?
>
> I'm pretty skeptical of this idea, unless you are willing to throw
> away at least one and possibly both of the following goals:
>
> 1. Compatibility with psql's existing \d behavior.
>

I don't think *compatibility* with the behavior should be a goal in itself.
Coverage of the majority of the use-cases is.

2. Usability of DESCRIBE for any purpose whatsoever other than emitting
> something that looks just like what psql prints.
>
> We've migrated many of the \d displays so far away from "a single query
> result" that I don't believe there's a way for a server command to
> duplicate them, at least not without some seriously unholy in-bed-ness
> between the server command and some postprocessing logic in describe.c.
> (At which point you've lost whatever system architectural value there
> might be in the whole project, since having a more-arm's-length
> relationship there kinda seems like the point to me.)
>

I think there's a genuine use for regular printed output, and there's also
a use for a query-able output. Maybe that queryable output is just a JSONB
output that the outer query can pick apart as it sees fit, and that would
handle the fact that the data often doesn't fit into a single query's
output.

Incidentally, I had need of this very functionality in Snowflake the other
day. The data dictionary there isn't capable of telling you which columns
are in a primary key, but that information is printed when you run
"DESCRIBE my_table".  The workaround is to run "DESCRIBE my_table" and then
make another query using a table function to recall the output of the last
query made in the session, and then filter that. Yeah, as a pattern it's
weird and sad, but it shows that there's are uses for something
DESCRIBE-ish on the server side.

So if we're going servier-side on DESCRIBE, it should be it's own entity,
not beholden to design decisions made in psql.


> There are a bunch of other little behavioral differences that you just
> can't replicate server-side, like the fact that localization of the
> results depends on psql's LC_MESSAGES not the server's.  Maybe people
> would be okay with changing that, but it's not a transparent
> reimplementation.
>

I think people would be OK with that. We're asking the server what it knows
about an object, not how psql feels about that same information.

I think if we want to have server-side describe capability, we're better
> off just to implement a DESCRIBE command that's not intended to be exactly
> like \d anything, and not try to make it be the implementation for \d
> anything.  (This was, in fact, where David started IIUC.  Other people's
> sniping at that idea hasn't yielded any better idea.)
>

I'm very much in support of server-side DESCRIBE that's not beholden to \d
in any way. For instance, I'm totally fine with DESCRIBE not being able to
handle wildcard patterns.

My initial suggestion for client-side \describe was mostly borne of it
being easy to implement a large subset of the \d commands to help users.
Not all users have psql access, so having a server side command helps more
people.

It could be that we decide that DESCRIBE is set-returning, and we have to
break up \d functionality to suit. By this I mean that we might find it
simpler to require DESCRIBE TABLE foo to only show columns with minimal
information about PKs and follow up commands like "DESCRIBE TABLE foo
INDEXES" or "DESCRIBE TABLE foo CONSTRAINTS" to keep output in tabular
format.


> In particular, I'm really strongly against having "\describe-foo-bar"
> invoke DESCRIBE, because (a) that will break compatibility with the
> existing \des command, and (b) it's not actually saving any typing,
> and (c) I think it'd confuse users no end.
>

+1. Having psql figure out which servers can give proper
servier-side-describes would boggle the mind.


> Of course, this line of thought does lead to the conclusion that we'd be
> maintaining psql/describe.c and server-side DESCRIBE in parallel forever,
>

Not fun, but what's our motivation for adding new new \d functionality once
a viable DESCRIBE is in place? Wouldn't the \d commands essentially be
feature-frozen at that point?


> which doesn't sound like fun.  But we should be making DESCRIBE with an
> eye to more use-cases than psql.  If it allows jdbc to not also maintain
> a pile of equivalent code, that'd be a win.  If it allows pg_dump to toss
> a bunch of logic overboard (or at least stop incrementally adding new
> variants), that'd be a big win.
>

I don't know enough about JDBC internals to know what sort of non-set
results it can handle, but that seems key to showing us how to proceed.

As for pg_dump, that same goal was a motivation for a similar server-side
command "SHOW CREATE " (essentially, pg_dump of ) which
would have basically the same design issues as DESCRIBE would, though the
result set would be a much simpler SETOF text.


Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-22 Thread Amit Kapila
On Fri, Jun 21, 2019 at 8:15 PM Robert Haas  wrote:
>
> On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick
>  wrote:
> > In Pg12, the code in pg_basebackup implies the correct thing to do is 
> > append to .auto.conf,
> > but as demonstrated that can cause problems with duplicate entries.
>
> Yeah.
>
> To me, forcing every tools author to use postgresql.conf parsing tools
> rather than just appending to the file is a needless burden on tool
> authors.
>

OTOH, if we give license to all the tools that they can append to the
.auto.conf file whenever they want, then, I think the contents of the
file can be unpredictable.  Basically, as of now, we allow only one
backend to write to the file, but giving a free pass to everyone can
create a problem.   This won't be a problem for pg_basebackup, but can
be for other tools.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-22 Thread Amit Kapila
On Fri, Jun 21, 2019 at 10:31 PM Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>
> > If anybody does complain, my first reaction would be to make ALTER
> > SYSTEM strip all comment lines too.
>
> Uh, I believe it already does?
>

Yeah, I also think so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-22 Thread Stephen Frost
Greetings,

On Sat, Jun 22, 2019 at 17:07 Amit Kapila  wrote:

> On Fri, Jun 21, 2019 at 8:15 PM Robert Haas  wrote:
> >
> > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick
> >  wrote:
> > > In Pg12, the code in pg_basebackup implies the correct thing to do is
> append to .auto.conf,
> > > but as demonstrated that can cause problems with duplicate entries.
> >
> > Yeah.
> >
> > To me, forcing every tools author to use postgresql.conf parsing tools
> > rather than just appending to the file is a needless burden on tool
> > authors.
> >
>
> OTOH, if we give license to all the tools that they can append to the
> .auto.conf file whenever they want, then, I think the contents of the
> file can be unpredictable.  Basically, as of now, we allow only one
> backend to write to the file, but giving a free pass to everyone can
> create a problem.   This won't be a problem for pg_basebackup, but can
> be for other tools.


I don’t think anyone was suggesting that tools be allowed to modify the
file while the server is running- if a change needs to be made while the
server is running, then it should be done through a call to ALTER SYSTEM.

There’s no shortage of tools that, particularly with the merger in of
recovery.conf, want to modify and manipulate the file when the server is
down though.

All that said, whatever code it is that we write for pg_basebackup to do
this properly should go into our client side library, so other tools can
leverage that and avoid having to write it themselves.

Thanks!

Stephen

>


Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-22 Thread Amit Kapila
On Sun, Jun 23, 2019 at 2:43 AM Stephen Frost  wrote:
>
> Greetings,
>
> On Sat, Jun 22, 2019 at 17:07 Amit Kapila  wrote:
>>
>> On Fri, Jun 21, 2019 at 8:15 PM Robert Haas  wrote:
>> >
>> > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick
>> >  wrote:
>> > > In Pg12, the code in pg_basebackup implies the correct thing to do is 
>> > > append to .auto.conf,
>> > > but as demonstrated that can cause problems with duplicate entries.
>> >
>> > Yeah.
>> >
>> > To me, forcing every tools author to use postgresql.conf parsing tools
>> > rather than just appending to the file is a needless burden on tool
>> > authors.
>> >
>>
>> OTOH, if we give license to all the tools that they can append to the
>> .auto.conf file whenever they want, then, I think the contents of the
>> file can be unpredictable.  Basically, as of now, we allow only one
>> backend to write to the file, but giving a free pass to everyone can
>> create a problem.   This won't be a problem for pg_basebackup, but can
>> be for other tools.
>
>
> I don’t think anyone was suggesting that tools be allowed to modify the file 
> while the server is running- if a change needs to be made while the server is 
> running, then it should be done through a call to ALTER SYSTEM.
>
> There’s no shortage of tools that, particularly with the merger in of 
> recovery.conf, want to modify and manipulate the file when the server is down 
> though.
>
> All that said, whatever code it is that we write for pg_basebackup to do this 
> properly should go into our client side library, so other tools can leverage 
> that and avoid having to write it themselves.
>

Fair enough.   In that case, don't we need some mechanism to ensure
that if the API fails, then the old contents are retained?  Alter
system ensures that by writing first the contents to a temporary file,
but I am not sure if whatever is done by pg_basebackup has that
guarantee.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-22 Thread Stephen Frost
Greetings,

On Sat, Jun 22, 2019 at 17:43 Amit Kapila  wrote:

> On Sun, Jun 23, 2019 at 2:43 AM Stephen Frost  wrote:
> >
> > Greetings,
> >
> > On Sat, Jun 22, 2019 at 17:07 Amit Kapila 
> wrote:
> >>
> >> On Fri, Jun 21, 2019 at 8:15 PM Robert Haas 
> wrote:
> >> >
> >> > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick
> >> >  wrote:
> >> > > In Pg12, the code in pg_basebackup implies the correct thing to do
> is append to .auto.conf,
> >> > > but as demonstrated that can cause problems with duplicate entries.
> >> >
> >> > Yeah.
> >> >
> >> > To me, forcing every tools author to use postgresql.conf parsing tools
> >> > rather than just appending to the file is a needless burden on tool
> >> > authors.
> >> >
> >>
> >> OTOH, if we give license to all the tools that they can append to the
> >> .auto.conf file whenever they want, then, I think the contents of the
> >> file can be unpredictable.  Basically, as of now, we allow only one
> >> backend to write to the file, but giving a free pass to everyone can
> >> create a problem.   This won't be a problem for pg_basebackup, but can
> >> be for other tools.
> >
> >
> > I don’t think anyone was suggesting that tools be allowed to modify the
> file while the server is running- if a change needs to be made while the
> server is running, then it should be done through a call to ALTER SYSTEM.
> >
> > There’s no shortage of tools that, particularly with the merger in of
> recovery.conf, want to modify and manipulate the file when the server is
> down though.
> >
> > All that said, whatever code it is that we write for pg_basebackup to do
> this properly should go into our client side library, so other tools can
> leverage that and avoid having to write it themselves.
> >
>
> Fair enough.   In that case, don't we need some mechanism to ensure
> that if the API fails, then the old contents are retained?  Alter
> system ensures that by writing first the contents to a temporary file,
> but I am not sure if whatever is done by pg_basebackup has that
> guarantee.


I’m not sure that’s really the same.  Certainly, pg_basebackup needs to
deal with a partial write, or failure of any kind, in a clean way that
indicates the backup isn’t good.  The important bit is that the resulting
file be one that ALTER SYSTEM and potentially other tools will be able to
work with.

Thanks,

Stephen

>


Re: Index Skip Scan

2019-06-22 Thread Floris Van Nee

> Thanks for testing! You're right, looks like in the current implementation in
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.

Thanks. Looking forward to it. I think I found some other strange behavior. 
Given the same table as in my previous e-mail, the following queries also 
return inconsistent results. I spent some time trying to debug it, but can't 
easily pinpoint the cause. It looks like it also skips over one value too much, 
my guess is during _bt_skippage call in _bt_skip?
Perhaps a question: when stepping through code in GDB, is there an easy way to 
pretty print for example the contents on an IndexTuple? I saw there's some 
tools out there that can pretty print plans, but viewing tuples is more 
complicated I guess.

-- this one is OK
postgres=# select distinct on (a) a,b from a where b>1;
 a | b
---+---
 1 | 2
 2 | 2
 3 | 2
 4 | 2
 5 | 2
(5 rows)

-- this one is not OK, it seems to skip too much
postgres=# select distinct on (a) a,b from a where b=2;
 a | b
---+---
 1 | 2
 3 | 2
 5 | 2
(3 rows)




Re: Index Skip Scan

2019-06-22 Thread Peter Geoghegan
On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee  wrote:
> Perhaps a question: when stepping through code in GDB, is there an easy way 
> to pretty print for example the contents on an IndexTuple? I saw there's some 
> tools out there that can pretty print plans, but viewing tuples is more 
> complicated I guess.

Try the attached patch -- it has DEBUG1 traces with the contents of
index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.

The patch should be considered a very rough hack, for now. It leaks
memory like crazy. But I think that you'll find it helpful.

-- 
Peter Geoghegan


0012-Index-scan-positioning-DEBUG1-instrumentation.patch
Description: Binary data