Re: Index Skip Scan
The following sql statement seems to have incorrect results - some logic in the backwards scan is currently not entirely right. -Floris drop table if exists a; create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 1) ks); create index on a (a,b); analyze a; select distinct on (a) a,b from a order by a desc, b desc; explain select distinct on (a) a,b from a order by a desc, b desc; DROP TABLE CREATE TABLE INSERT 0 5 CREATE INDEX ANALYZE a | b ---+--- 5 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 (6 rows) QUERY PLAN - Index Only Scan Backward using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=8) Scan mode: Skip scan (2 rows)
Re: Choosing values for multivariate MCV lists
On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote: On Thu, 20 Jun 2019 at 23:35, Tomas Vondra wrote: On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq >though, because that would seem likely to put too much weight on the >least commonly occurring values. But would that be an issue, or a good thing? I mean, as long as the item is above mincount, we take the counts as reliable. As I explained, my motivation for proposing that was that both ... (cost=... rows=1 ...) (actual=... rows=101 ...) and ... (cost=... rows=100 ...) (actual=... rows=200 ...) have exactly the same Abs(freq - base_freq), but I think we both agree that the first misestimate is much more dangerous, because it's off by six orders of magnitude. Hmm, that's a good example. That definitely suggests that we should be trying to minimise the relative error, but also perhaps that what we should be looking at is actually just the ratio freq / base_freq, rather than their difference. Attached are patches that implement this (well, the first one optimizes how the base frequency is computed, which helps for high statistic target values). The 0002 part picks the items based on Max(freq/base_freq, base_freq/freq) It did help with the addresses data set quite a bit, but I'm sure it needs more testing. I've also tried using freq * abs(freq - base_freq) but the estimates were not as good. One annoying thing I noticed is that the base_frequency tends to end up being 0, most likely due to getting too small. It's a bit strange, though, because with statistic target set to 10k the smallest frequency for a single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is something the float8 can represent). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 430ee6e6739d5f8fb2e25657f0196a5c413c2021 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sat, 22 Jun 2019 13:09:42 +0200 Subject: [PATCH 1/2] fix mcv build perf issue --- src/backend/statistics/mcv.c | 114 --- 1 file changed, 106 insertions(+), 8 deletions(-) diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c index 5fe61ea0a4..04a4f17b01 100644 --- a/src/backend/statistics/mcv.c +++ b/src/backend/statistics/mcv.c @@ -78,6 +78,9 @@ static MultiSortSupport build_mss(VacAttrStats **stats, int numattrs); static SortItem *build_distinct_groups(int numrows, SortItem *items, MultiSortSupport mss, int *ndistinct); +static SortItem **build_column_frequencies(SortItem *groups, int ngroups, + MultiSortSupport mss, int *ncounts); + static int count_distinct_groups(int numrows, SortItem *items, MultiSortSupport mss); @@ -172,6 +175,8 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs, SortItem *groups; MCVList*mcvlist = NULL; MultiSortSupport mss; + SortItem **freqs; + int*nfreqs; attnums = build_attnums_array(attrs, &numattrs); @@ -188,6 +193,10 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs, /* transform the sorted rows into groups (sorted by frequency) */ groups = build_distinct_groups(nitems, items, mss, &ngroups); + /* compute frequencies for values in each column */ + nfreqs = (int *) palloc0(sizeof(int) * numattrs); + freqs = build_column_frequencies(groups, ngroups, mss, nfreqs); + /* * Maximum number of MCV items to store, based on the attribute with the * largest stats target (and the number of groups we have available). @@ -242,6 +251,16 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs, if (nitems > 0) { int j; + SortItemkey; + MultiSortSupporttmp; + + /* used to search values */ + tmp = (MultiSortSupport) palloc(offsetof(MultiSortSupportData, ssup) + + sizeof(SortSupportData)); + + /* space for search key */ + key.values = palloc(sizeof(Datum)); + key.isnull = palloc(sizeof(bool)); /* * Allocate the MCV list structure, set the global parameters. @@ -281,22 +300,28 @@ statext_mcv_build(int numrows, HeapTuple *rows, Bitmapset *attrs, item->base_frequency = 1.0; for (j = 0; j < numattrs; j++) { - int
Re: Index Skip Scan
> On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee wrote: > The following sql statement seems to have incorrect results - some logic in > the backwards scan is currently not entirely right. Thanks for testing! You're right, looks like in the current implementation in case of backwards scan there is one unnecessary extra step forward. It seems this mistake was made, since I was concentrating only on the backward scans with a cursor, and used not exactly correct approach to wrap up after a scan was finished. Give me a moment, I'll tighten it up.
Re: \describe*
> > > So what is the uptake on implementing this at the server side, ie. > > DESCRIBE? > > I'm pretty skeptical of this idea, unless you are willing to throw > away at least one and possibly both of the following goals: > > 1. Compatibility with psql's existing \d behavior. > I don't think *compatibility* with the behavior should be a goal in itself. Coverage of the majority of the use-cases is. 2. Usability of DESCRIBE for any purpose whatsoever other than emitting > something that looks just like what psql prints. > > We've migrated many of the \d displays so far away from "a single query > result" that I don't believe there's a way for a server command to > duplicate them, at least not without some seriously unholy in-bed-ness > between the server command and some postprocessing logic in describe.c. > (At which point you've lost whatever system architectural value there > might be in the whole project, since having a more-arm's-length > relationship there kinda seems like the point to me.) > I think there's a genuine use for regular printed output, and there's also a use for a query-able output. Maybe that queryable output is just a JSONB output that the outer query can pick apart as it sees fit, and that would handle the fact that the data often doesn't fit into a single query's output. Incidentally, I had need of this very functionality in Snowflake the other day. The data dictionary there isn't capable of telling you which columns are in a primary key, but that information is printed when you run "DESCRIBE my_table". The workaround is to run "DESCRIBE my_table" and then make another query using a table function to recall the output of the last query made in the session, and then filter that. Yeah, as a pattern it's weird and sad, but it shows that there's are uses for something DESCRIBE-ish on the server side. So if we're going servier-side on DESCRIBE, it should be it's own entity, not beholden to design decisions made in psql. > There are a bunch of other little behavioral differences that you just > can't replicate server-side, like the fact that localization of the > results depends on psql's LC_MESSAGES not the server's. Maybe people > would be okay with changing that, but it's not a transparent > reimplementation. > I think people would be OK with that. We're asking the server what it knows about an object, not how psql feels about that same information. I think if we want to have server-side describe capability, we're better > off just to implement a DESCRIBE command that's not intended to be exactly > like \d anything, and not try to make it be the implementation for \d > anything. (This was, in fact, where David started IIUC. Other people's > sniping at that idea hasn't yielded any better idea.) > I'm very much in support of server-side DESCRIBE that's not beholden to \d in any way. For instance, I'm totally fine with DESCRIBE not being able to handle wildcard patterns. My initial suggestion for client-side \describe was mostly borne of it being easy to implement a large subset of the \d commands to help users. Not all users have psql access, so having a server side command helps more people. It could be that we decide that DESCRIBE is set-returning, and we have to break up \d functionality to suit. By this I mean that we might find it simpler to require DESCRIBE TABLE foo to only show columns with minimal information about PKs and follow up commands like "DESCRIBE TABLE foo INDEXES" or "DESCRIBE TABLE foo CONSTRAINTS" to keep output in tabular format. > In particular, I'm really strongly against having "\describe-foo-bar" > invoke DESCRIBE, because (a) that will break compatibility with the > existing \des command, and (b) it's not actually saving any typing, > and (c) I think it'd confuse users no end. > +1. Having psql figure out which servers can give proper servier-side-describes would boggle the mind. > Of course, this line of thought does lead to the conclusion that we'd be > maintaining psql/describe.c and server-side DESCRIBE in parallel forever, > Not fun, but what's our motivation for adding new new \d functionality once a viable DESCRIBE is in place? Wouldn't the \d commands essentially be feature-frozen at that point? > which doesn't sound like fun. But we should be making DESCRIBE with an > eye to more use-cases than psql. If it allows jdbc to not also maintain > a pile of equivalent code, that'd be a win. If it allows pg_dump to toss > a bunch of logic overboard (or at least stop incrementally adding new > variants), that'd be a big win. > I don't know enough about JDBC internals to know what sort of non-set results it can handle, but that seems key to showing us how to proceed. As for pg_dump, that same goal was a motivation for a similar server-side command "SHOW CREATE " (essentially, pg_dump of ) which would have basically the same design issues as DESCRIBE would, though the result set would be a much simpler SETOF text.
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Fri, Jun 21, 2019 at 8:15 PM Robert Haas wrote: > > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick > wrote: > > In Pg12, the code in pg_basebackup implies the correct thing to do is > > append to .auto.conf, > > but as demonstrated that can cause problems with duplicate entries. > > Yeah. > > To me, forcing every tools author to use postgresql.conf parsing tools > rather than just appending to the file is a needless burden on tool > authors. > OTOH, if we give license to all the tools that they can append to the .auto.conf file whenever they want, then, I think the contents of the file can be unpredictable. Basically, as of now, we allow only one backend to write to the file, but giving a free pass to everyone can create a problem. This won't be a problem for pg_basebackup, but can be for other tools. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Fri, Jun 21, 2019 at 10:31 PM Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > > If anybody does complain, my first reaction would be to make ALTER > > SYSTEM strip all comment lines too. > > Uh, I believe it already does? > Yeah, I also think so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Sat, Jun 22, 2019 at 17:07 Amit Kapila wrote: > On Fri, Jun 21, 2019 at 8:15 PM Robert Haas wrote: > > > > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick > > wrote: > > > In Pg12, the code in pg_basebackup implies the correct thing to do is > append to .auto.conf, > > > but as demonstrated that can cause problems with duplicate entries. > > > > Yeah. > > > > To me, forcing every tools author to use postgresql.conf parsing tools > > rather than just appending to the file is a needless burden on tool > > authors. > > > > OTOH, if we give license to all the tools that they can append to the > .auto.conf file whenever they want, then, I think the contents of the > file can be unpredictable. Basically, as of now, we allow only one > backend to write to the file, but giving a free pass to everyone can > create a problem. This won't be a problem for pg_basebackup, but can > be for other tools. I don’t think anyone was suggesting that tools be allowed to modify the file while the server is running- if a change needs to be made while the server is running, then it should be done through a call to ALTER SYSTEM. There’s no shortage of tools that, particularly with the merger in of recovery.conf, want to modify and manipulate the file when the server is down though. All that said, whatever code it is that we write for pg_basebackup to do this properly should go into our client side library, so other tools can leverage that and avoid having to write it themselves. Thanks! Stephen >
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Sun, Jun 23, 2019 at 2:43 AM Stephen Frost wrote: > > Greetings, > > On Sat, Jun 22, 2019 at 17:07 Amit Kapila wrote: >> >> On Fri, Jun 21, 2019 at 8:15 PM Robert Haas wrote: >> > >> > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick >> > wrote: >> > > In Pg12, the code in pg_basebackup implies the correct thing to do is >> > > append to .auto.conf, >> > > but as demonstrated that can cause problems with duplicate entries. >> > >> > Yeah. >> > >> > To me, forcing every tools author to use postgresql.conf parsing tools >> > rather than just appending to the file is a needless burden on tool >> > authors. >> > >> >> OTOH, if we give license to all the tools that they can append to the >> .auto.conf file whenever they want, then, I think the contents of the >> file can be unpredictable. Basically, as of now, we allow only one >> backend to write to the file, but giving a free pass to everyone can >> create a problem. This won't be a problem for pg_basebackup, but can >> be for other tools. > > > I don’t think anyone was suggesting that tools be allowed to modify the file > while the server is running- if a change needs to be made while the server is > running, then it should be done through a call to ALTER SYSTEM. > > There’s no shortage of tools that, particularly with the merger in of > recovery.conf, want to modify and manipulate the file when the server is down > though. > > All that said, whatever code it is that we write for pg_basebackup to do this > properly should go into our client side library, so other tools can leverage > that and avoid having to write it themselves. > Fair enough. In that case, don't we need some mechanism to ensure that if the API fails, then the old contents are retained? Alter system ensures that by writing first the contents to a temporary file, but I am not sure if whatever is done by pg_basebackup has that guarantee. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Sat, Jun 22, 2019 at 17:43 Amit Kapila wrote: > On Sun, Jun 23, 2019 at 2:43 AM Stephen Frost wrote: > > > > Greetings, > > > > On Sat, Jun 22, 2019 at 17:07 Amit Kapila > wrote: > >> > >> On Fri, Jun 21, 2019 at 8:15 PM Robert Haas > wrote: > >> > > >> > On Mon, Jun 17, 2019 at 10:50 AM Ian Barwick > >> > wrote: > >> > > In Pg12, the code in pg_basebackup implies the correct thing to do > is append to .auto.conf, > >> > > but as demonstrated that can cause problems with duplicate entries. > >> > > >> > Yeah. > >> > > >> > To me, forcing every tools author to use postgresql.conf parsing tools > >> > rather than just appending to the file is a needless burden on tool > >> > authors. > >> > > >> > >> OTOH, if we give license to all the tools that they can append to the > >> .auto.conf file whenever they want, then, I think the contents of the > >> file can be unpredictable. Basically, as of now, we allow only one > >> backend to write to the file, but giving a free pass to everyone can > >> create a problem. This won't be a problem for pg_basebackup, but can > >> be for other tools. > > > > > > I don’t think anyone was suggesting that tools be allowed to modify the > file while the server is running- if a change needs to be made while the > server is running, then it should be done through a call to ALTER SYSTEM. > > > > There’s no shortage of tools that, particularly with the merger in of > recovery.conf, want to modify and manipulate the file when the server is > down though. > > > > All that said, whatever code it is that we write for pg_basebackup to do > this properly should go into our client side library, so other tools can > leverage that and avoid having to write it themselves. > > > > Fair enough. In that case, don't we need some mechanism to ensure > that if the API fails, then the old contents are retained? Alter > system ensures that by writing first the contents to a temporary file, > but I am not sure if whatever is done by pg_basebackup has that > guarantee. I’m not sure that’s really the same. Certainly, pg_basebackup needs to deal with a partial write, or failure of any kind, in a clean way that indicates the backup isn’t good. The important bit is that the resulting file be one that ALTER SYSTEM and potentially other tools will be able to work with. Thanks, Stephen >
Re: Index Skip Scan
> Thanks for testing! You're right, looks like in the current implementation in > case of backwards scan there is one unnecessary extra step forward. It seems > this mistake was made, since I was concentrating only on the backward scans > with a cursor, and used not exactly correct approach to wrap up after a scan > was finished. Give me a moment, I'll tighten it up. Thanks. Looking forward to it. I think I found some other strange behavior. Given the same table as in my previous e-mail, the following queries also return inconsistent results. I spent some time trying to debug it, but can't easily pinpoint the cause. It looks like it also skips over one value too much, my guess is during _bt_skippage call in _bt_skip? Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess. -- this one is OK postgres=# select distinct on (a) a,b from a where b>1; a | b ---+--- 1 | 2 2 | 2 3 | 2 4 | 2 5 | 2 (5 rows) -- this one is not OK, it seems to skip too much postgres=# select distinct on (a) a,b from a where b=2; a | b ---+--- 1 | 2 3 | 2 5 | 2 (3 rows)
Re: Index Skip Scan
On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee wrote: > Perhaps a question: when stepping through code in GDB, is there an easy way > to pretty print for example the contents on an IndexTuple? I saw there's some > tools out there that can pretty print plans, but viewing tuples is more > complicated I guess. Try the attached patch -- it has DEBUG1 traces with the contents of index tuples from key points during index scans, allowing you to see what's going on both as a B-Tree is descended, and as a range scan is performed. It also shows details of the insertion scankey that is set up within _bt_first(). This hasn't been adopted to the patch at all, so you'll probably need to do that. The patch should be considered a very rough hack, for now. It leaks memory like crazy. But I think that you'll find it helpful. -- Peter Geoghegan 0012-Index-scan-positioning-DEBUG1-instrumentation.patch Description: Binary data