Hi, Le ven. 30 oct. 2020 à 15:57, Philip Semanchuk <phi...@americanefficient.com> a écrit :
> > > > On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: > > > > On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote: > >> > >> > >>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pry...@telsasoft.com> > >>> wrote: > >>> > >>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote: > >>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk > >>>> <phi...@americanefficient.com> wrote: > >>>> > >>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows, > >>>>>>> actual > >>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in > >>>>> this node. > >>>>>>> > >>>>>>> The first is Postgres' estimate. The condition for this index > >>>>>>> scan > >>>>> contains three expressions -- > >>>>>>> > >>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike = > >>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text) > >>>>> > >>>> > >>>> Are the columns correlated? > >>> > >>> I guess it shouldn't matter, since the FKs should remove all but one > >>> of the conditions. > >> > >> Yes, I had the same expectation. I thought Postgres would calculate the > >> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency > >> of ‘mike’, but since the frequency estimate is very accurate but the > >> planner’s estimate is not, there’s something else going on. > >> > > > > Well, this is quite a bit more complicated, I'm afraid :-( The clauses > > include parameters passed from the nodes above the index scan. So even > > if we had extended stats on the table, we couldn't use them as that > > requires (Var op Const) conditions. So this likely ends up with a > > product of estimates for each clause, and even then we can't use any > > particular value so we probably end up with something like 1/ndistinct > > or something like that. So if the values actually passed to the index > > scan are more common and/or if the columns are somehow correlated, it's > > not surprising we end up with an overestimate. > > I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, > five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The > estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in > the source table, so we have: > > 25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209 > > Hence the estimate of 7 rows returned. > > It's interesting that five_uniform’s estimated ndistinct is low by > 50% > (actual = 62958). Paradoxically, if I manually set ndistinct to the correct > value of 62958, the estimate gets worse (3 rows instead of 7). > > Suggestions for fixing this are of course welcome. :-) > > On a related topic, are there any in depth guides to the planner that I > could read? I can (and have) read the source code and it’s been > informative, but something higher level than the source code would help. > > You may already know this, but there's a bunch of documents up there: https://wiki.postgresql.org/wiki/Using_EXPLAIN I'm also working on a project to better document this. I'm just at the beginning, writing it all, in english (which isn't my native language), so it takes time. I already have most of it in french in various documents/formats, but it takes time to go through all of these, summarize them, and translate them. Anyway, work in progress as they say. You can have a look at it there: https://pgplanner.readthedocs.io/en/latest/index.html. Any comment/help is very welcome. -- Guillaume.