Hi, Le lun. 2 nov. 2020 à 20:09, Philip Semanchuk <phi...@americanefficient.com> a écrit :
> > > > On Oct 31, 2020, at 9:53 AM, Guillaume Lelarge <guilla...@lelarge.info> > wrote: > > > > Hi, > > > > Le ven. 30 oct. 2020 à 15:57, Philip Semanchuk < > phi...@americanefficient.com> a écrit : > > > > > > > On Oct 29, 2020, at 6:48 PM, Tomas Vondra < > tomas.von...@2ndquadrant.com> wrote: > > > > > > On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote: > > >> > > >> > > >>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pry...@telsasoft.com> > > >>> wrote: > > >>> > > >>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote: > > >>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk > > >>>> <phi...@americanefficient.com> wrote: > > >>>> > > >>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows, > > >>>>>>> actual > > >>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in > > >>>>> this node. > > >>>>>>> > > >>>>>>> The first is Postgres' estimate. The condition for this index > > >>>>>>> scan > > >>>>> contains three expressions -- > > >>>>>>> > > >>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike = > > >>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text) > > >>>>> > > >>>> > > >>>> Are the columns correlated? > > >>> > > >>> I guess it shouldn't matter, since the FKs should remove all but one > > >>> of the conditions. > > >> > > >> Yes, I had the same expectation. I thought Postgres would calculate > the > > >> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency > > >> of ‘mike’, but since the frequency estimate is very accurate but the > > >> planner’s estimate is not, there’s something else going on. > > >> > > > > > > Well, this is quite a bit more complicated, I'm afraid :-( The clauses > > > include parameters passed from the nodes above the index scan. So even > > > if we had extended stats on the table, we couldn't use them as that > > > requires (Var op Const) conditions. So this likely ends up with a > > > product of estimates for each clause, and even then we can't use any > > > particular value so we probably end up with something like 1/ndistinct > > > or something like that. So if the values actually passed to the index > > > scan are more common and/or if the columns are somehow correlated, it's > > > not surprising we end up with an overestimate. > > > > I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, > five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The > estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in > the source table, so we have: > > > > 25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209 > > > > Hence the estimate of 7 rows returned. > > > > It's interesting that five_uniform’s estimated ndistinct is low by > 50% > (actual = 62958). Paradoxically, if I manually set ndistinct to the correct > value of 62958, the estimate gets worse (3 rows instead of 7). > > > > Suggestions for fixing this are of course welcome. :-) > > > > On a related topic, are there any in depth guides to the planner that I > could read? I can (and have) read the source code and it’s been > informative, but something higher level than the source code would help. > > > > > > You may already know this, but there's a bunch of documents up there: > https://wiki.postgresql.org/wiki/Using_EXPLAIN > > > Bien merci, yes, I've visited most of those links and learned an enormous > amount from them. I've downloaded many of them for re-reading, including > yours. :-) It's helpful to be reminded of them again. > > EXPLAIN ANALYZE tells me what choices the planner made, but it doesn't > tell me why the planner made those choices. For instance, Tomas Vondra's > post enabled me to calculate how the planner arrived at its estimate of 7 > rows for one node of my query. I would prefer not to reverse engineer the > planner's calculation, but instead have the planner just tell me. > > If I was able to combine that information with a summary of the planner's > algorithm (a lot to ask for!), then I could understand how the planner > chose its plan. > > The query I asked about in the original post of this thread has 13 > relations in it. IIUC, that's 13! or > 6 billion possible plans. How did > the planner pick one plan out of 6 billion? I'm curious, both for practical > purposes (I want my query to run well) and also because it's fascinating. > > I understand, and I mostly agree, especially on the fascinating side of the planner. Anyway, that's what I'm working on right now. It will take a lot of time, and it will probably contain a lot of errors at the beginning, but I'll be happy to fix them. -- Guillaume.