Re: [HACKERS] slower merge join on sorted data chosen over nested loop

Tom Lane Mon, 10 Oct 2005 18:14:51 -0700

"Kevin Grittner" <[EMAIL PROTECTED]> writes:
> A couple questions occur to me, though.  I'm not clear on why
> ceil is called -- do we need to eliminate the fraction here?


Well, you don't get to read part of a page.  In particular, fetching 1.0
index tuples requires fetching 1.0 pages, not (say) 0.01 page.  We
consistently use ceil() in these sorts of estimates, and I think it's
correct.
 
> Also, to be clear, I thought someone said that only the leaf level
> was counted -- it looks like the other levels are being factored in,
> but not as heavily as they would be accessed with logical reads,
> because we're taking the fraction of tuples for the index multiplied
> by the total number of index pages except for the metadata page.
> This should bias the cost slightly higher for additional index levels,
> shouldn't it?

True, I overstated a little in claiming that upper pages weren't counted
at all.  However, the current formula drastically undercounts them.
Consider a simple int4 index: this will require 16 bytes per entry
(assuming MAXALIGN 4) so there could be up to 500-some index entries per
page.  At the traditional rule-of-thumb load factor of 2/3rds, that's
say 340 entries per page, which will also be the fanout factor for upper
index pages.  Assume a fully populated 3-level index:
        1 root page             340 entries
        340 2nd-level pages     340^2 = 115600 entries
        115600 leaf pages       340^3 = 39304000 entries
Total index size: 115942 pages, 39304000 leaf entries.  If we try to
calculate the cost to fetch 1000 index tuples, the current numIndexPages
calculation is
        ceil((1000 / 39304000) * (115942 - 1) + 1)
which is ceil(3.95) or 4 --- including the metapage.  So we're basically
just counting the three leaf pages that the tuples occupy; the root and
second level pages are too small a fraction of the index to affect the
result.  (The correct estimate of course would be six or seven pages
visited: metapage, root, one 2nd-level page, and 3 or 4 leaf pages
depending on boundary effects.)

And actually that example still understates the problem.  You have to
touch one page in each of the upper index levels to descend, regardless
of how many leaf pages you are going to touch in the scan.  Do the math
for the estimate of the cost to read just *one* index entry --- in this
example it comes out to 1.003 which ceil()'s up to 2, ie, the metapage
and the single leaf page.

With wider index entries, the fanout of upper pages decreases and so the
upper pages take a larger fraction of the index --- but the tree depth
increases, too, so more upper pages have to be visited to get to the
first leaf page.  You still end up with a big undercount.

I recall thinking about changing the formula to more accurately count
the number of pages touched; but I desisted when I realized that it
would drastically increase the cost estimates for index searches, and
that's surely the wrong direction to be going in.  We really can't do
that until we have some usable infrastructure to allow estimating the
probability that those pages are already in cache.  In the meantime,
the tweaks under discussion here amount to assuming that the metapage
and all upper pages are always in cache.

The current cost estimate to fetch a single tuple via indexscan is
basically random_page_cost + 2, plus some near-negligible cpu costs.
Not counting the metapage would take that down to random_page_cost + 1.
This would definitely move the goalposts, particularly for people
who run with smaller-than-default random_page_cost, but I'm not sure
if it's enough to solve the problem.

Also, all this is really just a sideshow; I think the main problem for
join estimation is that because we cost an inner-indexscan nestloop as
taking N times the cost of one execution of the inner scan, we fail to
account for cacheing effects in the table itself as well as the index.
That would cut into the random_page_cost part of the cost estimate as
well as the index cost.  For all the reasons I've cited, it's pretty
hard to justify reducing the estimate for an indexscan standing on its
own --- but in the context of a nestloop join, it's easier to make a
case.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] slower merge join on sorted data chosen over nested loop

Reply via email to