"Kevin Grittner" <[EMAIL PROTECTED]> writes: > A couple questions occur to me, though. I'm not clear on why > ceil is called -- do we need to eliminate the fraction here?
Well, you don't get to read part of a page. In particular, fetching 1.0 index tuples requires fetching 1.0 pages, not (say) 0.01 page. We consistently use ceil() in these sorts of estimates, and I think it's correct. > Also, to be clear, I thought someone said that only the leaf level > was counted -- it looks like the other levels are being factored in, > but not as heavily as they would be accessed with logical reads, > because we're taking the fraction of tuples for the index multiplied > by the total number of index pages except for the metadata page. > This should bias the cost slightly higher for additional index levels, > shouldn't it? True, I overstated a little in claiming that upper pages weren't counted at all. However, the current formula drastically undercounts them. Consider a simple int4 index: this will require 16 bytes per entry (assuming MAXALIGN 4) so there could be up to 500-some index entries per page. At the traditional rule-of-thumb load factor of 2/3rds, that's say 340 entries per page, which will also be the fanout factor for upper index pages. Assume a fully populated 3-level index: 1 root page 340 entries 340 2nd-level pages 340^2 = 115600 entries 115600 leaf pages 340^3 = 39304000 entries Total index size: 115942 pages, 39304000 leaf entries. If we try to calculate the cost to fetch 1000 index tuples, the current numIndexPages calculation is ceil((1000 / 39304000) * (115942 - 1) + 1) which is ceil(3.95) or 4 --- including the metapage. So we're basically just counting the three leaf pages that the tuples occupy; the root and second level pages are too small a fraction of the index to affect the result. (The correct estimate of course would be six or seven pages visited: metapage, root, one 2nd-level page, and 3 or 4 leaf pages depending on boundary effects.) And actually that example still understates the problem. You have to touch one page in each of the upper index levels to descend, regardless of how many leaf pages you are going to touch in the scan. Do the math for the estimate of the cost to read just *one* index entry --- in this example it comes out to 1.003 which ceil()'s up to 2, ie, the metapage and the single leaf page. With wider index entries, the fanout of upper pages decreases and so the upper pages take a larger fraction of the index --- but the tree depth increases, too, so more upper pages have to be visited to get to the first leaf page. You still end up with a big undercount. I recall thinking about changing the formula to more accurately count the number of pages touched; but I desisted when I realized that it would drastically increase the cost estimates for index searches, and that's surely the wrong direction to be going in. We really can't do that until we have some usable infrastructure to allow estimating the probability that those pages are already in cache. In the meantime, the tweaks under discussion here amount to assuming that the metapage and all upper pages are always in cache. The current cost estimate to fetch a single tuple via indexscan is basically random_page_cost + 2, plus some near-negligible cpu costs. Not counting the metapage would take that down to random_page_cost + 1. This would definitely move the goalposts, particularly for people who run with smaller-than-default random_page_cost, but I'm not sure if it's enough to solve the problem. Also, all this is really just a sideshow; I think the main problem for join estimation is that because we cost an inner-indexscan nestloop as taking N times the cost of one execution of the inner scan, we fail to account for cacheing effects in the table itself as well as the index. That would cut into the random_page_cost part of the cost estimate as well as the index cost. For all the reasons I've cited, it's pretty hard to justify reducing the estimate for an indexscan standing on its own --- but in the context of a nestloop join, it's easier to make a case. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings