> > AFAICS (part of) the real problem is in costsize.c:cost_index() where > > IO_cost is calculated from min_IO_cost, pages_fetched, > > random_page_cost, and indexCorrelation. The current implementation > > uses indexCorrelation^2 to interpolate between min_IO_cost and > > max_IO_cost, which IMHO gives results that are too close to > > max_IO_cost. > > The indexCorrelation^2 algorithm was only a quick hack with no theory > behind it :-(. I've wanted to find some better method to put in there, > but have not had any time to research the problem.
Could we "quick hack" it to a geometric mean instead since a mean seemed to yield better results than indexCorrelation^2? > > As nobody knows how each of these proposals performs in real life > > under different conditions, I suggest to leave the current > > implementation in, add all three algorithms, and supply a GUC variable > > to select a cost function. > > I don't think it's really a good idea to expect users to pick among > multiple cost functions that *all* have no guiding theory behind them. > I'd prefer to see us find a better cost function and use it. Has anyone > trawled the database literature on the subject? Hrm, after an hour of searching and reading, I think one of the better papers on the subject can be found here: http://www.cs.ust.hk/faculty/dimitris/PAPERS/TKDE-NNmodels.pdf Page 13, figure 3-12 is the ticket you were looking for Tom. It's an interesting read with a pretty good analysis and conclusion. The author notes that his formula begins to fall apart when the number of dimensions reaches 10 and suggests the use of a high dimension random cost estimate algo, but that the use of those comes at great expense to the CPU (which is inline with a few other papers that I read). The idea of precomputing values piqued my interest and I thought was very clever, albeit space intensive. *shrug* -sc -- Sean Chittenden ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]