[HACKERS] Correlation in cost_index()

Manfred Koizar Wed, 02 Oct 2002 12:24:52 -0700

You all know this FAQ: "Why does Postgres not use my index?"  Half of
the time this problem can easily be solved by casting a literal to the
type of the respective column;  this is not my topic here.


In many other cases it turns out that the planner over-estimates the
cost of an index scan.  Sometimes this can be worked around by
lowering random_page_cost.  "Of course, that's a hack that is quite
unrelated to the real problem."  I strongly agree ;-)

AFAICS (part of) the real problem is in costsize.c:cost_index() where
IO_cost is calculated from min_IO_cost, pages_fetched,
random_page_cost, and indexCorrelation.  The current implementation
uses indexCorrelation^2 to interpolate between min_IO_cost and
max_IO_cost, which IMHO gives results that are too close to
max_IO_cost.  This conjecture is supported by the fact, that often
actual run times are much lower than estimated, when seqscans are
disabled.

So we have to find a cost function, so that

  . min_IO_cost <= cost <= max_IO_cost
                         for  -1 <= indexCorrelation <= 1
  . cost --> min_IO_cost  for  indexCorrelation --> +/- 1
  . cost --> max_IO_cost  for  indexCorrelation --> 0
  . cost tends more towards min_IO_cost than current implementation

After playing around a bit I propose three functions satisfying above
conditions.  All proposals use absC = abs(indexCorrelation).

Proposal 1:  Use absC for interpolation.

        IO_cost = absC * min_IO_cost + (1 - absC) * max_IO_cost;


Proposal 2:  First calculate estimates for numbers of pages and cost
per page, then multiply the results.

        estPages = absC * minPages + (1 - absC) * maxPages;
        estPCost = absC * 1 + (1 - absC) * random_page_cost;
                      /*  ^
                          sequential_page_cost */
        IO_cost = estPages * estPCost;


Proposal 3:  Interpolate "geometrically", using absC.

        IO-cost = exp(   absC    * ln(min_IO_Cost) +
                      (1 - absC) * ln(max_IO_cost));


Here are some numbers for
    seq_page_cost = 1   (constant)
    random_page_cost = 4  (GUC)
    minPages = 61
    maxPages = 1440

        corr  current      p1        p2        p3
        0     5760.00   5760.00   5760.00   5760.00
        0.1   5703.01   5190.10   4817.77   3655.22
        0.2   5532.04   4620.20   3958.28   2319.55
        0.3   5247.09   4050.30   3181.53   1471.96
        0.4   4848.16   3480.40   2487.52    934.08
        0.5   4335.25   2910.50   1876.25    592.76
        0.6   3708.36   2340.60   1347.72    376.16
        0.7   2967.49   1770.70    901.93    238.70
        0.8   2112.64   1200.80    538.88    151.48
        0.9   1143.81    630.90    258.57     96.13
        0.95   616.65    345.95    149.44     76.57
        0.99   174.41    117.99     77.03     63.84
        0.995  117.85     89.50     68.91     62.40
        0.999   72.39     66.70     62.57     61.28
        1       61.00     61.00     61.00     61.00

Another example for
    seq_page_cost = 1   (constant)
    random_page_cost = 10  (GUC)
    minPages = 20
    maxPages = 938.58

        corr  current      p1        p2        p3
        0     9385.79   9385.79   9385.79   9385.79
        0.1   9292.14   8449.21   7705.17   5073.72
        0.2   9011.16   7512.64   6189.88   2742.73
        0.3   8542.87   6576.06   4839.94   1482.65
        0.4   7887.27   5639.48   3655.34    801.48
        0.5   7044.35   4702.90   2636.09    433.26
        0.6   6014.11   3766.32   1782.19    234.21
        0.7   4796.56   2829.74   1093.62    126.61
        0.8   3391.69   1893.16    570.40     68.44
        0.9   1799.50    956.58    212.53     37.00
        0.95   933.16    488.29     95.60     27.20
        0.99   206.38    113.66     31.81     21.27
        0.995  113.42     66.83     25.70     20.62
        0.999   38.72     29.37     21.11     20.12
        1       20.00     20.00     20.00     20.00

(If you want to play around with your own numbers, I can send my OOo
spreadsheet privately or to the list.)

The second example shows that especially with proposal 3 we could
afford to set random_page_cost to a *higher* value, which in contrast
to previous recommendations seems to be appropriate, IIRC that
benchmark results posted here showed values of up to 60.

As nobody knows how each of these proposals performs in real life
under different conditions, I suggest to leave the current
implementation in, add all three algorithms, and supply a GUC variable
to select a cost function.

Comments?  Ideas?  Objections?

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] Correlation in cost_index()

Reply via email to