Manikandan Swaminathan <maniswam...@gmail.com> writes: > 1) Why is the query currently picking the poorly performing index?
Because the planner thinks that one will be cheaper, as you can see by comparing the cost estimates in EXPLAIN. It's wrong, but this is a hard problem to estimate well. Especially when the behavior depends on a correlation between columns that the planner knows nothing about. > 2) Why would the index you suggested, (col_b, col_a), perform better than > (col_a, col_b)? I would’ve expected the filter on col_a to come first, > followed by the aggregate on col_b. In my mind, it needs to find rows > matching the col_a condition before calculating the MIN(col_b), and I assumed > it would traverse the B-tree accordingly. The idea the planner is using is "scan the index in order (that is, in col_b order) until you find the first row satisfying the other constraints (that is, the col_a condition). Then that row's col_b value is the correct MIN(), and you can stop." Since it knows nothing of the cross-column correlation, its estimate of how many rows it'll have to scan through is overly optimistic. But it knows that the other way involves scanning a whole lot of the index --- there's no chance of stopping early --- so that's estimated as higher-cost. The index I suggested on (col_b, col_a) is amenable to this same plan shape, since col_b is still the major sort column. The reason it wins is that the col_a condition can be checked in the index without having to visit the heap, thus eliminating a lot of random access to the heap. > 3) Why does the planner choose the better-performing (col_a, col_b) index > when the filter is col_a > 5000, but switch to the slower (col_b) index when > the filter is not at the edge of the range, like col_a > 4996? At some point, as less and less of the col_a-major index would need to be scanned, there's a crossover in the cost estimates for the two ways of doing this. I would not have cared to predict where the crossover is, but you evidently found it empirically. > For reference, here’s the query plan when filtering for col_a > 5000. It uses > the correct index on (col_a, col_b). You would do a lot better to approach this without rigid notions of which is the "correct" index. All of the ones we've discussed are potentially usable for this query, and they all have different cost curves depending on how selective the col_a condition is. Even the index on col_b alone could potentially be the best, because it'll be smaller than the two-column indexes. So if the col_a condition is very unselective then it's (at least in theory) possible that that would be the best choice. regards, tom lane