Re: Use of additional index columns in rows filtering

Peter Geoghegan Thu, 03 Aug 2023 17:09:36 -0700

On Thu, Aug 3, 2023 at 3:04 PM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
> When you say "index page accesses" do you mean accesses to index pages,
> or accesses to heap pages from the index scan?


Yes, that's exactly what I mean. Note that that's the dominant cost
for the original BitmapOr plan.

As I said upthread, the original BitmapOr plan has 7 buffer hits. The
breakdown is 1 single heap page access, 3 root page accesses, and 3
leaf page accesses. There is only 1 heap page access because only 1
out of the 3 index scans that feed into the BitmapOr actually end up
finding any matching rows in the index.

In short, the dominant cost here is index page accesses. It's a
particularly good case for my SAOP patch!

> Because my patch is all about reducing the heap pages, which are usually
> the expensive part of the index scan. But you're right the "index scan"
> with index filter may access more index pages, because it has fewer
> "access predicates".

The fact is that your patch correctly picks the cheapest plan, which
is kinda like a risky version of the plan that my SAOP patch would
pick -- it is cheaper for the very same reason. I understand that
that's not your intention at all, but this is clearly what happened.
That's what I meant by "weird second order effects".

To me, it really does kinda look like your patch accidentally
discovered a plan that's fairly similar to the plan that my SAOP patch
would have found by design! Perhaps I should have been clearer on this
point earlier. (If you're only now seeing this for yourself for the
first time, then...oops. No wonder you were confused about which
patch it was I was going on about!)

> I don't quite see that with the tenk1 query we've been discussing (the
> extra buffers were due to non-allvisible heap pages), but I guess that's
> possible.

The extra buffer hits occur because I made them occur by inserting new
tuples where thousand = 42. Obviously, I did it that way because I had
a point that I wanted to make. Obviously, there wouldn't have been any
notable regression from your patch at all if I had (say) inserted
tuples where thousand = 43 instead. (Not for the original "42" query,
at least.)

That's part of the problem, as I see it. Local changes like that can
have outsized impact on individual queries, even though there is no
inherent reason to expect it. How can statistics reliably guide the
planner here? Statistics are supposed to be a summary of the whole
attribute, that allow us to make various generalizations during
planning. But this plan leaves us sensitive to relatively small
changes in one particular "thousand" grouping, with potentially
outsized impact. And, this can happen very suddenly, because it's so
"local".

Making this plan perform robustly just doesn't seem to be one of the
things that statistics can be expected to help us with very much.

--
Peter Geoghegan

Re: Use of additional index columns in rows filtering

Reply via email to