On Wed, Aug 9, 2023 at 11:15 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
> Cool. I'll try to build my own set of examples that I find interesting
> either because it's what the patch aims to help with, or because I
> expect it to be problematic for some reason. And then we can compare.

That would be great. I definitely want to make this a collaborative thing.

> Yup, I agree with that principle. The AM can evaluate the expression in
> a smarter way, without the visibility checks.

Attached text document (which I guess might be my first draft) is an
attempt to put the discussion up until this point on a more formal
footing.

The format here tries to reduce the principles to a handful of bullet
points. For example, one line reads:

+ Index quals are better than equivalent index filters because bitmap
index scans can only use index quals

I'm pretty sure that these are all points that you and I both agree on
already. But you should confirm that. And make your own revisions, as
you see fit.

It's definitely possible that I overlooked an interesting and durable
advantage that index filters have. If there is some other way in which
index filters are likely to remain the best and only viable approach,
then we should note that. I just went with the really obvious case of
an expression that definitely needs visibility checks to avoid ever
throwing a division-by-zero error, related to some invisible tuple.
It's an extreme case in that it focuses on requirements that seem just
about unavoidable in any future world. (Come to think of it, the
biggest and most durable advantage for index filters is probably just
how general they are, which I do mention.)

-- 
Peter Geoghegan
General principles for index quals
==================================

Tomas Vondra's "evaluate filters on the index tuple" patch works by making an 
expressions that didn't becoming index
quals into additional index filter (not table filters) where possible.  The 
general idea is that the planner should
recognize a kind of "qual hierarchy", where it is always preferable to make 
each index path have use true index quals
where possible, but, failing that, it is still preferable to make the predicate 
into an index filter rather than into a
table filter.

In planner:
    Index Quals > Index Filter > Table Filter

Although this general design makes sense, it nevertheless can lead to 
suboptimal outcomes: situations where the planner
chooses a plan that is obviously not as good as some hypothetical other plan 
that manages to use true index quals
through some (currently unimplemented) other means.

This isn't really a problem with the patch itself; it is more a case of the 
planner lacking complementary optimizations
that make predicates into true index quals before the mechanism added by the 
patch is even reached.  The "catch-all"
nature of the patch isn't a problem to be fixed -- its very general nature is a 
key strength of the approach (it uses
full expression evaluation, which is much more general than any approach that 
ultimately reduces predicates to indexable
operators from an opclass).

This document lists certain specific scenarios where the patch doesn't quite do 
as well as one would hope.  That puts
things on a good footing for adding other complementary techniques later on.  
Theoretically this will be independent
work, that could have happened at any point in the past.  But the patch has 
brought these other related issues into
sharp relief for the first time -- so it doesn't necessarily feel all that 
independent.

A working "taxonomy" for all of these techniques seems useful, even one such as 
this, that's generally acknowledged to
be incomplete and far from perfect.

Why index filters are better than equivalent table filters
----------------------------------------------------------

This part is very simple.

+ They're better because they have a decent chance of avoiding significant 
amounts of heap accesses for expression evaluation.

This extends some of the benefits that index-only scans have to plain index 
scans.

Why index quals are better than equivalent index filters
--------------------------------------------------------

The advantages that index quals have over index filters can be much less clear 
than one would hope because index filters
are confusingly similar to index quals.  But index quals are significantly 
better, overall.  This is guaranteed to be
true, assuming we're comparing "equivalent" index quals and index filters -- 
logically interchangeable forms of quals
with different physical/runtime characteristics.

(An index filter might be better than an index qual when it happens to be much 
more selective, because it filters tuples
in some fundamentally different way, but that's not relevant to this 
discussion.  We're focussed on clear-cut cases,
where we can speak with the most confidence, since that's a useful basis for 
further improvements later on.)

We may even be missing out on the advantages of true index quals in cases where 
the patch actually offers big
improvements compared to what was possible in Postgres 16.  Since, of course, 
index filters can still be much better
than table filters.  To make matters more confusing, the consequences may only 
be apparent at certain times, for a given
query.  And minor variants of the same query can be very sensitive to whether 
true index quals or mere index filters
could be used.  We must keep all of this in perspective, which is hard.

Example 1: bitmap index scan, prefer index quals
------------------------------------------------

+ Index quals are better than equivalent index filters because bitmap index 
scans can only use index quals

If we have an index on "tenk1 (four, two)", then we miss out on the opportunity 
to eliminate heap accesses for a query
like this one:

select ctid, *
from tenk1
where four = 1 and two != 1;

This gives us a "bad" plan (at least in relative-to-ideal-case terms), since 
there is an excessive amount of heap access:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                      QUERY 
PLAN                                                                       │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Bitmap Heap Scan on public.tenk1  (cost=25.35..407.85 rows=1250 width=250) 
(actual time=0.707..0.707 rows=0 loops=1)                                  │
│   Output: ctid, unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Recheck Cond: (tenk1.four = 1)                                              
                                                                        │
│   Filter: (tenk1.two <> 1)                                                    
                                                                        │
│   Rows Removed by Filter: 2500                                                
                                                                        │
│   Heap Blocks: exact=345                                                      
                                                                        │
│   Buffers: shared hit=349                                                     
                                                                        │
│   ->  Bitmap Index Scan on tenk1_four_two_idx  (cost=0.00..25.04 rows=2500 
width=0) (actual time=0.073..0.074 rows=2500 loops=1)                      │
│         Index Cond: (tenk1.four = 1)                                          
                                                                        │
│         Buffers: shared hit=4                                                 
                                                                        │
│ Planning Time: 0.037 ms                                                       
                                                                        │
│ Execution Time: 0.719 ms                                                      
                                                                        │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we see a bitmap index scan plan (that uses our composite index), which 
makes sense overall.  But the details beyond
that make no sense -- since we're using table filter quals for "two".  It turns 
out that the bitmap heap scan will
access every single heap page in the tenk1 table as a result, even though we 
could have literally avoided all heap
accesses had we been able to push down the != as an index qual.

As things stand, the patch gives an enormous advantage to index scans over 
bitmap index scans.  But the costing fails to
capture that difference.  That may be okay in many individual cases, but it 
seems questionable here.  Intuitively, there
is no particular reason why this particular case (involving a simple 
inequality) couldn't find a way to use index quals
instead of relying on index filters.  That would be in keeping with the 
traditional tendency of index scans to do
approximately the same thing as similar bitmap index scans (just in a slightly 
different order).

This example relies on an understanding that we really could make != into just 
another indexable operator, since it is
already a kind of an honorary member of the btree opclass.  That is pretty 
clear cut -- nbtree knows how to do a bunch
of fairly similar tricks already.

Just for context, here is the patch's "good" index scan plan (or better plan, 
at least).  This plan does the trick that
we'd ideally see from both index scans and bitmap index scans alike (namely, it 
avoids all heap accesses):

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                      QUERY 
PLAN                                                                       │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using tenk1_four_two_idx on public.tenk1  (cost=0.29..709.59 
rows=1250 width=250) (actual time=0.300..0.300 rows=0 loops=1)                │
│   Output: ctid, unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Index Cond: (tenk1.four = 1)                                                
                                                                        │
│   Index Filter: (tenk1.two <> 1)                                              
                                                                        │
│   Rows Removed by Index Recheck: 2500                                         
                                                                        │
│   Filter: (tenk1.two <> 1)                                                    
                                                                        │
│   Buffers: shared hit=5                                                       
                                                                        │
│ Planning Time: 0.035 ms                                                       
                                                                        │
│ Execution Time: 0.311 ms                                                      
                                                                        │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

(Though of course it should actually make everything into "Index Cond:" index 
quals -- need to teach nbtree how to do
that in some related patch that I have yet to start work on.)

Costing each variant becomes much less important if we can find a way to make 
all useful runtime behavior available with
both plan variants.

Example 2: using index filters is inevitable and natural
--------------------------------------------------------

Similarly, we can imagine a somewhat similar query where the only possible 
approach will be index filters put there by
the patch.  Here we use the same index as before, on "tenk1 (four, two)":

select ctid, *
from tenk1
where four = 1 and 2/two = 1;

This gives us a good plan:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                      QUERY 
PLAN                                                                       │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using tenk1_four_two_idx on public.tenk1  (cost=0.29..59.14 
rows=14 width=250) (actual time=0.321..0.321 rows=0 loops=1)                   │
│   Output: ctid, unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Index Cond: (tenk1.four = 1)                                                
                                                                        │
│   Index Filter: ((2 / tenk1.two) = 1)                                         
                                                                        │
│   Rows Removed by Index Recheck: 2500                                         
                                                                        │
│   Filter: ((2 / tenk1.two) = 1)                                               
                                                                        │
│   Buffers: shared hit=5                                                       
                                                                        │
│ Planning Time: 0.043 ms                                                       
                                                                        │
│ Execution Time: 0.334 ms                                                      
                                                                        │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we must do a visibility check in order to know ahead of time that any 
division-by-zero errors are limited to those
index tuples actually visible to our MVCC snapshot.  That factor is what makes 
it very clear that the only sensible way
to do this is by using an index filter -- index AMs cannot be expected to do 
true expression evaluation.

One would hope that the optimizer would be able to give full credit to the 
index scan plan over a similar, potentially
much slower bitmap index scan plan (where the use of "table filters" is 
inevitable and has the potential to make things
far slower).  Whether or not that happens is ultimately a fairly standard 
question of costing, which makes it out of
scope for this document.

Our first and second examples can be placed into each of two different buckets 
in a fairly clear cut, black-and-white
way.  There is also a  huge amount of gray area that one can think of, but 
that's out of scope for this document.  This
document aims to be a starting point that will provide starting guidelines 
about how to categorize any of these gray
area cases.  Of course, our ability to implement various transformations as a 
practical matter (e.g., teaching nbtree to
deal with inequalities natively) is bound to play a very important role in 
practice -- maybe even the largest role.

Example 3: "risky" plan shifts away from a BitmapOr plan
--------------------------------------------------------

+ Index quals are better than index filters because they have the potential to 
give the index AM important context.

+ Index quals are better than index filters because they can reliably avoid 
heap access, no matter how the VM is set.

Currently, the patch has one notable change in regression tests output, for 
this query:

SELECT * FROM tenk1 WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR 
tenthous = 42);

Master branch plan, using regression test composite index on "tenk1(thousand, 
tenthous)":

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                               
  QUERY PLAN                                                                    
              │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Bitmap Heap Scan on public.tenk1  (cost=6.89..8.91 rows=1 width=244) (actual 
time=0.010..0.011 rows=1 loops=1)                                               
               │
│   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4        
                     │
│   Recheck Cond: (((tenk1.thousand = 42) AND (tenk1.tenthous = 1)) OR 
((tenk1.thousand = 42) AND (tenk1.tenthous = 3)) OR ((tenk1.thousand = 42) AND 
(tenk1.tenthous = 42))) │
│   Heap Blocks: exact=1                                                        
                                                                                
              │
│   Buffers: shared hit=7                                                       
                                                                                
              │
│   ->  BitmapOr  (cost=6.89..6.89 rows=1 width=0) (actual time=0.008..0.009 
rows=0 loops=1)                                                                 
                 │
│         Buffers: shared hit=6                                                 
                                                                                
              │
│         ->  Bitmap Index Scan on tenk1_thous_tenthous  (cost=0.00..2.29 
rows=1 width=0) (actual time=0.005..0.005 rows=0 loops=1)                       
                    │
│               Index Cond: ((tenk1.thousand = 42) AND (tenk1.tenthous = 1))    
                                                                                
              │
│               Buffers: shared hit=2                                           
                                                                                
              │
│         ->  Bitmap Index Scan on tenk1_thous_tenthous  (cost=0.00..2.29 
rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)                       
                    │
│               Index Cond: ((tenk1.thousand = 42) AND (tenk1.tenthous = 3))    
                                                                                
              │
│               Buffers: shared hit=2                                           
                                                                                
              │
│         ->  Bitmap Index Scan on tenk1_thous_tenthous  (cost=0.00..2.29 
rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)                       
                    │
│               Index Cond: ((tenk1.thousand = 42) AND (tenk1.tenthous = 42))   
                                                                                
              │
│               Buffers: shared hit=2                                           
                                                                                
              │
│ Planning Time: 0.059 ms                                                       
                                                                                
              │
│ Execution Time: 0.028 ms                                                      
                                                                                
              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The patch uses this plan, which is the faster plan according to every 
traditional measure:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN  
                                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using tenk1_thous_tenthous on public.tenk1  (cost=0.29..4.38 
rows=1 width=244) (actual time=0.010..0.012 rows=1 loops=1)             │
│   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Index Cond: (tenk1.thousand = 42)                                           
                                                                  │
│   Index Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR 
(tenk1.tenthous = 42))                                                         │
│   Rows Removed by Index Recheck: 9                                            
                                                                  │
│   Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR (tenk1.tenthous = 
42))                                                               │
│   Buffers: shared hit=4                                                       
                                                                  │
│ Planning Time: 0.051 ms                                                       
                                                                  │
│ Execution Time: 0.023 ms                                                      
                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Notably, the original BitmapOr plan only accesses one heap page buffer.  So the 
patch gets a faster plan by saving on
index page accesses, even though that wasn't really its intent.

Example 4: alternative approach, by SAOP patch
----------------------------------------------

Another patch (this one from Peter Geoghegan) improves SAOP execution.  If you 
run the same query with that patch
applied, rewritten to use an IN ( ... ), you get the following similar though 
slightly better plan:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN  
                                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using tenk1_thous_tenthous on public.tenk1  (cost=0.29..8.33 
rows=1 width=244) (actual time=0.008..0.008 rows=1 loops=1)             │
│   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Index Cond: ((tenk1.thousand = 42) AND (tenk1.tenthous = ANY 
('{1,3,42}'::integer[])))                                                       
 │
│   Buffers: shared hit=3                                                       
                                                                  │
│ Planning Time: 0.043 ms                                                       
                                                                  │
│ Execution Time: 0.017 ms                                                      
                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This plan does the absolute bare minimum number of buffer accesses, since it 
doesn't repeat any single individual access
to the same buffer. This is a key design goal for the SAOP patch.  That in 
itself isn't terribly interesting just yet --
what's interesting is that the plan is more _robust_, in the sense that will 
_reliably_ perform a lot better than the
similar index filter plan, even if various closely related things should happen 
to shift.

Suppose, for example, we do this:

insert into tenk1 (thousand, tenthous) select 42, i from generate_series(43, 
1000) i;

The situation at execution time with the SAOP patch remains completely 
unchanged.  If we re-execute the query we once
again get 3 buffer hits (the plan will _reliably_ do the absolute minimum 
number of page accesses in the way we already
described, even).

Meanwhile, this is what we see for the index filter patch -- much higher 
execution cost compared to our original example 3.
This is due to a change that doesn't affect any of the rows the query needs to 
return anyway:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN  
                                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using tenk1_thous_tenthous on public.tenk1  (cost=0.29..4.38 
rows=1 width=244) (actual time=0.020..0.382 rows=1 loops=1)             │
│   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Index Cond: (tenk1.thousand = 42)                                           
                                                                  │
│   Index Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR 
(tenk1.tenthous = 42))                                                         │
│   Rows Removed by Index Recheck: 967                                          
                                                                  │
│   Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR (tenk1.tenthous = 
42))                                                               │
│   Buffers: shared hit=336                                                     
                                                                  │
│ Planning Time: 0.088 ms                                                       
                                                                  │
│ Execution Time: 0.402 ms                                                      
                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

To be fair, the index filter patch wouldn't have done too badly had we 
consistently used SAOP syntax (it would have been
just like Postgres 16, which isn't "robust", but also isn't as "risky").  It is 
the same as the plan that the SAOP gets,
except that we see the same issue with excessive primitive index scans/repeated 
index page accesses:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│
                                                                   QUERY PLAN   
                                                                 │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using tenk1_thous_tenthous on public.tenk1  (cost=0.29..8.90 
rows=1 width=244) (actual time=0.011..0.012 rows=1 loops=1)             │
│   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 │
│   Index Cond: ((tenk1.thousand = 42) AND (tenk1.tenthous = ANY 
('{1,3,42}'::integer[])))                                                       
 │
│   Buffers: shared hit=7                                                       
                                                                  │
│ Planning Time: 0.047 ms                                                       
                                                                  │
│ Execution Time: 0.022 ms                                                      
                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The extra primitive index scans are of minimal concern here (though that might 
not be true in more elaborate cases with
hundreds of constants).

The real problem here seems to be a perverse interaction with OR expression 
evaluation.  It's already possible to
effectively transform the expression into index quals for the BitmapOr path, 
through other means.  Ideally, we'd
consistently "normalize to CNF" earlier, so that we'd at least get a "robust" 
SAOP-based plan in most simple cases
involving OR expressions (no less robust than in Postgres 16).  That would mean 
that the index filter patch would get a
plan that was expected to be more expensive, but also one that is relatively 
robust.  But once we had the main SAOP
patch then everything would be strictly better here.

In general, a SAOP (with nbtree, which has native support for SAOP clauses) 
passes down true index quals, that give the
index AM the context required to terminate scans early, and to avoid heap 
accesses.  While the SAOP patch pushes that
idea a lot further, it isn't really a new idea.

It should also be noted that the problem with the index filter patch is one of 
degree.  Even Postgres 16 will eventually
have the same problem, with a sufficiently elaborate OR-based expression -- 
since the cost of repeated index accesses
pushes things in that direction eventually.  The SAOP patch has the advantage 
of reliably avoiding the issue by reliably
avoiding repeat accesses to index pages.  This means that the plan we saw is 
always the cheapest plan, no matter how the
details are varied.  This happens because we simply don't have to ever make a 
trade-off between heap page accesses and
index page accesses -- the SAOP plan is literally guaranteed to win, no matter 
what happens at runtime.

For that reason, "risky OR plans" can be address within the scope of the main 
SAOP patch, and/or Alena Rybakina's
OR-to-SAOP transformation patch.  It seems unreasonable to make that the 
problem of the index filter patch.

(Actually, it's still possible that we ought to have chosen a sequential scan, 
and make the wrong choice, but that's not
something that we can just avoid.)

+ It's suboptimal (and weird) that we might sometimes rely on index filters to 
reduce index page accesses

+ Index quals can achieve the same desirable outcome "robustly", because the 
index AM has all required context

+ "Choice is confusion" -- we should always prefer one index path that "offers 
all of the advantages" over two or more hypothetical alternative index paths 
that can only do one thing well

+ More generally, we should try to avoid nonlinear cost differences between 
highly similar plans (e.g., bitmap index scan vs index scan of same index).

In nbtree (using Markus Winand's terminology, which we should arguably have 
exposed in EXPLAIN output):
    Access predicates > Index filter predicates

Within nbtree itself, we could say that SK_BT_REQFWD and SK_BT_REQBKWD scan 
keys are both types of "Access predicates".
It's a bit more complicated than that because with inequalities you'll often 
have scan keys that are required in one
scan direction (e.g., forward scan case), but not required in the other 
direction (should the direction change).
BTEqualStrategyNumber scan keys are both SK_BT_REQFWD and SK_BT_REQBKWD at the 
same time, regardless of other details.
But overall SK_BT_REQFWD/SK_BT_REQBKWD scan keys indicate Access Predicates, 
while anything else is merely index filter
predicates, which don't affect our scan's initial position nor affect when the 
scan can terminate.

Presumably we'd need to make != scan keys (discussed in example 1) into index 
filter predicates in this sense, as part
of any patch to teach index AMs (likely just nbtree) about simple inequalities.

Reply via email to