Re: index prefetching

Peter Geoghegan Thu, 08 Jun 2023 16:39:04 -0700

On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
>     Normal index scans are an even more interesting case but I'm not
>     sure how hard it would be to get that information. It may only be
>     convenient to get the blocks from the last leaf page we looked at,
>     for example.
>
> So this suggests we simply started prefetching for the case where the
> information was readily available, and it'd be harder to do for index
> scans so that's it.


What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

> Even if SAOP (probably) wasn't the reason, I think you're right it may
> be an issue for prefetching, causing regressions. It didn't occur to me
> before, because I'm not that familiar with the btree code and/or how it
> deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

> So if you're planning to work on this for PG17, collaborating on it
> would be great.
>
> For now I plan to just ignore SAOP, or maybe just disabling prefetching
> for SAOP index scans if it proves to be prone to regressions. That's not
> great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

> I guess something like this might be a "nice" bad case:
>
>     insert into btree_test mod(i,100000), md5(i::text)
>       from generate_series(1, $ROWS) s(i)
>
>     select * from btree_test where a in (999, 1000, 1001, 1002)
>
> The values are likely colocated on the same heap page, the bitmap scan
> is going to do a single prefetch. With index scan we'll prefetch them
> repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.

I'm talking about problems that exist today, without your patch.

I'll show a concrete example of the kind of index/index scan that
might be affected.

Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.

Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?

While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.

We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.

(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)

Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.

> > It's important to carefully distinguish between cases where plain
> > index scans really are at an inherent disadvantage relative to bitmap
> > index scans (because there really is no getting around the need to
> > access the same heap page many times with an index scan) versus cases
> > that merely *appear* that way. Implementation restrictions that only
> > really affect the plain index scan case (e.g., the lack of a
> > reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
> > should be accounted for when assessing the viability of index scan +
> > prefetch over bitmap index scan + prefetch. This is very subtle, but
> > important.
> >
>
> I do agree, but what do you mean by "assessing"?

I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.

> Wasn't the agreement at
> the unconference session was we'd not tweak costing? So ultimately, this
> does not really affect which scan type we pick. We'll keep doing the
> same planning decisions as today, no?

I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.

One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.

The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.

> I'm all for building a more comprehensive set of test cases - the stuff
> presented at pgcon was good for demonstration, but it certainly is not
> enough for testing. The SAOP queries are a great addition, I also plan
> to run those queries on different (less random) data sets, etc. We'll
> probably discover more interesting cases as the patch improves.

Definitely.

> There are two aspects why I think AM is not the right place:
>
> - accessing table from index code seems backwards
>
> - we already do prefetching from the executor (nodeBitmapHeapscan.c)
>
> It feels kinda wrong in hindsight.

I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.

> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.

> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
> to do that. But it seems like work for future someone.

Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.


--
Peter Geoghegan

 2023-06-07 23:46:15.130 PDT [260646][client backend] 
[pg_regress/insert][6/681:2269] LOG:  simple deletion of index 
"pg_type_typname_nsp_index", block 6:
        n_tup: 251, including high key
        first: (typname, typnamespace)=(column_column_usage, 13858)
        h_key: (typname, typnamespace)=(pg_largeobject)
        finished up on heap block 28 (having accessed 4 blocks in total)
         10. (typname, typnamespace)=(copydml_test, 2200) w TID (25,3) will be 
deleted freeing 36 
         28. (typname, typnamespace)=(enumtest, 2200) w TID (18,7) will be 
deleted freeing 28 
         29. (typname, typnamespace)=(enumtest_bogus_child, 2200) w TID (18,21) 
will be deleted freeing 44 
         30. (typname, typnamespace)=(enumtest_child, 2200) w TID (18,17) will 
be deleted freeing 36 
         31. (typname, typnamespace)=(enumtest_parent, 2200) w TID (18,13) will 
be deleted freeing 36 
         43. (typname, typnamespace)=(float8range_test, 2200) w TID (18,43) 
will be deleted freeing 36 
         54. (typname, typnamespace)=(hash_parted, 2200) w TID (27,34) will be 
deleted freeing 28 
         56. (typname, typnamespace)=(hpart0, 2200) w TID (27,36) will be 
deleted freeing 28 
         57. (typname, typnamespace)=(hpart1, 2200) w TID (27,38) will be 
deleted freeing 28 
         58. (typname, typnamespace)=(hpart2, 2200) w TID (27,40) will be 
deleted freeing 28 
         59. (typname, typnamespace)=(hpart3, 2200) w TID (27,42) will be 
deleted freeing 28 
         61. (typname, typnamespace)=(i8r_array, 2200) w TID (18,61) will be 
deleted freeing 28 
         68. (typname, typnamespace)=(insert_test_type, 2200) w TID (25,15) 
will be deleted freeing 36 
         73. (typname, typnamespace)=(inserttest, 2200) w TID (25,1) will be 
deleted freeing 28 (LP_DEAD bit set)
         74. (typname, typnamespace)=(inserttest, 2200) w TID (25,17) will be 
deleted freeing 28 
         75. (typname, typnamespace)=(inserttest2, 2200) w TID (25,19) will be 
deleted freeing 28 
        104. (typname, typnamespace)=(large_tuple_test, 2200) w TID (25,9) will 
be deleted freeing 36 
        107. (typname, typnamespace)=(list_parted, 2200) w TID (25,31) will be 
deleted freeing 28 (LP_DEAD bit set)
        108. (typname, typnamespace)=(list_parted, 2200) w TID (27,44) will be 
deleted freeing 28 
        109. (typname, typnamespace)=(lparted_nonullpart, 2200) w TID (27,54) 
will be deleted freeing 36 
        110. (typname, typnamespace)=(lparted_nonullpart_a, 2200) w TID (27,56) 
will be deleted freeing 44 
        124. (typname, typnamespace)=(mlparted5, 2200) w TID (27,66) will be 
deleted freeing 28 (LP_DEAD bit set)
        128. (typname, typnamespace)=(mlparted5a, 2200) w TID (27,68) will be 
deleted freeing 28 
        136. (typname, typnamespace)=(mydomain, 2200) w TID (18,45) will be 
deleted freeing 28 (LP_DEAD bit set)
        138. (typname, typnamespace)=(mydomainmultirange, 2200) w TID (18,48) 
will be deleted freeing 36 (LP_DEAD bit set)
        140. (typname, typnamespace)=(mydomainrange, 2200) w TID (18,47) will 
be deleted freeing 36 (LP_DEAD bit set)
        166. (typname, typnamespace)=(numrange_test2, 2200) w TID (18,9) will 
be deleted freeing 36 
        173. (typname, typnamespace)=(part1, 2200) w TID (25,23) will be 
deleted freeing 28 
        174. (typname, typnamespace)=(part2, 2200) w TID (25,25) will be 
deleted freeing 28 
        175. (typname, typnamespace)=(part3, 2200) w TID (25,27) will be 
deleted freeing 28 
        176. (typname, typnamespace)=(part4, 2200) w TID (25,29) will be 
deleted freeing 28 
        177. (typname, typnamespace)=(part_aa_bb, 2200) w TID (25,33) will be 
deleted freeing 28 
        178. (typname, typnamespace)=(part_cc_dd, 2200) w TID (25,35) will be 
deleted freeing 28 
        179. (typname, typnamespace)=(part_def, 2200) w TID (27,16) will be 
deleted freeing 28 
        180. (typname, typnamespace)=(part_default, 2200) w TID (27,2) will be 
deleted freeing 36 (LP_DEAD bit set)
        181. (typname, typnamespace)=(part_default, 2200) w TID (27,10) will be 
deleted freeing 36 (LP_DEAD bit set)
        182. (typname, typnamespace)=(part_default, 2200) w TID (27,46) will be 
deleted freeing 36 
        183. (typname, typnamespace)=(part_default_p1, 2200) w TID (27,12) will 
be deleted freeing 36 
        184. (typname, typnamespace)=(part_default_p2, 2200) w TID (27,14) will 
be deleted freeing 36 
        185. (typname, typnamespace)=(part_ee_ff, 2200) w TID (25,39) will be 
deleted freeing 28 
        186. (typname, typnamespace)=(part_ee_ff1, 2200) w TID (25,41) will be 
deleted freeing 28 
        187. (typname, typnamespace)=(part_ee_ff2, 2200) w TID (25,43) will be 
deleted freeing 28 
        188. (typname, typnamespace)=(part_ee_ff3, 2200) w TID (27,28) will be 
deleted freeing 28 
        189. (typname, typnamespace)=(part_ee_ff3_1, 2200) w TID (27,30) will 
be deleted freeing 36 
        190. (typname, typnamespace)=(part_ee_ff3_2, 2200) w TID (27,32) will 
be deleted freeing 36 
        191. (typname, typnamespace)=(part_gg, 2200) w TID (27,18) will be 
deleted freeing 28 
        192. (typname, typnamespace)=(part_gg1, 2200) w TID (27,20) will be 
deleted freeing 28 
        193. (typname, typnamespace)=(part_gg2, 2200) w TID (27,22) will be 
deleted freeing 28 
        194. (typname, typnamespace)=(part_gg2_1, 2200) w TID (27,24) will be 
deleted freeing 28 
        195. (typname, typnamespace)=(part_gg2_2, 2200) w TID (27,26) will be 
deleted freeing 28 
        196. (typname, typnamespace)=(part_null, 2200) w TID (25,37) will be 
deleted freeing 28 
        197. (typname, typnamespace)=(part_xx_yy, 2200) w TID (27,4) will be 
deleted freeing 28 
        198. (typname, typnamespace)=(part_xx_yy_defpart, 2200) w TID (27,8) 
will be deleted freeing 36 
        199. (typname, typnamespace)=(part_xx_yy_p1, 2200) w TID (27,6) will be 
deleted freeing 36 
        results: checked 70 tids out of 250 on page (28.00% of total on page), 
of which 54 (21.60% of total on page) deleted
        results: deleted tids plain 54, deleted tids posting 0, pagelsn: 
0/3303FD8
        results: exact free space 1716, exact TIDs deleted 54/LP_DEAD tuples 8 
(RR 6.75), LP_DEAD-related table blocks 4
 2023-06-07 23:46:15.130 PDT [260646][client backend] 
[pg_regress/insert][6/681:2269] STATEMENT:  create table mlparted5_a partition 
of mlparted5_ab for values in ('a');

Re: index prefetching

Reply via email to