On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > Normal index scans are an even more interesting case but I'm not > sure how hard it would be to get that information. It may only be > convenient to get the blocks from the last leaf page we looked at, > for example. > > So this suggests we simply started prefetching for the case where the > information was readily available, and it'd be harder to do for index > scans so that's it.
What the exact historical timeline is may not be that important. My emphasis on ScalarArrayOpExpr is partly due to it being a particularly compelling case for both parallel index scan and prefetching, in general. There are many queries that have huge in() lists that naturally benefit a great deal from prefetching. Plus they're common. > Even if SAOP (probably) wasn't the reason, I think you're right it may > be an issue for prefetching, causing regressions. It didn't occur to me > before, because I'm not that familiar with the btree code and/or how it > deals with SAOP (and didn't really intend to study it too deeply). I'm pretty sure that you understand this already, but just in case: ScalarArrayOpExpr doesn't even "get the blocks from the last leaf page" in many important cases. Not really -- not in the sense that you'd hope and expect. We're senselessly processing the same index leaf page multiple times and treating it as a different, independent leaf page. That makes heap prefetching of the kind you're working on utterly hopeless, since it effectively throws away lots of useful context. Obviously that's the fault of nbtree ScalarArrayOpExpr handling, not the fault of your patch. > So if you're planning to work on this for PG17, collaborating on it > would be great. > > For now I plan to just ignore SAOP, or maybe just disabling prefetching > for SAOP index scans if it proves to be prone to regressions. That's not > great, but at least it won't make matters worse. Makes sense, but I hope that it won't come to that. IMV it's actually quite reasonable that you didn't expect to have to think about ScalarArrayOpExpr at all -- it would make a lot of sense if that was already true. But the fact is that it works in a way that's pretty silly and naive right now, which will impact prefetching. I wasn't really thinking about regressions, though. I was actually more concerned about missing opportunities to get the most out of prefetching. ScalarArrayOpExpr really matters here. > I guess something like this might be a "nice" bad case: > > insert into btree_test mod(i,100000), md5(i::text) > from generate_series(1, $ROWS) s(i) > > select * from btree_test where a in (999, 1000, 1001, 1002) > > The values are likely colocated on the same heap page, the bitmap scan > is going to do a single prefetch. With index scan we'll prefetch them > repeatedly. I'll give it a try. This is the sort of thing that I was thinking of. What are the conditions under which bitmap index scan starts to make sense? Why is the break-even point whatever it is in each case, roughly? And, is it actually because of laws-of-physics level trade-off? Might it not be due to implementation-level issues that are much less fundamental? In other words, might it actually be that we're just doing something stoopid in the case of plain index scans? Something that is just papered-over by bitmap index scans right now? I see that your patch has logic that avoids repeated prefetching of the same block -- plus you have comments that wonder about going further by adding a "small lru array" in your new index_prefetch() function. I asked you about this during the unconference presentation. But I think that my understanding of the situation was slightly different to yours. That's relevant here. I wonder if you should go further than this, by actually sorting the items that you need to fetch as part of processing a given leaf page (I said this at the unconference, you may recall). Why should we *ever* pin/access the same heap page more than once per leaf page processed per index scan? Nothing stops us from returning the tuples to the executor in the original logical/index-wise order, despite having actually accessed each leaf page's pointed-to heap pages slightly out of order (with the aim of avoiding extra pin/unpin traffic that isn't truly necessary). We can sort the heap TIDs in scratch memory, then do our actual prefetching + heap access, and then restore the original order before returning anything. This is conceptually a "mini bitmap index scan", though one that takes place "inside" a plain index scan, as it processes one particular leaf page. That's the kind of design that "plain index scan vs bitmap index scan as a continuum" leads me to (a little like the continuum between nested loop joins, block nested loop joins, and merge joins). I bet it would be practical to do things this way, and help a lot with some kinds of queries. It might even be simpler than avoiding excessive prefetching using an LRU cache thing. I'm talking about problems that exist today, without your patch. I'll show a concrete example of the kind of index/index scan that might be affected. Attached is an extract of the server log when the regression tests ran against a server patched to show custom instrumentation. The log output shows exactly what's going on with one particular nbtree opportunistic deletion (my point has nothing to do with deletion, but it happens to be convenient to make my point in this fashion). This specific example involves deletion of tuples from the system catalog index "pg_type_typname_nsp_index". There is nothing very atypical about it; it just shows a certain kind of heap fragmentation that's probably very common. Imagine a plain index scan involving a query along the lines of "select * from pg_type where typname like 'part%' ", or similar. This query runs an instant before the example LD_DEAD-bit-driven opportunistic deletion (a "simple deletion" in nbtree parlance) took place. You'll be able to piece together from the log output that there would only be about 4 heap blocks involved with such a query. Ideally, our hypothetical index scan would pin each buffer/heap page exactly once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all, we're talking about a fairly selective query here, that only needs to scan precisely one leaf page (I verified this part too) -- so why wouldn't we expect "index scan parity"? While there is significant clustering on this example leaf page/key space, heap TID is not *perfectly* correlated with the logical/keyspace order of the index -- which can have outsized consequences. Notice that some heap blocks are non-contiguous relative to logical/keyspace/index scan/index page offset number order. We'll end up pinning each of the 4 or so heap pages more than once (sometimes several times each), when in principle we could have pinned each heap page exactly once. In other words, there is way too much of a difference between the case where the tuples we scan are *almost* perfectly clustered (which is what you see in my example) and the case where they're exactly perfectly clustered. In other other words, there is way too much of a difference between plain index scan, and bitmap index scan. (What I'm saying here is only true because this is a composite index and our query uses "like", returning rows matches a prefix -- if our index was on the column "typname" alone and we used a simple equality condition in our query then the Postgres 12 nbtree work would be enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect that there are still relatively many important cases where we perform extra PinBuffer()/UnpinBuffer() calls during plain index scans that only touch one leaf page anyway.) Obviously we should expect bitmap index scans to have a natural advantage over plain index scans whenever there is little or no correlation -- that's clear. But that's not what we see here -- we're way too sensitive to minor imperfections in clustering that are naturally present on some kinds of leaf pages. The potential difference in pin/unpin traffic (relative to the bitmap index scan case) seems pathological to me. Ideally, we wouldn't have these kinds of differences at all. It's going to disrupt usage_count on the buffers. > > It's important to carefully distinguish between cases where plain > > index scans really are at an inherent disadvantage relative to bitmap > > index scans (because there really is no getting around the need to > > access the same heap page many times with an index scan) versus cases > > that merely *appear* that way. Implementation restrictions that only > > really affect the plain index scan case (e.g., the lack of a > > reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) > > should be accounted for when assessing the viability of index scan + > > prefetch over bitmap index scan + prefetch. This is very subtle, but > > important. > > > > I do agree, but what do you mean by "assessing"? I mean performance validation. There ought to be a theoretical model that describes the relationship between index scan and bitmap index scan, that has actual predictive power in the real world, across a variety of different cases. Something that isn't sensitive to the current phase of the moon (e.g., heap fragmentation along the lines of my pg_type_typname_nsp_index log output). I particularly want to avoid nasty discontinuities that really make no sense. > Wasn't the agreement at > the unconference session was we'd not tweak costing? So ultimately, this > does not really affect which scan type we pick. We'll keep doing the > same planning decisions as today, no? I'm not really talking about tweaking the costing. What I'm saying is that we really should expect index scans to behave similarly to bitmap index scans at runtime, for queries that really don't have much to gain from using a bitmap heap scan (queries that may or may not also benefit from prefetching). There are several reasons why this makes sense to me. One reason is that it makes tweaking the actual costing easier later on. Also, your point about plan robustness was a good one. If we make the wrong choice about index scan vs bitmap index scan, and the consequences aren't so bad, that's a very useful enhancement in itself. The most important reason of all may just be to build confidence in the design. I'm interested in understanding when and how prefetching stops helping. > I'm all for building a more comprehensive set of test cases - the stuff > presented at pgcon was good for demonstration, but it certainly is not > enough for testing. The SAOP queries are a great addition, I also plan > to run those queries on different (less random) data sets, etc. We'll > probably discover more interesting cases as the patch improves. Definitely. > There are two aspects why I think AM is not the right place: > > - accessing table from index code seems backwards > > - we already do prefetching from the executor (nodeBitmapHeapscan.c) > > It feels kinda wrong in hindsight. I'm willing to accept that we should do it the way you've done it in the patch provisionally. It's complicated enough that it feels like I should reserve the right to change my mind. > >> I think this is acceptable limitation, certainly for v0. Prefetching > >> across multiple leaf pages seems way more complex (particularly for the > >> cases using pairing heap), so let's leave this for the future. > Yeah, I'm not saying it's impossible, and imagined we might teach nbtree > to do that. But it seems like work for future someone. Right. You probably noticed that this is another case where we'd be making index scans behave more like bitmap index scans (perhaps even including the downsides for kill_prior_tuple that accompany not processing each leaf page inline). There is probably a point where that ceases to be sensible, but I don't know what that point is. They're way more similar than we seem to imagine. -- Peter Geoghegan
2023-06-07 23:46:15.130 PDT [260646][client backend] [pg_regress/insert][6/681:2269] LOG: simple deletion of index "pg_type_typname_nsp_index", block 6: n_tup: 251, including high key first: (typname, typnamespace)=(column_column_usage, 13858) h_key: (typname, typnamespace)=(pg_largeobject) finished up on heap block 28 (having accessed 4 blocks in total) 10. (typname, typnamespace)=(copydml_test, 2200) w TID (25,3) will be deleted freeing 36 28. (typname, typnamespace)=(enumtest, 2200) w TID (18,7) will be deleted freeing 28 29. (typname, typnamespace)=(enumtest_bogus_child, 2200) w TID (18,21) will be deleted freeing 44 30. (typname, typnamespace)=(enumtest_child, 2200) w TID (18,17) will be deleted freeing 36 31. (typname, typnamespace)=(enumtest_parent, 2200) w TID (18,13) will be deleted freeing 36 43. (typname, typnamespace)=(float8range_test, 2200) w TID (18,43) will be deleted freeing 36 54. (typname, typnamespace)=(hash_parted, 2200) w TID (27,34) will be deleted freeing 28 56. (typname, typnamespace)=(hpart0, 2200) w TID (27,36) will be deleted freeing 28 57. (typname, typnamespace)=(hpart1, 2200) w TID (27,38) will be deleted freeing 28 58. (typname, typnamespace)=(hpart2, 2200) w TID (27,40) will be deleted freeing 28 59. (typname, typnamespace)=(hpart3, 2200) w TID (27,42) will be deleted freeing 28 61. (typname, typnamespace)=(i8r_array, 2200) w TID (18,61) will be deleted freeing 28 68. (typname, typnamespace)=(insert_test_type, 2200) w TID (25,15) will be deleted freeing 36 73. (typname, typnamespace)=(inserttest, 2200) w TID (25,1) will be deleted freeing 28 (LP_DEAD bit set) 74. (typname, typnamespace)=(inserttest, 2200) w TID (25,17) will be deleted freeing 28 75. (typname, typnamespace)=(inserttest2, 2200) w TID (25,19) will be deleted freeing 28 104. (typname, typnamespace)=(large_tuple_test, 2200) w TID (25,9) will be deleted freeing 36 107. (typname, typnamespace)=(list_parted, 2200) w TID (25,31) will be deleted freeing 28 (LP_DEAD bit set) 108. (typname, typnamespace)=(list_parted, 2200) w TID (27,44) will be deleted freeing 28 109. (typname, typnamespace)=(lparted_nonullpart, 2200) w TID (27,54) will be deleted freeing 36 110. (typname, typnamespace)=(lparted_nonullpart_a, 2200) w TID (27,56) will be deleted freeing 44 124. (typname, typnamespace)=(mlparted5, 2200) w TID (27,66) will be deleted freeing 28 (LP_DEAD bit set) 128. (typname, typnamespace)=(mlparted5a, 2200) w TID (27,68) will be deleted freeing 28 136. (typname, typnamespace)=(mydomain, 2200) w TID (18,45) will be deleted freeing 28 (LP_DEAD bit set) 138. (typname, typnamespace)=(mydomainmultirange, 2200) w TID (18,48) will be deleted freeing 36 (LP_DEAD bit set) 140. (typname, typnamespace)=(mydomainrange, 2200) w TID (18,47) will be deleted freeing 36 (LP_DEAD bit set) 166. (typname, typnamespace)=(numrange_test2, 2200) w TID (18,9) will be deleted freeing 36 173. (typname, typnamespace)=(part1, 2200) w TID (25,23) will be deleted freeing 28 174. (typname, typnamespace)=(part2, 2200) w TID (25,25) will be deleted freeing 28 175. (typname, typnamespace)=(part3, 2200) w TID (25,27) will be deleted freeing 28 176. (typname, typnamespace)=(part4, 2200) w TID (25,29) will be deleted freeing 28 177. (typname, typnamespace)=(part_aa_bb, 2200) w TID (25,33) will be deleted freeing 28 178. (typname, typnamespace)=(part_cc_dd, 2200) w TID (25,35) will be deleted freeing 28 179. (typname, typnamespace)=(part_def, 2200) w TID (27,16) will be deleted freeing 28 180. (typname, typnamespace)=(part_default, 2200) w TID (27,2) will be deleted freeing 36 (LP_DEAD bit set) 181. (typname, typnamespace)=(part_default, 2200) w TID (27,10) will be deleted freeing 36 (LP_DEAD bit set) 182. (typname, typnamespace)=(part_default, 2200) w TID (27,46) will be deleted freeing 36 183. (typname, typnamespace)=(part_default_p1, 2200) w TID (27,12) will be deleted freeing 36 184. (typname, typnamespace)=(part_default_p2, 2200) w TID (27,14) will be deleted freeing 36 185. (typname, typnamespace)=(part_ee_ff, 2200) w TID (25,39) will be deleted freeing 28 186. (typname, typnamespace)=(part_ee_ff1, 2200) w TID (25,41) will be deleted freeing 28 187. (typname, typnamespace)=(part_ee_ff2, 2200) w TID (25,43) will be deleted freeing 28 188. (typname, typnamespace)=(part_ee_ff3, 2200) w TID (27,28) will be deleted freeing 28 189. (typname, typnamespace)=(part_ee_ff3_1, 2200) w TID (27,30) will be deleted freeing 36 190. (typname, typnamespace)=(part_ee_ff3_2, 2200) w TID (27,32) will be deleted freeing 36 191. (typname, typnamespace)=(part_gg, 2200) w TID (27,18) will be deleted freeing 28 192. (typname, typnamespace)=(part_gg1, 2200) w TID (27,20) will be deleted freeing 28 193. (typname, typnamespace)=(part_gg2, 2200) w TID (27,22) will be deleted freeing 28 194. (typname, typnamespace)=(part_gg2_1, 2200) w TID (27,24) will be deleted freeing 28 195. (typname, typnamespace)=(part_gg2_2, 2200) w TID (27,26) will be deleted freeing 28 196. (typname, typnamespace)=(part_null, 2200) w TID (25,37) will be deleted freeing 28 197. (typname, typnamespace)=(part_xx_yy, 2200) w TID (27,4) will be deleted freeing 28 198. (typname, typnamespace)=(part_xx_yy_defpart, 2200) w TID (27,8) will be deleted freeing 36 199. (typname, typnamespace)=(part_xx_yy_p1, 2200) w TID (27,6) will be deleted freeing 36 results: checked 70 tids out of 250 on page (28.00% of total on page), of which 54 (21.60% of total on page) deleted results: deleted tids plain 54, deleted tids posting 0, pagelsn: 0/3303FD8 results: exact free space 1716, exact TIDs deleted 54/LP_DEAD tuples 8 (RR 6.75), LP_DEAD-related table blocks 4 2023-06-07 23:46:15.130 PDT [260646][client backend] [pg_regress/insert][6/681:2269] STATEMENT: create table mlparted5_a partition of mlparted5_ab for values in ('a');