On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <to...@vondra.me> wrote: > Makes sense. I started with the testing before before even looking at > the code, so it's mostly a "black box" approach. I did read the 1995 > paper before that, and the script generates queries with clauses > inspired by that paper, in particular:
I think that this approach with black box testing is helpful, but also something to refine over time. Gray box testing might work best. > OK, understood. If it's essentially an independent issue (perhaps even > counts as a bug?) what about correcting it on master first? Doesn't > sound like something we'd backpatch, I guess. What about backpatching it to 17? As things stand, you can get quite contradictory counts of the number of index scans due to irrelevant implementation details from parallel index scan. It just looks wrong, particularly on 17, where it is reasonable to expect near exact consistency between parallel and serial scans of the same index. > Seems like a bit of a mess. IMHO we should either divide everything by > nloops (so that everything is "per loop", or not divide anything. My > vote would be to divide, but that's mostly my "learned assumption" from > the other fields. But having a 50:50 split is confusing for everyone. My idea was that it made most sense to follow the example of "Buffers:", since both describe physical costs. Honestly, I'm more than ready to take whatever the path of least resistance is. If dividing by nloops is what people want, I have no objections. > > It's worth having skip support (the idea comes from the MDAM paper), > > but it's not essential. Whether or not an opclass has skip support > > isn't accounted for by the cost model, but I doubt that it's worth > > addressing (the cost model is already pessimistic). > > > > I admit I'm a bit confused. I probably need to reread the paper, but my > impression was that the increment/decrement is required for skipscan to > work. If we can't do that, how would it generate the intermediate values > to search for? I imagine it would be possible to "step through" the > index, but I thought the point of skip scan is to not do that. I think that you're probably still a bit confused because the terminology in this area is a little confusing. There are two ways of explaining the situation with types like text and numeric (types that lack skip support). The two explanations might seem to be contradictory, but they're really not, if you think about it. The first way of explaining it, which focuses on how the scan moves through the index: For a text index column "a", and an int index column "b", skip scan will work like this for a query with a qual "WHERE b = 55": 1. Find the first/lowest sorting "a" value in the index. Let's say that it's "Aardvark". 2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly returning some matches. 3. Find the next value after "Aardvark" in the index using a probe like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say that it turns out to be "Abacus". 4. Look for matches "WHERE a = 'Abacus' and b = 55"... ... (repeat these steps until we've exhaustively processed every existing "a" value in the index)... The second way of explaining it, which focuses on how the skip arrays advance. Same query (and really the same behavior) as in the first explanation: 1. Skip array's initial value is the sentinel -inf, which cannot possibly match any real index tuple, but can still guide the search. So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't include the "b = 55" part, since it is unnecessary, but conceptually it's a part of what we search for within _bt_first). 2. Find that the index has no "a" values matching -inf (it inevitably cannot have any matches for -inf), but we do locate the next highest match. The closest matching value is "Aardvark". The skip array on "a" therefore advances from -inf to "Aardvark". 3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly returning some matches. 4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b = 55", which will cause us to advance the array on "a" incrementally inside _bt_advance_array_keys (just like it would if there was a standard SAOP array on "a" instead). The skip array on "a" therefore advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use sentinel values for this text column, which lacks skip support). 5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55", which cannot possibly find matches, but, again, can reposition the scan as needed. We can't find an exact match, of course, but we do locate the next closest match -- which is "Abacus", again. So the skip array now advances from "Aardvark" +infinitesimal to "Abacus". The sentinel values are made up values, but that doesn't change anything. (And, again, we don't include the "b = 55" part here, for the same reason as before.) 6. Look for matches "WHERE a = 'Abacus' and b = 55"... ...(repeat these steps as many times as required)... In summary: Even index columns that lack skip support get to "increment" (or "decrement") their arrays by using sentinel values that represent -inf (or +inf for backwards scans), as well as sentinels that represent concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal for backwards scans, say). This scheme sounds contradictory, because in one sense it allows every skip array to be incremented, but in another sense it makes it okay that we don't have a type-specific way to increment values for many individual types/opclasses. Inventing these sentinel values allows _bt_advance_array_keys to reason about arrays without really having to care about which kinds of arrays are involved, their order relative to each other, etc. In a certain sense, we don't really need explicit "next key" probes of the kind that the MDAM paper contemplates, though we do still require the same index accesses as a design with explicit accesses. Does that make sense? Obviously, if we did add skip support for text, it would be very unlikely to help performance. Sure, one can imagine incrementing from "Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but that isn't very helpful. You're almost certain to end up accessing the same pages with such a scheme, anyway. What are the chances of an index with a leading text column actually containing tuples matching (say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically zero. Whereas if the column "a" happens to use a discrete type such as integer or date, then skip support is likely to help: there's a decent chance that a value generated by incrementing the last value (and I mean incrementing it for real) will find a real match when combined with the user-supplied "b" predicate. It might be possible to add skip support for text, but there wouldn't be much point. > Anyway, probably a good idea for extending the stress testing script. > Right now it tests with "bigint" columns only. Good idea. > Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming > no correlation means pessimistic skipscan costing) in a comment before > btcostestimate, or somewhere close. Will do. > Don't we do something similar elsewhere? For example, IIRC we do some > adjustments when estimating grouping in estimate_num_groups(), and > incremental sort had to deal with something similar too. Maybe we could > learn something from those places ... (both from the good and bad > experiences). I'll make a note of that. Gonna focus on regressions for now. > Right. I don't think I've been suggesting having a separate path, I 100% > agree it's better to have this as an option for index scan paths. Cool. > Sure. With this kind of testing I don't know what I'm looking for, so I > try to cover very wide range of cases. Inevitably, some of the cases > will not test the exact subject of the patch. I think it's fine. I agree. Just wanted to make sure that we were on the same page. > I think it'd help if I go through the results and try to prepare some > reproducers, to make it easier for you. After all, it's my script and > you'd have to reverse engineer some of it. Yes, that would be helpful. I'll probably memorialize the problem by writing my own minimal test case for it. I'm using the same TDD approach for this project as was used for the related Postgres 17 project. > This is on exactly the same data, after dropping caches and restarting > the instance. So there should be no caching effects. Yet, there's a > pretty clear difference - the total number of buffers is the same, but > the patched version has many more hits. Yet it's slower. Weird, right? Yes, it's weird. It seems likely that you've found an unambiguous bug, not just a "regular" performance regression. The regressions that I already know about aren't nearly this bad. So it seems like you have the right general idea about what to expect, and it seems like your approach to testing the patch is effective. -- Peter Geoghegan