On Fri, Jul 26, 2024 at 6:18 PM Yao Wang <yao-yw.w...@broadcom.com> wrote:
> 2. Should we update optimizer to take in account statistic > (pg_statistic->stadistinct) and choose mksort/qsort accordingly? > According to the result, the choosing mechanism in optimizer > almost eliminated all regressions. Please note that even when mksort is > disabled (i.e. qsort was performed twice actually), there are still > "regressions" which are usually less than 5% and should be accounted > to kinda error range. This is kind of an understatement. To be clear, I see mostly ~1%, which I take to be in the noise level. If it were commonly 5% regression, that'd be cause for concern. If actual noise were 5%, the testing methodology is not strict enough. > I am still hesitating about putting the code to final version because > there are a number of concerns: > > a. for sort keys without a real table, stadistinct is unavailable. > b. stadistinct may not be accurate as mentioned. > c. the formula I used may not be able to cover all possible cases. > > On the other hand, the worst result may be just 5% regression. So the > side effects may not be so serious? FWIW, I share these concerns as well. I don't believe this proves a bound on the worst result, since the test is using ideal well-behaved data with no filter. The worst result is when the estimates are wrong, so real world use could easily be back to 10-20% regression, which is not acceptable. I believe this is what Robert and Tomas were warning about. It'd be good to understand what causes the differences, whether better or worse. Some initial thoughts: - If the first key is unique, then I would hope multikey would be no different then a standard sort. - If the first key commonly ties, and other following keys tie also, those later comparisons are a waste. In that case, it's not hard to imagine that partitioning on only one key at a time might be fast. - If the first key commonly ties, but the second key is closer to unique, I'm not sure which way is better. Have we tested this case? - If we actually only have one sort key, a multi-key sort with a single depth should ideally have no significant performance difference than standard sort. That seems like a good sanity check. Has this been tried? - For the biggest benefit/regression cases, it'd be good to know what changed at the hardware level. # comparsisons? # swaps? # cache misses? # branch mispredicts? Looking at the code a bit, I have some questions and see some architectural issues. My thoughts below have a bunch of brainstorms so should be taken with a large grain of salt: 1. The new single-use abstraction for the btree tid tiebreak seems awkward. In standard sort, all that knowledge was confined to btree's full comparetup function. Now it's spread out, and general code has to worry about "duplicated" tuples. The passed-down isNull seems only needed for this? (Quick idea: It seems we could pass a start-depth and max-depth to some "comparetup_mk", which would be very similar to current "comparetup + comparetup_tiebreak". The btree function would have to know that a depth greater than the last sortkey is a signal to do the tid comparison. And if start-depth and max-depth are the same, that means comparing at a single depth. That might simplify the code elsewhere because there is no need for a separate getDatum function.) 2. I don't understand why the pre-ordered check sometimes tolerates duplicates and sometimes doesn't. 3. "tiebreak" paths and terminology is already somewhat awkward for standard sort (my fault), but seems really out of place in multikey sort. It already has a general concept of "depth", so that should be used in fullest generality. 3A. Random thought: I wonder if the shortcut (and abbreviated?) comparisons could be thought of as having their own depth < 0. If it's worth it to postpone later keys, maybe it's worth it to postpone the full comparison for the first key as well? I could be wrong, though. 3B. Side note: I've long wanted to try separating all NULL first keys to a separate array, so we can remove all those branches for NULL ordering and reduce SortTuple to 16 bytes. That might be easier to code if we could simply specify "start_depth = 1" at the top level for that. 4. Trying to stuff all our optimized comparators in the same path was a heroic effort, but it's quite messy and seems pretty bad for the instruction cache and branch predictor. I don't think we need yet another template, but some of these branches should be taken as we recurse into a partition to keep them out of the hot path. 5. + /* + * When the count < 16 and no need to handle duplicated tuples, use + * bubble sort. + * + * Use 16 instead of 7 which is used in standard qsort, because mk qsort + * need more cost to maintain more complex state. Note: 7 isn't ideal for standard sort either, and should probably be at least 10 (at least for single-key sorts). If one implementation's parameter is more ideal than the other, it obscures what the true trade-offs are. 16 happens to be a power of two -- how many different values did you test? (And isn't this the same as our insertion sort, not bubble sort?).