Re: 回复: An implementation of multi-key sort

John Naylor Sat, 10 Aug 2024 19:06:08 -0700

On Fri, Jul 26, 2024 at 6:18 PM Yao Wang <yao-yw.w...@broadcom.com> wrote:


> 2. Should we update optimizer to take in account statistic
> (pg_statistic->stadistinct) and choose mksort/qsort accordingly?

> According to the result, the choosing mechanism in optimizer
> almost eliminated all regressions. Please note that even when mksort is
> disabled (i.e. qsort was performed twice actually), there are still
> "regressions" which are usually less than 5% and should be accounted
> to kinda error range.

This is kind of an understatement. To be clear, I see mostly ~1%,
which I take to be in the noise level. If it were commonly 5%
regression, that'd be cause for concern. If actual noise were 5%, the
testing methodology is not strict enough.

> I am still hesitating about putting the code to final version because
> there are a number of concerns:
>
> a. for sort keys without a real table, stadistinct is unavailable.
> b. stadistinct may not be accurate as mentioned.
> c. the formula I used may not be able to cover all possible cases.
>
> On the other hand, the worst result may be just 5% regression. So the
> side effects may not be so serious?

FWIW, I share these concerns as well. I don't believe this proves a
bound on the worst result, since the test is using ideal well-behaved
data with no filter. The worst result is when the estimates are wrong,
so real world use could easily be back to 10-20% regression, which is
not acceptable. I believe this is what Robert and Tomas were warning
about.

It'd be good to understand what causes the differences, whether better
or worse. Some initial thoughts:

- If the first key is unique, then I would hope multikey would be no
different then a standard sort.

- If the first key commonly ties, and other following keys tie also,
those later comparisons are a waste. In that case, it's not hard to
imagine that partitioning on only one key at a time might be fast.

- If the first key commonly ties, but the second key is closer to
unique, I'm not sure which way is better. Have we tested this case?

- If we actually only have one sort key, a multi-key sort with a
single depth should ideally have no significant performance difference
than standard sort. That seems like a good sanity check. Has this been
tried?

- For the biggest benefit/regression cases, it'd be good to know what
changed at the hardware level. # comparsisons? # swaps? # cache
misses? # branch mispredicts?

Looking at the code a bit, I have some questions and see some
architectural issues. My thoughts below have a bunch of brainstorms so
should be taken with a large grain of salt:

1. The new single-use abstraction for the btree tid tiebreak seems
awkward. In standard sort, all that knowledge was confined to btree's
full comparetup function. Now it's spread out, and general code has to
worry about "duplicated" tuples. The passed-down isNull seems only
needed for this? (Quick idea: It seems we could pass a start-depth and
max-depth to some "comparetup_mk", which would be very similar to
current "comparetup + comparetup_tiebreak". The btree function would
have to know that a depth greater than the last sortkey is a signal to
do the tid comparison. And if start-depth and max-depth are the same,
that means comparing at a single depth. That might simplify the code
elsewhere because there is no need for a separate getDatum function.)

2. I don't understand why the pre-ordered check sometimes tolerates
duplicates and sometimes doesn't.

3. "tiebreak" paths and terminology is already somewhat awkward for
standard sort (my fault), but seems really out of place in multikey
sort. It already has a general concept of "depth", so that should be
used in fullest generality.

3A. Random thought: I wonder if the shortcut (and abbreviated?)
comparisons could be thought of as having their own depth < 0. If it's
worth it to postpone later keys, maybe it's worth it to postpone the
full comparison for the first key as well? I could be wrong, though.

3B. Side note: I've long wanted to try separating all NULL first keys
to a separate array, so we can remove all those branches for NULL
ordering and reduce SortTuple to 16 bytes. That might be easier to
code if we could simply specify "start_depth = 1" at the top level for
that.

4. Trying to stuff all our optimized comparators in the same path was
a heroic effort, but it's quite messy and seems pretty bad for the
instruction cache and branch predictor. I don't think we need yet
another template, but some of these branches should be taken as we
recurse into a partition to keep them out of the hot path.

5.
+ /*
+ * When the count < 16 and no need to handle duplicated tuples, use
+ * bubble sort.
+ *
+ * Use 16 instead of 7 which is used in standard qsort, because mk qsort
+ * need more cost to maintain more complex state.

Note: 7 isn't ideal for standard sort either, and should probably be
at least 10 (at least for single-key sorts). If one implementation's
parameter is more ideal than the other, it obscures what the true
trade-offs are. 16 happens to be a power of two -- how many different
values did you test? (And isn't this the same as our insertion sort,
not bubble sort?).

Re: 回复: An implementation of multi-key sort

Reply via email to