Re: tuple radix sort

John Naylor Thu, 27 Nov 2025 03:45:59 -0800

On Thu, Nov 27, 2025 at 2:49 PM Chao Li <[email protected]> wrote:
> I did an initial test before, but I didn’t read the code at the time. Today, 
> I spent time reviewing 0001. Overall, I believe the implementation is solid. 
> I just got a few comments/suggestions.


Thanks for looking!

> > On Nov 26, 2025, at 21:11, John Naylor <[email protected]> wrote:

> > <v5-0001-Use-radix-sort-when-SortTuple-contains-a-pass-by-.patch><v5-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patch><v5-0003-WIP-make-some-regression-tests-sort-order-more-de.patch><v5-0002-WIP-Adjust-regression-tests.patch>

> We recompute normalize_datum(tup->datum1, ssup) for every tuple in every 
> level, why don’t cache the result in SortTuple. As we have cached 
> current_byte in SortTuple, it shouldn’t be a big deal to add one more field 
> to it.

Actually it is a big deal, because the memtuples array counts against work_mem.

I saw two ways to store the full normalized without increasing the
size, and I've already rejected them:
- share space with isnull1/srctape -- v3 did this, and I already
explained the reason for changing when I shared v4.
- share space with datum1 -- that would require additional code to
restore the original datum and makes it more difficult to verify
correctness

There is also a proposal upthread to not store anything, and that's
still up in the air.

> 2 - 0001
> ```
> +       while (num_remaining > 1)
> +       {
> +               /* start the count over */
> +               num_remaining = num_partitions;
> ```
>
> The swap loop always start the count over, so that sorted partitions are 
> re-scanned as well. I think we can do an optimization like:
>
> ```
> num_active = num_partitions;
> while (num_active > 1)
> {
>     for (int i = 0; i < num_active; )
>     {
>         uint8 idx = remaining_partitions[i];
>         // do the swaps for the partition …
>
>         if (partitions[idx].offset == partitions[idx].next_offset)
>         {
>             remaining_partitions[i] = remaining_partitions[num_active - 1];
>             num_active--;
>         }
>         else
>             i++;
>     }
> }
> ```
>
> This way we move out sorted partitions, so they will not be re-scanned.

I don't think that's going to work without additional bookkeeping, so
I'm not sure it's worth it. See the recursion step.

> 3 - 0001
>
> In sort_byvalue_datum, we can add a fast-path for all NULL and all NOT NULL 
> cases, so that they won’t need to run the branchless cyclic permutation and 
> the two “for” loops of assertions. Something like:

This is too clever and yet doesn't go far enough.

There is already one fast path, which happens to cover the common ASC
+ all NOT NULL case. The right way to skip the permutation step would
be to add a second loop that starts from the end and stops at the
first tuple that needs to swap. That should work not just for all NULL
and all NOT NULL, but also where there is a mix of the two and some
(or all) are already in place. All these cases can be treated the same
and they will continue to the exact same paths.

I haven't yet bothered to try harder, but it may be necessary in order
to avoid introducing regressions, so I'll look into it.

--
John Naylor
Amazon Web Services

Re: tuple radix sort

Reply via email to