On 12/02/23 01:59, Andres Freund wrote:

However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.

The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.

This probably explains the issue.


One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples.  E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.

There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
  amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
  anything.

But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.

I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?


Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.

I definitely agree with this.


My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.

This should be easy hack and we can easily profile benefits from this.

It's very likely we can do better than just doing a plain sort of everything
after that.
You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a heapsort between the presorted blocks.

This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.


A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.
We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts.  I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.

The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.

This Looks doable, try this.

I think we also can replace the individual freeing of tuples in
tuplesort_heap_replace_top(), by allowing a number of dead tuples to
accumulate (up to work_mem/2 maybe), and then copying the still living tuples
into new memory context, freeing the old one.

While that doesn't sound cheap, in bounded sorts the number of tuples commonly
is quite limited, the pre-check before copying the tuple will prevent this
from occurring too often, the copy will result in higher locality, and, most
importantly, the reduced palloc() overhead (~25% or so with aset.c) will
result in a considerably higher cache hit ratio / lower memory usage.

I would try this as well.

There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.

Thanks a lot Andres, these are lots of pointers to work on (without major 
overhaul of sort
implementation and with potentially good amount of improvements). I will give 
these a try
and see if I could get some performance gains.

Thanks,
Ankit



Reply via email to