Re: [HACKERS] Using quicksort for every external sort run

Peter Geoghegan Wed, 23 Dec 2015 15:45:07 -0800

On Wed, Dec 23, 2015 at 1:16 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan <p...@heroku.com> wrote:
>> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>> The point is, nobody can tell WHAT effects this is modeling.
>>> Increasing the tuple size makes the crossover go up.  Or down.
>>
>> There are multiple, competing considerations.
>
> Please explain what they are and how they lead you to believe that the
> cost factors you have chosen are good ones.


Alright.

I've gone on at length about how I'm blurring the distinction between
internal and external sorting, or about how modern hardware
characteristics allow that. There are several reasons for that. Now,
we all know that main memory sizes have increased dramatically since
the 1970s, and storage characteristics are very different, and that
CPU caching effects have become very important, and that everyone has
lots more data.

There is one thing that hasn't really become bigger in all that time,
though: the width of tuples. So, as I go into in comments within
useselection(), that's the main reason why avoiding I/O isn't all that
impressive, especially at the high end. It's just not that big of a
cost at the high end. Beyond that, as linear costs go, palloc() is a
much bigger concern to me at this point. I think we can waste a lot
less time by amortizing that more extensively (to say nothing of the
saving in memory). This is really obvious by just looking at
trace_sort output with my patch applied when dealing with many runs,
sorting millions of tuples: There just isn't that much time spent on
I/O at all, and it's well hidden by foreground processing that is CPU
bound. With smaller work_mem sizes and far fewer tuples, a case much
more common within sort nodes (as opposed to utility statements), this
is less true. Sorting 1,000 or 10,000 tuples is an entirely different
thing to sorting 1,000,000 tuples.

So, first of all, the main consideration is that saving I/O turns out
to not matter that much at the high end. That's why we get very
conservative past the fairly arbitrary MaxAllocSize memtuples
threshold (which has a linear relationship to the number of tuples --
*not* the amount of memory used or disk space that may be used).

A second consideration is how much I/O we can save -- one would hope
it would be a lot, certainly the majority, to make up for the downside
of using a cache inefficient technique. That is a different thing to
the number of memtuples. If you had really huge tuples, there would be
a really big saving in I/O, often without a corresponding degradation
in cache performance (since there still many not be that many
memtuples, which is more the problem for the heap than anything else).
This distinction is especially likely to matter for the CLUSTER case,
where wide heap tuples (including heap tuple headers, visibility info)
are kind of along for the ride, which is less true elsewhere,
particularly for the CREATE INDEX case.

The cache inefficiency of spilling incrementally from a heap isn't so
bad if we only end up sorting a small number of tuples that way. So as
the number of tuples that we end up actually sorting that way
increases, the cache inefficiency becomes worse, while at the same
time, we save less I/O. The former is a bigger problem than the
latter, by a wide margin, I believe.

This code is an attempt to credit cases with really wide tuples:

    /*
     * Starting from a threshold of 90%, refund 7.5% per 32 byte
     * average-size-increment.
     */
    increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
    crossover = 0.90 - (increments * 0.075);

Most cases won't get too many "increments" of credit (although CLUSTER
sorts will probably get relatively many).

A third consideration is that we should be stingy about giving too
much credit to wider tuples because the cache inefficiency hurts more
as we achieve mere linear savings in I/O. So, most of the savings off
a 99.99% theoretical baseline threshold are fixed (you usually save
9.99% off that up-front).

A forth consideration is that the heap seems to do really badly past
1GB in general, due to cache characteristics. This is certainly not
something that I know how to model well.

I don't blame you for calling this voodoo, because to some extent it
is. But I remind you that the consequences of making the wrong
decision here are still better than the status quo today -- probably
far better, overall. I also remind you that voodoo code is something
you'll find in well regarded code bases at times. Have you ever
written networking code? Packet switching is based on some handwavy
observations about the real world. Practical implementations often
contain voodoo magic numbers. So, to answer your earlier question:
Yes, maybe it wouldn't be so bad, all things considered, to let
someone complain about this if they have a real-world problem with it.
The complexity of what we're talking about makes me modest about my
ability to get it exactly right. At the same time, the consequences of
getting it somewhat wrong are really not that bad. This is basically
the same tension that you get with more rigorous cost models anyway
(where greater rigor happens to be possible).

I will abandon this cost model at the first sign of a better
alternative -- I'm really not the least bit attached to it. I had
hoped that we'd be able to do a bit better than this through
discussion on list, but not far better. In any case, "quicksort with
spillover" is of secondary importance here (even though it just so
happens that I started with it).

>>>> Another factor is that the heap could be useful for other stuff in the
>>>> future. As Simon Riggs pointed out, for deduplicating values as
>>>> they're read in by tuplesort. (Okay, that's really the only other
>>>> thing, but it's a good one).
>>>
>>> Not sure how that would work?
>>
>> Tuplesort would have license to discard tuples with matching existing
>> values, because the caller gave it permission to. This is something
>> that you can easily imagine occurring with ordered set aggregates, for
>> example. It would work in a way not unlike a top-N heapsort does
>> today. This would work well when it can substantially lower the use of
>> memory (initially heapification when the threshold is crossed would
>> probably measure the number of duplicates, and proceed only when it
>> looked like a promising strategy).
>
> It's not clear to me how having a heap helps with that.

The immediacy of detecting a duplicate could be valuable. We could
avoid allocating tuplesort-owned memory entirely much of the time.
Basically, this is another example (quicksort with spillover being the
first) where incrementalism helps rather than hurts. Another
consideration is that we could thrash if we misjudge the frequency at
which to eliminate duplicates if we quicksort + periodically dedup.
This is especially of concern in the common case where there are big
clusters of the same value, and big clusters of heterogeneous values.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to