On Wed, Dec 23, 2015 at 1:16 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan <p...@heroku.com> wrote: >> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas <robertmh...@gmail.com> wrote: >>> The point is, nobody can tell WHAT effects this is modeling. >>> Increasing the tuple size makes the crossover go up. Or down. >> >> There are multiple, competing considerations. > > Please explain what they are and how they lead you to believe that the > cost factors you have chosen are good ones.
Alright. I've gone on at length about how I'm blurring the distinction between internal and external sorting, or about how modern hardware characteristics allow that. There are several reasons for that. Now, we all know that main memory sizes have increased dramatically since the 1970s, and storage characteristics are very different, and that CPU caching effects have become very important, and that everyone has lots more data. There is one thing that hasn't really become bigger in all that time, though: the width of tuples. So, as I go into in comments within useselection(), that's the main reason why avoiding I/O isn't all that impressive, especially at the high end. It's just not that big of a cost at the high end. Beyond that, as linear costs go, palloc() is a much bigger concern to me at this point. I think we can waste a lot less time by amortizing that more extensively (to say nothing of the saving in memory). This is really obvious by just looking at trace_sort output with my patch applied when dealing with many runs, sorting millions of tuples: There just isn't that much time spent on I/O at all, and it's well hidden by foreground processing that is CPU bound. With smaller work_mem sizes and far fewer tuples, a case much more common within sort nodes (as opposed to utility statements), this is less true. Sorting 1,000 or 10,000 tuples is an entirely different thing to sorting 1,000,000 tuples. So, first of all, the main consideration is that saving I/O turns out to not matter that much at the high end. That's why we get very conservative past the fairly arbitrary MaxAllocSize memtuples threshold (which has a linear relationship to the number of tuples -- *not* the amount of memory used or disk space that may be used). A second consideration is how much I/O we can save -- one would hope it would be a lot, certainly the majority, to make up for the downside of using a cache inefficient technique. That is a different thing to the number of memtuples. If you had really huge tuples, there would be a really big saving in I/O, often without a corresponding degradation in cache performance (since there still many not be that many memtuples, which is more the problem for the heap than anything else). This distinction is especially likely to matter for the CLUSTER case, where wide heap tuples (including heap tuple headers, visibility info) are kind of along for the ride, which is less true elsewhere, particularly for the CREATE INDEX case. The cache inefficiency of spilling incrementally from a heap isn't so bad if we only end up sorting a small number of tuples that way. So as the number of tuples that we end up actually sorting that way increases, the cache inefficiency becomes worse, while at the same time, we save less I/O. The former is a bigger problem than the latter, by a wide margin, I believe. This code is an attempt to credit cases with really wide tuples: /* * Starting from a threshold of 90%, refund 7.5% per 32 byte * average-size-increment. */ increments = MAXALIGN_DOWN((int) avgTupleSize) / 32; crossover = 0.90 - (increments * 0.075); Most cases won't get too many "increments" of credit (although CLUSTER sorts will probably get relatively many). A third consideration is that we should be stingy about giving too much credit to wider tuples because the cache inefficiency hurts more as we achieve mere linear savings in I/O. So, most of the savings off a 99.99% theoretical baseline threshold are fixed (you usually save 9.99% off that up-front). A forth consideration is that the heap seems to do really badly past 1GB in general, due to cache characteristics. This is certainly not something that I know how to model well. I don't blame you for calling this voodoo, because to some extent it is. But I remind you that the consequences of making the wrong decision here are still better than the status quo today -- probably far better, overall. I also remind you that voodoo code is something you'll find in well regarded code bases at times. Have you ever written networking code? Packet switching is based on some handwavy observations about the real world. Practical implementations often contain voodoo magic numbers. So, to answer your earlier question: Yes, maybe it wouldn't be so bad, all things considered, to let someone complain about this if they have a real-world problem with it. The complexity of what we're talking about makes me modest about my ability to get it exactly right. At the same time, the consequences of getting it somewhat wrong are really not that bad. This is basically the same tension that you get with more rigorous cost models anyway (where greater rigor happens to be possible). I will abandon this cost model at the first sign of a better alternative -- I'm really not the least bit attached to it. I had hoped that we'd be able to do a bit better than this through discussion on list, but not far better. In any case, "quicksort with spillover" is of secondary importance here (even though it just so happens that I started with it). >>>> Another factor is that the heap could be useful for other stuff in the >>>> future. As Simon Riggs pointed out, for deduplicating values as >>>> they're read in by tuplesort. (Okay, that's really the only other >>>> thing, but it's a good one). >>> >>> Not sure how that would work? >> >> Tuplesort would have license to discard tuples with matching existing >> values, because the caller gave it permission to. This is something >> that you can easily imagine occurring with ordered set aggregates, for >> example. It would work in a way not unlike a top-N heapsort does >> today. This would work well when it can substantially lower the use of >> memory (initially heapification when the threshold is crossed would >> probably measure the number of duplicates, and proceed only when it >> looked like a promising strategy). > > It's not clear to me how having a heap helps with that. The immediacy of detecting a duplicate could be valuable. We could avoid allocating tuplesort-owned memory entirely much of the time. Basically, this is another example (quicksort with spillover being the first) where incrementalism helps rather than hurts. Another consideration is that we could thrash if we misjudge the frequency at which to eliminate duplicates if we quicksort + periodically dedup. This is especially of concern in the common case where there are big clusters of the same value, and big clusters of heterogeneous values. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers