On Thu, Feb 4, 2016 at 3:14 AM, Peter Geoghegan <p...@heroku.com> wrote: > Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:
This paper is available from http://www.vldb.org/journal/VLDBJ4/P603.pdf (previously link is now dead) > The paper also has very good analysis of the economics of sorting: > > "Even for surprisingly large sorts, it is economical to perform the > sort in one pass." I suggest taking a look at "Figure 2. Replacement-selection sort vs. QuickSort" in the paper. It confirms what I said recently about cache size. The diagram is annotated: "The tournament tree of replacement-selection sort at left has bad cache behavior, unless the entire tournament fits in cache". I think we're well justified in giving no weight at all to cases where the *entire* tournament tree (heap) fits in cache, because it's not economical to use a cpu-cache-sized work_mem setting. It simply makes no sense. I understand the reluctance to give up on replacement selection. The authors of this paper were themselves reluctant to do so. As they put it: """ We were reluctant to abandon replacement-selection sort, because it has stability and it generates long runs. Our first approach was to improve replacement-selection sort's cache locality. Standard replacement-selection sort has terrible cache behavior, unless the tournament fits in cache. The cache thrashes on the bottom levels of the tournament. If you think of the tournament as a tree, each replacement-selection step traverses a path from a pseudo-random leaf of the tree to the root. The upper parts of the tree may be cache resident, but the bulk of the tree is not. We investigated a replacement-selection sort that clusters tournament nodes so that most parent-child node pairs are contained in the same cache line. This technique reduces cache misses by a factor of two or three. Nevertheless, replacement-selection sort is still less attractive than QuickSort because: 1. The cache behavior demonstrates less locality than QuickSorts. Even when QuickSort runs did not fit entirely in cache, the average compare-exchange time did not increase significantly. 2. Tournament sort is more CPU-intensive than QuickSort. Knuth calculated a 2:1 ratio for the programs he wrote. We observed a 2.5:1 speed advantage for QuickSort over the best tournament sort we wrote. The key to achieving high execution speeds on fast processors is to minimize the number of references that cannot be serviced by the on-board cache (4MB in the case of the DEC 7000 AXP). As mentioned before, QuickSort's memory access patterns are sequential and, thus, have good cache behavior """ This paper is co-authored by Jim Gray, a Turing award laureate, as well as some other very notable researchers. The paper appeared in "Readings in Database Systems, 4th edition", which was edited by by Joseph Hellerstein and Michael Stonebraker. These days, the cheapest consumer level CPUs have 4MB caches (in 1994, that was exceptional), so if this analysis wasn't totally justified in 1994, when the paper was written, it is today. I've spent a lot of time analyzing this problem. I've been looking at external sorting in detail for almost a year now. I've done my best to avoid any low-end regressions. I am very confident that I cannot do any better than I already have there, though. If various very influential figures in the database research community could not do better, then I have doubts that we can. I started with the intuition that we should still use replacement selection myself, but that just isn't well supported by benchmarking cases with sensible work_mem:cache size ratios. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers