On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang <starord...@gmail.com> wrote: > I am Kefan Yang, a third-year Computing Science student from Simon Fraser > University, Canada. I am very interested in the sorting algorithm > benchmarking and implementation issue you mentioned on the idealist of > Google Summer of Code 2018.
> I've attached a proposal draft to this email. It's a brief one but I guess > it's enough to show my current ideas:) The proposal reads: """ Industrial implementation of selected sorting algorithm: The industrial version is basically an optimization based on the benchmark implementation. I plan to use optimizations like checking if input array is already sorted or applying insertion sort directly for short arrays to see if the performance can be improved. I am still looking for other common optimization methods. """ But our qsort implementation, which is the Bentley & McIlroy implementation with a couple of customizations, already does everything you mention (I refer to the precheck algorithmic customization, and the way that we check to see which side of a pivot to use real recursion with to avoid stack overflows). I suggest that you read the paper [1] -- the code that we use is almost directly lifted from the paper. The opaque sounding variable names are the same, and the C code from the paper is structured in exactly the same way. I think that this won't be a particularly easy project to get committed. I suggest that if you go forward with it that you specifically look into integrating "Dual-Pivot Quicksort" [2] as a whole cloth replacement for the B&M implementation. It seems like this has some chance of working out, because it's more or less acknowledged by Bentley himself to be a kind of successor to his own industrial quicksort implementation [3] -- it's derived from it. Note that Java 7 uses dual pivot quicksort when sorting integers. In general, we've had a lot of success with optimizing sorting in the past few years by focusing on things like avoiding cache misses in comparators. There has been much less success with algorithmic improvements, and no success at all with novel algorithmic enhancements. PostgreSQL development just isn't a great way to do that sort of thing. BTW, I noticed that you go on to say this: """ However, since disk operation is much expensive than in-memory sorting, I am not sure if we can observe a significant difference in this way. """ I think that you'd be surprised at how often this isn't true these days, at least when sorting enough data for spilling to disk to be relevant. The main reason for that is that the cost of writing out runs increases linearly, and therefore eventually becomes very small compared to the costs of sorting itself, which increases at a linearithmic rate. [1] http://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf [2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf [3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-September/002630.html -- Peter Geoghegan