On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <pe...@2ndquadrant.com> wrote: > [ new patch ]
I spent quite a bit of time looking at this today - the patch specifically, and the issue of making quicksort fast more generally. It seemed to me that if we're going to have separate copies of the quicksort code for tuple sorting, we might as well go whole hog and specialize those copies to the particular needs of tuplesort.c as much as possible. Accordingly, I whacked the code around so that it knows that it is sorting SortTuple objects and need not conditionalize at runtime on the size of the objects being swapped. You suggested upthread that this might be worthwhile, and it seems that it is, so I think we should do it. Your patch removes the CHECK_FOR_INTERRUPTS() call from comparetup_heap, which is no good. However, since I'd already decided to specialize the copies of quicksort intended for sorting as much as possible, it made sense to me to refactor things so that the qsort routine itself, rather than the comparator, is responsible for calling CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons before doing it. I find that your pg_always_inline macro is equivalent to just plain inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need something like this: +#elif __GNUC__ +#define pg_always_inline inline __attribute__((always_inline)) ...but I'm not very happy about relying on that, because I don't know that it will work on every gcc version (never mind non-gcc compilers), and I'm not convinced it's actually improving performance even on this one. The documentation seems to indicate that this is intended to force inlining even when not optimizing, which may have something to do with the lack of effect: that's not really the point here anyway. What I did instead is to replace template_qsort_arg.h with a script called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c that tuplesort.c then #includes. This seems more flexible to me than the macro-based approach. In particular, it allows me to generate versions of qsort with different call signatures. The attached patch generates two: static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator cmp_tuple, Tuplesortstate *state); static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup); The first of these is a drop-in replacement for qsort_arg() - any tuplesort can use it, not just heap sorts. But it is faster than qsort_arg() because of the specializations for the SortTuple data type. The second handles the special case where we are sorting by a single key that has an associated SortSupport object. In this case we don't need to carry the overhead of passing around the Tuplesortstate and dereferencing it, nor do we need the SortTupleComparator: we can just pass the SortSupport itself. Maybe there's a way to get this effect using macros, but I couldn't figure it out. At any rate, at least for the single-key case, this approach effectively forces the comparator to be inlined without requiring pg_always_inline. With this patch, I get the following results, as compared with your 2012-02-10 version and master, using the same test cases I tested before. select * from nodups order by g offset 10001; tps on master: 289.471274, 289.967984, 289.595958 tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900 tps on attached version: 388.212793, 386.085083, 386.867478 select * from twocol order by a, b offset 10000; tps on master: 261.676611, 260.440886, 259.529362 tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208 tps on attached version: 283.146463, 278.344827, 280.727798 select * from f8 order by g offset 10000; tps on master: 228.299924, 222.650355, 227.408506 tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456 tps on attached version: 276.985299, 275.341575, 274.428095 There's some variation (which I can't account for) between the results on master now and the results on master before - possibly just code shifting around between cache lines due to unrelated changes, or maybe some inadvertent change in my test setup. But it looks to me like your 2012-02-10 version, without any type-specific optimizations, does pretty much just as well on multi-key sorting as your previous version, which had them - or if there is a difference, it's pretty small. Overall, I think the numbers for the version I'm attaching here look pretty good: the single-key performance is clearly better than your last version, and the multi-key performance is very slightly worse. I think that slight worsening is a good trade-off, though, because this version can use qsort_tuple() for all kinds of tuplesorts, not just heap tuplesorts. Still, it seems like we ought to be able to do even better: the multi-key specialization that you had in your patch can be coded in this framework, too, and in theory those are ndependent of the swapcode improvements. I tried coding up a multi-key specialization which I believe to be quite similar to what you did, but it didn't seem to do much. I'm attaching it here; maybe there's some way to improve it (or a different test case where it pays off). It strikes me that if we wanted to take this further, we could look at squeezing out ApplySortComparator. For example, suppose that, upon discovering that we can do an in-memory quicksort on a single sort key, we make an initial pass over the data where we check whether it's sorted and, as we go, swap all the entries with isnull1 = true to the end of the memtuples array. We then sort the isnull1 = true entries with the standard comparator, and the isnull1 = false entries with an optimized comparator that elides most of ApplySortComparator and instead just calls the comparison function directly. We then decide on whether to emit the isnull1 = true entries first or last based on NULLS FIRST/LAST, and decide whether to emit the remaining entries in forward or reverse order based on ASC/DESC. Or maybe not exactly that thing, but something like that, so that we sort the null and non-null entries separately. The additional complexity in the read-out logic would probably be more than offset by being able to use a simpler comparator. What do you think of this version? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
fastpath_sort_2012_02_14.patch
Description: Binary data
multi-key.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers