On Fri, Sep 9, 2016 at 6:14 AM Heikki Linnakangas <hlinn...@iki.fi> wrote: > 1. SortTuple.tupindex is not used when the sort fits in memory. If we > also got rid of the isnull1 flag, we could shrink SortTuple from 24 > bytes to 16 bytes (on 64-bit systems). That would allow you to pack more > tuples into memory, which generally speeds things up. We could for > example steal the least-significant bit from the 'tuple' pointer for > that, because the pointers are always at least 4-byte aligned. (But see > next point)
I implemented what you sketched here, back in 2016. That is, I found a way to make SortTuples 16 bytes on 64-bit platforms, down from 24 bytes (24 bytes includes alignment padding). Note that it only became possible to do this after commit 8b304b8b72b removed replacement selection sort, which reduced the number of things that use the SortTuple.tupindex field to just one: it is now only used for tapenum during merging. (I also had to remove SortTuple.isnull1 to get down to 16 bytes.) The easy part was removing SortTuple.tupindex itself -- it was fairly natural to stash that in the slab allocation for each tape. I used the aset.c trick of having a metadata "chunk" immediately prior to address that represents the allocation proper -- we can just back up by a few bytes from stup.tuple to find the place to stash the tape number during merging. The worst thing about this change was that it makes a tape slab allocation mandatory in cases that previously didn't have any need for a stup.tuple allocation (e.g. datum tuplesorts of pass-by-value types), though only during merging. Since we must always store the tapenum when merging, we always need a slab buffer for each tape when merging. This aspect wasn't so bad. The hard/ugly part was getting rid of the remaining "unnecessary" SortTuple field, isnull1. This involved squeezing an extra bit out of the stup.tuple pointer, by stealing the least-significant bit. This was invasive in about the way you'd expect it to be. It wasn't awful, but it also wasn't something I'd countenance pursuing without getting a fairly noticeable benefit for users. (Actually, the code that I wrote so far *is* pretty awful, but I could certainly clean it up some more if I felt like it.) I think that the rough patch that I came up with gives us an accurate picture of what the benefits of having SortTuples that are only 16 bytes wide are. The benefits seem kind of underwhelming at this point. For something like a "SELECT COUNT(distinct my_int4_col) FROM tab" query, which uses the qsort_ssup() qsort specialization, we can easily go from getting an external sort to getting an internal sort. We can maybe end up sorting about 20% faster if things really work out for the patch. But in cases that users really care about, such as REINDEX, the difference is in the noise. ISTM that this is simple not worth the trouble at this time. These days, external sorts are often slightly faster than internal sorts in practice, due to the fact that we can do an on-the-fly merge with external sorts, so we could easily hurt performance by making more memory available! I don't want to completely close the door on the idea of shrinking SortTuple to 16 bytes. I can imagine a world in which that matters. For example, perhaps there will one day be a strong case to be made for SIMD-based internal sort algorithms for simple cases, such as the "COUNT(distinct my_int4_col)" query case that I mentioned. Probably SIMD-based multiway merge sort. I understand that such algorithms are very sensitive to things like SIMD CPU register sizes, and were only considered plausible competitors to quicksort due to the advent of 512-bit SIMD registers. 512 bit SIMD registers haven't been available in mainstream CPUs for that long. I have to admit that this SIMD sorting stuff seems like a bit of a stretch, though. The papers in this area all seem to make rather optimistic assumptions about the distribution of values. And, I think we'd have to be even more aggressive about shrinking SortTuples in order to realize the benefits of SIMD-based sorting. Besides, sorting itself is the bottleneck for tuplesort-using operations less and less these days -- the only remaining interesting bottleneck is probably in code like index_form_tuple(), which is probably a good target for JIT. In general, it's much harder to make tuplesort.c noticeably faster than it used to be -- we've picked all the low-hanging fruit. -- Peter Geoghegan