On Wed, Jul 30, 2014 at 10:52 AM, Robert Haas <robertmh...@gmail.com> wrote: > I don't think we should commit anything that's not clearly under the > PostgreSQL license.
I thought that we were comfortable with using MIT licensed code. But, come to think of it I don't know what our exact policy is. Wikipedia says "The Simplified BSD license used by FreeBSD is essentially identical[citation needed][discuss] to the MIT License, as it contains neither an advertising clause, nor a prohibition on promotional use of the copyright holder's name". There are plenty of suitably licensed hyperLogLog implementations in any case, and many are similar. Some direction here would be useful, since I'm not an expert on licensing. > Heikki previously made quite firmly the point that you shouldn't blend > the addition of sortsupport logic in general with the poor man's key > optimization in particular; they should be separate patches. He's > right. Ordinarily, I'd agree with this. But the fact is that there was an extensive discussion of that patch. We both know quite a lot about it, because you wrote it and I reviewed it [1]. What more is there to learn about it? It's not very controversial. It provides a text sort support routine that consistently wins by removing avoidable overhead. For the strcoll() case, you put the improvement at 7%. It clearly cannot regress anything. It's easy to reason about. It's a fine patch. We had a silly disagreement about it, but I would almost have been willing to commit it there and then (although Windows was still broken with that patch, something I've only just fixed, AFAIK). This patch may cause regressions. They may not be that bad, but they need to be characterized and avoided. It can also make many representative cases more than 3 times faster (an improvement in excess of 200%). Even low cardinality cases (e.g sorting a list of cities in the world with more than a thousand inhabitants by country) can be up to 150% faster. That's a very strong improvement, and the lack of this kind of facility surely accounts for why other systems are said to sort text so much more efficiently than Postgres (something we discussed privately years ago). But, regressions are bad. Plus, I freely admit that this whole idea is ugly as sin. That's the only thing to discuss, AFAICT. We know that the best case performance is almost entirely attributable to the normalized key stuff. > Some other concerns: > > 1. As I think I mentioned before, I think the term "poor man's key" is > just terrible. It conveys, at least to me, no useful information > about what is really being done. If you called it, say, > "strxfrm-prefix comparison", it would be only a few more letters but a > whole lot more informative to future readers of the code. I don't have a problem with changing the name. But the name that you propose is all about text. This patch is intended to add an extensible infrastructure (a new part of sort support), plus one client of that more complete extensible infrastructure (sort support for text). I think there's a good chance that it could work well for another type too, like numeric, if we could come up with a good system for encoding numeric normalized keys, and if we can similarly get over the fact that there is going to be a minority of cases that won't be helped at all. That's a whole other discussion, though, and text is clearly the really compelling case. Do you have another suggestion? > 2. The need to modify analyze.c and nodeMergeAppend.c to disable this > optimization seems odd, and the comments are uninformative, saying > only that it "isn't feasible", but not why. If it were only analyze.c > I might expect that it required some kind of transaction state that > isn't present there, but that doesn't explain the merge-append case. The reason is that there is no convenient way to inject a conversion routine. A binary heap is calling heap_compare_slots(), which is where comparisons occur, so it isn't really sorting at all. It would be ugly, and probably even counter-productive to do normalization just in time, as tuples are pulled up. In general I assume that it's only useful to do a full normalization pass before sorting (preferably at a convenient juncture like copytup_heap()). Note that we still elide the fmgr trampoline, so only the additional normalization optimization fails to apply. I should add that there is one other case where we arbitrarily don't use the additional optimization that doesn't actually make any sense. Within nodeAgg.c, "We use a plain Datum sorter when there's a single input column; otherwise sort the full tuple". And thus, the optimization may not be used there more or less by accident. tuplesort_putdatum() needs to be taught about the new optimization too, I think. I feel it should be possible for both the Datum and Heap sort cases to use the new optimization (since they can always use sort support) wherever they can afford to make that normalization pass (they usually can, since copytup* is usually passed through). Eventually, I think we'll want to extend sort support to work with btree index builds. Please don't review the abort logic until I finish producing a new revision. It won't be long. I've come up with something appreciably better [2]. [1] http://www.postgresql.org/message-id/CA+Tgmoa8by24gd+YbuPX=5gsgmn0w5sgipzwwq7_8is26vl...@mail.gmail.com [2] http://www.postgresql.org/message-id/cam3swzryktbotvsun2r1j95xr5gnrvm+zbvz+rxhlq0clz4...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers