On Sun, Jul 27, 2014 at 12:34 PM, Peter Geoghegan <p...@heroku.com> wrote: > It's more or less testing for a primary weight level (i.e. the first > part of the blob) that is no larger than the original characters of > the string, and has no "header bytes" or other redundancies. It also > matches secondary and subsequently weight levels to ensure that they > match, since the two stings tested have identical case, use of > diacritics, etc (they're both lowercase ASCII-safe strings). I don't > set a locale, but that shouldn't matter.
Actually, come to think of it that might not quite be true. Consider this output from Robert's strxfrm test program: pg@hamster:~/code$ ./strxfrm hu_HU.utf8 potyty "potyty" -> 2826303001090909090109090909 (14 bytes) pg@hamster:~/code$ ./strxfrm hu_HU.utf8 potyta "potyta" -> 2826302e0c010909090909010909090909 (17 bytes) This is a very esoteric Hungarian collation rule [1], which at one point we found we had to plaster over within varstr_cmp() to prevent indexes giving wrong answers [2]. It turns out that with this collation, strcoll("potyty", "potty") == 0. The point specifically is that in principle, collations can alter the number of weights that appear in the primary level of the blob. This might imply that the number of primary level bytes for the ASCII-safe string "abcdefgh" might not equal those of "ijklmnop" for some collation, because of the application of some similar esoteric rule. In principle, collations are at liberty to make that happen, even though this hardly ever occurs in practice (we first heard about it in 2005, although the Unicode algorithm standard warns of this), and even though any of the cases where it does occur it probably happens to not affect my little AC_TRY_RUN program. Even still, I'm not comfortable with the deficiency of the program. I don't want my optimization to accidentally not apply just because some hypothetical collation where this is true was used when Postgres was built. It probably couldn't happen, but I must admit guaranteeing that it can't is a mess. I suppose I could fix this by no longer assuming that the number of bytes that appear in the primary level are fixed at n for n original ASCII code point strings. I think that in theory even that could break, though, because we have no principled way of parsing out different weight levels (the Unicode standard has some ideas about how given strxfrm()'s "no NULL bytes in blob" restriction, but that's clearly implementation defined). Given that Mac OS X is the only platform that appears to have this header byte problem at all, I think we'd be better off specifically disabling it on Mac OS X. I was very surprised to learn of the problem on Mac OS X. Clearly it's going against the grain by having the problem. [1] http://www.postgresql.org/message-id/43a16bb7.7030...@mage.hu [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers