On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan <p...@heroku.com> wrote: > On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> It looks like varstr_abbrev_convert calls strxfrm unconditionally >> (assuming TRUST_STRXFRM is defined). <captain-obvious>This needs to >> use ucol_getSortKey instead when appropriate.</> It looks like it's a >> bit more helpful than strxfrm about telling you the output buffer size >> it wants, and it doesn't need nul termination, which is nice. >> Unfortunately it is like strxfrm in that the output buffer's contents >> is unspecified if it ran out of space. > > One can use the ucol_nextSortKeyPart() interface to just get the first > 4/8 bytes of an abbreviated key, reducing the overhead somewhat, so > the output buffer size limitation is probably irrelevant. The ICU > documentation says something about this being useful for Radix sort, > but I suspect it's more often used to generate abbreviated keys. > Abbreviated keys were not my original idea. They're really just a > standard technique.
Nice! The other advantage of ucol_nextSortKeyPart is that you don't have to convert the whole string to UChar (UTF16) first, as I think you would need to with ucol_getSortKey, because the UCharIterator mechanism can read directly from a UTF8 string. I see in the documentation that ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and this caveat may be related to whether sort key compression is used. I don't understand what sort of compression is involved but out of curiosity I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could see their size. As you say, we can ask it to stop at 4 or 8 bytes which is very convenient for our purposes, but here I asked for more to get the full output so I could see where the primary weight part ends. The primary weight took one byte for the Latin letters I tried and two for the Japanese characters I tried (except 一 which was just 0xaa). ucol_nextSortKeyPart(en_US, "a", ...) -> 29 01 05 01 05 ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01 06 01 06 ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07 ucol_nextSortKeyPart(en_US, "abcd", ...) -> 29 2b 2d 2f 01 08 01 08 ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc ucol_nextSortKeyPart(en_US, "AB", ...) -> 29 2b 01 06 01 dc dc ucol_nextSortKeyPart(en_US, "ABC", ...) -> 29 2b 2d 01 07 01 dc dc dc ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc ucol_nextSortKeyPart(ja_JP, "一", ...) -> aa 01 05 01 05 ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f 01 06 01 06 ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07 ucol_nextSortKeyPart(ja_JP, "一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08 ucol_nextSortKeyPart(ja_JP, "日", ...) -> d0 18 01 05 01 05 ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06 ucol_nextSortKeyPart(fr_FR, "cote", ...) -> 2d 45 4f 31 01 08 01 08 ucol_nextSortKeyPart(fr_FR, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09 ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09 ucol_nextSortKeyPart(fr_FR, "côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a ucol_nextSortKeyPart(fr_CA, "cote", ...) -> 2d 45 4f 31 01 08 01 08 ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09 ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09 ucol_nextSortKeyPart(fr_CA, "côté", ...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a I wonder how it manages to deal with fr_CA's reversed secondary weighting rule which requires you to consider diacritics in reverse order -- apparently abandoned in France but still used in Canada -- using a fixed size space for state between calls. -- Thomas Munro http://www.enterprisedb.com