On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <p...@heroku.com> wrote: > Robert pointed out a case where varying character case of an English > word did not alter the primary level bytes (and thus the poor man's > normalized key was the same). He acknowledged that most of the entropy > of the first 8 bytes of the string went into the first 8 bytes of the > blob/key. This can actually be very useful to the optimization in some > cases. In particular, with most Latin alphabets you'll see the same > pattern when diacritics are used. This means that even though the > original string has (say) an accented character that would take 2 > bytes to store in UTF-8, the weight in the primary level is the same > as an equivalent unaccented character (and so only takes one byte to > store at that level, with differences only in subsequent levels). > Whole strxfrm() blobs are the same length regardless of how many > accents appear in otherwise equivalent original Latin string, and you > get exactly the same high concentrations of entropy in the first 8 > bytes in pretty much all Latin alphabets (the *absence* of diacritics > is stored in later weight levels too, even with the "en_US.UTF-8" > collation).
There are many other interesting cases where en_US.UTF-8, and presumably all other collations concentrate much more entropy into leading bytes than might be expected. Consider: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "abc" "abc" -> 0c0d0e0109090901090909 (11 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# abc" "# abc" -> 0c0d0e01090909010909090101760135 (16 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** abc" "***** abc" -> 0c0d0e010909090109090901017301730173017301730135 (24 bytes) and, to show you what this looks like when the primary weights/original codepoints appear backwards: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "cba" "cba" -> 0e0d0c0109090901090909 (11 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# cba" "# cba" -> 0e0d0c01090909010909090101760135 (16 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** cba" "***** cba" -> 0e0d0c010909090109090901017301730173017301730135 (24 bytes) Evidently, the implementation always places primary weights first, corresponding to "abc" (and later "cba") - the bytes "\0c\0d\0e" (and later "\0e\0d\0c") - no matter how many "extraneous" characters are placed in front. They're handled later. Space don't appear in the primary weight level at all: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "a b c" "a b c" -> 0c0d0e01090909010909090102350235 (16 bytes) Lots of punctuation-type characters will not affect the primary weight level: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "%@!()\/#-+,:^~? a b c" "%@!()\/#-+,:^~? a b c" -> 0c0d0e0109090901090909010177015d013e01500152017401420176013a0178013b013d011201170140013502350235 (48 bytes) Some non-alphabetic ASCII characters will affect the primary level, though. For example: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "1.) a b c" "1.) a b c" -> 030c0d0e010909090901090909090102440152013502350235 (25 bytes) There is one extra byte here, in front of the "abc" bytes "\0c\0d\0e", a primary weight "\03" corresponding to '1'. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers