On 11/21/2010 06:09 PM, Robert Haas wrote:
I think that's fair. It actually doesn't seem like it should be that hard if we knew that the server encoding were UTF8 - it's just a big translation table somewhere, no?
No, it's far more complex. See for example <http://unicode.org/reports/tr21/tr21-3.html>, which says:
There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII. * Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" /capital dz/, there is a third case, called /titlecase/, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase). o For example, the title case of the example character is 01F2 "Dz" /capital d with small z/. * Case mappings may produce strings of different length than the original. o For example, the German character 00DF "ß" /small letter sharp s/ expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with 0149 "'n" /latin small letter n preceded by apostrophe./ * Characters may also have different case mappings, depending on the context. o For example, 03A3 "?" /capital sigma/ lowercases to 03C3 "?" /small sigma/ if it is followed by another letter, but lowercases to 03C2 "?" /small final sigma/ if it is not. * Characters may have case mappings that depend on the locale. o For example, in Turkish the letter 0049 "I" /capital letter i/ lowercases to 0131 "?" /small dotless i/. * Case mappings are not, in general, reversible. o For example, once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. cheers andrew