Did we ever address this? ---------------------------------------------------------------------------
Tom Lane wrote: > I've been able to reproduce the behavior described here: > http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php > It's specific to UTF8 locales on Mac OS X. I'm not sure if the > problem can manifest anywhere else; considering that OS X's UTF8 > locales have a general reputation of being broken, it may only > happen on that platform. > > What is happening is that downcase_truncate_identifier() tries to > downcase identifiers like this: > > unsigned char ch = (unsigned char) ident[i]; > > if (ch >= 'A' && ch <= 'Z') > ch += 'a' - 'A'; > else if (IS_HIGHBIT_SET(ch) && isupper(ch)) > ch = tolower(ch); > result[i] = (char) ch; > > This is of course incapable of successfully downcasing any multibyte > characters, but there's an assumption that isupper() won't return TRUE > for a character fragment in a multibyte locale. However, on OS X > it seems that that's not the case :-(. For the particular example > cited by Francisco Figueiredo, I see the byte sequence \303\251 > converted to \343\251, because isupper() returns TRUE for \303 and > then tolower() returns \343. The byte \251 is not changed, but the > damage is already done: we now have an invalidly-encoded string. > > It looks like the blame for the subsequent "disappearance" of the bogus > data lies with fprintf back on the client side; that surprises me a bit > because I'd only heard of glibc being so cavalier with data it thought > was invalidly encoded. But anyway, the origin of the problem is in the > downcasing transformation. > > We could possibly fix this by not attempting the downcasing > transformation on high-bit-set characters unless the encoding is > single-byte. However, we have the exact same downcasing logic embedded > in the functions in src/port/pgstrcasecmp.c, and those don't have any > convenient way of knowing what the prevailing encoding is --- when > compiled for frontend use, they can't use pg_database_encoding_max_length. > > Or we could bite the bullet and start using str_tolower(), but the > performance implications of that are unpleasant; not to mention that > we really don't want to re-introduce the "Turkish problem" with > unexpected handling of i/I in identifiers. > > Or we could go the other way and stop downcasing non-ASCII letters > altogether. > > None of these options seem terribly attractive. Thoughts? > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers