Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings

Bruce Momjian Mon, 05 Sep 2011 19:18:56 -0700

Did we ever address this?

---------------------------------------------------------------------------


Tom Lane wrote:
> I've been able to reproduce the behavior described here:
> http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php
> It's specific to UTF8 locales on Mac OS X.  I'm not sure if the
> problem can manifest anywhere else; considering that OS X's UTF8
> locales have a general reputation of being broken, it may only
> happen on that platform.
> 
> What is happening is that downcase_truncate_identifier() tries to
> downcase identifiers like this:
> 
>               unsigned char ch = (unsigned char) ident[i];
> 
>               if (ch >= 'A' && ch <= 'Z')
>                       ch += 'a' - 'A';
>               else if (IS_HIGHBIT_SET(ch) && isupper(ch))
>                       ch = tolower(ch);
>               result[i] = (char) ch;
> 
> This is of course incapable of successfully downcasing any multibyte
> characters, but there's an assumption that isupper() won't return TRUE
> for a character fragment in a multibyte locale.  However, on OS X
> it seems that that's not the case :-(.  For the particular example
> cited by Francisco Figueiredo, I see the byte sequence \303\251
> converted to \343\251, because isupper() returns TRUE for \303 and
> then tolower() returns \343.  The byte \251 is not changed, but the
> damage is already done: we now have an invalidly-encoded string.
> 
> It looks like the blame for the subsequent "disappearance" of the bogus
> data lies with fprintf back on the client side; that surprises me a bit
> because I'd only heard of glibc being so cavalier with data it thought
> was invalidly encoded.  But anyway, the origin of the problem is in the
> downcasing transformation.
> 
> We could possibly fix this by not attempting the downcasing
> transformation on high-bit-set characters unless the encoding is
> single-byte.  However, we have the exact same downcasing logic embedded
> in the functions in src/port/pgstrcasecmp.c, and those don't have any
> convenient way of knowing what the prevailing encoding is --- when
> compiled for frontend use, they can't use pg_database_encoding_max_length.
> 
> Or we could bite the bullet and start using str_tolower(), but the
> performance implications of that are unpleasant; not to mention that
> we really don't want to re-introduce the "Turkish problem" with
> unexpected handling of i/I in identifiers.
> 
> Or we could go the other way and stop downcasing non-ASCII letters
> altogether.
> 
> None of these options seem terribly attractive.  Thoughts?
> 
>                       regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  <br...@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings

Reply via email to