[HACKERS] Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

Andrew Dunstan Sat, 08 Jun 2013 20:52:26 -0700


On 06/08/2013 10:52 PM, Noah Misch wrote:

On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote:

On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <and...@dunslane.net> wrote:

Don't downcase non-ascii identifier chars in multi-byte encodings.


Long-standing code has called tolower() on identifier character bytes
with the high bit set. This is clearly an error and produces junk output
when the encoding is multi-byte. This patch therefore restricts this
activity to cases where there is a character with the high bit set AND
the encoding is single-byte.

There have been numerous gripes about this, most recently from Martin
Sch?fer.

Backpatch to all live releases.

I'm all for changing this, but back-patching seems like a terrible
idea.  This could easily break queries that are working now.

If more than one encoding covers the characters used in a given application,
that application's semantics should be the same regardless of which of those
encodings is in use.  We certainly don't _guarantee_ that today; PostgreSQL
leaves much to libc, which may not implement the relevant locales compatibly.
However, this change bakes into PostgreSQL itself a departure from that
principle.  If a database contains tables "ä" and "Ä", which of those "SELECT
* FROM Ä" finds will be encoding-dependent.  If we're going to improve the
current (granted, worse) downcase_truncate_identifier() behavior, we should
not adopt another specification bearing such surprises.

Let's return to the drawing board on this one.  I would be inclined to keep
the current bad behavior until we implement the i18n-aware case folding
required by SQL.  If I'm alone in thinking that, perhaps switch to downcasing
only ASCII characters regardless of the encoding.  That at least gives
consistent application behavior.

I apologize for not noticing to comment on this week's thread.

The behaviour which this fixes is an unambiguous bug. Calling tolower()on the individual bytes of a multi-byte character can't possibly produceany sort of correct result. A database that contains such corruptednames, probably not valid in any encoding at all, is almost certainlynot restorable, and I'm not sure if it's dumpable either. It's alreadyproduced several complaints in recent months, so ISTM that returning toit for any period of time is unthinkable.


cheers

andrew






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

Reply via email to