Re: [BUGS] equal operator fails on two identical strings if initdb

Tom Lane Wed, 24 Nov 2004 19:55:59 -0800

Kent Tong <[EMAIL PROTECTED]> writes:
> You mean the OS fails to convert unicode strings to Big5 or the
> OS assumes the bytes are already in Big5?


The latter.

> It is the locale used for initdb or the default system locale
> set in Windows that is used by the collation routines that you
> mentioned above?

The former.

The real problem here, IMHO, is that Postgres allows you to select a
"database encoding" setting that is different from the encoding implied
by the initdb locale (ie, the LC_CTYPE setting).  If you make this
mistake, PG will carefully store data byte sequences in the specified
"database encoding" ... and then pass them to strcoll() for comparison
... and strcoll() will assume that the data is in the encoding
associated with LC_CTYPE.

This is partially bad design on our part (we should really not have
invented a per-database encoding selection when the locale setting is
not per-database) and partially bad design on the part of the C standard
(which doesn't provide any very sane way to find out what encoding is
implied by an LC_CTYPE setting).

I think the only real fix is to abandon the C library's locale routines
and find or write our own library with a better API.  This has been on
the TODO list for a long time but no one's quite wished to face up to
doing it ...

In the meantime, make sure your encoding setting agrees with the
LC_CTYPE value that initdb used.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [BUGS] equal operator fails on two identical strings if initdb

Reply via email to