Re: [HACKERS] Patch for collation using ICU

Palle Girgensohn Sat, 07 May 2005 06:19:53 -0700

--On l�rdag, maj 07, 2005 08.37.05 -0400 Bruce Momjian <[email protected]> wrote:

Palle Girgensohn wrote:

>
> Is this patch ready for application?

I don't think so, not quite. I have not had any positive reports from
linux  users, this is only tested in a FreeBSD environment. I'd say it
needs some  more testing.

OK.


John Hansen just reported that it does work on linux. fine!

Also, apparently, ICU is installed by default in many linux
distributions,  and usually it is version 2.8. Some linux users have
asked me if there are  plans for a patch that works with ICU 2.8. That's
probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
the stable version, older  versions are hard to find on their sites, but
most linux distributers seem  to consider it too bleeding edge, even
gentoo. I don't know why they don't  agree.


Good point.  Why would linux folks need ICU?  Doesn't their OS support
encodings natively?  I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.

Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version.  Do Linux users need
to use ICU for any reason?

There are corner cases where it is impossible to upper/lowercase one character at the time. for example:

-- without ICU
select upper('E�er');
upper
-------
E�ER
(1 row)

-- with ICU
select upper('E�er');
upper
-------
ESSER
(1 rad)

This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an � (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms.

> I do have a few questions:
>
> Why don't you use the lc_ctype_is_c() part of this test?
>
>    if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Um, well, I didn't think about that. :)  What would be the locale in
this  case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a
wide  encoding, indeed. Then the strings will be handled like byte-wide
chars.  Yeah, it's a bug. I'll fix it! Thanks.


The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in "C".  I realize ICU might be fine
with it but the optimization still is an issue.

Well, the results are quite different, depending on whether ICU is used or not. See separate mail.

> Why is so much code added, for example, in lower()?  The existing
> multibyte code is much smaller, and lots of code is added in other
> places too.

ICU uses UTF-16 internally, so all strings must be converted from the
database encoding to UTF-16. Since that means the strings need to be
copied, I took the same approach as in varlena.c:varstr_cmp(), where
small  strings use the heap and only larger strings use a palloc.
Comments in  varstr_cmp about performance made me use that approach.


Oh, interesting.   I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.


Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...

Also, in the latest patch, I also added checks and logging for *every*
status returned from ICU. I hope this will help debugging on debian,
where  previous version didn't work. That excessive status checking is
hardly be  necessary once the stuff is better tested.

I think the string copying and heap/palloc choices stands for most of
the  code bloat, together with the excessive status checking and logging.

OK, move that into some common functions and I think it will be better.

Best way for upper/lower/initcap is probably to use a function pointer... uhh...

> Why do you need to add a mapping of encoding names from iana to our
> names?

This was already answered by John Hansen... There's an old thread here
about the choice of the name "UNICODE" to describe an encoding, which it
doesn't. There's half a dozen unicode based encodings... UTF-8 is used
by  postgresql, that would have been a better name... Similarly for most
other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
PostgreSQL  can't provide it, so a mapping table is required.


We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
help?

I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately.

I use this patch in production on one FreeBSD 4.10 server at the moment.
With the latest version, I've had no problems. Logging is swithed on for
now, and it shows no signs of ICU complaining. I'd like more reports on
Linux, though.


OK, I certainly would like this all done for 8.1 which should have
feature freeze on July 1.


That shouldn't be a problem.

/Palle


--
  Bruce Momjian                        |  http://candle.pha.pa.us
  [email protected]               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania
19073


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Patch for collation using ICU

Reply via email to