Re: [HACKERS] Patch for collation using ICU

John Hansen Sat, 07 May 2005 06:12:41 -0700

Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > >
> > > Is this patch ready for application?
> > 
> > I don't think so, not quite. I have not had any positive 
> reports from 
> > linux users, this is only tested in a FreeBSD environment. 
> I'd say it 
> > needs some more testing.
> 
> OK.
> 
> > Also, apparently, ICU is installed by default in many linux 
> > distributions, and usually it is version 2.8. Some linux users have 
> > asked me if there are plans for a patch that works with ICU 2.8. 
> > That's probably a good idea. IBM and the ICU folks seem to consider 
> > 3.2 to be the stable version, older versions are hard to 
> find on their 
> > sites, but most linux distributers seem to consider it too bleeding 
> > edge, even gentoo. I don't know why they don't agree.
> 
> Good point.  Why would linux folks need ICU?  Doesn't their 
> OS support encodings natively?  I am particularly excited 
> about this for OSs that don't have such encodings, like UTF8 
> support for Win32.
> 
> Because ICU will not be used unless enabled by configure, it 
> seems we are fine with only supporting the newest version.  
> Do Linux users need to use ICU for any reason?


Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.

There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.

> 
> > > I do have a few questions:
> > >
> > > Why don't you use the lc_ctype_is_c() part of this test?
> > >
> > >   if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> > 
> > Um, well, I didn't think about that. :)  What would be the 
> locale in 
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
> CTYPE=C and use 
> > a wide encoding, indeed. Then the strings will be handled 
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
> 
> The additional test is more of an optmization, and it fixes a 
> problem with some OSs that have processing problems with UTF8 
> when the locale is supposed to be turned off, like in "C".  I 
> realize ICU might be fine with it but the optimization still 
> is an issue.

That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.

> > > Why is so much code added, for example, in lower()?  The existing 
> > > multibyte code is much smaller, and lots of code is added 
> in other 
> > > places too.
> > 
> > ICU uses UTF-16 internally, so all strings must be 
> converted from the 
> > database encoding to UTF-16. Since that means the strings 
> need to be 
> > copied, I took the same approach as in 
> varlena.c:varstr_cmp(), where 
> > small strings use the heap and only larger strings use a palloc. 
> > Comments in varstr_cmp about performance made me use that approach.
> 
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and 
> easier to maintain.
> 
> > Also, in the latest patch, I also added checks and logging 
> for *every* 
> > status returned from ICU. I hope this will help debugging 
> on debian, 
> > where previous version didn't work. That excessive status 
> checking is 
> > hardly be necessary once the stuff is better tested.
> > 
> > I think the string copying and heap/palloc choices stands 
> for most of 
> > the code bloat, together with the excessive status checking 
> and logging.
> 
> OK, move that into some common functions and I think it will 
> be better.
> 
> > > Why do you need to add a mapping of encoding names from 
> iana to our 
> > > names?
> > 
> > This was already answered by John Hansen... There's an old 
> thread here 
> > about the choice of the name "UNICODE" to describe an 
> encoding, which 
> > it doesn't. There's half a dozen unicode based encodings... 
> UTF-8 is 
> > used by postgresql, that would have been a better name... Similarly 
> > for most other encodings, really. ICU expect a setlocale(3) string 
> > (i.e. IANA). PostgreSQL can't provide it, so a mapping 
> table is required.
> 
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). 
>  Does that help?
> 
> > I use this patch in production on one FreeBSD 4.10 server 
> at the moment. 
> > With the latest version, I've had no problems. Logging is 
> swithed on 
> > for now, and it shows no signs of ICU complaining. I'd like more 
> > reports on Linux, though.
> 
> OK, I certainly would like this all done for 8.1 which should 
> have feature freeze on July 1.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   [email protected]               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, 
> Pennsylvania 19073
> 
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 
> 

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [HACKERS] Patch for collation using ICU

Reply via email to