Bruce Momjian wrote: > Palle Girgensohn wrote: > > > > > > Is this patch ready for application? > > > > I don't think so, not quite. I have not had any positive > reports from > > linux users, this is only tested in a FreeBSD environment. > I'd say it > > needs some more testing. > > OK. > > > Also, apparently, ICU is installed by default in many linux > > distributions, and usually it is version 2.8. Some linux users have > > asked me if there are plans for a patch that works with ICU 2.8. > > That's probably a good idea. IBM and the ICU folks seem to consider > > 3.2 to be the stable version, older versions are hard to > find on their > > sites, but most linux distributers seem to consider it too bleeding > > edge, even gentoo. I don't know why they don't agree. > > Good point. Why would linux folks need ICU? Doesn't their > OS support encodings natively? I am particularly excited > about this for OSs that don't have such encodings, like UTF8 > support for Win32. > > Because ICU will not be used unless enabled by configure, it > seems we are fine with only supporting the newest version. > Do Linux users need to use ICU for any reason?
Yes, because on many linux platforms locale support is broken. Also, ICU enables full unicode support, particularly in multi-language situations where locale is C, and makes upper/lower/initcap work as expected, except where it depends on locale information. There are also many other useful things in ICU that could be implemented. Transliteration, and break-iterators for example. Break-iteration particularly interresting for converting a text to a list of words. Another is it's builtin substring searches. > > > > I do have a few questions: > > > > > > Why don't you use the lc_ctype_is_c() part of this test? > > > > > > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) > > > > Um, well, I didn't think about that. :) What would be the > locale in > > this case? c_C.UTF-8? ;) Hmm, it is possible to have > CTYPE=C and use > > a wide encoding, indeed. Then the strings will be handled > like byte-wide chars. > > Yeah, it's a bug. I'll fix it! Thanks. > > The additional test is more of an optmization, and it fixes a > problem with some OSs that have processing problems with UTF8 > when the locale is supposed to be turned off, like in "C". I > realize ICU might be fine with it but the optimization still > is an issue. That the locale is supposed to be turned off, doesn't mean it shouldn't use ICU. ICU is more than just locales. > > > Why is so much code added, for example, in lower()? The existing > > > multibyte code is much smaller, and lots of code is added > in other > > > places too. > > > > ICU uses UTF-16 internally, so all strings must be > converted from the > > database encoding to UTF-16. Since that means the strings > need to be > > copied, I took the same approach as in > varlena.c:varstr_cmp(), where > > small strings use the heap and only larger strings use a palloc. > > Comments in varstr_cmp about performance made me use that approach. > > Oh, interesting. I think you need to create new functions that > factor out that common code so the patch is smaller and > easier to maintain. > > > Also, in the latest patch, I also added checks and logging > for *every* > > status returned from ICU. I hope this will help debugging > on debian, > > where previous version didn't work. That excessive status > checking is > > hardly be necessary once the stuff is better tested. > > > > I think the string copying and heap/palloc choices stands > for most of > > the code bloat, together with the excessive status checking > and logging. > > OK, move that into some common functions and I think it will > be better. > > > > Why do you need to add a mapping of encoding names from > iana to our > > > names? > > > > This was already answered by John Hansen... There's an old > thread here > > about the choice of the name "UNICODE" to describe an > encoding, which > > it doesn't. There's half a dozen unicode based encodings... > UTF-8 is > > used by postgresql, that would have been a better name... Similarly > > for most other encodings, really. ICU expect a setlocale(3) string > > (i.e. IANA). PostgreSQL can't provide it, so a mapping > table is required. > > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). > Does that help? > > > I use this patch in production on one FreeBSD 4.10 server > at the moment. > > With the latest version, I've had no problems. Logging is > swithed on > > for now, and it shows no signs of ICU complaining. I'd like more > > reports on Linux, though. > > OK, I certainly would like this all done for 8.1 which should > have feature freeze on July 1. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > > ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match