> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <br...@momjian.us>: > > On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote: >> Yep, it does. But we've made little to no progress on integration of ICU >> support and AFAIK nobody's working on it right now. > > Uh, this email from July says Peter Eisentraut will submit it in > September :-) > > > https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa228892...@2ndquadrant.com
Cool. I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns. https://github.com/girgen/postgres/ in branches icu/XXX where XXX is master or REL9_X_STABLE. They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU. I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when running $ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8 $ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs" *** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out 2016-08-10 21:09:03.000000000 +0200 --- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out 2016-08-10 21:12:53.000000000 +0200 *************** *** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false"; false ------- ! f (1 row) SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false"; --- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false"; false ------- ! t (1 row) SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false"; *************** *** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true"; true ------ ! t (1 row) -- The following actually exercises the selectivity estimation for ~*. --- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true"; true ------ ! f (1 row) -- The following actually exercises the selectivity estimation for ~*. ====================================================================== The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the other? I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or upper(both)? I haven't investigated this yet. @Devrim, is one more correct than the other? As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon. Palle [1] https://people.freebsd.org/~girgen/postgresql-icu/README.html
signature.asc
Description: Message signed with OpenPGP using GPGMail