Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Palle Girgensohn Wed, 10 Aug 2016 13:42:37 -0700

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <br...@momjian.us>:
> 
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
> 
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
> 
>       
> https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa228892...@2ndquadrant.com


Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support 
for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I 
have of course updated them regularly. In this latest version, I've removed 
support for other encodings beside UTF-8, mostly since I don't know how to test 
them, but also, I see little point in supporting anything else using ICU.



I have one question for someone with knowledge about Turkish (Devrim?). This is 
the diff from regression tests, when running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** 
/Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out
  2016-08-10 21:09:03.000000000 +0200
--- 
/Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out    
    2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 ****
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  -------
!  f
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ----
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  -------
!  t
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 ****
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  ------
!  t
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ----
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  ------
!  f
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. 
Any ideas if one is more correct than the other? I seems unclear to me. Perhaps 
it depends on whether the case-insensitive match is done using lower(both) or 
upper(both)? I haven't investigated this yet. @Devrim, is one more correct than 
the other?


As Thomas points out, using ucoll_strcoll it is quick, since no copying is 
needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html

signature.asc
Description: Message signed with OpenPGP using GPGMail

Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

Reply via email to