libicu global support

2021-07-11 Thread Jakub Jedelsky
Hi,

during the adoption of Centos 8 on our servers we ran into problems with
Postgresql (13.3), glibc (delivered by the Centos) and performance of
sorting. Because of that we're planning to use the ICU collations
(en-x-icu), but the current implementation is quite complicated to adopt as
there isn't support of global setup per cluster (initdb) nor creating of
database.

So, my silly question: is there any chance a work can be done on it with a
new version anytime soon?

There were already some discussions around some time ago:
https://www.postgresql.org/message-id/flat/3366.1498183854%40sss.pgh.pa.us#3366.1498183...@sss.pgh.pa.us
https://www.postgresql.org/message-id/flat/5e756dd6-0e91-d778-96fd-b1bcb06c161a%402ndquadrant.com

Thank you,

- jj


case insensitive collation of Greek's sigma

2021-11-25 Thread Jakub Jedelsky
Hello,

during our tests of Postgres with ICU we found an issue with ILIKE of upper
and lowercase sigma (Σ). The letter has two lowercase variants σ and ς (at
the end of a word). I'm working with en_US and en-US-x-icu collations and
results are a bit unexpected - they are inverted:

postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
postgres-# ;
 ?column? | ?column?
--+--
 t| f
(1 row)

postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
 ?column? | ?column?
--+--
 f| t
(1 row)

I run those commands on the latest (14.1) official docker image.

Is it possible to unify the behaviour?And which one is correct from the
community point of view?

If I could start, I think both results are wrong as both should return
True. If I got it right, in the background there is a lower() function
running to compare strings, which is not enough for such cases (until the
left side isn't taken as a standalone word).

Thanks,

- jj


Re: case insensitive collation of Greek's sigma

2021-12-02 Thread Jakub Jedelsky
On Wed, Dec 1, 2021 at 8:49 PM Tom Lane  wrote:

> Peter Eisentraut  writes:
> > Running lower() like this is really the wrong thing to do.  We should be
> > doing "case folding" instead, which normalizes these differences for the
> > purpose of case-insensitive comparisons.
>
> That just begs the question: if tolower (or towlower) isn't the
> appropriate API, what is?  Perhaps ICU has something for a more
> generalized notion of case-similarity, but I'm not aware of any such
> thing in the POSIX API.
>
> BTW, I think it's only accidental that the regex example shown upthread
> gets the right answer.  In that example, what's happening is that we
> consider a letter in a case-insensitive regex to match itself, or
> tolower() of itself, or toupper() of itself.  Both σ and ς have Σ
> as toupper() so they both work.  But if you'd written Σ in the regex,
> only one of σ and ς would match that as a data character.  (Haven't
> actually tested this, but given the way the code works I'm pretty
> sure it's so.)  Again, it's hard to see how to do better atop a POSIX
> locale library.
>

Thanks for digging into the issue.

Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it
possible to keep current behaviour with lowercase in POSIX as a fallback
and have the correct solution for ICU? I think (not an expert though) there
should be already working code for case folding for some time already.

[1] https://www.gnu.org/software/libunistring/
"""
Text files are nowadays usually encoded in Unicode, and may consist of very
different scripts – from Latin letters to Chinese Hanzi –, with many kinds
of special characters – accents, right-to-left writing marks, hyphens,
Roman numbers, and much more. But the POSIX platform APIs for text do not
contain adequate functions for dealing with particular properties of many
Unicode characters. In fact, the POSIX APIs for text have several
assumptions at their base which don't hold for Unicode text.
"""