On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > Here is a patch I've been working on to allow the use of ICU for sorting > and other locale things.
I'm really happy that you're working on this. This is more important than is widely appreciated, and very important overall. In a world where ICU becomes the defacto standard (i.e. is used on major platforms by default), what remaining barriers are there to documenting and enforcing the binary compatibility of replicas? I may be mistaken, but offhand I can't think of any. Being able to describe exactly what works and what doesn't is very important. After all, failure to adhere to "the rules" today, such as they are, can leave you with a subtly broken replica. I'd like to make that scenario mechanically impossible, by locking everything down. > I'm not sure how well it will work to replace all the bits of LIKE and > regular expressions with ICU API calls. One problem is that ICU likes > to do case folding as a whole string, not by character. I need to do > more research about that. My guess is that there are cultural reasons why it wants to operate on a whole string, at least in some cases. > Also note that ICU locales are encoding-independent and don't support a > separate collcollate and collctype, so the existing catalog structure is > not optimal. That makes more sense to me, personally. ICU very explicitly decouples technical issues (like the representation of strxfrm() keys, and, I gather, encoding) from cultural issues (the actual user-visible behaviors). This allows us to use strxfrm()-style binary keys in indexes directly, since they're versioned independently from their underlying collation; they can add a new optimization to strxfrm()-key generation to the next ICU version, and we can detect that and require a REINDEX, even when the collation version itself (the user-visible behaviors) are unchanged. I'm getting ahead of myself here, but that does seem very useful. The Unicode collation algorithm [1] that ICU is directly based on knows plenty about the requirements of indexing. It contains guidance about equivalence vs. equality that we learned the hard way in commit 656beff5, for example. > Where it gets really interesting is what to do with the database > locales. They just set the global process locale. So in order to port > that to ICU we'd need to check every implicit use of the process locale > and tweak it. We could add a datcollprovider column or something. But > we also rely on the datctype setting to validate the encoding of the > database. Maybe we wouldn't need that anymore, but it sounds risky. Not sure about that. Whatever we come up with here needs to mesh well with the existing conventions around collation versioning that ICU has, in the context of various operating system packages in particular. We can arrange it so that in practice, an ICU upgrade doesn't often break your indexes due to a collation rule change; ICU is happy to have multiple versions of a collation at a time, and you'll probably retain the old collation version in ICU. Even if your old collation version isn't available in a new ICU release (which I think is unlikely in practice), or you downgrade ICU, it might be possible to give guidance on how to download a "Collation Resource Bundle" [2][3] that *does* have the right collation version, which presumably satisfies the requirement immediately. Firebird already uses ICU. Maybe we have something to learn from them here. In particular, where do they (by which I mean the ICU version that Firebird links to) get its collations from in practice? I think that the CLDR Data collations were at one time not even distributed with ICU source. It might be a matter of individual OS packagers of ICU deciding what exact CLDR data to use, which may or may not be of any significant consequence in practice. [1] http://unicode.org/reports/tr10 [2] http://site.icu-project.org/design/size/collation [3] http://userguide.icu-project.org/icudata -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers