As I pointed out a couple of times already [1], we don't currently sanitize ICU's BCP 47 language tags within CREATE COLLATION. CREATE COLLATION will accept literally any string as a language tag as things stand, even when the string is unambiguously bogus. While I accept that there are limits on how far you can take sanitizing the BCP 47 tag format, due to its extensibility and "best effort" emphasis on forward and backward compatibility, we can and should do more here, IMHO. We should at least do the bare minimum, which has no possible downside, and some notable upsides.
If I hack the CREATE COLLATION code to put any language tag string provided by the user through the same sanitization process that initdb already puts ICU language tags through, then we do much better. CREATE COLLATION rejects syntax errors, which seems desirable: postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu'); CREATE COLLATION postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz'); ERROR: XX000: could not convert locale name "foo bar baz" to language tag: U_ILLEGAL_ARGUMENT_ERROR LOCATION: get_icu_language_tag, collationcmds.c:454 postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu'); ERROR: XX000: could not convert locale name "en-gb-icu" to language tag: U_ILLEGAL_ARGUMENT_ERROR LOCATION: get_icu_language_tag, collationcmds.c:454 postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country'); CREATE COLLATION (To be more specific, I'm calling get_icu_language_tag()/uloc_toLanguageTag() [2] as an extra step for CREATE COLLATION here.) It's not like the current behavior is a disaster, or that the alternative behavior that I propose is perfect. The collation behavior you end up with today, having specified a language tag with a syntax error is the totally generic base Ducet collation behavior. Using 'foo bar baz' is effectively the same as using the preinstalled 'und-x-icu' collation, which I'm pretty sure is the same as using any current English locale anyway. That said, falling back on the most useful collation behavior based on inferences about the language tag is supposed to be about rolling with the complexities of internationalization, like political changes that are not yet accounted for by the CLDR/ICU version, and so on. It's no justification for not letting the user know when they've fat fingered a language tag, which they could easily miss. These are *syntax* errors. At one point a couple of months back, it was understood that get_icu_language_tag() might not always work with (assumed) valid locale names -- that is at least the impression that the commit message of eccead9 left me with. But, that was only with ICU 4.2, and in any case we've since stopped creating keyword variants at initdb time for other reasons (see 2bfd1b1 for details of those other reasons). I tend to think that we should not install any language tag that uloc_toLanguageTag() does not accept as valid on general principle (so not just at initdb time, when it's actually least needed). Thoughts? I can write a patch for this, if that helps. It should be straightforward. [1] postgr.es/m/cah2-wzm22vtxvd-e1oz90de8z_m61_8amhsdozf1pwrkfrm...@mail.gmail.com [2] https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers