On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote: > > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as > > LC_COLLATE=C, > > but leave LC_CTYPE=C.UTF-8 as-is? > > Yes. The basic idea, at least for these two OSes, is that every > category behaves as if set to C, except LC_CTYPE.
If that's true, and we version C.UTF-8, then users could still get the behavior they want, a stable collation order, and benefit from the optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8. The only caveat is to be careful with things that depend on ctype in indexes and constraints. While still a problem, it's a smaller problem than unversioned collation. We should think a little more about solving it, because I think there's a strong case to be made that a default collation of C and a database ctype of something else is a good combination (it makes less sense for a case-insensitive collation, but those aren't allowed as a default collation). In any case, we're better off following the rule "version anything that goes to any external provider, period". And by "version", I really mean a best effort, because we don't always have great information, but I think it's better to record what we do have than not. We have just seen too many examples of weird behavior. On top of that, it's simply inconsistent to assume that C=C.UTF-8 for collation version, but not for the collation implementation. Users might get frustrated that the collation for C.UTF-8 is versioned, of course. But I don't think it will affect anyone for quite some time, because existing users will have a datcollversion=NULL; so they won't get the warnings until they refresh the versions (or create new collations/databases), and then after that upgrade libc. Right? So they should have time to adjust to use LC_COLLATE=C if that's what they want. An alternative would be to define lc_collate_is_c("C.UTF-8") == true while leaving lc_ctype_is_c("C.UTF-8") == false and get_collation_actual_version("C.UTF-8") == NULL. In that case we would not be passing it to an external provider, so we don't have to version it. But that might be a little too magical and I'm not inclined to do that. Another alternative would be to implement C.UTF-8 internally according to the "true" semantics, if they are truly simple and well-defined and stable. But I don't think ctype=C.UTF-8 is actually stable because new characters can be added, right? Regards, Jeff Davis