Re: locale specific ordering in EN_US vs. characterset collation rules for UTF-8

Paolo Bonzini Fri, 28 Jun 2013 02:51:05 -0700

Il 28/06/2013 07:04, Linda Walsh ha scritto:
> 
> 
> Chet Ramey wrote:
>> The world is larger than glibc and the glibc locale definitions.  We need
>> a solution that encompasses all of it.  That solution should, and maybe
>> will, include glibc, but that is not sufficient by itself.
> ----
>     I don't suppose it is possible to use the Unicode
> collation order when using unicode?


When matching regular expressions, people usually want to treat case
specially; for example [A-E] should exclude lowercase a/b/c/d/e.
Unfortunately, this is not the case when collating other things.  The
Unicode collation standard in fact says ("1.1 Multi-Level Comparison"):

   Case differences (uppercase versus lowercase), are typically
   ignored, if the base letters or their accents differ

So for example "ABC" < "def" _and_ "abc" < "DEF" (whether "ABC" is < or
> than "abc" is customizable, see "6.6 Case Comparisons") .  When
working with files, people often want "ABC < DEF < abc < def", which is
a different order.

So it boils down to different weighing of case, not to the choice of
collation algorithm.

Now, GNU libc implements two different collation orders: a weight-based
one similar to the Unicode algorithm, and "collating element order"
(CEO).  Collating element order is simply the order in which the locale
definitions specify the collating elements, and it is only used for
regular expression matching and globbing.  It need not be portable and
need not match any international standard for languages.

Modifying collating element order to weigh case more heavily would make
the GNU libc regular expression matcher fit the above needs better, and
at the same time could be fine-tuned to various scripts and encodings.

Note that there are some problems with CEO that cannot be solved just by
tweaking the order.  For example, [a-e] would match either "à" or "è"
(the order could be something like "aàbcdeè" or "àabcdèe", respectively)
but not both.  This would have to be solved differently.  For example
one could include all symbols in the same equivalence class as one of
the endpoints, transforming "[a-e]" into "[a-e[=a=][=e=]]".

Paolo

> algorithm reference: http://www.unicode.org/reports/tr10/tr10-24.html
> 
> Collation order chart:
> http://www.unicode.org/Public/UCA/latest/allkeys.txt
> 
> How does one get UTF-8 collation order?
> 
> I would think think that a character specific ordering specified
> in LC_COLLATE would take precedence over a less specific regional ordering.
> 
> I.e LC_COLLATE="XXX.UTF-8" -- Seems like it should use the UTF-8 rules
> over the XXX rules for COLLATION.  If they wanted regional rules,
> then "XXX" alone without specifying an international standard like unicode,
> would allow regional rules to take precedence.
> 
> But if they specify a specific character encoding for the characters,
> under collation, why wouldn't the character set's collation order be used?
> 
> So how does one get UTF-8's Unicode collation ordering?
> 
> 
>

Re: locale specific ordering in EN_US vs. characterset collation rules for UTF-8

Reply via email to