Il 28/06/2013 07:04, Linda Walsh ha scritto: > > > Chet Ramey wrote: >> The world is larger than glibc and the glibc locale definitions. We need >> a solution that encompasses all of it. That solution should, and maybe >> will, include glibc, but that is not sufficient by itself. > ---- > I don't suppose it is possible to use the Unicode > collation order when using unicode?
When matching regular expressions, people usually want to treat case specially; for example [A-E] should exclude lowercase a/b/c/d/e. Unfortunately, this is not the case when collating other things. The Unicode collation standard in fact says ("1.1 Multi-Level Comparison"): Case differences (uppercase versus lowercase), are typically ignored, if the base letters or their accents differ So for example "ABC" < "def" _and_ "abc" < "DEF" (whether "ABC" is < or > than "abc" is customizable, see "6.6 Case Comparisons") . When working with files, people often want "ABC < DEF < abc < def", which is a different order. So it boils down to different weighing of case, not to the choice of collation algorithm. Now, GNU libc implements two different collation orders: a weight-based one similar to the Unicode algorithm, and "collating element order" (CEO). Collating element order is simply the order in which the locale definitions specify the collating elements, and it is only used for regular expression matching and globbing. It need not be portable and need not match any international standard for languages. Modifying collating element order to weigh case more heavily would make the GNU libc regular expression matcher fit the above needs better, and at the same time could be fine-tuned to various scripts and encodings. Note that there are some problems with CEO that cannot be solved just by tweaking the order. For example, [a-e] would match either "à" or "è" (the order could be something like "aàbcdeè" or "àabcdèe", respectively) but not both. This would have to be solved differently. For example one could include all symbols in the same equivalence class as one of the endpoints, transforming "[a-e]" into "[a-e[=a=][=e=]]". Paolo > algorithm reference: http://www.unicode.org/reports/tr10/tr10-24.html > > Collation order chart: > http://www.unicode.org/Public/UCA/latest/allkeys.txt > > How does one get UTF-8 collation order? > > I would think think that a character specific ordering specified > in LC_COLLATE would take precedence over a less specific regional ordering. > > I.e LC_COLLATE="XXX.UTF-8" -- Seems like it should use the UTF-8 rules > over the XXX rules for COLLATION. If they wanted regional rules, > then "XXX" alone without specifying an international standard like unicode, > would allow regional rules to take precedence. > > But if they specify a specific character encoding for the characters, > under collation, why wouldn't the character set's collation order be used? > > So how does one get UTF-8's Unicode collation ordering? > > >