On Fri, Feb 26, 2010 at 12:38:08AM -0600, Jonathan Nieder wrote: > Computers are dumb > ------------------ > > Andras wrote: > > > 1. grep has no way of knowing whether a "zs" sequence is a "single letter" > > or two letters, because the combination can occur in compound words without > > becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" + > > "zenekar"), it's simply an "s" and a "z" letter next to each other. There > > may even exist words that make (a different) sense either way, but I can't > > think of any right now. > > Are there simple heuristics that would make this condition easy to > discover? For example, vowels that would never appear before a true > "sz" letter, things like that? I am just curious; please feel free to > e-mail me privately about this. > > This sounds like a (hard to fix) bug in the collation algorithm, but > not a reason not to make 'sort' follow the conventions of the language.
Sorting is actually also tricky with dumb computers, because there is no way for sort to know whether e.g. "nyolcszáz" contains a "cs" collating symbol followed by "z" or a "c" followed by an "sz" collating symbol (the latter is in fact the case). cs+z would be sorted after "cz" (because "cs" comes after "c"), but c+sz would be sorted _before_ "cz" because "sz" precedes "z". I'd say this is unfixable. There is no way, short of understanding the natural language, for a program to determine whether two (or three) characters represent a single collating symbol or themselves. Clearly, the Hungarian language must be fixed, either by introducing separate glyphs for the composite "letters", or by no longer insisting that something represented by more than one character is "a single letter". :) Andras -- If Chuck Norris had been Spartan, the movie would simply have been called 1. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org