Paul Eggert wrote: >>>>Using strcoll is inefficient anyway >>> >>>Don't we know it! If we can avoid it, we'd like to. >> >>Well, the mbstowcs+wcscoll solution I presented >>should be equivalent to strcoll on any platform, >>and it's much faster in my tests. > > > That's good to know, though I'm puzzled as to why it's true. For a > single comparison, can't strcoll typically return an answer without > examining all the input, and wouldn't that be faster than > mbstowc+wcscoll? > > But if it is true, perhaps we should rewrite memcoll to use the > mbstowc+wcscoll combination as well.
I missed out a test case in my performance runs for same length lines with random data (where strcoll can break out early). I'll run that and comment more. I was also using the string length comparison shortcut on the wide string. I'm unsure whether this is valid (on all platforms). >>>>but it probably is possible in ICU? >>> >>>Sorry, don't know. >> >>I wonder could we add this as a dependency? > > > You mean, ship ICU code? Or depend on it already being installed? probably ship it? > Sorry, I'm not familiar with the ICU code. Is it free software and is > it well maintained? Where else is it being used, outside ICU itself? I am not familiar with it myself, but note it's used for various things in python, mozilla, openoffice, ... >>Also I don't agree with splitting entities into >>valid multibyte ranges and "C" for the rest. >>That is probably not what the user wants the data interpreted as, >>and I think (at least for uniq which I've thought about), >>that it's just best to treat the whole entity as "C" >>if there are invalid multibyte sequences in the entity. > > > We can't adopt this approach in general, since it would mean that our > comparison operation could return inconsistent answers. Suppose "Y" > has an invalid byte sequence but "X" and "Z" are valid. Then we might > have "X" < "Y" < "Z" (using C-locale comparison), but "Z" < "X" (using > some other locale's comparison). This will lead to inconsistencies, > which will be hard to document and will confuse users. Garbage In Garbage Out. As for confusing users my solution was to print a warning indicating the invalid input. > Worse, it can > even lead to buffer overruns: e.g., qsort has undefined behavior if > you pass it a comparison function that is not a total order. Thanks for pointing that out. I'll look into that. cheers, Pádraig. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils