A couple of weeks ago, in this thread Philippe Verdy said:

> Breaking on words, even if it requirs a very modest buffering, 
> will significantly improve the processing time, 
> because each word in the long texts will be scanned only 
> once, and all the rest will occur within the small and 
> constantly reused buffer.
...
> I don't forget that in most practical cases, sorts will operate 
> on texts whose collation keys have been only partly 
> generated and truncated, because they really speed up and 
> reduce the number of compares to perform  ...

and so on.

Instead of continuing the discussion with a back and forth in
email, I decided instead to write a Unicode Technical Note
on the general topic, including a case study of alternative
orderings for a French topic list.

Those who are interested in collation and in the particular issues
that were discussed in this thread may wish to take a look:

http://www.unicode.org/notes/tn34/

--Ken


Reply via email to