A couple of weeks ago, in this thread Philippe Verdy said: > Breaking on words, even if it requirs a very modest buffering, > will significantly improve the processing time, > because each word in the long texts will be scanned only > once, and all the rest will occur within the small and > constantly reused buffer. ... > I don't forget that in most practical cases, sorts will operate > on texts whose collation keys have been only partly > generated and truncated, because they really speed up and > reduce the number of compares to perform ...
and so on. Instead of continuing the discussion with a back and forth in email, I decided instead to write a Unicode Technical Note on the general topic, including a case study of alternative orderings for a French topic list. Those who are interested in collation and in the particular issues that were discussed in this thread may wish to take a look: http://www.unicode.org/notes/tn34/ --Ken