I'm not sure if I am reading the Sword code correctly, but it appears that it is sorting at a byte level and not a character level. That isn't by code points.
I think that we discussed this a little bit ago and concluded that some work needs to be done in the engine. Her is my thought on the matter, for what it is worth. Today the sort serves two purposes: order and search. But it is search that constrains the order to be as it is. I think that if we could search independently of the order of keys in the module that would be ideal. One simple way for any application to provide this is to create a Lucene index similar to what we do for a Bible for the dictionary (similar to what we do for a Bible) that consists of the term (stored and indexed), the offset (stored) in the module (so it can be retrieved and previous and next indexes can be found), the entry for the term (indexed, but not stored). The application can then create any kind of collation of the keys (using the excellent facilities of ICU) that suite the user's needs. Then using this double handle present the keys in part (as in BibleCS) or whole (as in BibleDesktop, MacSword, ...) in the order that the user expects. There are some related problems to this: A user may expect to be able to find a Hebrew word in a Hebrew dictionary independent of the pointing of the word in the dictionary. (i.e. a user may wish to search without specifying accents) A user may expect to find a word by stem not just by prefix. A user may expect to be able to type "photos" (a transliteration) and find the real Greek word in a Greek dictionary. I'm cross-posting to J-Sword because this will be of interest there as well. In His Service, DM Smith On Oct 28, 2007, at 9:13 PM, Frank wrote: > peter wrote: >> Is this really only a Vietnamese problem, but will not all latinate >> scripts with extra signs have exactly the same problem? >> >> Or actually all scripts which are treated as derrived scripts - >> Farsi, >> urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc - >> the >> code points are initially for the "main" characters and then there >> is a >> always bunch of extra characters which are used only in one or other >> language. >> >> But maybe I am just showing my ignorance here. I need to look at some >> dictionaries - never had any installed. > Any language that uses letters outside the ASCII range will be > affected > unless the collate the letter after "z"... and if it's strictly in > Unicode point order, then all upper case will collate before lower > case... > > -- > Blessings > > Frank > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page