I've been chasing up the reason why udmsearch does not index non-english too well and after having a chat with the developer it all comes down to charsets.
English basically has a word charset of [A-Za-z0-9] easy stuff and all 7 bit. But other languages have other charsets. Charsets I have already are: Cryllic: cpl25l, koi8r, cp866, iso88595, maccyr Western: iso-8859-1 Central Europe: iso-8859-2, cpl250 Arabic: cpl256 OK, so all charsets include the ASCII [A-Za-z0-9] but what I need to know from the translators is their charset for their language. I need upper case characters first, then lowercase. If there is no equivalent upper/lower then put it in twice. If the language has no concept of upper/lower at all then just include the set once and let me know it doesn't have upper/lower. The format is flexible: characters in their wierd form: "áâ÷çäå³öúéêëìíîïðòóôõæèãþûýÿùøüàñÁÂ×ÇÄÅ£ÖÚÉÊËÌÍÎÏÐÒÓÔÕÆÈÃÞÛÝßÙØÜÀÑ" characters in their decimal equivalents: 193,195,194,196,161,198,200,199,207,201,204,203,202,208,205,206,197 characters in their hex equivalents: 0x8d, 0x8e, 0x90, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7 As long as I can do something like char mycharset[] = <your stuff>; in C then I'm happy. I will pass on these charsets to upstream to include in udmsearch proper. I'll try to make sure you get acknowledged (include some email address you want in there). Hope it isn't too much trouble, but it will mean that udmsearch will index in your language very nicely. For the dual-byte folks, I don't think this will work. The upstream author is willing to work with you, but he's not sure how to do it. Actually it may work... if you put both bytes into the charset. Depends on what your whitespace looks like. - Craig -- Craig Small VK2XLZ GnuPG:1C1B D893 1418 2AF4 45EE 95CB C76C E5AC 12CA DFA5 Eye-Net Consulting http://www.eye-net.com.au/ <[EMAIL PROTECTED]> MIEEE <[EMAIL PROTECTED]> Debian developer <[EMAIL PROTECTED]>