On Fri, Sep 22, 2000 at 09:20:29AM +0900, NOKUBI Takatsugu wrote: > Hmm... I don't know about Chinese. I think, it is hard to determine > word boundary in Chinese. So some word segmentation tools need for > processing Chinese (like kakasi, chasen in Japanese). I looked in the > output of "apt-get search chinese", but it seems there are no such > tool... The way udmsearch works is it has a big array of characters. Those characters make up a word, anything not in that array makes up whitespace. Easy stuff for single byte character sets.
I totally don't understand dual byte character sets at all, but I'm guessing you could do a similar thing. Have an array of dual bytes which make up characters. I really don't know what to do here. > There is the another solution. It is "letter indexing > approach". However, that approach is more difficult to implement than > "word indexing approach". It sould be hard to implement it in Glimpse. Sounds hard to implement it at all. Is there any solution that searches well for both single and dual byte character sets? - Craig -- Craig Small VK2XLZ GnuPG:1C1B D893 1418 2AF4 45EE 95CB C76C E5AC 12CA DFA5 Eye-Net Consulting http://www.eye-net.com.au/ <[EMAIL PROTECTED]> MIEEE <[EMAIL PROTECTED]> Debian developer <[EMAIL PROTECTED]>