On Mon, 2007-12-10 at 23:29 +0800, Joe Wong wrote: > Hi Timo, > > Just take your suggestion. I have another collections of emails and running > full text search on that did not encounter any problem no matter they are on > NFS or local disk. > > You mentioned that full text search is only working on for english only > mailbox, what is the current limitation of it? Is there any plan to support > non-english email ( conversion to UTF8? )
It should work with any UTF8 input, and I've tested that it works with some mails containing non-ASCII characters. There's nothing in design that prevents it. But I guess there is some bug then that causes these problems. If you could send me a test mailbox where this happens I could take a look at fixing it. Although now that you mentioned it, I wonder if the current design could be optimized to work a bit differently with Chinese/Japanese/etc. Currently it works by indexing 4 character blocks, so with non-ASCII UTF-8 input it may end up indexing more than 4 bytes per block. How many bytes does a typical chinese UTF-8 character take? How many characters does a typical chinese word take? How many characters are in your typical search word? I was just wondering if there's a lot of 1-3 character words, maybe the indexing could limit itself to something like minimum of(4 characters, ~8 bytes). That would then take less space and memory.
signature.asc
Description: This is a digitally signed message part