On Tue, 12 Apr 2016 13:41:51 -0400 Yu Qian <jinggeqianyu1...@gmail.com> wrote:
> Yup, that's right, it becomes difficult if we want to support multiple > language in one spam detection solution. and it's true that there are > some best practice for single language. but didn't see too much > support multiple The only practical approach is to normalize everything into Unicode and tokenize Unicode characters. (We actually use UTF-8 as the on-disk representation.) We have a custom Bayes engine that treats any character in the CJK Unified Ideographs range as a word. This is not strictly correct because there are two-character (and longer) CJK words, but it's close enough, especially because our Bayes engine also looks at word pairs. I think this is a Summer of Code project for SpamAssassin. :) Regards, Dianne.