On Tue, 12 Apr 2016 13:41:51 -0400
Yu Qian <jinggeqianyu1...@gmail.com> wrote:

> Yup, that's right, it becomes difficult if we want to support multiple
> language in one spam detection solution. and it's true that there are
> some best practice for single language. but didn't see too much
> support multiple

The only practical approach is to normalize everything into Unicode and
tokenize Unicode characters.  (We actually use UTF-8 as the on-disk
representation.)

We have a custom Bayes engine that treats any character in the CJK
Unified Ideographs range as a word.  This is not strictly correct
because there are two-character (and longer) CJK words, but it's close
enough, especially because our Bayes engine also looks at word pairs.

I think this is a Summer of Code project for SpamAssassin. :)

Regards,

Dianne.

Reply via email to