On Tue, 12 Apr 2016 14:15:50 -0400 Dianne Skoll wrote: > On Tue, 12 Apr 2016 13:41:51 -0400 > Yu Qian <jinggeqianyu1...@gmail.com> wrote: > > > Yup, that's right, it becomes difficult if we want to support > > multiple language in one spam detection solution. and it's true > > that there are some best practice for single language. but didn't > > see too much support multiple > > The only practical approach is to normalize everything into Unicode > and tokenize Unicode characters. (We actually use UTF-8 as the > on-disk representation.) > > We have a custom Bayes engine that treats any character in the CJK > Unified Ideographs range as a word. This is not strictly correct > because there are two-character (and longer) CJK words, but it's close > enough,
What happens in mainstream SpamAssassin is that if a word is over 15 bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in place of the original word. Everything can be normalized to UTF-8 with "normalize_charset 1" This will likely work fairly well for CJK, but won't work well for any 3 or 4 byte UTF-8 alphabet that isn't composed of ideograms (unless it's only in spam). This includes most Asian and African languages. I think the best solution to this is simply to retain the original long-word as a token - or to allow it as an option. Setting normalize_charset also helps with custom rules if you edit them as UTF-8, but it's important to remember that SA sees a multibyte character as a sequence of bytes rather than a single charcter. For example you can't put a non-ascii character between square brackets.