On Tue, 12 Apr 2016 14:15:50 -0400
Dianne Skoll wrote:

> On Tue, 12 Apr 2016 13:41:51 -0400
> Yu Qian <jinggeqianyu1...@gmail.com> wrote:
> 
> > Yup, that's right, it becomes difficult if we want to support
> > multiple language in one spam detection solution. and it's true
> > that there are some best practice for single language. but didn't
> > see too much support multiple  
> 
> The only practical approach is to normalize everything into Unicode
> and tokenize Unicode characters.  (We actually use UTF-8 as the
> on-disk representation.)
> 
> We have a custom Bayes engine that treats any character in the CJK
> Unified Ideographs range as a word.  This is not strictly correct
> because there are two-character (and longer) CJK words, but it's close
> enough,

What happens in mainstream SpamAssassin is that if a word is over 15
bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in
place of the original word. Everything can be normalized to UTF-8 with 
"normalize_charset 1"

This will likely work fairly well for CJK, but won't work well for any 3
or 4 byte UTF-8  alphabet that isn't composed of ideograms (unless
it's only in spam). This includes most Asian and African languages. 

I think the best solution to this is simply to retain the original
long-word as a token - or to allow it as an option.

Setting normalize_charset also helps with custom rules if you edit them
as  UTF-8, but it's important to remember that SA sees a multibyte
character as a sequence of bytes rather than a single charcter. For
example you can't put a non-ascii character between square brackets.

Reply via email to