That's nice to hear SpamAssassin can looks at word pairs, As I am new to SpamAssassin, so still trying to find out more interesting things of it.
According to the word pairs stuff, does SpamAssassin can detect word like this: if a single word is splitted by space, like Free appeared in a email as the format F R E E. ? --- Yu Qian Ottawa Ontario Phone: (514)-553-0198 On Tue, Apr 12, 2016 at 2:15 PM, Dianne Skoll <d...@roaringpenguin.com> wrote: > On Tue, 12 Apr 2016 13:41:51 -0400 > Yu Qian <jinggeqianyu1...@gmail.com> wrote: > > > Yup, that's right, it becomes difficult if we want to support multiple > > language in one spam detection solution. and it's true that there are > > some best practice for single language. but didn't see too much > > support multiple > > The only practical approach is to normalize everything into Unicode and > tokenize Unicode characters. (We actually use UTF-8 as the on-disk > representation.) > > We have a custom Bayes engine that treats any character in the CJK > Unified Ideographs range as a word. This is not strictly correct > because there are two-character (and longer) CJK words, but it's close > enough, especially because our Bayes engine also looks at word pairs. > > I think this is a Summer of Code project for SpamAssassin. :) > > Regards, > > Dianne. >