Re: How does SpamAssassin processing languages other than English

Yu Qian Tue, 12 Apr 2016 14:01:04 -0700

That's nice to hear SpamAssassin can looks at word pairs, As I am new to
SpamAssassin, so still trying to find out more interesting things of it.


According to the word pairs stuff, does SpamAssassin can detect word like
this: if a single word is splitted by space, like Free appeared in a email
as the format F R E E. ?


---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Tue, Apr 12, 2016 at 2:15 PM, Dianne Skoll <d...@roaringpenguin.com>
wrote:

> On Tue, 12 Apr 2016 13:41:51 -0400
> Yu Qian <jinggeqianyu1...@gmail.com> wrote:
>
> > Yup, that's right, it becomes difficult if we want to support multiple
> > language in one spam detection solution. and it's true that there are
> > some best practice for single language. but didn't see too much
> > support multiple
>
> The only practical approach is to normalize everything into Unicode and
> tokenize Unicode characters.  (We actually use UTF-8 as the on-disk
> representation.)
>
> We have a custom Bayes engine that treats any character in the CJK
> Unified Ideographs range as a word.  This is not strictly correct
> because there are two-character (and longer) CJK words, but it's close
> enough, especially because our Bayes engine also looks at word pairs.
>
> I think this is a Summer of Code project for SpamAssassin. :)
>
> Regards,
>
> Dianne.
>

Re: How does SpamAssassin processing languages other than English

Reply via email to