Jason Haar wrote:
Hi there
I just did a stat-run on email I received 31st Oct, and found that of
the mail SA scored lower than 5/5 (i.e. SA classified as "ham"), a large
amount was SPAM. In fact it only caught 80% of the SPAM I received that
day (this is with SA 3.1.0)
Of that I was able to tell that the vast majority of "missed" SPAM was
actually Asian SPAM - the Subject: lines alone were 100% non-ASCII - bit
of a give-away as I am ignorant and can't speak anything but
Kiwi-English ;-)
If I removed that Asian SPAM from the figures, the effectiveness of SA
shot up to 98% - pretty darn good!
Now personally I can run SA on my workstation with "ok_locales en" and
bang extra points onto non-English mail - but I certainly can't do that
for our company as a whole - which has customers from every
country/nationality, etc.
So the only thing I can think of is that there appears to be a need for
more non-English rulesets to add points for different language usages of
viagra/porn/whatever.
Am I correct in my thinking, and if so is the SA group getting help from
non-English developers to make this happen? I see a couple of
"body_test" rules that appear to be for Spanish and Polish - but no others?
Jason,
I know that I have personally contributed some rules to catch certain
phrases in Japanese, however this seems like a really scenario for
manual bayes training.
While the auto-learning is convenient and often "good enough", I think
the general concensus is that you should do at least a certain bit of
manual training so that your bayes databases better represent your mail
traffic patterns.
hope this helps,
alan