Hello,

David Jones [mailto:djo...@ena.com] wrote:
> With non-English email flow, it's more challenging.  If no RBLs hit, then you 
> really must train your Bayes properly which requires some way to accurately 
> determine the ham and spam.  You must keep a copy of the 
ham and spam corpi and be allowed to review suspicious email.

I really appreciate you to take time helping on this. 

Yes, I can confirm that we usually have issues with Hungarian spams. English 
spams often caught by the default rules.

As far as I understood today, I need to re-build the bayes database from 
scratch:

1. turn off autolearning

2. populate then spam database
Guys behind the http://artinvoice.hu/spams/ site are doing an excellent work, 
they publish catched spams in mbox format
I checked, many spam e-mails that was sent for investigation are in their mbox.

3. populate the ham database
That's the tricky part. As I mentioned earlier, I don't really want end-users 
involved in this. And I don't have the necessary resource to do that manually.
I assume I can hack something into the mailflow to copy all outgoing e-mails to 
a separate mailbox and - we'll assume that every outgoing e-mail are hams - 
these mails are learnt.
That should do it?

End-users are working in a heavily controlled environment (both technically and 
legally), in the last ten years, we haven't experienced spams that were sent 
from inside. That's why I would blindly trust outgoing emails as hams.

One more question: is there a recommended ham to spam ratio? 1:1? 

I'm thinking about if you see my "populating the ham database automatically 
with the outgoing emails" idea as a complete nonsense, then I would find 
sysadministrator resource to collect 2000 legit emails and train those mails as 
hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 
2000 legit emails are enough for training)

Best regards,
  Szabolcs Horvath

Reply via email to