On Tue, 13 Feb 2018, Horváth Szabolcs wrote:

3. populate the ham database

That's the tricky part. As I mentioned earlier, I don't really want end-users involved in this.

You might be able to find a few that are somewhat technically competent and don't mind their ham samples being manually reviewed.

One more question: is there a recommended ham to spam ratio? 1:1?

I suggest "try to match your ham:spam ratio at your MTA before filtering", but others may have different advice. Generally: the more *reliable* data you can feed Bayes, the better it does.

I'm thinking about if you see my "populating the ham database automatically with the outgoing emails" idea as a complete nonsense, then I would find sysadministrator resource to collect 2000 legit emails and train those mails as hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 2000 legit emails are enough for training)

2000 is enough to start, but it would have to be ongoing as the nature of mail changes over time.

Generally training on misclassifications is what you do after the initial training. So if a ham drops into a user's quarantine folder, you'd want to train that as ham.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Windows Genuine Advantage (WGA) means that now you use your
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason;
  it also shuts down in sympathy when the servers at Microsoft crash.
-----------------------------------------------------------------------
 9 days until George Washington's 286th Birthday

Reply via email to