Hi,

We have accumulated quite a large list of whitelisted users, primarily
because they were previously tagged incorrectly. I've extracted a copy
of all whitelisted mail into a separate mbox.

Certainly there is some spam in there as well, but assuming I only
learn the ham, would it make sense to train bayes using the emails
from this folder? It's all business-related, but I'm concerned that it
may have things in the email that caused it to be tagged in the first
place, like excessive HTML, sent from a host with no reverse DNS, etc.
-- all the reasons for it being whitelisted in the first place.

Looking at the logs before the addresses were added to the whitelist,
I see quite a few that were BAYES_99, probably because they resemble
mailing lists, such as those from networkworld, for example. IOW, I
wouldn't want to whitelist an email from networkworld.com, but one of
the company's partners could send the company an email that had many
of those characteristics.

Someone may also send them a one-line email with a small GIF as an
attachment, such as their corporate logo in their signature. This
would be a valid email, but also very much resembles the
characteristics of a typical spam.

This is all being done to hopefully train bayes to better recognize
corporate email, and hopefully cut down on the number of whitelisted
senders that must be added in the future (or, corporate email that
gets tagged then must be whitelisted).

Ideas greatly appreciated.
Thanks,
Alex

Reply via email to