Hi, We have accumulated quite a large list of whitelisted users, primarily because they were previously tagged incorrectly. I've extracted a copy of all whitelisted mail into a separate mbox.
Certainly there is some spam in there as well, but assuming I only learn the ham, would it make sense to train bayes using the emails from this folder? It's all business-related, but I'm concerned that it may have things in the email that caused it to be tagged in the first place, like excessive HTML, sent from a host with no reverse DNS, etc. -- all the reasons for it being whitelisted in the first place. Looking at the logs before the addresses were added to the whitelist, I see quite a few that were BAYES_99, probably because they resemble mailing lists, such as those from networkworld, for example. IOW, I wouldn't want to whitelist an email from networkworld.com, but one of the company's partners could send the company an email that had many of those characteristics. Someone may also send them a one-line email with a small GIF as an attachment, such as their corporate logo in their signature. This would be a valid email, but also very much resembles the characteristics of a typical spam. This is all being done to hopefully train bayes to better recognize corporate email, and hopefully cut down on the number of whitelisted senders that must be added in the future (or, corporate email that gets tagged then must be whitelisted). Ideas greatly appreciated. Thanks, Alex