Charles Sprickman wrote: > Hello, > > I'm not seeing it in the FAQ/wiki, but I've missed things in there > before, so I thought I'd ask a quick question here. > > I assume everyone else sees spam sneak through that contains a "spammy" > subject (usually mentioning drugs with some mis-spellings/obfu), an > attached image that apparently has the actual spam "message" in it, then > some text that is very hammy in it's content. > > I've been assuming that this is what people refer to as "bayes poison" > and I do not feed sa-learn with these. > > Is this correct, or would information in the headers still prove > valuable to bayes?
It is correct that is what people mean by bayes-poison. However it is incorrect that you should avoid training them. Try to train SA realistically. Don't try to second-guess and censor it's input. If it's spam, train it as spam. If it's nonspam, train it as nonspam. Do this without regard for what the body "looks like". I see a lot of admins out there pushing the idea of only training "ideal" spam and "ideal" nonspam, with the assumption that by avoiding the oddball cases they'll get better results. This is completely the opposite of the truth. By biasing your training with unrealistic input, you're going to get unrealistic output. The way bayes works it won't "instant spam" any messages with the same words as the bayes-poison. However, SA will be more aware that these words are often used in both types of mail, resulting in a more mid-line probability for that token. SA's use of chi-squared combining means SA will be more influenced by words that occur exclusively in one type or the other, and these "present in both" will have little impact on bayes scoring.