Charles Sprickman wrote:
> Hello,
> 
> I'm not seeing it in the FAQ/wiki, but I've missed things in there
> before, so I thought I'd ask a quick question here.
> 
> I assume everyone else sees spam sneak through that contains a "spammy"
> subject (usually mentioning drugs with some mis-spellings/obfu), an
> attached image that apparently has the actual spam "message" in it, then
> some text that is very hammy in it's content.
> 
> I've been assuming that this is what people refer to as "bayes poison"
> and I do not feed sa-learn with these.
> 
> Is this correct, or would information in the headers still prove
> valuable to bayes?

It is correct that is what people mean by bayes-poison. However it is incorrect
that you should avoid training them.

Try to train SA realistically. Don't try to second-guess and censor it's input.
If it's spam, train it as spam. If it's nonspam, train it as nonspam. Do this
without regard for what the body "looks like".

I see a lot of admins out there pushing the idea of only training "ideal" spam
and "ideal" nonspam, with the assumption that by avoiding the oddball cases
they'll get better results. This is completely the opposite of the truth. By
biasing your training with unrealistic input, you're going to get unrealistic
output.

The way bayes works it won't "instant spam" any messages with the same words as
the bayes-poison. However, SA will be more aware that these words are often used
in both types of mail, resulting in a more mid-line probability for that token.
SA's use of chi-squared combining means SA will be more influenced by words that
occur exclusively in one type or the other, and these "present in both" will
have little impact on bayes scoring.



Reply via email to