Re: Skipping RBL checks for internal servers

RW Thu, 19 Mar 2015 15:52:55 -0700

On Thu, 19 Mar 2015 20:46:10 +0100
Reindl Harald wrote:

> 
> 
> Am 19.03.2015 um 20:35 schrieb RW:
> > On Thu, 19 Mar 2015 01:12:15 +0100
> > Reindl Harald wrote:


> >>
> >> the last point is easy to prove by having the old, unmodified
> >> corpus and run spamc against the cleaned bayes database and the
> >> final result is that you stop training in circles because you need
> >> a ton of classified ham messages to reduce the pision impact
> >
> >
> > But you're testing mail that's already been trained into the
> > database. Even though you stripped the "Bayes-poison" when
> > training, you'll have left enough rare tokens from the headers and
> > elsewhere to effectively "fingerprint" that spam. It's pretty much
> > inevitable that it hits BAYES_99[9].
> 
> you didn't get what i wrote

I think  I did.

> * i removed poision and rebuilt bayes
> * i verfied the *original* junk still containing poision aginst
>    the new bayes because i am not an idiot to verify cleaned samples
>    against a bayes built of the same contents

The mail you used to train was edited from the mail you used to
test, which invalidates the result. 

When you train a spam you typically add a few dozen hapaxes to the
database, and substantially alter the probabilities of many low-count
tokens. This means that if you train and retest, the new result almost
always matches the training. 

When you train with spam that's had its "Bayes poison" removed you
still skew the result of a test with the full spam unless removing the
poison removes all of the hapaxes and low-count tokens, and that's
highly unlikely.

Re: Skipping RBL checks for internal servers

Reply via email to