On Thu, 19 Mar 2015 20:46:10 +0100 Reindl Harald wrote: > > > Am 19.03.2015 um 20:35 schrieb RW: > > On Thu, 19 Mar 2015 01:12:15 +0100 > > Reindl Harald wrote:
> >> > >> the last point is easy to prove by having the old, unmodified > >> corpus and run spamc against the cleaned bayes database and the > >> final result is that you stop training in circles because you need > >> a ton of classified ham messages to reduce the pision impact > > > > > > But you're testing mail that's already been trained into the > > database. Even though you stripped the "Bayes-poison" when > > training, you'll have left enough rare tokens from the headers and > > elsewhere to effectively "fingerprint" that spam. It's pretty much > > inevitable that it hits BAYES_99[9]. > > you didn't get what i wrote I think I did. > * i removed poision and rebuilt bayes > * i verfied the *original* junk still containing poision aginst > the new bayes because i am not an idiot to verify cleaned samples > against a bayes built of the same contents The mail you used to train was edited from the mail you used to test, which invalidates the result. When you train a spam you typically add a few dozen hapaxes to the database, and substantially alter the probabilities of many low-count tokens. This means that if you train and retest, the new result almost always matches the training. When you train with spam that's had its "Bayes poison" removed you still skew the result of a test with the full spam unless removing the poison removes all of the hapaxes and low-count tokens, and that's highly unlikely.