Am 19.03.2015 um 23:52 schrieb RW:
On Thu, 19 Mar 2015 20:46:10 +0100 Reindl Harald wrote:Am 19.03.2015 um 20:35 schrieb RW:On Thu, 19 Mar 2015 01:12:15 +0100 Reindl Harald wrote:the last point is easy to prove by having the old, unmodified corpus and run spamc against the cleaned bayes database and the final result is that you stop training in circles because you need a ton of classified ham messages to reduce the pision impactBut you're testing mail that's already been trained into the database. Even though you stripped the "Bayes-poison" when training, you'll have left enough rare tokens from the headers and elsewhere to effectively "fingerprint" that spam. It's pretty much inevitable that it hits BAYES_99[9].you didn't get what i wroteI think I did.* i removed poision and rebuilt bayes * i verfied the *original* junk still containing poision aginst the new bayes because i am not an idiot to verify cleaned samples against a bayes built of the same contentsThe mail you used to train was edited from the mail you used to test, which invalidates the result. When you train a spam you typically add a few dozen hapaxes to the database, and substantially alter the probabilities of many low-count tokens. This means that if you train and retest, the new result almost always matches the training.
the same happens in the other direction if somebody sends you a small, legit mail with just a question and one of the dumb fortune-footers many people use which was sadly part of bayes-posion
that mail would get BAYES_95 or BAYES_99 just because the footer
When you train with spam that's had its "Bayes poison" removed you still skew the result of a test with the full spam unless removing the poison removes all of the hapaxes and low-count tokens, and that's highly unlikely.
the point is when you remove 70% of a message because it is poison in form of mark twain poems and such bad jokes and *after* that test the un-altered message with the poem included and it get's BAYES_99 on a corpus with 30000 samples training works as expected
the final result are no BAYES_50 in the whole ham-corpus which where areound 2% before the cleanups which was also "testing mail that's already been trained into the database"
why would you want poems or cooking recipes trained as spam?
signature.asc
Description: OpenPGP digital signature