Am 19.03.2015 um 23:52 schrieb RW:
On Thu, 19 Mar 2015 20:46:10 +0100
Reindl Harald wrote:

Am 19.03.2015 um 20:35 schrieb RW:
On Thu, 19 Mar 2015 01:12:15 +0100
Reindl Harald wrote:


the last point is easy to prove by having the old, unmodified
corpus and run spamc against the cleaned bayes database and the
final result is that you stop training in circles because you need
a ton of classified ham messages to reduce the pision impact

But you're testing mail that's already been trained into the
database. Even though you stripped the "Bayes-poison" when
training, you'll have left enough rare tokens from the headers and
elsewhere to effectively "fingerprint" that spam. It's pretty much
inevitable that it hits BAYES_99[9].

you didn't get what i wrote

I think  I did.

* i removed poision and rebuilt bayes
* i verfied the *original* junk still containing poision aginst
    the new bayes because i am not an idiot to verify cleaned samples
    against a bayes built of the same contents

The mail you used to train was edited from the mail you used to
test, which invalidates the result.

When you train a spam you typically add a few dozen hapaxes to the
database, and substantially alter the probabilities of many low-count
tokens. This means that if you train and retest, the new result almost
always matches the training.

the same happens in the other direction if somebody sends you a small, legit mail with just a question and one of the dumb fortune-footers many people use which was sadly part of bayes-posion

that mail would get BAYES_95 or BAYES_99 just because the footer

When you train with spam that's had its "Bayes poison" removed you
still skew the result of a test with the full spam unless removing the
poison removes all of the hapaxes and low-count tokens, and that's
highly unlikely.

the point is when you remove 70% of a message because it is poison in form of mark twain poems and such bad jokes and *after* that test the un-altered message with the poem included and it get's BAYES_99 on a corpus with 30000 samples training works as expected

the final result are no BAYES_50 in the whole ham-corpus which where areound 2% before the cleanups which was also "testing mail that's already been trained into the database"

why would you want poems or cooking recipes trained as spam?


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to