Am 20.02.2015 um 21:29 schrieb Dave Warren:
On 2015-02-20 09:44, Bowie Bailey wrote:
On 2/20/2015 12:35 PM, Kevin Miller wrote:
When a fresh spam flood comes in, sometimes 50 or more of my users
will get hit with the same message - just a different user in the To:
line.  When one trains the bayes database, is there a significant
difference between training on all 50+ or just grabbing a few of the
messages and training on them?  Will bayes be more convinced of the
spaminess of a particular message if it sees dozens rather than a
couple?

Yes, there will be a difference.  Training the exact same message
multiple times will not do anything, but if you have 50 copies of the
message that are all slightly different, train them all.

In general, train as much as you can manage.  Ideally, you would train
bayes on every message that passes through your server.  The more data
bayes has, the better it works.

And I'd suggest the same for non-spam, train duplicative ham even if it
happens to be similarly addressed to different users. More data is
(nearly) always better for bayesian learning systems

of course

in doubt the amout of trained ham and spam should be near 50%, while thats's not a strict rule since we talk about a statistical decision and not each mail contains the same ammount of tokens, length and so on it's some point you can overview for balance

*in doubt* train more ham than spam since one can easily delete a spam message but not pull a rejected false positive (in case of a system reject spam above a defined score)

$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0      10731          0  non-token data: nspam
0.000          0      11260          0  non-token data: nham
0.000          0    1507863          0  non-token data: ntokens
0.000          0  993467899          0  non-token data: oldest atime
0.000          0 1424463233          0  non-token data: newest atime
0.000 0 1424463292 0 non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to