RE: Bayes spam and ham out of proportion

Giampaolo Tomassoni Fri, 30 Apr 2010 02:54:26 -0700

> On Thu, 29 Apr 2010 18:32:04 +0200
> "Giampaolo Tomassoni" <g.tomass...@libero.it> wrote:
> 
> > > what you need to do write a script that divides the metadata
> > > num_spam value and all the token Nspam counts by 3. The updated
> > > database can then be loaded back in with --restore.
> >
> > I don't know if this is going to be effective. After all, this way
> > you are basically lowering the effectiveness of all the spam tokens,
> > even potentially remarkable ones.
> 
> Correct, but if those counts came from autolearning 90% of spam and 30%
> of ham, then rescaling may be the correct thing to do.
> 
> It may also be pragmatic, if a high spam/ham ratio is leading to FPs,
> to keep the learned ratio closer to 1:1 than the actual ratio.


I was almost thinking your statement of pragmatism was wrong in principle,
but after having checked the bayesian filtering equation and having seen
what happens forcing a 1:1 ratio in the number of received spam messages
over ham ones, I see that it is:

        P'(S|W) <= P(S|W) iff P(H) <= P(S)

which is, the probability of a message being spam because it contains word W
is lower after forcing the 1:1 ratio iff the original probability of a
message (irrespective to its content) to be ham is lower than the
probability of it to be spam. Which is the case.

However, the fact that Frank's database comes from about three-times more
spam messages then ham ones, means his bayes learned 75% of spam and 25% of
ham. This is in line with world statistics about spam, which roughly are
around 70%-80% spam.

So, while your approach may yield a lower rate in FPs (which is a very
important result), it doesn't mean that Bayes is going to be more precise in
the overall. This is because the bayes filtering equation is designed around
the fact that the probability of receiving a spam message is very different
(i.e.: greater) than the opposite, in order to better mime real-world data.


> > I would instead, in order of effectiveness:
> >
> >     a) expire old tokens;
> 
> Token retention is a good thing.  The only reason for ageing-out tokens
> is to limit the database size.

This is not the only reason to ageing-out tokens. Ham and spam tokens evolve
with time. There is a point at which a token's nspam and nham occurrences
don't reflect anymore current world. Expiring them would get (partially) rid
of stale data. Expiring old tokens also means adapting to a changing world.

Please note from this point of view the SA implementation of the bayesian
filtering is less than optimal, since it doesn't expire tokens which roll
out of a given time window, but instead the ones which had not been seen
anymore for a while. The reason for this is of course to keep database size
and workload low.


> >     b) eliminate tokens with very few ham/spam occurrences.
> 
> Some Bayesian filters, such as dspam, allow low-count tokens to be
> aged-out quicker, but the point of that is to free-up space for longer
> retention of high-count tokens.
> 
> There's no other reason for deleting them. Either a low-count token is
> never seen again, in which case it's just wasting space, or we are
> still learning it's frequencies, in which case resetting the counters
> makes no sense.

My suggestion would not mean that one had to do b) and/or c) every day. It
is something that one may do once in a year, so that the error introduced in
removing "currently learning" tokens is low and limited in time.


> >     c) eliminate tokens with very close nham to nspam values;
> 
> This is only superficially appealing - similar arguments apply to "b".
> What's the point in deleting a token with counts of 7483:7922 when an
> hour later it might be back at 2:0?

2:0 means a definitive answer about token spamminess or hamminess. Removing
tokens where nham ~ nspam means discarding the history of a token which
actually doesn't play any role, letting it to have a new chance in current
world.

Let say 7483 is nham, then:

      P(S) = .75
        P(H) = .25
        P(S|W) = 7922*.75 / (7922*.75 + 7483*.25) ~ .76

an hour later it is:

        P(S|W) = 7922*.75 / (7922*.75 + 7485*.25) ~ .76

which is the same: this token doesn't help anymore.

Instead, with a nham=2 and a nspam=0 you get:

        P(S|W) = 0*.75 / (0*.75 + 2*.25) = 0

i.e.: a definitive answer. The same is of course true with nspam=2 and
nham=0, so this token may seem dangerous, but please note that SA uses a
*naive* bayesian filter, so that you are going to "put together" the P(S|W)
for each W in the message (not only this one), and you didn't forget all the
tokens: only the ones which were not storing any interesting information
anymore.


> I don't see anything here that would reduce FPs, "b" and "c" simply
> free-up some space, but "a" means you are not taking any advantage of
> it.

Mine wasn't a short term solution. It was instead meant to bring tokens to a
new life.

Anyway, if you fear the dangers of some short-term misbehavior after
applying b) or c), I have a more conservative approach.

Disable the token expiration functionality and take a backup of your bayes
database. After, say, 6 months, take a snapshot of your current bayes db and
re-build it by subtracting the nham and nspam columns of each token with the
respective ones in the old backup. Do this also with the num_spam and
num_nonspam fields. Remove the tokens which didn't change in 6 months (ie:
nham and nspam are both 0 after subtraction). Also remove seen tokens
relative to messages present in both snapshots. Load the worked image into
bayes.

After this operation, your bayes database will look like if it was started 6
months ago, which may be appealing because now it will contain at most
6-months old data.

Giampaolo

RE: Bayes spam and ham out of proportion

Reply via email to