> On Thu, 29 Apr 2010 18:32:04 +0200 > "Giampaolo Tomassoni" <g.tomass...@libero.it> wrote: > > > > what you need to do write a script that divides the metadata > > > num_spam value and all the token Nspam counts by 3. The updated > > > database can then be loaded back in with --restore. > > > > I don't know if this is going to be effective. After all, this way > > you are basically lowering the effectiveness of all the spam tokens, > > even potentially remarkable ones. > > Correct, but if those counts came from autolearning 90% of spam and 30% > of ham, then rescaling may be the correct thing to do. > > It may also be pragmatic, if a high spam/ham ratio is leading to FPs, > to keep the learned ratio closer to 1:1 than the actual ratio.
I was almost thinking your statement of pragmatism was wrong in principle, but after having checked the bayesian filtering equation and having seen what happens forcing a 1:1 ratio in the number of received spam messages over ham ones, I see that it is: P'(S|W) <= P(S|W) iff P(H) <= P(S) which is, the probability of a message being spam because it contains word W is lower after forcing the 1:1 ratio iff the original probability of a message (irrespective to its content) to be ham is lower than the probability of it to be spam. Which is the case. However, the fact that Frank's database comes from about three-times more spam messages then ham ones, means his bayes learned 75% of spam and 25% of ham. This is in line with world statistics about spam, which roughly are around 70%-80% spam. So, while your approach may yield a lower rate in FPs (which is a very important result), it doesn't mean that Bayes is going to be more precise in the overall. This is because the bayes filtering equation is designed around the fact that the probability of receiving a spam message is very different (i.e.: greater) than the opposite, in order to better mime real-world data. > > I would instead, in order of effectiveness: > > > > a) expire old tokens; > > Token retention is a good thing. The only reason for ageing-out tokens > is to limit the database size. This is not the only reason to ageing-out tokens. Ham and spam tokens evolve with time. There is a point at which a token's nspam and nham occurrences don't reflect anymore current world. Expiring them would get (partially) rid of stale data. Expiring old tokens also means adapting to a changing world. Please note from this point of view the SA implementation of the bayesian filtering is less than optimal, since it doesn't expire tokens which roll out of a given time window, but instead the ones which had not been seen anymore for a while. The reason for this is of course to keep database size and workload low. > > b) eliminate tokens with very few ham/spam occurrences. > > Some Bayesian filters, such as dspam, allow low-count tokens to be > aged-out quicker, but the point of that is to free-up space for longer > retention of high-count tokens. > > There's no other reason for deleting them. Either a low-count token is > never seen again, in which case it's just wasting space, or we are > still learning it's frequencies, in which case resetting the counters > makes no sense. My suggestion would not mean that one had to do b) and/or c) every day. It is something that one may do once in a year, so that the error introduced in removing "currently learning" tokens is low and limited in time. > > c) eliminate tokens with very close nham to nspam values; > > This is only superficially appealing - similar arguments apply to "b". > What's the point in deleting a token with counts of 7483:7922 when an > hour later it might be back at 2:0? 2:0 means a definitive answer about token spamminess or hamminess. Removing tokens where nham ~ nspam means discarding the history of a token which actually doesn't play any role, letting it to have a new chance in current world. Let say 7483 is nham, then: P(S) = .75 P(H) = .25 P(S|W) = 7922*.75 / (7922*.75 + 7483*.25) ~ .76 an hour later it is: P(S|W) = 7922*.75 / (7922*.75 + 7485*.25) ~ .76 which is the same: this token doesn't help anymore. Instead, with a nham=2 and a nspam=0 you get: P(S|W) = 0*.75 / (0*.75 + 2*.25) = 0 i.e.: a definitive answer. The same is of course true with nspam=2 and nham=0, so this token may seem dangerous, but please note that SA uses a *naive* bayesian filter, so that you are going to "put together" the P(S|W) for each W in the message (not only this one), and you didn't forget all the tokens: only the ones which were not storing any interesting information anymore. > I don't see anything here that would reduce FPs, "b" and "c" simply > free-up some space, but "a" means you are not taking any advantage of > it. Mine wasn't a short term solution. It was instead meant to bring tokens to a new life. Anyway, if you fear the dangers of some short-term misbehavior after applying b) or c), I have a more conservative approach. Disable the token expiration functionality and take a backup of your bayes database. After, say, 6 months, take a snapshot of your current bayes db and re-build it by subtracting the nham and nspam columns of each token with the respective ones in the old backup. Do this also with the num_spam and num_nonspam fields. Remove the tokens which didn't change in 6 months (ie: nham and nspam are both 0 after subtraction). Also remove seen tokens relative to messages present in both snapshots. Load the worked image into bayes. After this operation, your bayes database will look like if it was started 6 months ago, which may be appealing because now it will contain at most 6-months old data. Giampaolo