Vivek Khera <[EMAIL PROTECTED]> writes: > I'm curious how you GA score the RBL hits. RBL's are by definition > dynamic with IPs going in and out of the lists all the time. It > seems to me the only reliable way to score it would be to see if the > IP being tested was in the RBL at the time the message was > originally received (or perhaps even a short while later), not at > the time the GA test is run, perhaps many months after it was sent.
I had the GA score for sets of messages with various ages (less than 6, 3, 2, and 1 months old). Generally, the score for each RBL improved as the age of the message reduced. To avoid overfitting (since the "short while later" meant a reduced corpus size), I ran the GA multiple times and used an averaged score. The score for any RBL might have been slightly higher if scoring was real-time, but the scores did not vary all that much. Why? Because the main effect of non-realtime scoring appears to be lower hit rates on spam, not particuarly higher FP rates. Most of my FPs appear to be persistent errors such as kernel.org appearing on SPEWS or spammer DSL lines inherited by innocent users. Luckily, I did the new GA scores before the new GA algorithm because it does care about hit percentages and I think real-time results would be more necessary now. The average age of messages used for the test was about 2 weeks old, although I had to go further back for some RBLs which don't get many hits. I then tested the result on messages received recently and in all cases (messages more recent than 6 months, 3 months, and 1 month), the overall FP and FN rates improved by quite a bit (I posted the results on a bugzilla ticket somewhere). I should also mention that I modified the GA software to leave the normally GA-scored tests unmodified, the only scores it had the ability to change were the non-local tests plus of the other tests in the non-GA section. Obviously, we need to work towards some sort of real-time GA scoring, not just for correctness, but because I think it's the only way to get any significant improvement on the current scores. It's not the easiest problem since we periodically change the set of RBLs used and ... well, it's not a simple problem. Before anyone complains about how my methodology was horrible and completely non-optimal, IT IS MUCH MORE OPTIMAL THAN THE HAND-SET WILD-ASS-GUESS SCORES THAT WERE THERE BEFORE. Thank you. :-) Dan ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk