Vivek Khera <[EMAIL PROTECTED]> writes:

> I'm curious how you GA score the RBL hits.  RBL's are by definition
> dynamic with IPs going in and out of the lists all the time.  It
> seems to me the only reliable way to score it would be to see if the
> IP being tested was in the RBL at the time the message was
> originally received (or perhaps even a short while later), not at
> the time the GA test is run, perhaps many months after it was sent.

I had the GA score for sets of messages with various ages (less than
6, 3, 2, and 1 months old).  Generally, the score for each RBL
improved as the age of the message reduced.  To avoid overfitting
(since the "short while later" meant a reduced corpus size), I ran the
GA multiple times and used an averaged score.  The score for any RBL
might have been slightly higher if scoring was real-time, but the
scores did not vary all that much.  Why?  Because the main effect of
non-realtime scoring appears to be lower hit rates on spam, not
particuarly higher FP rates.  Most of my FPs appear to be persistent
errors such as kernel.org appearing on SPEWS or spammer DSL lines
inherited by innocent users.

Luckily, I did the new GA scores before the new GA algorithm because
it does care about hit percentages and I think real-time results would
be more necessary now.

The average age of messages used for the test was about 2 weeks old,
although I had to go further back for some RBLs which don't get many
hits.  I then tested the result on messages received recently and in
all cases (messages more recent than 6 months, 3 months, and 1 month),
the overall FP and FN rates improved by quite a bit (I posted the
results on a bugzilla ticket somewhere).

I should also mention that I modified the GA software to leave the
normally GA-scored tests unmodified, the only scores it had the
ability to change were the non-local tests plus of the other tests in
the non-GA section.

Obviously, we need to work towards some sort of real-time GA scoring,
not just for correctness, but because I think it's the only way to get
any significant improvement on the current scores.  It's not the
easiest problem since we periodically change the set of RBLs used and
...  well, it's not a simple problem.

Before anyone complains about how my methodology was horrible and
completely non-optimal, IT IS MUCH MORE OPTIMAL THAN THE HAND-SET
WILD-ASS-GUESS SCORES THAT WERE THERE BEFORE.  Thank you.  :-)

Dan


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to