Kingsley G. Morse Jr. writes:

> Being an old AI/GA programmer who just started using
> SA, your post fascinates me. Thanks for the update on
> your research.
> [...]
> It seems to me that it would be interesting to consider a _summary_ of 
> 
>         a.) The percentage of false positives and
>         negatives _before_ testing for date differences
> 
>         b.) The percentage of false positives and
>         negatives _after_ testing for date differences

Agreed.  I think Craig might do something like this for each major
version of spamassassin.

Using my limited data set (6146 messages) and a very simple guess at
scores (I used the original for 96 hours or more and divided by 2 for
each period).  I think most of the GA scores will end up much higher
than these, but I had to start them somewhere and I usually err on the
low side to avoid false positives.

  score DATE_IN_FUTURE_03_06           0.072
  score DATE_IN_FUTURE_06_12           0.145
  score DATE_IN_FUTURE_12_24           0.290
  score DATE_IN_FUTURE_24_48           0.580
  score DATE_IN_FUTURE_48_96           1.159
  score DATE_IN_FUTURE_96_XX           2.318
  score DATE_IN_PAST_03_06             0.072
  score DATE_IN_PAST_06_12             0.145
  score DATE_IN_PAST_12_24             0.290
  score DATE_IN_PAST_24_48             0.580
  score DATE_IN_PAST_48_96             1.159
  score DATE_IN_PAST_96_XX             2.318

before:

  nonspam     4819 correct, 5 false positives
  spam        1170 correct, 152 false negatives

after:

  nonspam     4819 correct, 5 false positives
  spam        1172 correct, 150 false negatives

So, no additional false positives and 2 fewer false negatives.  I think
the GA will improve the improvement by quite a bit.  Until then, I
believe it's premature to do this sort of analysis, but since you asked
for it...

>         c.) _How_many_ more rules would be added.

11 additional rules, but only the first invocation takes any significant
amount of time, subsequent invocations are probably faster than most
regular expression rules since it's just doing a numerical comparison
based on a cached number.

I think you're better off adding the rules, seeing how they work, and
removing the slowest and worst performers later.

Dan

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to