I'm in the process of revising the date difference testing. So far, here's what I've done:
- fix timezone addition/subtraction (it was sign-reversed!) - don't compare unparseable dates (caused false positives) - don't require seconds (per RFC-2822) - added support for North American timezones (per RFC-2822) - separated tests: past vs. future and differing error regions My general guess was that: - dates from the future are "spammier" than dates from the past - larger errors are also "spammier" So, I wanted to break down the test further to allow the GA more flexibility in scoring and to match more spam. Original results ---------------- The current CVS tree. The test set is basically 4 months of my incoming mail. 1322 spam messages: 73 DATE_IN_FUTURE 1249 negative 4824 good messages: 0 DATE_IN_FUTURE 4824 negative First pass results ------------------ The numbers in rule names are ranges of days. 1322 spam messages: 64 DATE_IN_PAST_4_MORE 2 DATE_IN_PAST_2_4 27 DATE_IN_PAST_1_2 1207 negative 12 DATE_IN_FUTURE_1_2 7 DATE_IN_FUTURE_2_4 3 DATE_IN_FUTURE_4_MORE All 6 messages that were DATE_IN_FUTURE but neither DATE_IN_PAST_4_MORE nor DATE_IN_FUTURE_4_MORE were unparseable, but each was caught by the current INVALID_DATE test. 4824 good messages: 0 DATE_IN_PAST_4_MORE 2 DATE_IN_PAST_2_4 1 DATE_IN_PAST_1_2 4821 negative 0 DATE_IN_FUTURE_1_2 0 DATE_IN_FUTURE_2_4 0 DATE_IN_FUTURE_4_MORE Second pass results ------------------- With such a small number of false positives, I figured I would go crazy and add tests for dates errors of less than one day in the future or past. At least, it seemed crazy since the current cutoff was 4 days, but it worked much better than I expected (for once). The numbers in rule names are now ranges of hours. 1322 spam messages: 64 DATE_IN_PAST_96_XX 2 DATE_IN_PAST_48_96 27 DATE_IN_PAST_24_48 63 DATE_IN_PAST_12_24 64 DATE_IN_PAST_06_12 56 DATE_IN_PAST_03_06 785 negative 17 DATE_IN_FUTURE_03_06 157 DATE_IN_FUTURE_06_12 65 DATE_IN_FUTURE_12_24 12 DATE_IN_FUTURE_24_48 7 DATE_IN_FUTURE_48_96 3 DATE_IN_FUTURE_96_XX 4824 good messages: 0 DATE_IN_PAST_96_XX 2 DATE_IN_PAST_48_96 1 DATE_IN_PAST_24_48 4 DATE_IN_PAST_12_24 5 DATE_IN_PAST_06_12 3 DATE_IN_PAST_03_06 4800 negative 8 DATE_IN_FUTURE_03_06 1 DATE_IN_FUTURE_06_12 0 DATE_IN_FUTURE_12_24 0 DATE_IN_FUTURE_24_48 0 DATE_IN_FUTURE_48_96 0 DATE_IN_FUTURE_96_XX Based on these results, my inclination is to add all of the above rules and let the GA sort it out. They all look pretty good to me except for DATE_IN_FUTURE_03_06 (it could just be my small sample size, though, because 7 out of the 9 false positives were from one guy). If any rules at the lower end of the range have a negative score, we can remove them later. I also tried 1-6, 6-12, 12-18, and 18-24 hour regions, but there were too many false positives from about 1-3 hours (which makes sense), so I settled on 3-6, 6-12, and 12-24 hour regions which is also one less test and it looks nicer anyway (each test is twice the size of it's smaller neighbor). I suppose that makes the above my third pass. :-) My only gripe is that having so many rules is somewhat clumsy in the scores file, even using arguments. What if spamassassin supported something like this? header DATE_SHIFT_PAST eval:check_for_shifted_date($1, $2) describe DATE_SHIFT_PAST Date: is $1 to $2 hours before Received: date iterations DATE_SHIFT_PAST (3,6),(6,12),(12,18),(24,48),(48,96),(96,Infinity) score DATE_SHIFT_PAST(3,6) 0.541 score DATE_SHIFT_PAST(6,12) 1.249 ... Dan _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk