I'm in the process of revising the date difference testing.  So far,
here's what I've done:

  - fix timezone addition/subtraction (it was sign-reversed!)
  - don't compare unparseable dates (caused false positives)
  - don't require seconds (per RFC-2822)
  - added support for North American timezones (per RFC-2822)
  - separated tests: past vs. future and differing error regions

My general guess was that:

  - dates from the future are "spammier" than dates from the past
  - larger errors are also "spammier"

So, I wanted to break down the test further to allow the GA more
flexibility in scoring and to match more spam.

Original results
----------------

The current CVS tree.

The test set is basically 4 months of my incoming mail.

1322 spam messages:

        73      DATE_IN_FUTURE
        1249    negative

4824 good messages:

        0       DATE_IN_FUTURE
        4824    negative

First pass results
------------------

The numbers in rule names are ranges of days.

1322 spam messages:

        64      DATE_IN_PAST_4_MORE
        2       DATE_IN_PAST_2_4
        27      DATE_IN_PAST_1_2
        1207    negative
        12      DATE_IN_FUTURE_1_2
        7       DATE_IN_FUTURE_2_4
        3       DATE_IN_FUTURE_4_MORE

All 6 messages that were DATE_IN_FUTURE but neither DATE_IN_PAST_4_MORE
nor DATE_IN_FUTURE_4_MORE were unparseable, but each was caught by the
current INVALID_DATE test.

4824 good messages:

        0       DATE_IN_PAST_4_MORE
        2       DATE_IN_PAST_2_4
        1       DATE_IN_PAST_1_2
        4821    negative
        0       DATE_IN_FUTURE_1_2
        0       DATE_IN_FUTURE_2_4
        0       DATE_IN_FUTURE_4_MORE

Second pass results
-------------------

With such a small number of false positives, I figured I would go
crazy and add tests for dates errors of less than one day in the
future or past.  At least, it seemed crazy since the current cutoff
was 4 days, but it worked much better than I expected (for once).

The numbers in rule names are now ranges of hours.

1322 spam messages:

        64      DATE_IN_PAST_96_XX
        2       DATE_IN_PAST_48_96
        27      DATE_IN_PAST_24_48
        63      DATE_IN_PAST_12_24
        64      DATE_IN_PAST_06_12
        56      DATE_IN_PAST_03_06
        785     negative
        17      DATE_IN_FUTURE_03_06
        157     DATE_IN_FUTURE_06_12
        65      DATE_IN_FUTURE_12_24
        12      DATE_IN_FUTURE_24_48
        7       DATE_IN_FUTURE_48_96
        3       DATE_IN_FUTURE_96_XX

4824 good messages:

        0       DATE_IN_PAST_96_XX
        2       DATE_IN_PAST_48_96
        1       DATE_IN_PAST_24_48
        4       DATE_IN_PAST_12_24
        5       DATE_IN_PAST_06_12
        3       DATE_IN_PAST_03_06
        4800    negative
        8       DATE_IN_FUTURE_03_06
        1       DATE_IN_FUTURE_06_12
        0       DATE_IN_FUTURE_12_24
        0       DATE_IN_FUTURE_24_48
        0       DATE_IN_FUTURE_48_96
        0       DATE_IN_FUTURE_96_XX

Based on these results, my inclination is to add all of the above rules
and let the GA sort it out.  They all look pretty good to me except for
DATE_IN_FUTURE_03_06 (it could just be my small sample size, though,
because 7 out of the 9 false positives were from one guy).  If any rules
at the lower end of the range have a negative score, we can remove them
later.

I also tried 1-6, 6-12, 12-18, and 18-24 hour regions, but there were
too many false positives from about 1-3 hours (which makes sense), so
I settled on 3-6, 6-12, and 12-24 hour regions which is also one less
test and it looks nicer anyway (each test is twice the size of it's
smaller neighbor).  I suppose that makes the above my third pass.  :-)

My only gripe is that having so many rules is somewhat clumsy in the
scores file, even using arguments.  What if spamassassin supported
something like this?

  header DATE_SHIFT_PAST     eval:check_for_shifted_date($1, $2)
  describe DATE_SHIFT_PAST   Date: is $1 to $2 hours before Received: date
  iterations DATE_SHIFT_PAST (3,6),(6,12),(12,18),(24,48),(48,96),(96,Infinity)

  score DATE_SHIFT_PAST(3,6)    0.541
  score DATE_SHIFT_PAST(6,12)   1.249
  ...

Dan

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to