I've been using freqdiff for SA rule testing and I think it might be
interesting to other people doing the same.  Testing SA modifications is
much more convenient with it.

I originally wrote it to compare the frequency of words, headers, and
similar frequency inputs generated using "sort|uniq -c", but it does a
bit more than that now.  :-)

Before, if I made some modification and ran it over a spam corpus, I'd
often compare the results as follows:

  $ cd masses
  $ ./mass-check <options> spam-folder > spam-orig.log
  $ ./mass-check <options> good-folder > good-orig.log
  [ make some changes ]
  $ ./mass-check <options> spam-folder > spam.log
  $ ./mass-check <options> good-folder > good.log
  $ awk '{print $4}' spam-orig.log|tr , '\n'|sort|uniq -c|sort -nr > spam-orig.count
  $ awk '{print $4}' spam.log|tr , '\n'|sort|uniq -c|sort -nr > spam.count
  $ diff -U 0 spam-orig.count spam.count

That would be followed by a similar sequence for the nonspam corpus.
With freqdiff, the last three commands are shrunk down to one (it can
read mass-check logs directly in addition to the "sort|uniq -c" type of
format) and it's clearer than reading diff output.

  $ freqdiff spam-orig.log spam.log
  7       SUBJ_FULL_OF_8BITS
  4       SUBJ_ALL_CAPS
  3       DATE_IN_FUTURE_06_12
  -2      FROM_BTAMAIL
  -3      INVALID_DATE

The above means that SUBJ_FULL_OF_8_BITS appears 7 more times in
spam.log than it did in spam-orig.log and so on.  Items which appear the
same number of times in each input are not printed (unless you are
printing percentages, see below).

To show rule percentages:

  $ freqdiff -r spam.log good.log | tail -20 | head -10
  76.34   MSG_ID_ADDED_BY_MTA
  75.23   X_PRIORITY_HIGH
  74.20   PORN_10
  70.82   FROM_AND_TO_SAME
  68.34   DOUBLE_CAPSWORD
  68.27   DATE_IN_PAST_48_96
  66.85   URI_IS_POUND
  66.64   TO_LOCALPART_EQ_REAL
  61.74   REPLY_TO_EMPTY
  48.58   WEIRD_PORT

The above means that 74.20% of the time PORN_10 appears, it matches a
spam.  If you reversed the order of spam.log and good.log on the command
line, PORN_10 would then be 25.80%.

The -r flag just means account for the differing sizes of spam.log and
good.log.  If you want absolute percentages, just drop the "-r".  There
are several other options and you can also display the scores as part of
the output, but freqdiff will generally do the right thing without any
options (with the exception of -r which has to be explicitly given).

If you wanted to do a simple comparison of word frequencies:

  $ perl -pe 's/\s+/\n/' spam-folder | sort | uniq -c > spam.freq
  $ perl -pe 's/\s+/\n/' good-folder | sort | uniq -c > good.freq
  $ freqdiff -r spam.freq good.freq

Anyway, I hope that gives you the general idea.  The script is attached
below.  Please let me know if you have any comments.

Dan

[ATTACHMENT ~/scripts/freqdiff, text/plain]

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas - 
http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to