I've been using freqdiff for SA rule testing and I think it might be interesting to other people doing the same. Testing SA modifications is much more convenient with it.
I originally wrote it to compare the frequency of words, headers, and similar frequency inputs generated using "sort|uniq -c", but it does a bit more than that now. :-) Before, if I made some modification and ran it over a spam corpus, I'd often compare the results as follows: $ cd masses $ ./mass-check <options> spam-folder > spam-orig.log $ ./mass-check <options> good-folder > good-orig.log [ make some changes ] $ ./mass-check <options> spam-folder > spam.log $ ./mass-check <options> good-folder > good.log $ awk '{print $4}' spam-orig.log|tr , '\n'|sort|uniq -c|sort -nr > spam-orig.count $ awk '{print $4}' spam.log|tr , '\n'|sort|uniq -c|sort -nr > spam.count $ diff -U 0 spam-orig.count spam.count That would be followed by a similar sequence for the nonspam corpus. With freqdiff, the last three commands are shrunk down to one (it can read mass-check logs directly in addition to the "sort|uniq -c" type of format) and it's clearer than reading diff output. $ freqdiff spam-orig.log spam.log 7 SUBJ_FULL_OF_8BITS 4 SUBJ_ALL_CAPS 3 DATE_IN_FUTURE_06_12 -2 FROM_BTAMAIL -3 INVALID_DATE The above means that SUBJ_FULL_OF_8_BITS appears 7 more times in spam.log than it did in spam-orig.log and so on. Items which appear the same number of times in each input are not printed (unless you are printing percentages, see below). To show rule percentages: $ freqdiff -r spam.log good.log | tail -20 | head -10 76.34 MSG_ID_ADDED_BY_MTA 75.23 X_PRIORITY_HIGH 74.20 PORN_10 70.82 FROM_AND_TO_SAME 68.34 DOUBLE_CAPSWORD 68.27 DATE_IN_PAST_48_96 66.85 URI_IS_POUND 66.64 TO_LOCALPART_EQ_REAL 61.74 REPLY_TO_EMPTY 48.58 WEIRD_PORT The above means that 74.20% of the time PORN_10 appears, it matches a spam. If you reversed the order of spam.log and good.log on the command line, PORN_10 would then be 25.80%. The -r flag just means account for the differing sizes of spam.log and good.log. If you want absolute percentages, just drop the "-r". There are several other options and you can also display the scores as part of the output, but freqdiff will generally do the right thing without any options (with the exception of -r which has to be explicitly given). If you wanted to do a simple comparison of word frequencies: $ perl -pe 's/\s+/\n/' spam-folder | sort | uniq -c > spam.freq $ perl -pe 's/\s+/\n/' good-folder | sort | uniq -c > good.freq $ freqdiff -r spam.freq good.freq Anyway, I hope that gives you the general idea. The script is attached below. Please let me know if you have any comments. Dan [ATTACHMENT ~/scripts/freqdiff, text/plain] _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas - http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk