I've had SA 2.55 with bayes running for some time now, and I'd like to take it a bit further. Here's what I'm wondering and thinking...I'd like to know if others do similar spam maintenance and if they could respond back with the best methods for gaining useful info from my spam.
- I've got two collections of mail...spam and ham. I already train bayes, but I'd like to use these for more indepth reports. I'd like to be able to add new rules to a temp (non-active) config file, and get feedback on how many messages would get flagged with my new rule...and also how many it might flag with the non-spam corpus.
- I'd like to be able to create a quick ranking of the most popular rules, and also the least popular rules that get used on these corpuses. I think that might be useful for adjusting scores to custom-fit my spam.
- I'd like to get feedback on my range of hits...to see what the average number my spam draws, and consider if I need to adjust my general limit (right now I've got it set at 5.5 with global rules in place, no individual rules). Also, I'd like to learn what my average hammessage ranks.
All of the above features can be obtained by running the mass-check tool, followed by the hit-frequencies tool. This will generate the exact same report as the STATISTICS.txt file that comes with SA, but for YOUR corpus, ruleset and scores. In fact, this is exactly how the SA developers do it.
With a bit of creative use of the command-line sort utility, you can re-sort your report by spam-hit count, or any other column.
The mass-check and hit-frequencies tools can be found in the masses/ subdirectory of the tarball. STATISTICS.txt is in the rules/ subdirectory.
Some bits and pieces of a STATISTICS.txt:
# SUMMARY for threshold 5.0: # Correctly non-spam: 130678 56.21% (99.92% of non-spam corpus) # Correctly spam: 90057 38.74% (88.55% of spam corpus) # False positives: 100 0.04% (0.08% of nonspam, 4551 weighted) # False negatives: 11640 5.01% (11.45% of spam, 39435 weighted) # Average score for spam: 16.4 nonspam: -1.3 # Average for false-pos: 5.9 false-neg: 3.4 # TOTAL: 232475 100.00%
<snip - statistics for other score thresholds follow>
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 232475 101697 130778 0.437 0.00 0.00 (all messages) 100.000 43.7453 56.2547 0.437 0.00 0.00 (all messages as %) 19.399 0.0059 34.4798 0.000 1.00 -0.50 REFERENCES 17.224 0.0128 30.6076 0.000 0.99 -0.50 EMAIL_ATTRIBUTION
- Feedback on domains would be fantastic, but I'm not sure if this is best suited for SA or my mailer (qmail with vpopmail for the user accounts)...get to know what the most popular domains that are sending me spam...
I don't know how to get you that one.
------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk