-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Kettler wrote:
| At 03:39 PM 11/30/2004, Bob Amen wrote:
|
|>     Thank you for that very well written and helpful explanation! Now,
|> do you have a script that computes the test values from a SA log file
|> that you'd care to share?
|
|
| You can't measure any of those performance metrics from logfiles alone..
| there's no way to determine FP and FN count from just logs... Gotta have
| a human for that part.
|
| There's really two ways
|         1) set up pre-sorted corpus pair and run against that, then
| calculate. You can detect FP and FN by doing separate runs on each half
| of the corpus. Any positives in the ham corpus are FPs...
|
|         2) go through your mail and hand-decide all the FPs and FNs, and
| combine that with the total statistics for that account from your logs.
| Of course, a script that breaks out logs by user, spam count and ham
| count could make that easier.. If your logs have the delivery account in
| the same line as the spam/ham claims of your filter, this could just be
| a simple pair of greps..
|                 grep "mkettler" /var/log/mailog | grep "is spam" | wc -l

There's an alternative to using log analysis or corpus snapshots--record
the data you're looking for in a database in real time, based on
user-mail interaction (e.g. a quarantine management system).

The Maia Mailguard package (which uses SpamAssassin via a
heavily-modified amavisd-new base) lets users manage their filter
settings, whitelists/blacklists, quarantines, and a "ham cache", all
from a web-based interface built with PHP.

Users can release any false positives from their quarantine page, and
similarly they can report false negatives from their ham cache.  Doing
either of these implicitly adjusts the FP and FN stats in Maia's database.

Users (or administrators) essentially do the confirming of ham and spam,
and the reporting of false positives and negatives, just by managing
their quarantines and ham caches.  This then allows a set of Perl
scripts to run at scheduled intervals behind the scenes to train the
Bayes database and do reporting of spam to Razor/Pyzor/DCC.

Since the counts of spam, ham, FP, and FN are maintained in a database
table on a per-user basis, it then becomes trivial to compute PPV, NPV,
Sensitivity, Specificity, and Efficiency for individual users, or the
system as a whole.  Maia summarizes all of this data at the bottom of
its stats page, e.g. <http://www.renaissoft.com/mail/public.php>.

- --
Robert LeBlanc <[EMAIL PROTECTED]>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamailguard.com/>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBrOeDGmqOER2NHewRAl5rAJ4/X5gVMQ/ZnAx25mFC6mbf5KNjpwCfc4WA
DMs4YaSVZBTXJ7nLQlXIE0A=
=Ncf/
-----END PGP SIGNATURE-----

Reply via email to