On Dienstag, 9. Mai 2006 23:14 Bowie Bailey wrote:
> When I look at the overall stats, bayes does pretty good:
> RANK    RULE NAME   COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
> ------------------------------------------------------------
>    6    BAYES_99    26754     4.19   44.49   67.00    3.06

3% HAM hits for BAYES_99 is horrible, not good. It's the FP that should 
make you alert.

> But when I do it for only our domain (which is where all the manual
> training happens), it hits less ham, but less spam as well:
> RANK    RULE NAME   COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
> ------------------------------------------------------------
>    8    BAYES_99     4649     3.29   33.41   54.64    0.20

At least much better FP rate, by a factor of 15!

> Just my personal email address (which is trained aggressively) gets
> very few ham hits (partly because I lowered my threshold to 4.0), but
> less spam than overall:
> RANK    RULE NAME   COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
> ------------------------------------------------------------
>    5    BAYES_99     1643     3.08   27.05   65.72    0.08

Again the FPs reduced...

> And then when I modify sa-stats to exclude our domain, I find that
> our customers (who are trained exclusively with autolearn) seem to do
> better than us:
> RANK    RULE NAME   COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
> ------------------------------------------------------------
>    6    BAYES_99    22105     4.44   47.83   70.35    4.11

No, 4% FPs is nothing you should be happy with.

> Based on these results, it almost seems like the more training Bayes
> gets, the worse it does!

But remember that sa-stats can never tell if that HAM/SPAM are really 
such, it just tells you what it *believed* was HAM/SPAM.

> Are these anomolies just an artifact of sa-stats relying on SA to
> judge ham and spam properly?  Can these numbers be trusted at all if
> my users don't reliably report false negatives and positives?

As I said on the other thread: Be very careful what you feed to bayes. 
Try to find those 4% of FPs, and if they are really FPs. Maybe your SA 
made the mistakes because you don't have enough rules to detect all 
SPAMs.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660/4156531                          .network.your.ideas.
// PGP Key:   "lynx -source http://zmi.at/zmi3.asc | gpg --import"
// Fingerprint: 44A3 C1EC B71E C71A B4C2  9AA6 C818 847C 55CB A4EE
// Keyserver: www.keyserver.net                 Key-ID: 0x55CBA4EE

Attachment: pgpaSKVYsXRpj.pgp
Description: PGP signature

Reply via email to