Am Freitag, 11. Februar 2005 17:58 schrieb Matt Kettler:
> At 11:15 AM 2/11/2005, Matías López Bergero wrote:
> >The sa-learn man page says that for a good training of the Bayesian
> >filter, you need to train it with equal amounts of spam and ham, or more
> >ham if is possible. So if I sa-learn the spam folder, the spam tokens
> >are going to grow a lot compared to ham tokens.
> >Here are my training efforts:
> >
[..]

> >This possible increase in the spam data would have adverse effects on
> >the bayes filter classifying the spam or ham messages??
>
> The manpage is suggesting an ideal situation.. Really, you can be pretty
> wildly off and bayes will work reasonably well.
>
> Training a lot more spam than ham makes bayes more likely to misclassify a
> nonspam email as spam, but really, I'm VERY off balance and I've not had
> any problems with this at all. The difference between spam and nonspam here
> is just too great. Even a massive imbalance isn't causing FPs.
>
> Look at my stats:
>
> 0.000          0          2          0  non-token data: bayes db version
> 0.000          0     565896          0  non-token data: nspam
> 0.000          0      24693          0  non-token data: nham
> 0.000          0     180900          0  non-token data: ntokens
>
> My spam training outnumbers my ham training by 22:1. That's pretty far off
> from the ideal 1:1 or 1:1+. I've got more FN problems than FP problems with
> my bayes DB, but I rarely have problems with either.
>
> Also, I trust Dan and the SADevs to know bayes better than I do, and they
> have tested and have found 1:1 to work best.. However, from a mathematics
> perspective I'm still not sure why that works best. My original impressions
> were the ratio should match your real-world ratio... Sometime when I have
> some spare time I intend to test this myself so I can better understand it.

Hot agree, mathematical your database should represent the real ratio. Best 
way is to train SA with every message. If you train less spam then you will 
get a to low probability for spam. You will get less False Positive in this 
case. My spam / ham ratio is 1 : 40 and it works fine.

Thomas
 
-- 
icq:133073900
http://www.t-arend.de

Attachment: pgpOLOJn8MA6r.pgp
Description: PGP signature

Reply via email to