At 04:55 PM 1/8/2005, Fajar Priyanto wrote:
Thanks Matt,
So talking statistically, does it mean I have to train SA about 'ham' as many
as 'spam'? Right now, I train SA mostly on spams.

Ideally, yes.

( Personally, my understanding of statistics would say that real-world ratios would be ideal, but Dan Q has pointed out that the SA dev testing shows 50/50 works best. I trust Dan's real test of SA more than my own theoretical observations. )

However, I'd also point out my own training is wildly imbalanced and works fine. SA's bayes system is quite toleratant of wild variations in the training ratio.

My training ratio even more imbalanced than real-world spam-ham ratios are. My current training is about 4.1% ham, 95.9% spam, and I have a daily feed of both ham and spam training. My real world rate is about 40% ham, 60% spam.


I would also say it's fairly important to regularly train at least some ham when you train in spam. Even if the ratio isn't 40/60 or 50/50, it shouldn't be 0/100.




Reply via email to