Hi list, I'm currently trying to build up a new bayes DB here, since the autobuilt DB fubared (as expected, no need to throw things at me ;)). It's rather easy to build up the spam part, as we are getting right enough of it, yet it poses a problem to build up the ham part. Much of our mail coming from relationed companies or customers comes directly via Lotus Notes replication, so nothing to feed there. Much of the inbound smtp mail either contains private or confidential information, so I cannot use them as I keep the source of the bayes messages in a Notes DB serverside - I'd run into privacy issues.
So much for where I'm coming from, but now the question is: Will a small ham corpus - let's say we take the minimum of 200 for the beginning - compared to a fast growing spam corpus (currently at around 2000 spam) be a problem and possibly lead to false bayes scoring? I most certain that there is this possibility of course - it's natural - but the question is how bad could it influence the scoring and how high is the propability (aproximately)? Any insights on this would be most appreciated. regards sash -------------------------------------------------- Sascha Runschke Netzwerk Administration IT-Services ABIT AG Robert-Bosch-Str. 1 40668 Meerbusch Tel.:+49 (0) 2150.9153.226 Mobil:+49 (0) 173.5419665 mailto:[EMAIL PROTECTED] http://www.abit.net http://www.abit-epos.net --------------------------------- Sicherheitshinweis zur E-Mail Kommunikation / Security note regarding email communication: http://www.abit.net/sicherheitshinweis.html