Bayes - how bad is a small ham corpus with a big spam corpus?

srunschke Mon, 16 Jan 2006 02:34:08 -0800

Hi list,

I'm currently trying to build up a new bayes DB here, since the autobuilt
DB fubared (as expected, no need to throw things at me ;)). It's rather 
easy
to build up the spam part, as we are getting right enough of it, yet it 
poses
a problem to build up the ham part.
Much of our mail coming from relationed companies or customers comes
directly via Lotus Notes replication, so nothing to feed there. Much of 
the
inbound smtp mail either contains private or confidential information, so
I cannot use them as I keep the source of the bayes messages in a Notes
DB serverside - I'd run into privacy issues.


So much for where I'm coming from, but now the question is:

Will a small ham corpus - let's say we take the minimum of 200 for the
beginning - compared to a fast growing spam corpus (currently at around
2000 spam) be a problem and possibly lead to false bayes scoring?
I most certain that there is this possibility of course - it's natural - 
but the
question is how bad could it influence the scoring and how high is
the propability (aproximately)?

Any insights on this would be most appreciated.

regards
        sash

--------------------------------------------------
Sascha Runschke
Netzwerk Administration
IT-Services

ABIT AG
Robert-Bosch-Str. 1
40668 Meerbusch

Tel.:+49 (0) 2150.9153.226
Mobil:+49 (0) 173.5419665
mailto:[EMAIL PROTECTED]

http://www.abit.net
http://www.abit-epos.net
---------------------------------
Sicherheitshinweis zur E-Mail Kommunikation /
  Security note regarding email communication:
http://www.abit.net/sicherheitshinweis.html

Bayes - how bad is a small ham corpus with a big spam corpus?

Reply via email to