[SAtalk] bayesian pollution

hank Fri, 31 Oct 2003 08:39:11 -0800

Greetings,

I am currently running a multi-user system in which mail is filtered
using a centralized database of tokens.  While I realize it is not the
ideal solution for filtering, I am in the process of implementing a system
that will allow users to submit Spam/Ham samples to their own separated
database.  For some this is the ideal solution, for others it represents
a new level of complication that they would rather not deal with.


My question concerns recent reports of Spam that are appended with large
messages.  Some of these messages are movie reviews or other random
articles which, I fear, may 'pollute' our token database in a way that
makes it less effective.  I am seeking recommendation on what to do with
these messages; if I should allow these messages to be learned from,
will there be any negative impact?

Undoubtedly I don't have a firm enough grasp of how classification works;
most notably, how the Bayesian aspect of SA decides which words are the
most "interesting".  However, this may not be relevant to the problem.
Will the simple task of submitting enough Ham resolve this issue?

While I continue to research the Web for an answer, I encourage anyone
interested in this topic to comment.  Thank you for helping me understand
this situation better.

Yours truly,
Adam J. Henry


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?   SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] bayesian pollution

Reply via email to