NOTE:  I read the Corvigo whitepaper, but I don't know ANYTHING about their
product.  This statement is based entirely upon what I have assumed after
reading this single document.

> Essentially AI interpretation of the meaning (or intent as they put it)
> of language in order to identify spam.
>

Correct me if I'm wrong (and I'm sure I have at least over-simplified).
The key to this seems to be in the statement "Make m0-ney fast from home!",
where Corvigo ignores "m0-ney" because it is "unknown".  They then key off
of "Make ... fast from home!".  Their claim is that Corvigo's offering is
better than Bayesian b/c it ignores unknowns, instead of classifying
unknowns and then scoring based upon the value of
possibly-improperly-weighted unknowns (like m0-ney).

But doesn't the naive Bayesian algo take care of this inherently?  Doesn't
the number of extrema reduce the likelihood that a new word or bizarre word
is at all considered when scoring the e-mail as a whole?(see -k in 'man bmf'
for my definition of extrema)  Once I hit that extrema sweet-spot (somewhere
between 10 and 20 tokens per message, based upon what is ignored), I am not
looking at anything but the "meat" of the message.  Bayes' algo shows us
that if a message contains ~20 distinct "spam" tokens, the message has a
very high (99%) likelihood of being actual spam.

This can be improved (I think this is one of the ways that CRM-114 works?)
by tokenizing individual words and their neighbors (Make money) and (money
fast).  Again, Bayes can do what Corvigo claims "contextually".

On the surface, it seems to me that Corvigo is offering a mild improvement
to Bayes, which could be easily incorporated into current Bayesian
offerings.  Consider:  Train your filter with a significant, personal corpus
of hand-sorted spam and ham.  Then, be able to "turn off" the learning
feature of your Bayesian filter.  This would essentially give you what
Corvigo has (at least, per their simple example).  If my Bayes doesn't know
"m0-ney" then it keys off of "Make fast from home!".

> Is this approach being pursued in open source space?

I think we have it, with minor tweaks to existing code (if even necessary at
all).



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to