Regarding your interest in advanced classifiers, along the lines of what
the Corvigo Mailgate product whitepapers suggest:

        Open-source CRM114 hashes many permutations of subphrases.  It is
        slower than bag-of-words, because it generates and checks many
        more clues, so there's more work, and slowness from cache-busting
        behavior.  It's a very good classifier.

        The most popular open-source spam classifiers (SpamAssassin,
        SpamBayes, bogofilter) are based on bag-of-words and not n-grams
        or phrases.  Classification results seem to be pretty good for
        most of us, and it's fast.

There is certainly room for both semantic indexing and more aggressive
tokenization and parsing methods.  For many users the quality difference
between quite-good bag-of-words versus even-better phrase-based will
not matter.  For others it may be crucial, as long as the cost (in speed,
processing) is not too high.

We have reached the point of making cost/performance tradeoffs.  For most
users, all of the above are well past "good enough to use."

Liudvikas Bukys
[EMAIL PROTECTED]


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to