Regarding your interest in advanced classifiers, along the lines of what the Corvigo Mailgate product whitepapers suggest:
Open-source CRM114 hashes many permutations of subphrases. It is slower than bag-of-words, because it generates and checks many more clues, so there's more work, and slowness from cache-busting behavior. It's a very good classifier. The most popular open-source spam classifiers (SpamAssassin, SpamBayes, bogofilter) are based on bag-of-words and not n-grams or phrases. Classification results seem to be pretty good for most of us, and it's fast. There is certainly room for both semantic indexing and more aggressive tokenization and parsing methods. For many users the quality difference between quite-good bag-of-words versus even-better phrase-based will not matter. For others it may be crucial, as long as the cost (in speed, processing) is not too high. We have reached the point of making cost/performance tradeoffs. For most users, all of the above are well past "good enough to use." Liudvikas Bukys [EMAIL PROTECTED] ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk