Scott A Crosby <[EMAIL PROTECTED]> writes: > The thing is that a gibberish token (not-with the statistics of $LANG, > not-dictionary) should, as a new token, be given a different bayes > catagory than one that is in a dictionary, etc.
Perhaps. It would probably be somewhat expensive to test every word for gibberish. >> My initial testing indicates that new tokens (in the body) have >> a spam probability of about 0.83, at least for me. > Can you do testing to see if new non-english or new non-dictionary > tokens have a higher spam probability? Umm, I could. :-) I'm sure non-English ones are higher due to foreign spam. Which, by the way, could also be a useful token (as suggested by Justin Mason), the possible language guesses from the TextCat.pm module (which is *really* slow, but some people already use it and it works well). Daniel ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk