Scott A Crosby <[EMAIL PROTECTED]> writes:

> The thing is that a gibberish token (not-with the statistics of $LANG,
> not-dictionary) should, as a new token, be given a different bayes
> catagory than one that is in a dictionary, etc.

Perhaps.  It would probably be somewhat expensive to test every word for
gibberish.

>> My initial testing indicates that new tokens (in the body) have
>> a spam probability of about 0.83, at least for me.

> Can you do testing to see if new non-english or new non-dictionary
> tokens have a higher spam probability?

Umm, I could.  :-)

I'm sure non-English ones are higher due to foreign spam.  Which, by the
way, could also be a useful token (as suggested by Justin Mason), the
possible language guesses from the TextCat.pm module (which is *really*
slow, but some people already use it and it works well).

Daniel


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to