Good evening, all, On Wed, 8 Oct 2003, Daniel Quinlan wrote:
> Scott A Crosby <[EMAIL PROTECTED]> writes: > > > The thing is that a gibberish token (not-with the statistics of $LANG, > > not-dictionary) should, as a new token, be given a different bayes > > catagory than one that is in a dictionary, etc. > > Perhaps. It would probably be somewhat expensive to test every word for > gibberish. I'm almost _certain_ I'm about to look incredibly stupid here, but might I suggest: Could we simply test for letter frequency? For a given language, it would seem that the frequency would stay predictable; random strings of characters would show up with different histograms. Note that I handwave over the fact that we probably don't know the intended langauge beforehand. :-( As I said, my apologies for a one-half^Wone-quarter^Wone-eigth baked idea. Cheers, - Bill --------------------------------------------------------------------------- "``Threads are like salt. You like salt, I like salt, but we eat a lot more pasta than salt.'' The thread guys are trying to tell you that diet of salt is a good idea. They are wrong, don't listen, eat more pasta and be happy." -- Larry McVoy <[EMAIL PROTECTED]> -------------------------------------------------------------------------- William Stearns ([EMAIL PROTECTED]). Mason, Buildkernel, freedups, p0f, rsync-backup, ssh-keyinstall, dns-check, more at: http://www.stearns.org Linux articles at: http://www.opensourcedigest.com -------------------------------------------------------------------------- ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk