Craig R Hughes writes: > Better than a straight dictionary of single words is a dictionary of > phrases, weighted by their frequency in spam vs nonspam. Hmm, wait, > that sounds familiar somehow... ;) I suppose we ought to turn spam > phrases back on.... I'll work on that right now, and check it in > once working.
Craig, are the spam phrases extracted by hand or does an automated program extract them from your spam corpus? It's on my TODO list to try extracting a "spam-dialect" English language model to see if the N-gram distribution is sufficiently different from the regular English language model and see if the language code can accurately guess whether someone is using spam-dialect. I was also wondering if we could merge the word code in somehow with the N-gram extraction code so only one-pass is required. Something like: foreach $word (@words) { wcount{$word}++; foreach $ngram (ngrams($word)) { $ncount{$ngram}++; } } You could then use the language guess and/or locale to compare against spam-phrases for that particular language/locale. Dan _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk