Daniel,

look in the wordfreqs/ directory of the distribution.

C

Daniel Quinlan wrote:

DQ> Craig R Hughes writes:
DQ>
DQ> > Better than a straight dictionary of single words is a dictionary of
DQ> > phrases, weighted by their frequency in spam vs nonspam.  Hmm, wait,
DQ> > that sounds familiar somehow...  ;) I suppose we ought to turn spam
DQ> > phrases back on....  I'll work on that right now, and check it in
DQ> > once working.
DQ>
DQ> Craig, are the spam phrases extracted by hand or does an automated
DQ> program extract them from your spam corpus?
DQ>
DQ> It's on my TODO list to try extracting a "spam-dialect" English language
DQ> model to see if the N-gram distribution is sufficiently different from
DQ> the regular English language model and see if the language code can
DQ> accurately guess whether someone is using spam-dialect.
DQ>
DQ> I was also wondering if we could merge the word code in somehow with the
DQ> N-gram extraction code so only one-pass is required.  Something like:
DQ>
DQ>   foreach $word (@words) {
DQ>       wcount{$word}++;
DQ>       foreach $ngram (ngrams($word)) {
DQ>       $ncount{$ngram}++;
DQ>       }
DQ>   }
DQ>
DQ> You could then use the language guess and/or locale to compare against
DQ> spam-phrases for that particular language/locale.
DQ>
DQ> Dan
DQ>
DQ>
DQ>


_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to