Re: [SAtalk] scoring individual words

Daniel Quinlan Sat, 01 Jun 2002 15:18:47 -0700

Craig R Hughes writes:

> Better than a straight dictionary of single words is a dictionary of
> phrases, weighted by their frequency in spam vs nonspam.  Hmm, wait,
> that sounds familiar somehow...  ;) I suppose we ought to turn spam
> phrases back on....  I'll work on that right now, and check it in
> once working.


Craig, are the spam phrases extracted by hand or does an automated
program extract them from your spam corpus?

It's on my TODO list to try extracting a "spam-dialect" English language
model to see if the N-gram distribution is sufficiently different from
the regular English language model and see if the language code can
accurately guess whether someone is using spam-dialect.

I was also wondering if we could merge the word code in somehow with the
N-gram extraction code so only one-pass is required.  Something like:

  foreach $word (@words) {
      wcount{$word}++;
      foreach $ngram (ngrams($word)) {
          $ncount{$ngram}++;
      }
  }

You could then use the language guess and/or locale to compare against
spam-phrases for that particular language/locale.

Dan

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] scoring individual words

Reply via email to