Daniel, look in the wordfreqs/ directory of the distribution.
C Daniel Quinlan wrote: DQ> Craig R Hughes writes: DQ> DQ> > Better than a straight dictionary of single words is a dictionary of DQ> > phrases, weighted by their frequency in spam vs nonspam. Hmm, wait, DQ> > that sounds familiar somehow... ;) I suppose we ought to turn spam DQ> > phrases back on.... I'll work on that right now, and check it in DQ> > once working. DQ> DQ> Craig, are the spam phrases extracted by hand or does an automated DQ> program extract them from your spam corpus? DQ> DQ> It's on my TODO list to try extracting a "spam-dialect" English language DQ> model to see if the N-gram distribution is sufficiently different from DQ> the regular English language model and see if the language code can DQ> accurately guess whether someone is using spam-dialect. DQ> DQ> I was also wondering if we could merge the word code in somehow with the DQ> N-gram extraction code so only one-pass is required. Something like: DQ> DQ> foreach $word (@words) { DQ> wcount{$word}++; DQ> foreach $ngram (ngrams($word)) { DQ> $ncount{$ngram}++; DQ> } DQ> } DQ> DQ> You could then use the language guess and/or locale to compare against DQ> spam-phrases for that particular language/locale. DQ> DQ> Dan DQ> DQ> DQ> _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk