Theo Van Dinter wrote:
On Thu, Oct 26, 2006 at 12:19:23PM -0400, Peter H. Lemieux wrote:
No, because there are going to be a lot of mails that would hit that.
Really? Maybe it's because I live in the US, but I can't think of a legitimate message I've ever received consisting only of a base64 blob.

You look at a lot of raw messages?  ;)

Doesn't everybody?

Seriously, I do look at a lot of raw messages; for instance, I review the full text of nearly every spam message that doesn't get caught by my filters and shows up in my inbox. Obviously I don't get much mail from Blackberry users or Ticketmaster!

Rather than making anyone else do the work for me, is there something I can read about how to determine the frequency of different message features appearing in the corpus?

Well, there isn't "a" SA corpus, so there's no answer to that question.

Ah, I hadn't read this page before:
        http://wiki.apache.org/spamassassin/HandClassifiedCorpora
My recollection was that 2.x used a centrally-defined corpus rather than a variety of developers' corpora (see, I read the wiki). Either things changed with the switch in scoring algorithms in 3.x, or my recollection is shoddy. Probably the latter.

You can generate some rules and use mass-check to run against your own corpus
to gather some statistics.  I'm willing to run some rules for you against my
corpus if you want.  I just don't have time to come up with the rules right
now.

Thanks for the offer, Theo, but don't spend your valuable time on this. I'll give it shot some day when I've got some spare moments. If I do get some candidate rules, I'll pass them along to you for testing.


Thanks again!
Peter

Reply via email to