Re: Scoring base64 blob messages

Peter H. Lemieux Fri, 27 Oct 2006 14:30:43 -0700

Theo Van Dinter wrote:

On Thu, Oct 26, 2006 at 12:19:23PM -0400, Peter H. Lemieux wrote:
No, because there are going to be a lot of mails that would hit that.
Really? Maybe it's because I live in the US, but I can't think of alegitimate message I've ever received consisting only of a base64 blob.
You look at a lot of raw messages?  ;)


Doesn't everybody?

Seriously, I do look at a lot of raw messages; for instance, I review thefull text of nearly every spam message that doesn't get caught by myfilters and shows up in my inbox. Obviously I don't get much mail fromBlackberry users or Ticketmaster!

Rather than making anyone else do the work for me, is there something Ican read about how to determine the frequency of different messagefeatures appearing in the corpus?

Well, there isn't "a" SA corpus, so there's no answer to that question.


Ah, I hadn't read this page before:
        http://wiki.apache.org/spamassassin/HandClassifiedCorpora

My recollection was that 2.x used a centrally-defined corpus rather thana variety of developers' corpora (see, I read the wiki). Either thingschanged with the switch in scoring algorithms in 3.x, or my recollectionis shoddy. Probably the latter.

You can generate some rules and use mass-check to run against your own corpus
to gather some statistics.  I'm willing to run some rules for you against my
corpus if you want.  I just don't have time to come up with the rules right
now.

Thanks for the offer, Theo, but don't spend your valuable time on this.I'll give it shot some day when I've got some spare moments. If I do getsome candidate rules, I'll pass them along to you for testing.



Thanks again!
Peter

Re: Scoring base64 blob messages

Reply via email to