Re: [SAtalk] Spammers sneaking lower Bayes scores (solutions)

thorsten Thu, 26 Jun 2003 15:16:49 -0700

Hi

For my new bayesian filter ( my diploma thesis) I need these mails to.


But:
I think it will be no problem to identify these spams. They have to transport 
any kind of message / advertising. And even if this is a link to an homepage, 
this is a  valid token.

Please send me these SPAMs
Thorsten

Am Donnerstag, 26. Juni 2003 23:05 schrieb alex avriette:
> "Daniel Quinlan" <[EMAIL PROTECTED]> writes:
> > Well, if everyone stopped using SpamAssassin, it would work better too,
> > so I blame all users.
>
> Yeah. Blame users. Users are the easiest to blame, anyways.
>
> > Well, I heard about this weakness before we even adopted Bayes.  It was
> > coming, one way or another.
> >
> > There's one non-Bayesian rule in 2.60 to catch some of these, but we
> > could probably use more rules to catch tricks like this.
>
> Does anyone have a stockpile of this stuff? I was thinking of some
> filters for this myself earlier. What are the strategies other people
> have been using/pondering for it?
>
> I think most times, these messages look like:
>
> studs hairiness
> miltonic waldo pseudoinstruction RzneXzvxrRzneXongpu.pbzRzneX
> monocotyledon conceals rickshaws raising lutheranizers gels modulating
> cautions bowed verbally storeyed tormenting dairy pruners height
>
> A couple of things could be done with this. How frequently do you see
> the word "rickshaws" in email? pseudoinstruction? I think if you see a
> word you've only ever seen once in "ham" before, and then you see four
> others, you're looking at gibberish.
>
> Additionally, you'll notice that most of these words are > 5 characters.
>
> [scorch:~] alex% perl -lne '$s+=length;END{print $s/$.}'
> /usr/share/dict/words
> 9.58507174263739
>
> However, in practice, if we take this email, we'll find that the words
> are actually interspersed with many smaller words. if, the, and, i, me,
> my, and so on. So if we see several occurrences of words we don't see
> frequently, along with a lack of the smaller words that make speech
> understandable, it is probably gibberish.
>
> I suppose to defeat it you could use a sort of anti-bayesian filtering,
> by coming up with a file full of commonly used words and intersperse
> them with smaller words, or to use legitimate text (somebody mentioned
> the declaration of independence). A counter counter measure I guess
> would just be to use fuzzy logic and determine a normal word frequency
> ratio or a normal large:small word ratio. In combination with other
> filters, it might be helpful to implement.
>
> It isn't a regex, though, and would be somewhat slow.
>
> alex
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: INetU
> Attention Web Developers & Consultants: Become An INetU Hosting Partner.
> Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
> INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk



-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Spammers sneaking lower Bayes scores (solutions)

Reply via email to