"Daniel Quinlan" <[EMAIL PROTECTED]> writes:

Well, if everyone stopped using SpamAssassin, it would work better too,
so I blame all users.

Yeah. Blame users. Users are the easiest to blame, anyways.


Well, I heard about this weakness before we even adopted Bayes.  It was
coming, one way or another.

There's one non-Bayesian rule in 2.60 to catch some of these, but we
could probably use more rules to catch tricks like this.

Does anyone have a stockpile of this stuff? I was thinking of some filters for this myself earlier. What are the strategies other people have been using/pondering for it?


I think most times, these messages look like:

studs hairiness
miltonic waldo pseudoinstruction RzneXzvxrRzneXongpu.pbzRzneX
monocotyledon conceals rickshaws raising lutheranizers gels modulating
cautions bowed verbally storeyed tormenting dairy pruners height

A couple of things could be done with this. How frequently do you see the word "rickshaws" in email? pseudoinstruction? I think if you see a word you've only ever seen once in "ham" before, and then you see four others, you're looking at gibberish.

Additionally, you'll notice that most of these words are > 5 characters.

[scorch:~] alex% perl -lne '$s+=length;END{print $s/$.}' /usr/share/dict/words
9.58507174263739


However, in practice, if we take this email, we'll find that the words are actually interspersed with many smaller words. if, the, and, i, me, my, and so on. So if we see several occurrences of words we don't see frequently, along with a lack of the smaller words that make speech understandable, it is probably gibberish.

I suppose to defeat it you could use a sort of anti-bayesian filtering, by coming up with a file full of commonly used words and intersperse them with smaller words, or to use legitimate text (somebody mentioned the declaration of independence). A counter counter measure I guess would just be to use fuzzy logic and determine a normal word frequency ratio or a normal large:small word ratio. In combination with other filters, it might be helpful to implement.

It isn't a regex, though, and would be somewhat slow.

alex



-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to