Hi For my new bayesian filter ( my diploma thesis) I need these mails to.
But: I think it will be no problem to identify these spams. They have to transport any kind of message / advertising. And even if this is a link to an homepage, this is a valid token. Please send me these SPAMs Thorsten Am Donnerstag, 26. Juni 2003 23:05 schrieb alex avriette: > "Daniel Quinlan" <[EMAIL PROTECTED]> writes: > > Well, if everyone stopped using SpamAssassin, it would work better too, > > so I blame all users. > > Yeah. Blame users. Users are the easiest to blame, anyways. > > > Well, I heard about this weakness before we even adopted Bayes. It was > > coming, one way or another. > > > > There's one non-Bayesian rule in 2.60 to catch some of these, but we > > could probably use more rules to catch tricks like this. > > Does anyone have a stockpile of this stuff? I was thinking of some > filters for this myself earlier. What are the strategies other people > have been using/pondering for it? > > I think most times, these messages look like: > > studs hairiness > miltonic waldo pseudoinstruction RzneXzvxrRzneXongpu.pbzRzneX > monocotyledon conceals rickshaws raising lutheranizers gels modulating > cautions bowed verbally storeyed tormenting dairy pruners height > > A couple of things could be done with this. How frequently do you see > the word "rickshaws" in email? pseudoinstruction? I think if you see a > word you've only ever seen once in "ham" before, and then you see four > others, you're looking at gibberish. > > Additionally, you'll notice that most of these words are > 5 characters. > > [scorch:~] alex% perl -lne '$s+=length;END{print $s/$.}' > /usr/share/dict/words > 9.58507174263739 > > However, in practice, if we take this email, we'll find that the words > are actually interspersed with many smaller words. if, the, and, i, me, > my, and so on. So if we see several occurrences of words we don't see > frequently, along with a lack of the smaller words that make speech > understandable, it is probably gibberish. > > I suppose to defeat it you could use a sort of anti-bayesian filtering, > by coming up with a file full of commonly used words and intersperse > them with smaller words, or to use legitimate text (somebody mentioned > the declaration of independence). A counter counter measure I guess > would just be to use fuzzy logic and determine a normal word frequency > ratio or a normal large:small word ratio. In combination with other > filters, it might be helpful to implement. > > It isn't a regex, though, and would be somewhat slow. > > alex > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: INetU > Attention Web Developers & Consultants: Become An INetU Hosting Partner. > Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission! > INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.Net email is sponsored by: INetU Attention Web Developers & Consultants: Become An INetU Hosting Partner. Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission! INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk