On Wed, 26 Nov 2003 14:17:30 +0600, Alexander Litvinov <[EMAIL PROTECTED]> writes:
> > Solution is to learn a monogram, bigram and trigram character model > > for the ham you recieve. Mix the statistics together (to account for > > partial information) and that'll be very good at detecting gibberish > > and foreign languages. Assume if its not been seen before that its a > > spam sign. Canonicalize the non-alphabetic tokens and it could detect, > > weakly, mangled text like. V.I.A.G..... > > This can be the solution, but V.I.A.G is good example of byers work. > Not really. Spammers can use: V.I.A.G.R.A V.I.A.G.R A V.I.A.G.R,A V.I.A.G.R_A V.I.A.G R.A V.I.A.G R A V.I.A.G R,A V.I.A.G R_A V.I.A.G,R.A V.I.A.G,R A V.I.A.G,R,A V.I.A.G,R_A V.I.A.G_R.A V.I.A.G_R A V.I.A.G_R,A V.I.A.G_R_A and so on. Ignoring those that use " " which would break the word into two tokens, that is 3^5, or 243 distinct tokens, If they add a random [,_.] at the begin and end, thats 3^7=2187 distinct tokens, or 2^2*4^5 = 9216 distinct ways to write viagr, without mangling a single letter. This can be applied to readibly mangle any 6 letter word in 9216 distinct ways. Do this to two different words in an email and now each message is unique. That pretty much kills the current pyzor implementation. Secondly, a spammer can use this to track who may have submitted an email to a blacklist. Fortunately, this can be caught with a regexp along the lines of: V[ _,.]*I[ _,.]*A[ _,.]*G[ _,.]*R[ _,.]*A Just be happy that the porn spammers haven't started to use this trick. Skimming over 20_porn.cf and 20_phrases.cf, not a single rule looks for this, and no HTML required! I know that the more modern rulesets on the list look directly for obfuscating, but doing it at the individual regexp level is going to be better than looking for lots of .'s. I care about V_I,A.G_R,A, or V.I.A.G.R.A, not U.S.A. You need a way to progrmatically transform regexps, en mass, into a set that can detect trivial obfuscations likee the above. I've got an OCaml regexp parser/unparser that handles almost all perl re's. It would be almost trivial to adapt it to do the above transform. If I do the transform, would someone be willing to get the regexps into SA? If there is interest, I can also see about releasing the code. Scott Amusingly enough, the first shipment of this email ware bounced by sourceforge with: Diagnostic-Code: X-Postfix; host mail.sourceforge.net[66.35.250.206] said: 550-This message matches a blacklisted regular expression ([Vv] *[Ii] *[Aa] 550 *[Gg] *[Rr] *[Aa]) (in reply to end of DATA command) Note this version has 'viagr' in ONE place, but over 20 obfuscated versions of that word. Lets see if this one passes. :) ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk