On 27 Nov 2003 10:06:46 +0200, [EMAIL PROTECTED] writes: > On 27 Nov 2003 01:13:04 -0600, Scott A Crosby <[EMAIL PROTECTED]> > posted to spamassassin-devel and spamassassin-talk: > > On Wed, 26 Nov 2003 14:17:30 +0600, Alexander Litvinov > > <[EMAIL PROTECTED]> writes: > >> > Solution is to learn a monogram, bigram and trigram character model > >> > for the ham you recieve. Mix the statistics together (to account for > >> > partial information) and that'll be very good at detecting gibberish > >> > and foreign languages. Assume if its not been seen before that its a > >> > spam sign. Canonicalize the non-alphabetic tokens and it could detect, > >> > weakly, mangled text like. V.I.A.G..... > >> This can be the solution, but V.I.A.G is good example of byers work. > > Not really. Spammers can use: > > V.I.A.G.R.A > > V.I.A.G.R A > <...> > > V.I.A.G_R,A > > V.I.A.G_R_A > > and so on. Ignoring those that use " " which would break the word into > > two tokens, that is 3^5, or 243 distinct tokens, If they add a random > > [,_.] at the begin and end, thats 3^7=2187 distinct tokens, or 2^2*4^5 > > = 9216 distinct ways to write viagr, without mangling a single > > letter. > > The solution to this is to "normalize" each message before you pass it > to the rules which examine the n-grams. I believe that's what was > meant by "canonicalize" in the earliest message quoted above -- you'd > replace all punctuation (and maybe whitespace too) with a single > punctuation character ... or even strip out all punctuation and > whitespace entirely and then look at the resulting n-grams.
Yup. Thats exactly what I was referring to for doing the ngram language model, but that was a side effect. The ngram language model would be to detect foreign or gibberish. It would only weakly detect obfuscated text as a secondary effect. But I was describing an attack that works right now on all of the porn rules. Really, *each* of them should be rewritten to just as I rewrote /VI.../ into /V[._, ]I[._, ].../. Also for more robustness, the letters also need to be rewritten, for instance: /I/ into /[Ii1|]/. The sort of thing I'm referring to is already up at: http://www.exit0.us/index.php/ChrissMediocreObfuScript Except that this does textual substitution, you can do better with a parser/modifier/unparser. For instance, reduce FP chances by only having it transform parts of the regexp that are sufficiently long. You can also do deeper things. Simple canonicalization or removal of punctuation of the input before matching of the input can't avoid a spammer doing things like VxIxAxGxRxA, and bypassing filters. A rule-transformer can take each and every rule and add an 'x' after every character and do the above automatically. (but only conditionally. :) Wouldn't want to do it for short rules where there's a high risk of FP.) The catch for all this.... > More generally, I believe it would make sense to define a handful of > different "normal forms" for different classes of rules. Having lots of rules generally cause things to become slower. Also, doing this sort of transformation will severely impact perl's regexp optimizer, because it breaks its constant-substring optimization. Ergo, my current and past advocacy for a DFA engine that won't care how many patterns it got fed. Scott ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk