On Wed, 26 Nov 2003 14:17:30 +0600, Alexander Litvinov <[EMAIL PROTECTED]> writes:

> > Solution is to learn a monogram, bigram and trigram character model
> > for the ham you recieve. Mix the statistics together (to account for
> > partial information) and that'll be very good at detecting gibberish
> > and foreign languages. Assume if its not been seen before that its a
> > spam sign. Canonicalize the non-alphabetic tokens and it could detect,
> > weakly, mangled text like. V.I.A.G.....
> 
> This can be the solution, but V.I.A.G is good example of byers work.
> 

Not really. Spammers can use:

V.I.A.G.R.A
V.I.A.G.R A
V.I.A.G.R,A
V.I.A.G.R_A

V.I.A.G R.A
V.I.A.G R A
V.I.A.G R,A
V.I.A.G R_A

V.I.A.G,R.A
V.I.A.G,R A
V.I.A.G,R,A
V.I.A.G,R_A

V.I.A.G_R.A
V.I.A.G_R A
V.I.A.G_R,A
V.I.A.G_R_A

and so on. Ignoring those that use " " which would break the word into
two tokens, that is 3^5, or 243 distinct tokens, If they add a random
[,_.] at the begin and end, thats 3^7=2187 distinct tokens, or 2^2*4^5
= 9216 distinct ways to write viagr, without mangling a single
letter.

This can be applied to readibly mangle any 6 letter word in 9216
distinct ways. Do this to two different words in an email and now each
message is unique. That pretty much kills the current pyzor
implementation. Secondly, a spammer can use this to track who may have
submitted an email to a blacklist.

Fortunately, this can be caught with a regexp along the lines of:

V[ _,.]*I[ _,.]*A[ _,.]*G[ _,.]*R[ _,.]*A

Just be happy that the porn spammers haven't started to use this
trick.  Skimming over 20_porn.cf and 20_phrases.cf, not a single rule
looks for this, and no HTML required! I know that the more modern
rulesets on the list look directly for obfuscating, but doing it at
the individual regexp level is going to be better than looking for
lots of .'s. I care about V_I,A.G_R,A, or V.I.A.G.R.A, not U.S.A. 

You need a way to progrmatically transform regexps, en mass, into a
set that can detect trivial obfuscations likee the above. I've got an
OCaml regexp parser/unparser that handles almost all perl re's. It
would be almost trivial to adapt it to do the above transform.

If I do the transform, would someone be willing to get the regexps
into SA? If there is interest, I can also see about releasing the
code.

Scott


Amusingly enough, the first shipment of this email ware bounced by
sourceforge with:

Diagnostic-Code: X-Postfix; host mail.sourceforge.net[66.35.250.206] said:
    550-This message matches a blacklisted regular expression ([Vv] *[Ii] *[Aa]
    550 *[Gg] *[Rr] *[Aa]) (in reply to end of DATA command)


Note this version has 'viagr' in ONE place, but over 20 obfuscated
versions of that word. Lets see if this one passes. :)



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to