On 11 Dec 2003 08:11:43 +0200, [EMAIL PROTECTED] writes:

> Getting back on topic, the problem with a stepwise normalization of
> the message is that you sort of assume that transformations are
> applied consistently and mechanically. What would be really neat would
> be to have an automaton which recognizes all possible variants at the
> same time. 

It exists. Check out the dragon book, chapter 3. I'm working on a
variant specifically targetted for content filtering --- designed for
up to 10,000 or so rules, supporting overlapping matches, automatic
transformation of rules (a la the obfu script), all at about 10-100
times the performance of the current SA engine.

It *is* possible, but I don't have time to work on it aggressively so
it may be many months until I have a demo ready.

> The obfu script (look back in the archives for a few days)
> is a nice start, but it could obviously be improved. In the grand
> scheme of things, I imagine you would have to use another formalism
> instad of regular expressions to really capture what the spammers are
> doing.

For each rule in the canonicalization where you do:
   [o0O]  ----> o   IE R -> S,  

you can apply that the the regular expression by substituting the
premise of the rule in for the consequent in all character sets and
literals along the lines of:

foo ->  f[o0O][o0O]
col[ou]r -> c[o0O]l[o0Ou]r

For each 'delete' rule, just put it between all tokens in the regexp.

   [_ ]* -> EPSILON

which would be:

foo ->  [_ ]*f[_ ]*o[_ ]*o[_ ]*

The major catch with this particular implementation is that it cannot
deal with nondeterministic transformations. What this means is that
any consequent for a substitute rule must be a single character.  ( '4
-> for' would be bad) Thats not something that I think is going to be
a real problem in practice. Another problem is that with a few good
transforming rulesets, you've just increased the regexp ruleset
5x. The matching engine has to support that without even more of a
resource hog. This would be a problem for the perl regexp engine that
SA uses, but not for an automata based matcher like what I have been
proposing and implementing.

On the plus side, this sort of regexp transformation is fully
automatable. The really plus side is that it can transform *all* rules
and catch mail like:

 Subject: ***SPAM*** discount [EMAIL PROTECTED] ut pb dvifjzw

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to