On 11 Dec 2003 08:11:43 +0200, [EMAIL PROTECTED] writes: > Getting back on topic, the problem with a stepwise normalization of > the message is that you sort of assume that transformations are > applied consistently and mechanically. What would be really neat would > be to have an automaton which recognizes all possible variants at the > same time.
It exists. Check out the dragon book, chapter 3. I'm working on a variant specifically targetted for content filtering --- designed for up to 10,000 or so rules, supporting overlapping matches, automatic transformation of rules (a la the obfu script), all at about 10-100 times the performance of the current SA engine. It *is* possible, but I don't have time to work on it aggressively so it may be many months until I have a demo ready. > The obfu script (look back in the archives for a few days) > is a nice start, but it could obviously be improved. In the grand > scheme of things, I imagine you would have to use another formalism > instad of regular expressions to really capture what the spammers are > doing. For each rule in the canonicalization where you do: [o0O] ----> o IE R -> S, you can apply that the the regular expression by substituting the premise of the rule in for the consequent in all character sets and literals along the lines of: foo -> f[o0O][o0O] col[ou]r -> c[o0O]l[o0Ou]r For each 'delete' rule, just put it between all tokens in the regexp. [_ ]* -> EPSILON which would be: foo -> [_ ]*f[_ ]*o[_ ]*o[_ ]* The major catch with this particular implementation is that it cannot deal with nondeterministic transformations. What this means is that any consequent for a substitute rule must be a single character. ( '4 -> for' would be bad) Thats not something that I think is going to be a real problem in practice. Another problem is that with a few good transforming rulesets, you've just increased the regexp ruleset 5x. The matching engine has to support that without even more of a resource hog. This would be a problem for the perl regexp engine that SA uses, but not for an automata based matcher like what I have been proposing and implementing. On the plus side, this sort of regexp transformation is fully automatable. The really plus side is that it can transform *all* rules and catch mail like: Subject: ***SPAM*** discount [EMAIL PROTECTED] ut pb dvifjzw Scott ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk