Re: [SAtalk] Re: paris hilton

Chris Thielen Fri, 05 Dec 2003 11:09:27 -0800

[EMAIL PROTECTED] said:
<snip>
> If I enter a single-character string in the "easy mode" text box, the
> rules will somehow manage to drop the character from the obfuscated
> rules. I.e. for the input "d" I get the regex /(?!\bd\b)\b/i (and not
> the nonsensical /(?!\bd\b)\bd\b/i or an error message in the case when
> the default "obfu only" option is selected).


Thanks for the bug report, I'll fix that early next week, I expect.  No
doubt I made an assumption that there would be more than one character
somewhere in the code.

> Why are character classes not used consistently? For the input "lad"
> and with -g but no -o it gives me the regex
>     /(?:\b[l1I]|[\|\xA3]|(?:\xC5[\x80-\x82]|\xC4[\xB9-\xBF]))
<snip>
>     /(?:\b[l1I|\xA3]|(?:\xC5[\x80-\x82]|\xC4[\xB9-\xBF]))
<and>
>     (?:d\b|[\xD0]|\xC4[\x8E-\x91])/i
> or actually even with the last line being
>     (?:[d\xD0]|\xC4[\x8E-\x91])\b/i
>
> instead. I don't have any timings to back it up, but probably it will
> be slightly faster as well as more human-readable if you normalize the
> expressions to use classes wherever you can.
<snip>

What you're seeing here is the bugfix regarding word boundaries mentioned
on the home page and the version history.  I'll explain why it works like
it does.

Take this simple regex, for example:
/asdf/i
Let's pretend my rules-gen script is much simpler than it is.  It generates:
/[EMAIL PROTECTED]/i
This rule matches " ASDF ", "bananasASDF", " @SDF ", "[EMAIL PROTECTED]", etc...
and life is good.

Now, we add word boundaries to the original rule:
/\basdf/i
Simply adding \b word boundaries to the generated rule gives us:
/[EMAIL PROTECTED]/i
This rule matches " asdf ", "[EMAIL PROTECTED]", etc.  Because of the word
boundary, it no longer matches "bananasASDF" which is probably what we
wanted in the first place. However, notice that it also no longer matches
" @sdf ".  This is because a space and an @ are both non-word characters
(\W), therefore the \b doesn't match.

My solution was to split the tokens into word/nonword classes and group
them.  The characters in the word character class get the \b word boundary
check, while the non-word character classes simply match regardless of
what's on the other side.

Makes sense?  It really does allow for better matching, methinks.


> Thanks for a useful tool, BTW! I wish I had thought of setting that up.
>
> /* era */

Glad you like it!


--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Re: paris hilton

Reply via email to