[EMAIL PROTECTED] said: <snip> > If I enter a single-character string in the "easy mode" text box, the > rules will somehow manage to drop the character from the obfuscated > rules. I.e. for the input "d" I get the regex /(?!\bd\b)\b/i (and not > the nonsensical /(?!\bd\b)\bd\b/i or an error message in the case when > the default "obfu only" option is selected).
Thanks for the bug report, I'll fix that early next week, I expect. No doubt I made an assumption that there would be more than one character somewhere in the code. > Why are character classes not used consistently? For the input "lad" > and with -g but no -o it gives me the regex > /(?:\b[l1I]|[\|\xA3]|(?:\xC5[\x80-\x82]|\xC4[\xB9-\xBF])) <snip> > /(?:\b[l1I|\xA3]|(?:\xC5[\x80-\x82]|\xC4[\xB9-\xBF])) <and> > (?:d\b|[\xD0]|\xC4[\x8E-\x91])/i > or actually even with the last line being > (?:[d\xD0]|\xC4[\x8E-\x91])\b/i > > instead. I don't have any timings to back it up, but probably it will > be slightly faster as well as more human-readable if you normalize the > expressions to use classes wherever you can. <snip> What you're seeing here is the bugfix regarding word boundaries mentioned on the home page and the version history. I'll explain why it works like it does. Take this simple regex, for example: /asdf/i Let's pretend my rules-gen script is much simpler than it is. It generates: /[EMAIL PROTECTED]/i This rule matches " ASDF ", "bananasASDF", " @SDF ", "[EMAIL PROTECTED]", etc... and life is good. Now, we add word boundaries to the original rule: /\basdf/i Simply adding \b word boundaries to the generated rule gives us: /[EMAIL PROTECTED]/i This rule matches " asdf ", "[EMAIL PROTECTED]", etc. Because of the word boundary, it no longer matches "bananasASDF" which is probably what we wanted in the first place. However, notice that it also no longer matches " @sdf ". This is because a space and an @ are both non-word characters (\W), therefore the \b doesn't match. My solution was to split the tokens into word/nonword classes and group them. The characters in the word character class get the \b word boundary check, while the non-word character classes simply match regardless of what's on the other side. Makes sense? It really does allow for better matching, methinks. > Thanks for a useful tool, BTW! I wish I had thought of setting that up. > > /* era */ Glad you like it! -- Chris Thielen Easily generate SpamAssassin rules to catch obfuscated spam phrases: http://www.sandgnat.com/cmos/ ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk