On 8/9/2010 8:27 AM, Henrik K wrote:
Nope, people constantly underestimate the power of regexes.. of course you
can easily make bad ones, but Perl can run huge lists of simple alternations
FAST.

I downloaded a 10000 random name pack, and made a quick hack to regexify it
with my favourite Regexp::Assemble.

------------------------------
#!/usr/bin/perl
use Regexp::Assemble;
$ra = Regexp::Assemble->new;
while (<STDIN>) {
     chomp;
     # Read comma separated names from stdin: Firstname,Lastname
     ($firstname, $lastname) = split(',', lc);
     # Firstname Lastname
     $ra->add("$firstname $lastname");
     # Lastname,? Firstname
     $ra->add("$lastname,? $firstname");
     # Print rule every 10000 names
     # (?:^| ) instead of \b since "Kate" would hit "Mary-Kate"
     if (++$cnt % 10000 == 0 || eof STDIN) {
        print 'body TEST_NAMES_'.++$idx;
         print ' /(?:^| )'.$ra->as_string.'(?:$| )/i'."\n";
     }
}
------------------------------
./names.pl<  names.csv>  names.cf

The resulting single 170000 byte rule did not affect SA in anyway, there was
virtually no difference in my mass check tests. Running the regex through
some file manually results in 80000 lines/second. This with one 3Ghz core.
I think you can make rules/REs of MBs in size, but gains probably nothing.

About ClamAV...

+ It would probably handle this even faster
+ Easy logging of exact signature that got hit (single name per sig)
- It would also match any header like To: From: etc (PRETTY BAD...)

I'd choose SA since it's way more flexible. I doubt performance here is a
factor, especially with outgoing mail..

Thanks for the info.

- It would also match any header like To: From: etc (PRETTY BAD...)

That could be an issue. I will check to see if I can find a workaround, if not, ClamAV may not be an option.

Reply via email to