Re: Why is regex so slow?

Roy Smith Wed, 19 Jun 2013 13:04:02 -0700

On Wednesday, June 19, 2013 9:21:43 AM UTC-4, Duncan Booth wrote:

> I'd just like to point out that your simple loop is looking at every 
> character of the input string. The simple "'ENQ' not in line" test can look 
> at the third character of the string and if it's none of 'E', 'N' or 'Q' 
> skip to checking the 6th and then the 9th. It doesn't have to touch the 
> intervening characters at all.


It's been a while since I looked at boyer-moore in detail.  Looking at 
Objects/stringlib/fastsearch.h from the 2.7.4 source, it occurs to me that:

        /* create compressed boyer-moore delta 1 table */

        /* process pattern[:-1] */
        for (i = 0; i < mlast; i++) {
            STRINGLIB_BLOOM_ADD(mask, p[i]);
            if (p[i] == p[mlast])
                skip = mlast - i - 1;
        }
        /* process pattern[-1] outside the loop */
        STRINGLIB_BLOOM_ADD(mask, p[mlast]);

is essentially (well, sort-if) the same as the compile() step of a regex.  For 
the (presumably) common use case of searching many strings for the same 
substring (which is what we're doing here), it seems like it would be a win to 
cache the mask and reuse it if the search string id is the same as the last 
search string id.  The overhead on cache misses would be a single pointer 
comparison.  Has anybody looked at doing that?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Why is regex so slow?

Reply via email to