On Wednesday, June 19, 2013 9:21:43 AM UTC-4, Duncan Booth wrote: > I'd just like to point out that your simple loop is looking at every > character of the input string. The simple "'ENQ' not in line" test can look > at the third character of the string and if it's none of 'E', 'N' or 'Q' > skip to checking the 6th and then the 9th. It doesn't have to touch the > intervening characters at all.
It's been a while since I looked at boyer-moore in detail. Looking at Objects/stringlib/fastsearch.h from the 2.7.4 source, it occurs to me that: /* create compressed boyer-moore delta 1 table */ /* process pattern[:-1] */ for (i = 0; i < mlast; i++) { STRINGLIB_BLOOM_ADD(mask, p[i]); if (p[i] == p[mlast]) skip = mlast - i - 1; } /* process pattern[-1] outside the loop */ STRINGLIB_BLOOM_ADD(mask, p[mlast]); is essentially (well, sort-if) the same as the compile() step of a regex. For the (presumably) common use case of searching many strings for the same substring (which is what we're doing here), it seems like it would be a win to cache the mask and reuse it if the search string id is the same as the last search string id. The overhead on cache misses would be a single pointer comparison. Has anybody looked at doing that? -- http://mail.python.org/mailman/listinfo/python-list