Justin Mason wrote:
There's an interesting discussion on my weblog at
http://taint.org/2006/07/07/184022a.html , regarding the
recently-contributed "trie" and Aho-Corasick regexp optimizations, which
have been added to perl 5.9.x.

These could provide a way to get *serious* speedups in SpamAssassin.

Note that these work really well on fixed text. Once you start adding STAR, PLUS and CURLY the gain diminishes.

If anyone is interested in working on speedups, this would be a really
great place to do it; body rules are by far the biggest CPU time sink in
our code.

One source of slowness is that a lot of SA pattern writers use (?=[abcd]) zwla's in the belief, I think, to supply more information to the engine, but even in older perls this was always a loss. You have to have a something like 95-99% of strings not matching /[abcd]/ for the benefit to kick in, at least that has been my experience.

I think the biggest problem is the fact that the scanning algorithm consists of applying one pattern after the other to see what it triggers.

I wrote Regexp::Assemble to allow one to take a large slab of patterns and concoct a single expression (also structured as a trie, incidentally) as a result. The benefit is that all the common paths are shared, so all the expensive curlies and stars are visited only one.

For instance, with /a+b+c+/ and /a+c+e+c+/ the result is /a+(c+e+|b+)c+/, so if you are give aaaafc, you don't have to try the first pattern, fail and then the second pattern and fail. As soon as the alternation in the middle of the resulting pattern above fails, there's no backtracking and so you stop. When you have three thousand patterns instead of two, this starts to become very interesting.

Regexp::Assemble also has a mode whereby you can build up a hash of pattern => coderef pairs, assemble all the patterns into a single pattern, and then feed that the body text.

This then means that your entire scan of the body just becomes a single //g pattern match, and each time something matches, you can go back to the dispatch table and call out the action according to what just matched. The more patterns you have, the more efficient it becomes.

I'd be happy to explain this in more detail if this sounds like something you want to explore.

David
--
Much of the propaganda that passes for news in our own society is given to immobilising and pacifying people and diverting them from the idea that they can confront power. -- John Pilger

Reply via email to