Bowie Bailey wrote:
Try using a string that's longer than 320 characters that starts with a
short comment.
i.e.: '<!-- comment --> blah blah blah blah.....'
This is where your original version will fail. Your original regex
translates as "a string starting with a comment opener followed by at
least 3200 characters that do not start with a comment closer". So a
long string that starts with a short comment will match your original
regexp. I confirmed this by running your code above and moving the
comment closer from the end to just after the first "foo".
Ah, right, OK. The example in the Camel Book isn't very clear on
exactly how the condition attaches to either the regex as a whole, or
any particular part of it.
None of the variants seem to be *too* nasty on the CPU though; feeding
one of these monster messages through a minimal Perl script as above
that just runs a handful of regexes showed:
real 0m0.050s
user 0m0.045s
sys 0m0.012s
That doesn't look too bad. I compared the two variants on my own with a
large test string (over 32000 chars) and found that the extra
look-aheads in the working regexp took my case from 26ms to 36ms.
Probably not enough to cause a problem, but definitely significant.
However, this only occurs when there is a huge comment. If the comment
is small, both versions run the same, so you are probably ok as far as
that goes.
It's probably a lot nastier on large *legitimate* messages with many
(small) HTML comments, but those already take a long time to scan anyway
and the best thing I can do about them is whitelist or blacklist them
upstream of SA (depending on user preference).
Closer inspection of one of these spams showed it was actually several
very long HTML comments in between the actual content tags - all four or
five of them. Stripping the comments trims it down to less than 1K -
essentially just a couple of <img> tags pointing to remote servers for
the actual spam payload images.
-kgd