Bowie Bailey wrote:
Try using a string that's longer than 320 characters that starts with a
short comment.

i.e.:    '<!-- comment -->  blah blah blah blah.....'

This is where your original version will fail.  Your original regex
translates as "a string starting with a comment opener followed by at
least 3200 characters that do not start with a comment closer".  So a
long string that starts with a short comment will match  your original
regexp.  I confirmed this by running your code above and moving the
comment closer from the end to just after the first "foo".

Ah, right, OK. The example in the Camel Book isn't very clear on exactly how the condition attaches to either the regex as a whole, or any particular part of it.

None of the variants seem to be *too* nasty on the CPU though;  feeding
one of these monster messages through a minimal Perl script as above
that just runs a handful of regexes showed:

real    0m0.050s
user    0m0.045s
sys     0m0.012s

That doesn't look too bad.  I compared the two variants on my own with a
large test string (over 32000 chars) and found that the extra
look-aheads in the working regexp took my case from 26ms to 36ms.
Probably not enough to cause a problem, but definitely significant.
However, this only occurs when there is a huge comment.  If the comment
is small, both versions run the same, so you are probably ok as far as
that goes.

It's probably a lot nastier on large *legitimate* messages with many (small) HTML comments, but those already take a long time to scan anyway and the best thing I can do about them is whitelist or blacklist them upstream of SA (depending on user preference).

Closer inspection of one of these spams showed it was actually several very long HTML comments in between the actual content tags - all four or five of them. Stripping the comments trims it down to less than 1K - essentially just a couple of <img> tags pointing to remote servers for the actual spam payload images.

-kgd

Reply via email to