Re: Regex help (targetting very long HTML comments)

Kris Deugau Tue, 03 Apr 2012 09:12:22 -0700

Bowie Bailey wrote:

Try using a string that's longer than 320 characters that starts with a
short comment.


i.e.:    '<!-- comment -->  blah blah blah blah.....'

This is where your original version will fail.  Your original regex
translates as "a string starting with a comment opener followed by at
least 3200 characters that do not start with a comment closer".  So a
long string that starts with a short comment will match  your original
regexp.  I confirmed this by running your code above and moving the
comment closer from the end to just after the first "foo".

Ah, right, OK. The example in the Camel Book isn't very clear onexactly how the condition attaches to either the regex as a whole, orany particular part of it.

None of the variants seem to be *too* nasty on the CPU though;  feeding
one of these monster messages through a minimal Perl script as above
that just runs a handful of regexes showed:

real    0m0.050s
user    0m0.045s
sys     0m0.012s


That doesn't look too bad.  I compared the two variants on my own with a
large test string (over 32000 chars) and found that the extra
look-aheads in the working regexp took my case from 26ms to 36ms.
Probably not enough to cause a problem, but definitely significant.
However, this only occurs when there is a huge comment.  If the comment
is small, both versions run the same, so you are probably ok as far as
that goes.

It's probably a lot nastier on large *legitimate* messages with many(small) HTML comments, but those already take a long time to scan anywayand the best thing I can do about them is whitelist or blacklist themupstream of SA (depending on user preference).

Closer inspection of one of these spams showed it was actually severalvery long HTML comments in between the actual content tags - all four orfive of them. Stripping the comments trims it down to less than 1K -essentially just a couple of <img> tags pointing to remote servers forthe actual spam payload images.


-kgd

Re: Regex help (targetting very long HTML comments)

Reply via email to