On 4/2/2012 6:03 PM, Kris Deugau wrote:
>> On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
>>> Don't know about the spamassassin issue, but that regexp
>>> matches<!-- followed by a sequence of 32000 of more characters
>>> provided that sequence doesn't start with "-->".
>>>
>>> ITYM
>>>
>>> m|<!--(?:(?!-->).){32000,}|s
>>>
>>> That is you need to look ahead at each character of the sequence
>>> to look for the closing comment tag, otherwise you'll match on
>>> <!-- short comment -->  <31982 or more characters>
> Actually, no, it works as intended.
>
> If you uncomment the string fragments below, the 320-character versions 
> both match but the 32000-character ones don't.  As-is, neither matches.
>
> my $shorty = "<!-- foo bar baxkja safdjwelkj werf kjwlekrjwlekr jlwkerk 
> jawelkj awlekj lakewjflakwjef lakj ".
> #"awelkj alkfj awlekfj lawie fjalwief jlawijfe lawiejflfiwj elifj 
> lawiej4lti j34wlit j43wli jliajs lij flisaj ".
> #"flsaidfj liasjdf lisdj lijsa fldi fa;slkjf;lask j;lkaj fs; jfsdjf sak 
> hflkshf lksj fhlksaj fhlska fhlkajs ".
> "fhlkajshlkjashflkjasdfhlkjsahdflkjas hlkfh lwelif hwli3u fhliuwae 
> fhliuawfheliuhfliu fhwei ufhsd fg/sd ".
> "/dsf/g/sdafg /sdf/ 
> gdf/sg/sdf/gds/g/sd/th/ser/h/ser/ghs/rg/srg/ser/gs/erg/ser/g/ser/g/ser/g/ser/g/serg
>  
> -->";
>
> my @regex = ('<!--(?:(?!-->).){32000,}', '<!--(?:(?!-->).){320,}', 
> '<!--(?!-->).{32000,}', '<!--(?!-->).{320,}');
>
> foreach (@regex) {
>    print "$_ shorty ok\n" if $shorty =~ m/$_/s;
> }
>
> (And yes, this is almost exactly what I'm seeing in these monster 
> comments, although they're usually at least mostly real words, and they 
> are in the ~100K+-characters length range.)

Try using a string that's longer than 320 characters that starts with a
short comment.

i.e.:    '<!-- comment --> blah blah blah blah.....'

This is where your original version will fail.  Your original regex
translates as "a string starting with a comment opener followed by at
least 3200 characters that do not start with a comment closer".  So a
long string that starts with a short comment will match  your original
regexp.  I confirmed this by running your code above and moving the
comment closer from the end to just after the first "foo".

Output:
<!--(?!-->).{320,} shorty ok

> Bowie Bailey wrote:
>> And you may or may not want to match on a closing comment at the end.
>>
>> m|<!--(?:(?!-->).){32000,}-->|s
> Enh, I don't think it matters.

Maybe not, but I like to be as specific as possible in a regexp.  If it
doesn't end with a comment closer, is it really a comment?

> However, when testing in a minimal Perl script that just tries to match 
> on the whole raw message, my original works fine;  I don't need the 
> extra non-capturing parentheses.

See my comment above.  I don't think you've tested it properly yet.

>
>> Also, because of all of the lookaheads, this may be an expensive
>> regexp.  If you try it, keep a close eye on your SA.  If it slows down
>> to a crawl, this is probably the culprit.
> None of the variants seem to be *too* nasty on the CPU though;  feeding 
> one of these monster messages through a minimal Perl script as above 
> that just runs a handful of regexes showed:
>
> real    0m0.050s
> user    0m0.045s
> sys     0m0.012s

That doesn't look too bad.  I compared the two variants on my own with a
large test string (over 32000 chars) and found that the extra
look-aheads in the working regexp took my case from 26ms to 36ms. 
Probably not enough to cause a problem, but definitely significant. 
However, this only occurs when there is a huge comment.  If the comment
is small, both versions run the same, so you are probably ok as far as
that goes.

-- 
Bowie

Reply via email to