On 4/2/2012 6:03 PM, Kris Deugau wrote: >> On 4/2/2012 12:58 PM, Stephane Chazelas wrote: >>> Don't know about the spamassassin issue, but that regexp >>> matches<!-- followed by a sequence of 32000 of more characters >>> provided that sequence doesn't start with "-->". >>> >>> ITYM >>> >>> m|<!--(?:(?!-->).){32000,}|s >>> >>> That is you need to look ahead at each character of the sequence >>> to look for the closing comment tag, otherwise you'll match on >>> <!-- short comment --> <31982 or more characters> > Actually, no, it works as intended. > > If you uncomment the string fragments below, the 320-character versions > both match but the 32000-character ones don't. As-is, neither matches. > > my $shorty = "<!-- foo bar baxkja safdjwelkj werf kjwlekrjwlekr jlwkerk > jawelkj awlekj lakewjflakwjef lakj ". > #"awelkj alkfj awlekfj lawie fjalwief jlawijfe lawiejflfiwj elifj > lawiej4lti j34wlit j43wli jliajs lij flisaj ". > #"flsaidfj liasjdf lisdj lijsa fldi fa;slkjf;lask j;lkaj fs; jfsdjf sak > hflkshf lksj fhlksaj fhlska fhlkajs ". > "fhlkajshlkjashflkjasdfhlkjsahdflkjas hlkfh lwelif hwli3u fhliuwae > fhliuawfheliuhfliu fhwei ufhsd fg/sd ". > "/dsf/g/sdafg /sdf/ > gdf/sg/sdf/gds/g/sd/th/ser/h/ser/ghs/rg/srg/ser/gs/erg/ser/g/ser/g/ser/g/ser/g/serg > > -->"; > > my @regex = ('<!--(?:(?!-->).){32000,}', '<!--(?:(?!-->).){320,}', > '<!--(?!-->).{32000,}', '<!--(?!-->).{320,}'); > > foreach (@regex) { > print "$_ shorty ok\n" if $shorty =~ m/$_/s; > } > > (And yes, this is almost exactly what I'm seeing in these monster > comments, although they're usually at least mostly real words, and they > are in the ~100K+-characters length range.)
Try using a string that's longer than 320 characters that starts with a short comment. i.e.: '<!-- comment --> blah blah blah blah.....' This is where your original version will fail. Your original regex translates as "a string starting with a comment opener followed by at least 3200 characters that do not start with a comment closer". So a long string that starts with a short comment will match your original regexp. I confirmed this by running your code above and moving the comment closer from the end to just after the first "foo". Output: <!--(?!-->).{320,} shorty ok > Bowie Bailey wrote: >> And you may or may not want to match on a closing comment at the end. >> >> m|<!--(?:(?!-->).){32000,}-->|s > Enh, I don't think it matters. Maybe not, but I like to be as specific as possible in a regexp. If it doesn't end with a comment closer, is it really a comment? > However, when testing in a minimal Perl script that just tries to match > on the whole raw message, my original works fine; I don't need the > extra non-capturing parentheses. See my comment above. I don't think you've tested it properly yet. > >> Also, because of all of the lookaheads, this may be an expensive >> regexp. If you try it, keep a close eye on your SA. If it slows down >> to a crawl, this is probably the culprit. > None of the variants seem to be *too* nasty on the CPU though; feeding > one of these monster messages through a minimal Perl script as above > that just runs a handful of regexes showed: > > real 0m0.050s > user 0m0.045s > sys 0m0.012s That doesn't look too bad. I compared the two variants on my own with a large test string (over 32000 chars) and found that the extra look-aheads in the working regexp took my case from 26ms to 36ms. Probably not enough to cause a problem, but definitely significant. However, this only occurs when there is a huge comment. If the comment is small, both versions run the same, so you are probably ok as far as that goes. -- Bowie