Keith Ivey wrote:

Jesse Houwing wrote:

BODY TABLEOBFU m{<td([^>]+|"[^"]+)>(<([^>]+|"[^"]+)>)*[a-z]{1,2}(<([^>]+|"[^"]+)>)*</td([^>]+|"[^"]+)>}i



I think you may want a * after the ) inside the <>. As it is, you're looking for either a bunch of characters that are not > or a quote followed by a bunch of characters that are not quote. In fact, I think what was really intended was something more like this (note that this also requires an ending quote on contained quoted strings and allows ""):


m{<td([^>"]+|"[^"]*")*>(<([^>"]+|"[^"]*")*>)*[a-z]{1,2}(<([^>"]+|"[^"]*")*>)*</td([^>"]+|"[^"]*")*>}i


The other problem with the pattern as written (with no *) is that the subpatterns don't match plain <td> or </td>, since they require at least one character between the td and the >.


It was late ;)

I'm currently rinning tests on a couple of alternatives:

rawbody tblobfu_opttag 
/<td(?:[^>'"]|"[^"]*"|'[^']*')*>(?:<(?!\/?td)(?:[^>'"]|"[^"]*"|'[^']*')*>){0,5}(?![oi][ns]|an?|en|of|de|l[ae]|us|no|tm)[a-z]{1,2}(?:<(?!\/?td)(?:[^>'"]|"[^"]*"|'[^']*')*>){0,5}<\/td(?:[^>'"]|"[^"]*"|'[^']*')*>/i
rawbody tblobfu_tag 
/<td(?:[^>'"]|"[^"]*"|'[^']*')*>(?:<(?!\/?td)(?:[^>'"]|"[^"]*"|'[^']*')*>){1,5}(?![oi][ns]|an?|en|of|de|l[ae]|us|no|tm)[a-z]{1,2}(?:<(?!\/?td)(?:[^>'"]|"[^"]*"|'[^']*')*>){1,5}<\/td(?:[^>'"]|"[^"]*"|'[^']*')*>/i

Please note that before making this final I will be removing the splats (*) with some usable limitations, but I want to compare the number ofg ham/spam hits first before making the final rules.

Jesse




Reply via email to