On 10/14, [email protected] wrote: > rawbody __SPOOFED_URL > m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# > :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
> I agree it seems like we should be able to improve it. Maybe make > exceptions for known marketing trackers, as Adam Katz mentioned it has > problems with. I dug some of the hits out of my own corpora. Of the 9 emails I looked at *all* cases where it looked like this rule could have hit, matched at the host name level. So I think there is definite room for improvement there - just check for a matching host name, ignore all the extra gunk after it. Although I'm not certain it doesn't already try to do that, maybe I should take more time to try to read it. Okay, it's starting to sink in, and looks like it's trying to match the whole url. Several examples where cases where somebody with a gmail account replied to an email of mine and gmail converted the url in my plain text signature to html: throats."<br>=A0- Henry Louis Mencken (1880-1956)<br><a href=3D"http:/= /www.chaosreigns.com/" target=3D"_blank">http://www.ChaosReigns.com</a><br> And I did get to see lots of gross html. Particularly from yahoo groups. So maybe it would help to do some more html parsing (un-escaping) before this rule. I don't know how much work that would take. But I didn't find any of the marketing trackers Adam mentioned. -- "Think, or I will set you on fire." http://www.ChaosReigns.com
