On 10/14, [email protected] wrote:
> rawbody  __SPOOFED_URL        
> m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# 
> :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i

> I agree it seems like we should be able to improve it.  Maybe make
> exceptions for known marketing trackers, as Adam Katz mentioned it has
> problems with.  

I dug some of the hits out of my own corpora.  Of the 9 emails I looked at
*all* cases where it looked like this rule could have hit, matched at the
host name level.  So I think there is definite room for improvement there -
just check for a matching host name, ignore all the extra gunk after it.
Although I'm not certain it doesn't already try to do that, maybe I should
take more time to try to read it.  Okay, it's starting to sink in, and
looks like it's trying to match the whole url.  

Several examples where cases where somebody with a gmail account replied to
an email of mine and gmail converted the url in my plain text signature
to html:

throats.&quot;<br>=A0- Henry Louis Mencken (1880-1956)<br><a href=3D"http:/=
/www.chaosreigns.com/" target=3D"_blank">http://www.ChaosReigns.com</a><br>

And I did get to see lots of gross html.  Particularly from yahoo groups.
So maybe it would help to do some more html parsing (un-escaping) before
this rule.  I don't know how much work that would take.

But I didn't find any of the marketing trackers Adam mentioned.  

-- 
"Think, or I will set you on fire."
http://www.ChaosReigns.com

Reply via email to