Randy Ramsdell wrote:
Matt Kettler wrote:
Joseph Brennan wrote:

I was surprised that this rule...

 uri CU_CN_LINK      /http:..\w+\.cn\b/

matches not only this...

 <a href="http://foobar.cn";>

but also this...

<a href="http://www.columbia.edu/foo.html";>KooXoo Buys Kuxun.cn Domain</a>


First, I did not realize that SpamAssassin's idea of "uri" includes not
only the uri, but the start tag, end tag, and all in between.  That's
useful but not real clear in Mail::SpamAssassin::Conf.
Actually, it doesn't.. your second example has two URIs as far as SpamAssassin is concerned. "http://www.columbia.edu/foo.html"; and "http://Kuxun.cn";. Two separate URIs.

Since many email clients "auto-link" domains in text portions, like www.google.com, SpamAssassin tries to find text strings that clients will treat as URIs and use them in the URI tests as well.


How so? How does spamassassin URI check determine Kuxun.cn in a URI as opposed to someone who forgot to add a "space" after a sentence end? Is it because it is located within the "a" tag?

try putting this
   "I often forget spaces.it happens to me all the time..."
in a message and run with -D. you'll see:

...
[74536] dbg: uridnsbl: domains to query: spaces.it
...
[74536] dbg: rules: ran uri rule __LOCAL_PP_NONPPURL ======> got hit: "http://spaces.it";
...

As you see, SA can't guess that a space is missing, so it checks the "resulting" URI anyway.


Things get "tricky" when you want to hit things like
   Did you visit http://www.example.com/foo/bar?if so...
and you are looking for specific patterns in the "bar" part...




Second, I can't figure out how \w+ matches the punctuation and spaces!
It doesn't. :)




Reply via email to