Joseph Brennan wrote:
I was surprised that this rule...
uri CU_CN_LINK /http:..\w+\.cn\b/
matches not only this...
<a href="http://foobar.cn">
but also this...
<a href="http://www.columbia.edu/foo.html">KooXoo Buys Kuxun.cn
Domain</a>
First, I did not realize that SpamAssassin's idea of "uri" includes not
only the uri, but the start tag, end tag, and all in between. That's
useful but not real clear in Mail::SpamAssassin::Conf.
Actually, it doesn't.. your second example has two URIs as far as
SpamAssassin is concerned. "http://www.columbia.edu/foo.html" and
"http://Kuxun.cn". Two separate URIs.
Since many email clients "auto-link" domains in text portions, like
www.google.com, SpamAssassin tries to find text strings that clients
will treat as URIs and use them in the URI tests as well.
Second, I can't figure out how \w+ matches the punctuation and spaces!
It doesn't. :)