On Fri, 15 Sep 2017, Robert Boyl wrote:

uri             
__KAM_SHORT/(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|u
rla\.ru|fw\.to)(\/|$|\b)/i

Seems a bit complicated.

It would be to make this rule check that suffixes are at the end of URI.

uri __TEST_URLS /\b(\.vn|\.pl|\.my|\.lu|\.vn|\.ar)\b/i

I believe this does it, correct?

uri __TEST_URLS /\b(\.vn$|\.pl$|\.my$|\.lu$|\.vn$|\.ar$)\b/i

As Paul said, if you're just looking at uris, the enlist_uri might be
the better way to go.  And it has the advantage that you don't have to
use (some might say abuse) regular expressions.

I believe URIs as collected for the uri tests consist of more than
just the server part of the URI, but maybe I'm wrong (or maybe the
list includes the server part only as well as the full URI).  If I'm
correct, then using the "$" will not work where URIs have a local part
and might not work where there's only a trailing "/".

In the case where you're only looking at the TLD, you don't have to
worry about the front word boundary because you're explicitly
anchoring the front of the match with the "\." part.  At the end, you
need to make sure that you're not allowing characters that would
indicate the server part of the URI continues past your intended match
(to avoid things like matching "blah.com" when you're really trying to
match ".co").  In my estimation, the characters that might indicate
continuation of the URI are letters, numbers, underscores, hyphens,
and the literal ".".

So, my rule for just matching TLDs looks like:

uri __TEST_URLS  /\.(vn|pl|my|lu|vn|ar)\b[^\.-]/i

The "\b" part excludes the letters, numbers and underscore because
those wouldn't be a word boundary.  The "[^\.-]" part excludes the
hyphen and literal "." from being on the right side of that word
boundary.

And now that I'm looking at it, I'm wondering if it would match a
URI like "https://legit.domain.com/great.beer/"; ("beer" being one of
the TLDs my rule contains).  Like I said, the enlist_uri method might
be worth it just to avoid regular expressions.

--
Public key #7BBC68D9 at            |                 Shane Williams
http://pgp.mit.edu/                |      System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines |              sha...@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew

Reply via email to