Check out the 'urlparse' module, in the standard library, unless for some reason you *have* to use regular expressions.


/arg

On Dec 6, 2004, at 7:46 PM, Vivek wrote:

Hi,

I am trying to construct a regular expression using the re module that
matches for
1. my hostname
2. absolute from the root URLs including just "/"
3. relative URLs.

Basically I want the attern to not match for URLs that are not on my
host.

The following statement satisfies numbers 1 and 2, but not 3:

line =
re.sub(r'(href=")(http[s]?://'+hostname+'[/]? |/)([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)


An improvement that also partially satisfies number 3 is

line =
re.sub(r'(href=")(http[s]?://'+hostname+'[/]?|/ |[^h][^t][^t][^p][^:][^/][^/])([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)


This is not complete because if the relative url is less than seven
characters, than it will not match.

Any suggestions?

Thanx.

--
http://mail.python.org/mailman/listinfo/python-list

-- http://mail.python.org/mailman/listinfo/python-list

Reply via email to