More regex help

Support Desk Wed, 24 Sep 2008 09:27:26 -0700

I am working on a python webcrawler, that will extract all links from an
html page, and add them to a queue, The problem I am having is building
absolute links from relative links, as there are so many different types of
relative links. If I just append the relative links to the current url, some
websites will send it into a never-ending loop.


What I am looking for is a regexp that will extract the root url from any 
url string I pass to it, such as

'http://example.com/stuff/stuff/morestuff/index.html'

Regexp = http:example.com

'http://anotherexample.com/stuff/index.php

Regexp = 'http://anotherexample.com/

'http://example.com/stuff/stuff/

Regext = 'http://example.com'





--
http://mail.python.org/mailman/listinfo/python-list

More regex help

Reply via email to