Hi, I used SGMLParser to parse all href's in a html file. Now I need to cut some strings. For example:
http://www.example.com/dir/example.html Now I like to cut the string, so that only domain and directory is left over. Expected result: http://www.example.com/dir/ I know how to do this in bash programming, but not in python. How could this be done? The next problem is not only to extract href's, but also images. A href is easy: <a href="install.php">Install</a> But a image is a little harder: <img class="bild" src="images/marine.jpg"> This is my current example code: from sgmllib import SGMLParser leach_url = "http://stargus.sourceforge.net/" class URLLister(SGMLParser): def reset(self): SGMLParser.reset(self) self.urls = [] def start_a(self, attrs): href = [v for k, v in attrs if k=='href'] if href: self.urls.extend(href) if __name__ == "__main__": import urllib usock = urllib.urlopen(leach_url) parser = URLLister() parser.feed(usock.read()) parser.close() usock.close() for url in parser.urls: print url Perhaps you've some tips how to solve this problems? regards Andreas -- http://mail.python.org/mailman/listinfo/python-list