Andreas Volz wrote:

Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:


http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/";

class URLLister(SGMLParser):
        def reset(self):
                SGMLParser.reset(self)
                self.urls = []

        def start_a(self, attrs):
                href = [v for k, v in attrs if k=='href']
                if href:
                        self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: print url



Perhaps you've some tips how to solve this problems?

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/";

class URLLister(SGMLParser):
        
        def reset(self):
                SGMLParser.reset(self)
                self.urls = []
                self.images = []

        def start_a(self, attrs):
                href = [v for k, v in attrs if k=='href']
                if href:
                        self.urls.extend(href)

        def do_img(self, attrs):
                "We assume each image *has* a src attribute."
                for k, v in attrs:
                        if k == 'src':
                                self.images.append(v)
                                break
                
                
if __name__ == "__main__":
        import urllib
        usock = urllib.urlopen(leach_url)
        parser = URLLister()
        parser.feed(usock.read())
        parser.close()
        usock.close()
        print "URLs:"
        for url in parser.urls:
                print url
        print "IMGs:"
        for img in parser.images:
                print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?group_id=119561&amp;type=1

regards
 Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to