Andreas Volz wrote:
Hi,
I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:
http://www.example.com/dir/example.html
Now I like to cut the string, so that only domain and directory is
left over. Expected result:
http://www.example.com/dir/
I know how to do this in bash programming, but not in python. How could
this be done?
The next problem is not only to extract href's, but also images. A href
is easy:
<a href="install.php">Install</a>
But a image is a little harder:
<img class="bild" src="images/marine.jpg">
This is my current example code:
from sgmllib import SGMLParser
leach_url = "http://stargus.sourceforge.net/"
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?
from sgmllib import SGMLParser
leach_url = "http://stargus.sourceforge.net/"
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
self.images = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.append(v)
break
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img
$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?group_id=119561&type=1
regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list