I do not really know, what you want to do. Getting he urls from the a
tags of a html file? I think the easiest method would be a regular

>>>import urllib, sre
>>>html = urllib.urlopen("http://www.google.com";).read()
>>>sre.findall('href="([^>]+)"', html)
>>> sre.findall('href=[^>]+>([^<]+)</a>', html)
['Bilder', 'Groups', 'Verzeichnis', 'News', 'Froogle',
'Mehr&nbsp;&raquo;', 'Erweiterte Suche', 'Einstellungen',
'Sprachtools', 'Werbung', 'Unternehmensangebote', 'Alles \xfcber
Google', 'Google.com in English']

Google has some strange html, href without quotation marks: <a
href=http://www.google.com/ncr>Google.com in English</a>


Reply via email to