Re: Parsing an HTML a tag

beza1e1 Sat, 24 Sep 2005 11:06:06 -0700

I do not really know, what you want to do. Getting he urls from the a
tags of a html file? I think the easiest method would be a regular
expression.


>>>import urllib, sre
>>>html = urllib.urlopen("http://www.google.com";).read()
>>>sre.findall('href="([^>]+)"', html)
['/imghp?hl=de&tab=wi&ie=UTF-8',
'http://groups.google.de/grphp?hl=de&tab=wg&ie=UTF-8',
'/dirhp?hl=de&tab=wd&ie=UTF-8',
'http://news.google.de/nwshp?hl=de&tab=wn&ie=UTF-8',
'http://froogle.google.de/frghp?hl=de&tab=wf&ie=UTF-8',
'/intl/de/options/']
>>> sre.findall('href=[^>]+>([^<]+)</a>', html)
['Bilder', 'Groups', 'Verzeichnis', 'News', 'Froogle',
'Mehr&nbsp;&raquo;', 'Erweiterte Suche', 'Einstellungen',
'Sprachtools', 'Werbung', 'Unternehmensangebote', 'Alles \xfcber
Google', 'Google.com in English']

Google has some strange html, href without quotation marks: <a
href=http://www.google.com/ncr>Google.com in English</a>

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing an HTML a tag

Reply via email to