"beza1e1" <[EMAIL PROTECTED]> writes: > I do not really know, what you want to do. Getting he urls from the a > tags of a html file? I think the easiest method would be a regular > expression.
I think this ranks as #2 on the list of "difficult one-day hacks". Yeah, it's simple to write an RE that works most of the time. It's a major PITA to write one that works in all the legal cases. Getting one that also handles all the cases seen in the wild is damn near impossible. >>>>import urllib, sre >>>>html = urllib.urlopen("http://www.google.com").read() >>>>sre.findall('href="([^>]+)"', html) This fails in a number of cases. Whitespace around the "=" sign for attibutes. Quotes around other attributes in the tag (required by XHTML). '>' in the URL (legal, but disrecommended). Attributes quoted with single quotes instead of double quotes, or just unqouted. It misses IMG SRC attributes. It hands back relative URLs as such, instead of resolving them to the absolute URL (which requires checking for the base URL in the HEAD), which may or may not be acceptable. > Google has some strange html, href without quotation marks: <a > href=http://www.google.com/ncr>Google.com in English</a> That's not strange. That's just a bit unusual. Perfectly legal, though - any browser (or other html processor) that fails to handle it is broken. <mike -- Mike Meyer <[EMAIL PROTECTED]> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list