On Mar 29, 1:50 am, John Nagle <[EMAIL PROTECTED]> wrote: > Here's a construct with which BeautifulSoup has problems. It's > from "http://support.microsoft.com/contactussupport/?ws=support". > > This is the original: > > <a href="http://www.microsoft.com/usability/enroll.mspx" > id="L_75998" > title="<!--http://www.microsoft.com/usability/information.mspx->" > onclick="return MS_HandleClick(this,'C_32179', true);"> > Help us improve our products > </a> > <snip> > > Strictly speaking, it's Microsoft's fault. > > title="<!--http://www.microsoft.com/usability/information.mspx->" > > is supposed to be an HTML comment. But it's improperly terminated. > It should end with "-->". So all that following stuff is from what > follows the next "-->" which terminates a comment. >
No, that comment is inside a quoted string, so it should be ok. If you are just trying to extract <a href=...> tags, this pyparsing scraper gets them, including this problematic one: import urllib from pyparsing import makeHTMLTags pg = urllib.urlopen("http://support.microsoft.com/contactussupport/? ws=support") htmlSrc = pg.read() pg.close() # only take first tag returned from makeHTMLTags, not interested in # closing </a> tags anchorTag = makeHTMLTags("A")[0] for a in anchorTag.searchString(htmlSrc): if "title" in a: print "Title:", a.title print "HREF:", a.href # or use this statement to dump the complete tag contents # print a.dump() print Prints: Title: <!--http://www.microsoft.com/usability/information.mspx-> HREF: http://www.microsoft.com/usability/enroll.mspx Title: Print this page HREF: /gp/noscript/ Title: Print this page HREF: /gp/noscript/ Title: E-mail this page HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f %2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws %3dsupport Title: E-mail this page HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f %2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws %3dsupport Title: Microsoft Worldwide HREF: /common/international.aspx?rdPath=0 Title: Microsoft Worldwide HREF: /common/international.aspx?rdPath=0 Title: Save to My Support Favorites HREF: /gp/noscript/ Title: Save to My Support Favorites HREF: /gp/noscript/ Title: Go to My Support Favorites HREF: /gp/noscript/ Title: Go to My Support Favorites HREF: /gp/noscript/ Title: Send Feedback HREF: /gp/noscript/ Title: Send Feedback HREF: /gp/noscript/ -- Paul -- http://mail.python.org/mailman/listinfo/python-list