BobAalsma wrote: > I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page. > No errors, but some of the tags seem to go missing for no apparent reason - any advice? > I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :( > > Code: > import urllib2 > from HTMLParser import HTMLParser > > from GetHttpFileContents import getHttpFileContents > > # create a subclass and override the handler methods > class MyHTMLParser(HTMLParser): > def handle_starttag(self, tag, attrs): > print "Start tag:\n\t", tag > for attr in attrs: > print "\t\tattr:", attr > # end for attr in attrs: > # > def handle_endtag(self, tag): > print "End tag :\n\t", tag > # > def handle_data(self, data): > if data != '\n\n': > if data != '\n': > print "Data :\t\t", data > # end if 1 > # end if 2
Please no! A kitten dies every time you write one of those comments ;) > def removeHtmlFromFileContents(): > TextOut = '' > > parser = MyHTMLParser() > parser.feed(urllib2.urlopen( > 'http://nl.linkedin.com/in/bobaalsma').read()) > > return TextOut > # > # --------------------------------------------------------------------- > # > if __name__ == '__main__': > TextOut = removeHtmlFromFileContents() After removing > from GetHttpFileContents import getHttpFileContents from your script I get the following output (using python 2.7): $ python parse_orig.py | grep meta -C2 script Start tag: meta attr: ('http-equiv', 'content-type') attr: ('content', 'text/html; charset=UTF-8') Start tag: meta attr: ('http-equiv', 'X-UA-Compatible') attr: ('content', 'IE=8') Start tag: meta attr: ('name', 'description') attr: ('content', 'Bekijk het (Nederland) professionele profiel van Bob Aalsma op LinkedIn. LinkedIn is het grootste zakelijke netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners vinden.') Start tag: meta attr: ('name', 'pageImpressionID') attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134') Start tag: meta attr: ('name', 'pageKey') attr: ('content', 'nprofile-public-success') Start tag: meta attr: ('name', 'analyticsURL') attr: ('content', '/analytics/noauthtracker') $ So there definitely are some meta tags. Note that if you're logged in into a site the html the browser is "seeing" may differ from the html you are retrieving via urllib.urlopen(...).read(). Perhaps that is the reason why you don't get what you expect. -- http://mail.python.org/mailman/listinfo/python-list