I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page. No errors, but some of the tags seem to go missing for no apparent reason - any advice? I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
Code: import urllib2 from HTMLParser import HTMLParser from GetHttpFileContents import getHttpFileContents # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Start tag:\n\t", tag for attr in attrs: print "\t\tattr:", attr # end for attr in attrs: # def handle_endtag(self, tag): print "End tag :\n\t", tag # def handle_data(self, data): if data != '\n\n': if data != '\n': print "Data :\t\t", data # end if 1 # end if 2 # # # --------------------------------------------------------------------- # def removeHtmlFromFileContents(): TextOut = '' parser = MyHTMLParser() parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read()) return TextOut # # --------------------------------------------------------------------- # if __name__ == '__main__': TextOut = removeHtmlFromFileContents() Part of the output: End tag : script Start tag: title Data : Bob Aalsma - Nederland | LinkedIn End tag : title Start tag: script attr: ('type', 'text/javascript') attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma') End tag : script Start tag: link attr: ('rel', 'stylesheet') attr: ('type', 'text/css') attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69') Start tag: script attr: ('type', 'text/javascript') attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo') End tag : script End tag : head But the source text for this is [and all of the "<meta ...> seem to go missing: </script> <title>Bob Aalsma | LinkedIn</title> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5"> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo"> <meta name="LinkedInBookmarkType" content="profile"> <meta name="ShortTitle" content="Bob Aalsma"> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)"> <meta name="UniqueID" content="24198692"> <meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG"> </head> -- http://mail.python.org/mailman/listinfo/python-list