Diez B. Roggisch wrote: > Francach schrieb: > >> Hi, >> >> I'm trying to use the Beautiful Soup package to parse through the >> "bookmarks.html" file which Firefox exports all your bookmarks into. >> I've been struggling with the documentation trying to figure out how to >> extract all the urls. Has anybody got a couple of longer examples using >> Beautiful Soup I could play around with? > > > Why do you use BeautifulSoup on that? It's generated content, and I > suppose it is well-formed, most probably even xml. So use a standard > parser here, better yet somthing like lxml/elementtree > > Diez
Once upon a time I have written for my own purposes some code on this subject, so maybe it can be used as a starter (tested a bit, but consider its status as a kind of alpha release): <code> from urllib import urlopen from sgmllib import SGMLParser class mySGMLParserClassProvidingListOf_HREFs(SGMLParser): # provides only HREFs <a href="someURL"> for links to another pages skipping # references to: # - internal links on same page : "#..." # - email adresses : "mailto:..." # and skipping part with appended internal link info, so that e.g.: # - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only # --- # reset() overwrites an empty function available in SGMLParser class def reset(self): SGMLParser.reset(self) self.A_HREFs = [] #: def reset(self) # start_a() overwrites an empty function available in SGMLParser class # from which this class is derived. start_a() will be called each time the # SGMLParser detects an <a ...> tag within the feed(ed) HTML document: def start_a(self, tagAttributes_asListOfNameValuePairs): for attrName, attrValue in tagAttributes_asListOfNameValuePairs: if attrName=='href': if attrValue[0] != '#' and attrValue[:7] !='mailto:': if attrValue.find('#') >= 0: attrValue = attrValue[:attrValue.find('#')] #: if self.A_HREFs.append(attrValue) #: if #: if #: for #: def start_a(self, attributes_NamesAndValues_AsListOfTuples) #: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser) # ------------------------------------------------------------------------------ # --- # Execution block: fileLikeObjFrom_urlopen = urlopen('www.google.com') # set URL mySGMLParserClassObj_withListOfHREFs = mySGMLParserClassProvidingListOf_HREFs() mySGMLParserClassObj_withListOfHREFs.feed(fileLikeObjFrom_urlopen.read()) mySGMLParserClassObj_withListOfHREFs.close() fileLikeObjFrom_urlopen.close() for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs: print href #: for </code> Claudio Grondi -- http://mail.python.org/mailman/listinfo/python-list