Thomas SMETS wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > Dear, > > I need to parse XHTML/HTML files in all ways : > ~ _ Removing comments and javascripts is a first issue > ~ _ Retrieving the list of fields to submit is my following item (todo) > > Any idea where I could find this already made ... ?
You could try XIST (http://www.livinglogic.de/Python/xist). Removing comments and javascripts works like this: --- from ll.xist import xsc, parsers from ll.xist.ns import html e = parsers.parseURL("http://www.python.org/", tidy=True) def removestuff(node, converter): if isinstance(node, xsc.Comment): node = xsc.Null elif isinstance(node, html.script) and \ (unicode(node["type"]) == u"text/javascript" or \ unicode(node["language"]) == u"Javascript" \ ): node = xsc.Null return node e = e.mapped(removestuff) print e.asBytes() --- Retrieving the list of fields from all forms on a page might look like this: --- from ll.xist import xsc, parsers, xfind from ll.xist.ns import html e = parsers.parseURL("http://www.python.org/", tidy=True) for form in e//html.form: print "Fields for %s" % form["action"] for field in form//xfind.is_(html.input, html.textarea): if "id" in field.attrs: print "\t%s" % field["id"] else: print "\t%s" % field["name"] --- This prints: Fields for http://www.google.com/search q domains sitesearch sourceid submit Hope that helps! Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list