On Jan 18, 8:40 am, Stefan Behnel <stefan...@behnel.de> wrote: > Simon Forman wrote: > > I want to take a webpage, find all URLs (links, img src, etc.) and > > rewrite them in-place, and I'd like to do it in python (pure python > > preferred.) > > lxml.html has functions specifically for this problem. > > http://codespeak.net/lxml/lxmlhtml.html#working-with-links > > Code would be something like > > html_doc = lxml.html.parse(b"http://.../xyz.html") > html_doc.rewrite_links( ... ) > print( lxml.html.tostring(html_doc) ) > > It also handles links in CSS or JavaScript, as well as broken HTML documents. > > Stefan
Thank you so much! This is exactly what I needed. (for what it's worth, parse() seems to return a "plain" ElementTree object but document_fromstring() gave me a special html object with the rewrite_links() method. I think, but I didn't try it, that html_doc = lxml.html.parse(b"http://.../xyz.html") lxml.html.rewrite_links(html_doc, ... ) would work too.) Thanks again! Regards, ~Simon -- http://mail.python.org/mailman/listinfo/python-list