On Fri, Jul 13, 2007 at 11:18:18AM +0100, Nic James Ferrier wrote: > > Derek Anderson <[EMAIL PROTECTED]> writes: > > > hey all, > > > > could anyone point me to a python html sanitizer implementation (or > > example)? i don't mean to strip all html, just tags and attributes not > > on a whitelist, such as I/B/A href/U/etc. > > I use libxml2/libxslt, something like: > > doc = libxml2.htmlParseDoc(whatever, "utf8") > result = libxslt.applyStylesheetFile(doc, "strip.xslt", {}) > > There are loads of ways of stripping in xslt depending on what you > want to do.
Only works on well formed XHTML documents though... which although they should be the norm, really aren't! I tend to use the sanitizer from feedparser. from feedparser import _HTMLSanitizer as HTMLSanitizer sanitizer = HTMLSanitizer('utf8') sanitizer.feed(content) output = sanitizer.output() Cheers, -- Brett Parker --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---