On Fri, Jul 13, 2007 at 11:18:18AM +0100, Nic James Ferrier wrote:
> 
> Derek Anderson <[EMAIL PROTECTED]> writes:
> 
> > hey all,
> >
> > could anyone point me to a python html sanitizer implementation (or 
> > example)?  i don't mean to strip all html, just tags and attributes not 
> > on a whitelist, such as I/B/A href/U/etc.
> 
> I use libxml2/libxslt, something like:
> 
>   doc = libxml2.htmlParseDoc(whatever, "utf8")
>   result = libxslt.applyStylesheetFile(doc, "strip.xslt", {})
> 
> There are loads of ways of stripping in xslt depending on what you
> want to do.

Only works on well formed XHTML documents though... which although they
should be the norm, really aren't!

I tend to use the sanitizer from feedparser.

from feedparser import _HTMLSanitizer as HTMLSanitizer

sanitizer = HTMLSanitizer('utf8')
sanitizer.feed(content)
output = sanitizer.output()
            
Cheers,
-- 
Brett Parker

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to