On 15 Jan 2006, at 23:22, tonemcd wrote:
If your articles have HTML in them, you'll need to be careful that no
'dangerous' HTML is included (javascript is the most common). A good
library is stripogram -
http://www.zope.org/Members/chrisw/StripOGram/readme
Stripogram is inadequate for protecting against XSS attacks. It
doesn't strip style="" attributes (which can contain executable code)
and has very simplistic code for filtering javascript: style links.
Here's their code for attribute filtering:
if lower(k[0:2]) != 'on' and lower(v[0:10]) != 'javascript':
self.result += ' %s="%s"' % (k, v)
And here are three ways off the top of my head to defeat that:
<a href=" javascript:alert('XSS')">Click me</a> (Note the leading space)
<a href="vbscript:alert('XSS')">Click me</a> (IE will run this)
<a href="java
script:alert('XSS')">Click me</a> (IE will run this too; it was part
of the MySpace worm: http://namb.la/popular/tech.html )
Filtering unsafe HTML is a deceptively hard problem - you need to be
aware not just of the HTML spec but also of the full details of all
of the common implementations and their bugs. Since the most
widespread of these is closed source, good luck!
Definitely don't use stripogram though. It will give you nothing more
than a false sense of security. I'm going to submit these bugs to the
library author.
The best Python stripping code I've seen is in Mark Pilgrim's
feedparser. You might want to try extracting it.
Cheers,
Simon