On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason-s...@creativetrax.com> wrote:
> (apologies for possible multiple posts--I've sent this twice to gmane and it
> hasn't appeared)
>
> I've implemented some sanitizing of public worksheets [1] and applied it to
> demo.sagenb.org as a test.  The concerns from before were that javascript
> was executing on the page, leading to malware being on the page.
>
> Can people test the new html sanitizing being done on demo.sagenb.org?
>
> If it looks good (especially to William), I'll roll it out to the other
> *.sagenb.org servers and we'll turn published worksheets back on.

I hate to criticize this, since I know it was a lot of work to
implement.     But since I'm the one who gets my network cutoff when
things go wrong, I've got to take this seriously.

I happen to have read straight through the book "The Tangled Web: A
Guide to Securing Modern Web Applications" this summer.   There are
thousands of known and truly dangerous exploits that one can sneak
into an HTML+CSS document.  (It's amazing the sorts of crazy evil
things one can do using weird character sets combined with browser
implementation issues, especially if said browser is IE.)   The book
convincingly argues that the only way to have any hope of making a
safe HTML document is to use a whitelist approach.

The code you wrote uses the exact opposite approach -- the blacklist
approach.    Here's a typical stackoverflow discussion about exactly
the approach in your code:

   
http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587

Note the comment there: "Note that this uses a blacklist approach to
filter out evil bits, rather than whitelist, but only a whitelisting
approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10"

Also, the lxml documentation describes the function you're using as:
"Removes unwanted tags and content."    Instead, the "whitelist
approach" to sanitizing an HTML+CSS document is to "include wanted
tags and content", which is a very, very different thing.

There are Python libraries  that implement a whitelist approach, e.g.,
see the discussion here:

    
http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python

 -- William

P.S. Also, this worries me a little:

   if el.tag=='script' and el.get('type')=='math/tex' and not el.get('src'):

I wonder if there is a way to put malware into a mathjax script tag?

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To post to this group, send email to sage-devel@googlegroups.com.
To unsubscribe from this group, send email to 
sage-devel+unsubscr...@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel?hl=en.


Reply via email to