On Sat, Oct 6, 2012 at 6:06 AM, William Stein <wst...@gmail.com> wrote:
> On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason-s...@creativetrax.com> 
> wrote:
>> (apologies for possible multiple posts--I've sent this twice to gmane and it
>> hasn't appeared)
>>
>> I've implemented some sanitizing of public worksheets [1] and applied it to
>> demo.sagenb.org as a test.  The concerns from before were that javascript
>> was executing on the page, leading to malware being on the page.
>>
>> Can people test the new html sanitizing being done on demo.sagenb.org?
>>
>> If it looks good (especially to William), I'll roll it out to the other
>> *.sagenb.org servers and we'll turn published worksheets back on.
>
> I hate to criticize this, since I know it was a lot of work to
> implement.     But since I'm the one who gets my network cutoff when
> things go wrong, I've got to take this seriously.
>
> I happen to have read straight through the book "The Tangled Web: A
> Guide to Securing Modern Web Applications" this summer.   There are
> thousands of known and truly dangerous exploits that one can sneak
> into an HTML+CSS document.  (It's amazing the sorts of crazy evil
> things one can do using weird character sets combined with browser
> implementation issues, especially if said browser is IE.)   The book
> convincingly argues that the only way to have any hope of making a
> safe HTML document is to use a whitelist approach.
>
> The code you wrote uses the exact opposite approach -- the blacklist
> approach.    Here's a typical stackoverflow discussion about exactly
> the approach in your code:
>
>    
> http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587
>
> Note the comment there: "Note that this uses a blacklist approach to
> filter out evil bits, rather than whitelist, but only a whitelisting
> approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10"
>
> Also, the lxml documentation describes the function you're using as:
> "Removes unwanted tags and content."    Instead, the "whitelist
> approach" to sanitizing an HTML+CSS document is to "include wanted
> tags and content", which is a very, very different thing.
>
> There are Python libraries  that implement a whitelist approach, e.g.,
> see the discussion here:
>
>     
> http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python

There's a FAQ linked to from that page that nicely explains whitelist/blacklist:

  http://jacobian.org/writing/untrusted-users-and-html/

"I’ve literally seen hundreds of recipes for stripping unsafe HTML
that are about as effective as a screen door on a submarine."

William

>
>  -- William
>
> P.S. Also, this worries me a little:
>
>    if el.tag=='script' and el.get('type')=='math/tex' and not el.get('src'):
>
> I wonder if there is a way to put malware into a mathjax script tag?



-- 
William Stein
Professor of Mathematics
University of Washington
http://wstein.org

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To post to this group, send email to sage-devel@googlegroups.com.
To unsubscribe from this group, send email to 
sage-devel+unsubscr...@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel?hl=en.


Reply via email to