Before you even get to the question of black/whitelisting you have to deal 
with malformed documents. Are your rules (black or white) going to apply to 
subtly broken tags? I think lxml does the only sane thing here: Parse the 
document into a valid xml document, apply rules, and then write everything 
out into a new (valid) xml document. 

The lxml.html.clean.Cleaner class defaults to removing unknown tags and, 
for the known tags, removing those that are troublesome. So it is 
technically whitelisting. The main difference to BeautifulSoup, say, is 
that it uses C libraries to do the heavy lifting so its faster.




On Saturday, October 6, 2012 2:07:02 PM UTC+1, William wrote:
>
> On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout 
> <jason...@creativetrax.com<javascript:>> 
> wrote: 
> > (apologies for possible multiple posts--I've sent this twice to gmane 
> and it 
> > hasn't appeared) 
> > 
> > I've implemented some sanitizing of public worksheets [1] and applied it 
> to 
> > demo.sagenb.org as a test.  The concerns from before were that 
> javascript 
> > was executing on the page, leading to malware being on the page. 
> > 
> > Can people test the new html sanitizing being done on demo.sagenb.org? 
> > 
> > If it looks good (especially to William), I'll roll it out to the other 
> > *.sagenb.org servers and we'll turn published worksheets back on. 
>
> I hate to criticize this, since I know it was a lot of work to 
> implement.     But since I'm the one who gets my network cutoff when 
> things go wrong, I've got to take this seriously. 
>
> I happen to have read straight through the book "The Tangled Web: A 
> Guide to Securing Modern Web Applications" this summer.   There are 
> thousands of known and truly dangerous exploits that one can sneak 
> into an HTML+CSS document.  (It's amazing the sorts of crazy evil 
> things one can do using weird character sets combined with browser 
> implementation issues, especially if said browser is IE.)   The book 
> convincingly argues that the only way to have any hope of making a 
> safe HTML document is to use a whitelist approach. 
>
> The code you wrote uses the exact opposite approach -- the blacklist 
> approach.    Here's a typical stackoverflow discussion about exactly 
> the approach in your code: 
>
>    
> http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587
>  
>
> Note the comment there: "Note that this uses a blacklist approach to 
> filter out evil bits, rather than whitelist, but only a whitelisting 
> approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10" 
>
> Also, the lxml documentation describes the function you're using as: 
> "Removes unwanted tags and content."    Instead, the "whitelist 
> approach" to sanitizing an HTML+CSS document is to "include wanted 
> tags and content", which is a very, very different thing. 
>
> There are Python libraries  that implement a whitelist approach, e.g., 
> see the discussion here: 
>
>     
> http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python
>  
>
>  -- William 
>
> P.S. Also, this worries me a little: 
>
>    if el.tag=='script' and el.get('type')=='math/tex' and not 
> el.get('src'): 
>
> I wonder if there is a way to put malware into a mathjax script tag? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To post to this group, send email to sage-devel@googlegroups.com.
To unsubscribe from this group, send email to 
sage-devel+unsubscr...@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel?hl=en.


Reply via email to