The whitelist is lxml.html.defs.tags, see 
http://lxml.de/api/lxml.html.defs-module.html

Cleaning CSS should probably be considered a separate problem, especially 
since Microsoft decided in their infinite wisdom to allow embedded 
javascript in CSS files (hence the _css_javascript_re). But we use none of 
that since Jason's patch explicitly removes all style tags.



On Saturday, October 6, 2012 4:59:02 PM UTC+1, William wrote:
>
> On Sat, Oct 6, 2012 at 8:19 AM, Volker Braun 
> <vbrau...@gmail.com<javascript:>> 
> wrote: 
> > Before you even get to the question of black/whitelisting you have to 
> deal 
> > with malformed documents. Are your rules (black or white) going to apply 
> to 
> > subtly broken tags? I think lxml does the only sane thing here: Parse 
> the 
> > document into a valid xml document, apply rules, and then write 
> everything 
> > out into a new (valid) xml document. 
> > 
> > The lxml.html.clean.Cleaner class defaults to removing unknown tags and, 
> for 
> > the known tags, removing those that are troublesome. So it is 
> technically 
> > whitelisting. 
>
> What is the whitelist it is using, and why is that whitelist a good 
> choice?  I don't see a whitelist in the lxml.html.cleaner docs. 
>
> The page  [1] has a publicly discussed and thought through white list 
> of tags and css.  In fact, [1] is much longer than the source code [2] 
> of lxml's Cleaner.    Moreover, reading the source of [2] gives me no 
> confidence whatever that it generates something safe.  e.g., this is 
> not confidence inspiring to me: 
>
>     # This is an IE-specific construct you can have in a stylesheet to 
>   # run some Javascript: 
>     _css_javascript_re = re.compile( 
>           r'expression\s*\(.*?\)', re.S|re.I) 
>
>
> [1]  http://wiki.whatwg.org/wiki/Sanitization_rules 
> [2]  http://lxml.de/api/lxml.html.clean-pysrc.html 
>
>
> > The main difference to BeautifulSoup, say, is that it uses C 
> > libraries to do the heavy lifting so its faster. 
>
>
>
> > 
> > 
> > 
> > 
> > On Saturday, October 6, 2012 2:07:02 PM UTC+1, William wrote: 
> >> 
> >> On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason...@creativetrax.com> 
>
> >> wrote: 
> >> > (apologies for possible multiple posts--I've sent this twice to gmane 
> >> > and it 
> >> > hasn't appeared) 
> >> > 
> >> > I've implemented some sanitizing of public worksheets [1] and applied 
> it 
> >> > to 
> >> > demo.sagenb.org as a test.  The concerns from before were that 
> >> > javascript 
> >> > was executing on the page, leading to malware being on the page. 
> >> > 
> >> > Can people test the new html sanitizing being done on demo.sagenb.org? 
>
> >> > 
> >> > If it looks good (especially to William), I'll roll it out to the 
> other 
> >> > *.sagenb.org servers and we'll turn published worksheets back on. 
> >> 
> >> I hate to criticize this, since I know it was a lot of work to 
> >> implement.     But since I'm the one who gets my network cutoff when 
> >> things go wrong, I've got to take this seriously. 
> >> 
> >> I happen to have read straight through the book "The Tangled Web: A 
> >> Guide to Securing Modern Web Applications" this summer.   There are 
> >> thousands of known and truly dangerous exploits that one can sneak 
> >> into an HTML+CSS document.  (It's amazing the sorts of crazy evil 
> >> things one can do using weird character sets combined with browser 
> >> implementation issues, especially if said browser is IE.)   The book 
> >> convincingly argues that the only way to have any hope of making a 
> >> safe HTML document is to use a whitelist approach. 
> >> 
> >> The code you wrote uses the exact opposite approach -- the blacklist 
> >> approach.    Here's a typical stackoverflow discussion about exactly 
> >> the approach in your code: 
> >> 
> >> 
> >> 
> http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587
>  
> >> 
> >> Note the comment there: "Note that this uses a blacklist approach to 
> >> filter out evil bits, rather than whitelist, but only a whitelisting 
> >> approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10" 
> >> 
> >> Also, the lxml documentation describes the function you're using as: 
> >> "Removes unwanted tags and content."    Instead, the "whitelist 
> >> approach" to sanitizing an HTML+CSS document is to "include wanted 
> >> tags and content", which is a very, very different thing. 
> >> 
> >> There are Python libraries  that implement a whitelist approach, e.g., 
> >> see the discussion here: 
> >> 
> >> 
> >> 
> http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python
>  
> >> 
> >>  -- William 
> >> 
> >> P.S. Also, this worries me a little: 
> >> 
> >>    if el.tag=='script' and el.get('type')=='math/tex' and not 
> >> el.get('src'): 
> >> 
> >> I wonder if there is a way to put malware into a mathjax script tag? 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "sage-devel" group. 
> > To post to this group, send email to 
> > sage-...@googlegroups.com<javascript:>. 
>
> > To unsubscribe from this group, send email to 
> > sage-devel+...@googlegroups.com <javascript:>. 
> > Visit this group at http://groups.google.com/group/sage-devel?hl=en. 
> > 
> > 
>
>
>
> -- 
> William Stein 
> Professor of Mathematics 
> University of Washington 
> http://wstein.org 
>

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To post to this group, send email to sage-devel@googlegroups.com.
To unsubscribe from this group, send email to 
sage-devel+unsubscr...@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel?hl=en.


Reply via email to