On Sat, Oct 6, 2012 at 8:19 AM, Volker Braun <vbraun.n...@gmail.com> wrote: > Before you even get to the question of black/whitelisting you have to deal > with malformed documents. Are your rules (black or white) going to apply to > subtly broken tags? I think lxml does the only sane thing here: Parse the > document into a valid xml document, apply rules, and then write everything > out into a new (valid) xml document. > > The lxml.html.clean.Cleaner class defaults to removing unknown tags and, for > the known tags, removing those that are troublesome. So it is technically > whitelisting.
What is the whitelist it is using, and why is that whitelist a good choice? I don't see a whitelist in the lxml.html.cleaner docs. The page [1] has a publicly discussed and thought through white list of tags and css. In fact, [1] is much longer than the source code [2] of lxml's Cleaner. Moreover, reading the source of [2] gives me no confidence whatever that it generates something safe. e.g., this is not confidence inspiring to me: # This is an IE-specific construct you can have in a stylesheet to # run some Javascript: _css_javascript_re = re.compile( r'expression\s*\(.*?\)', re.S|re.I) [1] http://wiki.whatwg.org/wiki/Sanitization_rules [2] http://lxml.de/api/lxml.html.clean-pysrc.html > The main difference to BeautifulSoup, say, is that it uses C > libraries to do the heavy lifting so its faster. > > > > > On Saturday, October 6, 2012 2:07:02 PM UTC+1, William wrote: >> >> On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason...@creativetrax.com> >> wrote: >> > (apologies for possible multiple posts--I've sent this twice to gmane >> > and it >> > hasn't appeared) >> > >> > I've implemented some sanitizing of public worksheets [1] and applied it >> > to >> > demo.sagenb.org as a test. The concerns from before were that >> > javascript >> > was executing on the page, leading to malware being on the page. >> > >> > Can people test the new html sanitizing being done on demo.sagenb.org? >> > >> > If it looks good (especially to William), I'll roll it out to the other >> > *.sagenb.org servers and we'll turn published worksheets back on. >> >> I hate to criticize this, since I know it was a lot of work to >> implement. But since I'm the one who gets my network cutoff when >> things go wrong, I've got to take this seriously. >> >> I happen to have read straight through the book "The Tangled Web: A >> Guide to Securing Modern Web Applications" this summer. There are >> thousands of known and truly dangerous exploits that one can sneak >> into an HTML+CSS document. (It's amazing the sorts of crazy evil >> things one can do using weird character sets combined with browser >> implementation issues, especially if said browser is IE.) The book >> convincingly argues that the only way to have any hope of making a >> safe HTML document is to use a whitelist approach. >> >> The code you wrote uses the exact opposite approach -- the blacklist >> approach. Here's a typical stackoverflow discussion about exactly >> the approach in your code: >> >> >> http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587 >> >> Note the comment there: "Note that this uses a blacklist approach to >> filter out evil bits, rather than whitelist, but only a whitelisting >> approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10" >> >> Also, the lxml documentation describes the function you're using as: >> "Removes unwanted tags and content." Instead, the "whitelist >> approach" to sanitizing an HTML+CSS document is to "include wanted >> tags and content", which is a very, very different thing. >> >> There are Python libraries that implement a whitelist approach, e.g., >> see the discussion here: >> >> >> http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python >> >> -- William >> >> P.S. Also, this worries me a little: >> >> if el.tag=='script' and el.get('type')=='math/tex' and not >> el.get('src'): >> >> I wonder if there is a way to put malware into a mathjax script tag? > > -- > You received this message because you are subscribed to the Google Groups > "sage-devel" group. > To post to this group, send email to sage-devel@googlegroups.com. > To unsubscribe from this group, send email to > sage-devel+unsubscr...@googlegroups.com. > Visit this group at http://groups.google.com/group/sage-devel?hl=en. > > -- William Stein Professor of Mathematics University of Washington http://wstein.org -- You received this message because you are subscribed to the Google Groups "sage-devel" group. To post to this group, send email to sage-devel@googlegroups.com. To unsubscribe from this group, send email to sage-devel+unsubscr...@googlegroups.com. Visit this group at http://groups.google.com/group/sage-devel?hl=en.