The whitelist is lxml.html.defs.tags, see http://lxml.de/api/lxml.html.defs-module.html
Cleaning CSS should probably be considered a separate problem, especially since Microsoft decided in their infinite wisdom to allow embedded javascript in CSS files (hence the _css_javascript_re). But we use none of that since Jason's patch explicitly removes all style tags. On Saturday, October 6, 2012 4:59:02 PM UTC+1, William wrote: > > On Sat, Oct 6, 2012 at 8:19 AM, Volker Braun > <vbrau...@gmail.com<javascript:>> > wrote: > > Before you even get to the question of black/whitelisting you have to > deal > > with malformed documents. Are your rules (black or white) going to apply > to > > subtly broken tags? I think lxml does the only sane thing here: Parse > the > > document into a valid xml document, apply rules, and then write > everything > > out into a new (valid) xml document. > > > > The lxml.html.clean.Cleaner class defaults to removing unknown tags and, > for > > the known tags, removing those that are troublesome. So it is > technically > > whitelisting. > > What is the whitelist it is using, and why is that whitelist a good > choice? I don't see a whitelist in the lxml.html.cleaner docs. > > The page [1] has a publicly discussed and thought through white list > of tags and css. In fact, [1] is much longer than the source code [2] > of lxml's Cleaner. Moreover, reading the source of [2] gives me no > confidence whatever that it generates something safe. e.g., this is > not confidence inspiring to me: > > # This is an IE-specific construct you can have in a stylesheet to > # run some Javascript: > _css_javascript_re = re.compile( > r'expression\s*\(.*?\)', re.S|re.I) > > > [1] http://wiki.whatwg.org/wiki/Sanitization_rules > [2] http://lxml.de/api/lxml.html.clean-pysrc.html > > > > The main difference to BeautifulSoup, say, is that it uses C > > libraries to do the heavy lifting so its faster. > > > > > > > > > > > > > On Saturday, October 6, 2012 2:07:02 PM UTC+1, William wrote: > >> > >> On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason...@creativetrax.com> > > >> wrote: > >> > (apologies for possible multiple posts--I've sent this twice to gmane > >> > and it > >> > hasn't appeared) > >> > > >> > I've implemented some sanitizing of public worksheets [1] and applied > it > >> > to > >> > demo.sagenb.org as a test. The concerns from before were that > >> > javascript > >> > was executing on the page, leading to malware being on the page. > >> > > >> > Can people test the new html sanitizing being done on demo.sagenb.org? > > >> > > >> > If it looks good (especially to William), I'll roll it out to the > other > >> > *.sagenb.org servers and we'll turn published worksheets back on. > >> > >> I hate to criticize this, since I know it was a lot of work to > >> implement. But since I'm the one who gets my network cutoff when > >> things go wrong, I've got to take this seriously. > >> > >> I happen to have read straight through the book "The Tangled Web: A > >> Guide to Securing Modern Web Applications" this summer. There are > >> thousands of known and truly dangerous exploits that one can sneak > >> into an HTML+CSS document. (It's amazing the sorts of crazy evil > >> things one can do using weird character sets combined with browser > >> implementation issues, especially if said browser is IE.) The book > >> convincingly argues that the only way to have any hope of making a > >> safe HTML document is to use a whitelist approach. > >> > >> The code you wrote uses the exact opposite approach -- the blacklist > >> approach. Here's a typical stackoverflow discussion about exactly > >> the approach in your code: > >> > >> > >> > http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587 > > >> > >> Note the comment there: "Note that this uses a blacklist approach to > >> filter out evil bits, rather than whitelist, but only a whitelisting > >> approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10" > >> > >> Also, the lxml documentation describes the function you're using as: > >> "Removes unwanted tags and content." Instead, the "whitelist > >> approach" to sanitizing an HTML+CSS document is to "include wanted > >> tags and content", which is a very, very different thing. > >> > >> There are Python libraries that implement a whitelist approach, e.g., > >> see the discussion here: > >> > >> > >> > http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python > > >> > >> -- William > >> > >> P.S. Also, this worries me a little: > >> > >> if el.tag=='script' and el.get('type')=='math/tex' and not > >> el.get('src'): > >> > >> I wonder if there is a way to put malware into a mathjax script tag? > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "sage-devel" group. > > To post to this group, send email to > > sage-...@googlegroups.com<javascript:>. > > > To unsubscribe from this group, send email to > > sage-devel+...@googlegroups.com <javascript:>. > > Visit this group at http://groups.google.com/group/sage-devel?hl=en. > > > > > > > > -- > William Stein > Professor of Mathematics > University of Washington > http://wstein.org > -- You received this message because you are subscribed to the Google Groups "sage-devel" group. To post to this group, send email to sage-devel@googlegroups.com. To unsubscribe from this group, send email to sage-devel+unsubscr...@googlegroups.com. Visit this group at http://groups.google.com/group/sage-devel?hl=en.