On Sat, Oct 6, 2012 at 6:06 AM, William Stein <wst...@gmail.com> wrote: > On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason-s...@creativetrax.com> > wrote: >> (apologies for possible multiple posts--I've sent this twice to gmane and it >> hasn't appeared) >> >> I've implemented some sanitizing of public worksheets [1] and applied it to >> demo.sagenb.org as a test. The concerns from before were that javascript >> was executing on the page, leading to malware being on the page. >> >> Can people test the new html sanitizing being done on demo.sagenb.org? >> >> If it looks good (especially to William), I'll roll it out to the other >> *.sagenb.org servers and we'll turn published worksheets back on. > > I hate to criticize this, since I know it was a lot of work to > implement. But since I'm the one who gets my network cutoff when > things go wrong, I've got to take this seriously. > > I happen to have read straight through the book "The Tangled Web: A > Guide to Securing Modern Web Applications" this summer. There are > thousands of known and truly dangerous exploits that one can sneak > into an HTML+CSS document. (It's amazing the sorts of crazy evil > things one can do using weird character sets combined with browser > implementation issues, especially if said browser is IE.) The book > convincingly argues that the only way to have any hope of making a > safe HTML document is to use a whitelist approach. > > The code you wrote uses the exact opposite approach -- the blacklist > approach. Here's a typical stackoverflow discussion about exactly > the approach in your code: > > > http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587 > > Note the comment there: "Note that this uses a blacklist approach to > filter out evil bits, rather than whitelist, but only a whitelisting > approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10" > > Also, the lxml documentation describes the function you're using as: > "Removes unwanted tags and content." Instead, the "whitelist > approach" to sanitizing an HTML+CSS document is to "include wanted > tags and content", which is a very, very different thing. > > There are Python libraries that implement a whitelist approach, e.g., > see the discussion here: > > > http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python
There's a FAQ linked to from that page that nicely explains whitelist/blacklist: http://jacobian.org/writing/untrusted-users-and-html/ "I’ve literally seen hundreds of recipes for stripping unsafe HTML that are about as effective as a screen door on a submarine." William > > -- William > > P.S. Also, this worries me a little: > > if el.tag=='script' and el.get('type')=='math/tex' and not el.get('src'): > > I wonder if there is a way to put malware into a mathjax script tag? -- William Stein Professor of Mathematics University of Washington http://wstein.org -- You received this message because you are subscribed to the Google Groups "sage-devel" group. To post to this group, send email to sage-devel@googlegroups.com. To unsubscribe from this group, send email to sage-devel+unsubscr...@googlegroups.com. Visit this group at http://groups.google.com/group/sage-devel?hl=en.