On Sat, Oct 6, 2012 at 8:19 AM, Volker Braun <vbraun.n...@gmail.com> wrote:
> Before you even get to the question of black/whitelisting you have to deal
> with malformed documents. Are your rules (black or white) going to apply to
> subtly broken tags? I think lxml does the only sane thing here: Parse the
> document into a valid xml document, apply rules, and then write everything
> out into a new (valid) xml document.
>
> The lxml.html.clean.Cleaner class defaults to removing unknown tags and, for
> the known tags, removing those that are troublesome. So it is technically
> whitelisting.

What is the whitelist it is using, and why is that whitelist a good
choice?  I don't see a whitelist in the lxml.html.cleaner docs.

The page  [1] has a publicly discussed and thought through white list
of tags and css.  In fact, [1] is much longer than the source code [2]
of lxml's Cleaner.    Moreover, reading the source of [2] gives me no
confidence whatever that it generates something safe.  e.g., this is
not confidence inspiring to me:

    # This is an IE-specific construct you can have in a stylesheet to
  # run some Javascript:
    _css_javascript_re = re.compile(
          r'expression\s*\(.*?\)', re.S|re.I)


[1]  http://wiki.whatwg.org/wiki/Sanitization_rules
[2]  http://lxml.de/api/lxml.html.clean-pysrc.html


> The main difference to BeautifulSoup, say, is that it uses C
> libraries to do the heavy lifting so its faster.



>
>
>
>
> On Saturday, October 6, 2012 2:07:02 PM UTC+1, William wrote:
>>
>> On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason...@creativetrax.com>
>> wrote:
>> > (apologies for possible multiple posts--I've sent this twice to gmane
>> > and it
>> > hasn't appeared)
>> >
>> > I've implemented some sanitizing of public worksheets [1] and applied it
>> > to
>> > demo.sagenb.org as a test.  The concerns from before were that
>> > javascript
>> > was executing on the page, leading to malware being on the page.
>> >
>> > Can people test the new html sanitizing being done on demo.sagenb.org?
>> >
>> > If it looks good (especially to William), I'll roll it out to the other
>> > *.sagenb.org servers and we'll turn published worksheets back on.
>>
>> I hate to criticize this, since I know it was a lot of work to
>> implement.     But since I'm the one who gets my network cutoff when
>> things go wrong, I've got to take this seriously.
>>
>> I happen to have read straight through the book "The Tangled Web: A
>> Guide to Securing Modern Web Applications" this summer.   There are
>> thousands of known and truly dangerous exploits that one can sneak
>> into an HTML+CSS document.  (It's amazing the sorts of crazy evil
>> things one can do using weird character sets combined with browser
>> implementation issues, especially if said browser is IE.)   The book
>> convincingly argues that the only way to have any hope of making a
>> safe HTML document is to use a whitelist approach.
>>
>> The code you wrote uses the exact opposite approach -- the blacklist
>> approach.    Here's a typical stackoverflow discussion about exactly
>> the approach in your code:
>>
>>
>> http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587
>>
>> Note the comment there: "Note that this uses a blacklist approach to
>> filter out evil bits, rather than whitelist, but only a whitelisting
>> approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10"
>>
>> Also, the lxml documentation describes the function you're using as:
>> "Removes unwanted tags and content."    Instead, the "whitelist
>> approach" to sanitizing an HTML+CSS document is to "include wanted
>> tags and content", which is a very, very different thing.
>>
>> There are Python libraries  that implement a whitelist approach, e.g.,
>> see the discussion here:
>>
>>
>> http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python
>>
>>  -- William
>>
>> P.S. Also, this worries me a little:
>>
>>    if el.tag=='script' and el.get('type')=='math/tex' and not
>> el.get('src'):
>>
>> I wonder if there is a way to put malware into a mathjax script tag?
>
> --
> You received this message because you are subscribed to the Google Groups
> "sage-devel" group.
> To post to this group, send email to sage-devel@googlegroups.com.
> To unsubscribe from this group, send email to
> sage-devel+unsubscr...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sage-devel?hl=en.
>
>



-- 
William Stein
Professor of Mathematics
University of Washington
http://wstein.org

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To post to this group, send email to sage-devel@googlegroups.com.
To unsubscribe from this group, send email to 
sage-devel+unsubscr...@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel?hl=en.


Reply via email to