Please allow me to introduce myself briefly: I'm Wietse Venema from
IBM Research, also known as the creator of the open source Postfix
mail system, co-author of the Coroner's toolkit and SATAN, and the
original author of TCP Wrapper. That doesn't mean everything I touch
becomes gold; if it did then I'd be a very rich man now :-)

This is a proposal to add basic Perl/Ruby like tainting support to
PHP: an option that is turned off by default, and that programmers
may turn on at runtime to alert them when they make the common
mistake of using uncleansed input with include, echo, system, open,
etc.  This would work with unmodified third-party extensions.

Taint support is first of all an education tool. When an alert is
raised that data needs cleansing, the programmer needs to make a
conscious decision. It's their job to choose the right method.  I'll
discuss below why I think PHP shouldn't make the decision for them.

Taint support is not a sandbox; a malicious PHP script can still
open a pipe to a shell process and feed uncleansed commands to it.
Taint support can be an ingredient to build a sandbox, but that
involves lots more. See for example the Ruby reference at the end.

Of course when overhead is low enough, people might want to turn
on taint checks in production, to implement a multi-layer defense.
Wise people know that no single layer provides perfect protection.
People already do this with other scripting languages.

Last month I did a proof of concept implementation to raise an alert
when raw network data is passed to a few PHP primitives (echo/print,
system/exec, eval/include, open stream, and manipulate file/directory).

If this proposal survives review, then I could spend a chunk of
time in 2007 working out more details and doing a full production
quality implementation.

Before I discuss my taint proposal details I first want to explain
why I see problems with other approaches.

Why not have PHP automatically detect and cleanse malicious data?
=================================================================

In preparation for this I studied several years worth of literature
about making PHP etc. applications "secure" (see references at the
end).  I found that many researchers have worked on systems that
try to secure PHP etc. applications without changing PHP script
source code.

These systems explicitly permit the use of uncleansed data in
html/shell/sql commands. When an html/shell/sql request contains
"forbidden" characters in certain positions, these systems either
"neuter" the request before execution, or they abort the request.
In the normal case where requests contain benign inputs only, no-one
will ever notice that these protections exist, except perhaps by
their lack of responsiveness.

I see many problems with systems that do automatic data cleansing.
The main problems are not technical but psychological:

- Education: automatic cleansing systems don't make programmers
  aware that network data is inherently untrustworthy. Instead,
  they teach the exact opposite: don't worry about data hygiene.
  This of course means they will get bitten elsewhere anyway.

- Expectation: automatic cleansing systems have to be perfect. If
  the safety net catches some but not all cross-site scripting or
  SQL injection attacks, then the system has a security hole and
  people lose confidence. This gives security a bad reputation.

These two problems are inherent to all solutions that automatically
fix security problems for the programmer. They encourage programmers
to write sloppy code. I want to help them write better code instead.

But wait, there is more :-) There are technical problems, too:

- Overhead: as strings are sliced, diced, and tossed around, the
  automatic cleansing safety net has to keep track of exactly which
  characters in a substring are derived from untrusted input, and
  which characters are not, so that the safety net can later recognize
  malicious content in the middle of html/shell/sql/etc.  commands.

- More overhead: special-purpose code is needed in all functions
  and all primitives that execute html/shell/sql/etc.  commands.
  This code is needed because each context has a different definition
  of what is "malicious" content in the middle of a request.

- Collateral damage: providers of PHP extensions need to implement
  their own special-purpose code to detect or neuter untrusted
  substrings in inputs, or to mark untrusted substrings in return
  values, otherwise the safety net is incomplete, and the PHP
  extension introduces a security hole.

Compared to this, the run-time overhead of maintaining and testing
taint bits in PHP is miniscule, if my experiences with the prototype
are meaningful.

A first look at PHP taint support
=================================

The general idea is to mark certain external inputs as tainted (ex:
network, file), and to disallow the use of tainted data with certain
operations that change PHP's own state (ex: include, eval), or that
access or modify external state (ex: create/open/remove file; connect
to server; generate HTML; execute shell command; execute database
command).

The exact definitions of what inputs are tainted and what operations
are sensitive will need to be made more precise later (lots of
opportunity for splitting fine hairs).  For now I want to focus on
the general mechanism.

The following is a high-level view of what would happen when taint
checking is turned on at run-time:

- Each ZVAL is marked tainted or not tainted (i.e.  we don't taint
  individual characters within substrings). Black and white is all.
  In some future, someone may want to explore the possibility of
  more than two shades. But not now.

- Primitives and functions such as echo, eval, or mysql_query are
  not allowed to receive tainted input. When this happens the script
  terminates with a run-time error.  It is a bad idea for software
  to continue after a security violation.

- PHP propagates taintedness across expressions.  If an input to
  an expression is tainted, then the result of that expression is
  tainted too. There are exceptions to this rule: these are called
  sanitisers, as discussed next.

- The PHP application programmer untaints data by explicit assignment
  with an untainted value.  For example, the result from htmlentities()
  or mysql_real_escape_string() is not tainted. People could apply
  the wrong sanitizer if they really want to. Remember, the purpose
  is to help programmers by telling what data needs cleansing.  It
  is up to them to make the right decision.  If we wanted to force
  the use of the "right" sanitizer then we would need multiple
  shades of untaintedness. This would not be practical.

That's the high-level view. At a lower level we make trade-offs
between usability, implementation cost, maintenance cost, run-time
cost, and more.  In particular, my goal is to support taint with
third-party extensions, but without changes to the source code of
those extensions.

Thus, compromises need to be found, and some perfection needs to
be sacrificed. I'm aiming for "useful" instead of "impossible".

Categorizing the functions and primitives
=========================================

Before we get into the nitty-gritty of tainting and untainting we
first need to categorize functions and primitives according to their
inputs and results.

- Some functions or primitives are not allowed to receive tainted
  input; the script is terminated instead. These functions or
  primitives are called sensitive.

- Some functions or primitives always return an untainted result.
  These functions or primitives are called sanitisers.

- Some functions or primitives return a tainted result only when
  their input is tainted.  For example, if $x is tainted, then the
  return value of substr($x, whatever) is tainted too. I'm still
  looking for good name for this function or primitive category
  ("permeable"? The term should appeal to non-English speakers,
  including myself).

- Some functions or primitives always return tainted results. For
  example, results from mysql_query must be sanitized before use
  in a sensitive context such as echo or print, or in an SQL query(!);
  this prevents "stored cross-site scripting" and SQL injection
  problems.  I'm still looking for good name for this function or
  primitive category (tainter?).

mysql_query is an example of a function that is at the same time
sensitive and a tainter: it can receive only untainted input, and
its results are always tainted.

If we want to support taint checking with third-party PHP extensions,
but without changes to their source code, then all we need at this
point is a table that says which functions are sensitive and/or
tainters, and so on.  If no information is available about a specific
function we could assume the worst: when taint checking is turned
on at runtime, an uncategorized function would not be allowed to
receive tainted inputs, and its results would always be tainted.

All this categorization of functions makes no difference to how
applications run, until someone actually turns on taint checking
at run-time.  It is therefore completely backwards compatible, even
when the function category information is incorrect or incomplete.

Setting taint
=============

Obviously, data from the network and from databases needs to be
marked as tainted. Less obvious is if we want to go as far as Perl,
and also taint the current directory and a bunch of other things.
More study is needed. Although these decisions will greatly impact
usability, they can be made later, because they don't really affect
the over-all design or implementation of the system.

Propagating taint
=================

At this level things start to become interesting. Within the PHP
core we have a large amount of control over how taint is handled,
so we can spend a lot of time splitting fine hairs. With third-party
extensions we are very much limited by the requirement that taint
checking must not require source code changes to those extensions.

- Within the PHP core, taint propagation can be fine-grained. For
  example, we could decide that substr(string, start, length) returns
  a tainted result only if the string argument is tainted; a tainted
  start or length argument would not affect the taintedness of the
  result. For comparison, Perl even taints the result when the start
  argument is tainted; with Ruby, numbers can't be tainted, so the
  issue with a tainted start or length values does not even exist.
  This is just one example of splitting fine hairs.

- With PHP extensions, taint detection and propagation can happen
  only at the function interface (the functions themselves aren't
  taint-aware, unless their implementor went through the effort).
  If an extension function is allowed to receive tainted input,
  then its return value is also tainted unless the function is
  classified as a sanitizer. With extensions that receive complex
  arguments such as arrays, there is an increased cost as the PHP
  core inspects each sub-element for taintedness before making the
  actual function call. A similar cost exists with extensions that
  return complex tainted results, as the PHP core needs to mark
  each sub-element as tainted. These issues may disappear over time
  as implementors adopt taint support into their extensions.

Removing taint
==============

This last section gives just a few examples; there are a lot of
detail issues that will need to be considered before we have a
complete system.

- Removing taintedness requires an explicit assignment.  For example,
  assigning the result from a sanitizer produces an untainted result.

- Some implementations (Perl) propagate taint across, for example,
  string<->number conversions; other implementations consider the
  result not tainted (Ruby). More fine hair splitting.

- Testing a variable does not untaint it.  For example, if $x is
  tainted, the following does not change its taint status:

    if (some test involving $x) { $x is still tainted here }

  The reason for this is entirely practical: we can't reliably
  determine if the intention of the test is to sanitize input.

- Testing a tainted variable does not taint the result of expressions
  that follow the test.  For example, if $y is tainted, the following
  code does not taint $x:

    if ($y == 0) { $x = 0; } else { $x = 1; }

  There is one well-known example where such a strategy breaks down,
  and that is the degenerate case where a long chain of if/else if/...
  statements is used instead of a lookup table:

    if ($y == 0) { $x = 0; } else if ($y == 1) { $x = 1; } else ...

  Remember that my purpose is to help the programmer; my purpose
  is not fighting programmers who insist on writing bad code. After
  all, they can always open a pipe to a shell process and execute
  uncleansed commands there.

This is just the beginning of a long list of things that need to
be looked into, but I will stop here for now.

Conclusion
==========

This is a proposal to add run-time taint checking to PHP. It is not
a sandbox for the execution of hostile code. It is just a tool to
help programmers find places where they need to sanitize data.  It
avoids changes to third-party extensions, and is turned off by
default.  It is therefore completely backwards compatible.

Last month I did a proof of concept implementation that protects a
few PHP primitives. If this proposal survives review, then I could
spend a chunk of time in 2007 working out the details doing a full
production quality implementation.

References
==========

The Perl taint feature has been an example for many other efforts.
http://perldoc.perl.org/perlsec.html

Ruby implements multiple levels of taint checking. The lowest level
is like Perl. The highest level is a sandbox where code can neither
create nor modify untainted objects, read/write files or sockets,
etc. At this level, my claim that "the programmer can always open
a pipe to a shell and execute uncleansed commands there" is no
longer true.  http://www.rubycentral.com/book/taint.html

Wei Xu, Sandeep Bhatkar, R. Sekar: "Taint-Enhanced Policy Enforcement:
A Practical Approach to Defeat a Wide Range of Attacks". 15th USENIX
Security Symposium  Vancouver, BC, Canada, August 2006.  
http://seclab.cs.sunysb.edu/sandeep/pubs/papers/taint_usenixsec06.pdf

    Source-to-source transformation to instrument source code with
    taint tracking. They use a rule-based policy to disallow tainted
    metacharacters shell/sql commands, format strings, html tags etc.
    Modest overhead.

Alex Ho, Michael Fetterman, Christopher Clark, Andrew Warfield, and
Steven Hand: "Practical TaintBased Protection using Demand Emulation".
Eurosys2006, Leuven, Belgium, April 2006.
http://www.cl.cam.ac.uk/~akw27/papers/taint-eurosys06.pdf
http://www.cs.kuleuven.ac.be/conference/EuroSys2006/papers/p29-ho.pdf

    This work uses Xen virtualisation for code that handles untainted
    data, and switches to Qemu emulation for code that touches
    tainted data.  Unlike anyone else they also taint individual
    disk blocks, as the process writes tainted data to file.

J. Newsome and D. Song: "Dynamic Taint Analysis: Automatic Detection,
Analysis, and Signature Generation of Exploit Attacks on Commodity
Software".  Network and Distributed Systems Security Symposium,
February 2005.  http://www.ece.cmu.edu/~dawnsong/papers/taintcheck.pdf

    Binary-to-binary translation with valgrind. Significant overhead.

A. Nguyen-Tuong, S. Guarnieri, D. Greene, J. Shirley, and D. Evans.
"Automatically hardening web applications using precise tainting."
20th IFIP International Information Security Conference, 2005.
http://www.cs.virginia.edu/evans/pubs/infosec05.pdf

    Precise taint tracking using a modified PHP engine. Tainted
    data is forbidden (ex: SQL queries, system functions) or sanitized
    (ex: HTML output).  Data from SQL database is considered tainted.
    Careful parsing of SQL, HTML to prevent command injection and
    cross-site scripting via tainted operators or tags. Low overhead.

Tadeusz Pietraszek, Chris Vanden Berghe: "Defending Against Injection
Attacks Through Context-Sensitive String Evaluation".  Recent
Advances in Intrusion Detection (RAID), 2005.  
http://chris.vandenberghe.org/publications/csse_raid2005.pdf

    Precise taint tracking using a modified PHP engine.  Taint-awareness
    requires modifications to built-ins and to extensions. Careful
    parsing of SQL etc. to prevent command injection. Modest overhead.

Yao-Wen Huang, Fang Yu, Christian Hang, Chung-Hung Tsai, D. T. Lee,
Sy-Yen Kuo: "Securing Web Application Code by Static Analysis and
Runtime Protection". Proceedings of the 13th international conference
on World Wide Web (May 2004).
http://www2004.org/proceedings/docs/1p40.pdf 

    Hybrid system: a static source analyzer identifies code that
    may be vulnerable, then inserts sanitizer code that appears to
    be missing.  Very low overhead.

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to