A while ago I posted a proposal to add support for tainted variables
to PHP, to alert programmers at run-time when they make the common
mistake of using uncleansed input with include, echo, system, open,
etc. A copy of the original post can be found on-line, for example,
at http://marc.info/?l=php-internals&m=116621380305497&w=2

After working on this over the summer in bursts, I now have a first
implementation that adds taint support to the core engine and to
a selection of built-ins and extensions. Not surprisingly, the
initial plan needed to be updated in the light of experience.  The
good news is that performance is better than I hoped it would be.

A quick example
---------------

To give an idea of the functionality, consider the following program
with an obvious HTML injection bug:

    <?php
    $username = $_GET['username'];
    echo "Welcome back, $username\n";
    ?>

With default .ini settings, this program does exactly what the
programmer wrote: it echos the contents of the username request
attribute, including all the malicious HTML code that an attacker
may have supplied along with it.

When I change one .ini setting:

    taint_error_level = E_WARNING

the program produces the same output, but it also produces a warning:

    Warning: echo(): Argument contains data that is not converted
    with htmlspecialchars() or htmlentities() in /path/to/script
    on line 3

Changing E_WARNING into E_ERROR causes execution to terminate, as
one would expect.  These and other settings can appear in php.ini,
on the PHP command line, or they can be made with ini_set() calls
in the application itself.

Current status
--------------

My plan is to release something in the next few days, and to go
through a several iterations while adjusting the course based on
the feedback I receive.

The performance is good: the overhead for "make test" is within the
measurement error of 1-2% (when comparing user-mode CPU time). I
do realize that a portion of the total "make test" time is spent
in non-PHP processing, but with the current numbers it doesn't
really matter if taint overhead is 1% or 2%.  If someone has a
better "macro" benchmark, then of course I'm interested.

Although the code is still green, its present form already shows
potential to do things like labeling privacy sensitive data (social
security or credit card numbers) and preventing such data from
flying out the window via HTML web pages.


Multiple flavors of taint
-------------------------

As many know, a PHP program should use htmlspecialchars() or
htmlentities() before rendering data as HTML output.  And before
rendering data in other "output" like contexts, a program should
use the proper conversion function, too: escapeshellcmd() etc.
for shell commands, mysqli_real_escape_string() for mysqli queries.
These conversions are not only needed for security reasons, they
are also required for robustness; shell or SQL commands may fail
unexpectedly when given a legitimate name such as O'Reilly.

What I have implemented encourages programmers to do exactly that:
using the RIGHT conversion function. To achieve this goal, the code
implements multiple flavors of taint.

In my initial post I proposed black-and-white taint, but I changed
my mind for human interface reasons.  To help programmers eliminate
security holes from their code, they need to be told how to fix the
problem. With black-and-white taint, the system just didn't know
if the programmer had chosen the right conversion function. The
tool was too simplistic for PHP's complex environment.

Low-level implementation
------------------------

Taint support is implemented with some of the unused bits in the
zval data structure. The zval is the PHP equivalent of a memory
cell. Besides a type (string, number etc.) and value, each zval has
a reference count and a flag that says whether the zval is a reference
to yet another zval that contains the actual value.

Right now I am using seven bits, but there is room for more: 32-bit
UNIX compilers such as GCC add 16 bits of padding to the current
zval data structure, and this amount isn't going to be smaller on
64-bit architectures.  If I really have to squeeze the taint bits
in-between the existing bits, the taint support performance hit
goes up to perhaps 2% in the macro benchmarks that I mentioned
above (but again, this number is barely above the 1-2% measurement
error, so don't take it too literally).

The preliminary configuration user interface is rather low-level,
somewhat like MS-DOS file permissions :-(  This is good enough for
testing and debugging the taint support itself, but I would not
want to have wires hanging out of the machine like this forever.
The raw bits will need to be encapsulated so that applications can
work with meaningful names and abstractions.

To give an idea of what the nuts and bolts look like, this is the
preliminary list of bits, or should I say: binary properties,
together with the parameters that control their handling:

- TC_HTML: 
    - By default, data with this bit is not allowed in HTML output
      (parameter: taint_checks_html). This requirement is not
      enforced with the default setting of taint_error_level = 0.
    - The htmlspecialchars() and htmlentities() functions produce
      output without this bit.
    - By default, this bit is set on all data from the web (parameter:
      taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms)
      or from elsewhere (parameter: taint_marks_other).

- TC_SHELL: 
    - By default, data with this bit is not allowed in shell commands
      (parameter: taint_checks_shell). This requirement is not
      enforced with the default setting of taint_error_level = 0.
    - The escapeshellarg() and escapeshellcmd() functions produce
      output without this bit.
    - By default, this bit is set on all data from the web (parameter:
      taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms)
      or elsewhere (parameter: taint_marks_other).

- TC_MYSQL: 
    - By default, data with this bit is not allowed in mysql_query()
      (parameter: taint_checks_mysql). This requirement is not
      enforced with the default setting of taint_error_level = 0.
    - The mysql_real_escape_string() function produces output without
      this bit.
    - By default, this bit is set on all data from the web (parameter:
      taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or
      elsewhere (parameter: taint_marks_other).

- TC_MYSQLI:
    - Like TC_MYSQL, except this one is for the mysqli_query() API.
      The TC_MYSQLI bit is removed with mysqli_real_escape_string().

- TC_SELF: 
    - By default, data with this bit cannot be used for internal
      control operations such as eval, include, as file name argument,
      as network destination, or in other contexts where someone could
      take away control from the application (parameter:
      taint_checks_self). This requirement disappears with
      taint_error_level = 0.
    - I haven't yet implemented a dedicated conversion function;
      to shut up the warnings, this data needs to be marked as
      "safe" with a low-level untaint($foo, TC_SELF) call.
    - By default, this bit is set on all data from the web (parameter:
      taint_marks_egpcs).

- TC_USER1, TC_USER2: These are labels that an application can set
  on specific data.  For example, it could set these bits when credit
  card or social security numbers come out of a database.  The
  taint_checks_html policy for HTML output (see above) would then
  be configured to disallow data with not only with the TC_HTML
  property, but also with TC_USER1 or TC_USER2. Obviously some
  polished user interface would need to be built on top of this to
  make application-defined attributes usable.

Taint propagation policy
------------------------

Before implementing the above policies, the first order of business
was adding taint propagation to the PHP core: for each operator,
including type conversion, a decision had to made how to propagate
taint from source operands to results.

The general taint propagation rules are:

- Pure arithmetic and pure string operations propagate all the taint
  bits from their operands to their results. The rules become more
  complicated with operators whose operands have different types.

- Conversions from string to number or to boolean remove all but a
  few taint bits (by default, only the TC_SELF bit stays).

  This prevents silly warnings about having to use htmlspecialchars()
  or mysql_real_escape_string() when rendering non-string data in
  SQL/HTML/shell context, while still protecting the application
  against some control hijacking attacks.

- Conversions from number or boolean to string preserve all the
  taint bits.

- Comparison operators currently ignore taint bits.

Most of this taint propagation is done, but there are a few minor
issues that still need to be resolved.

- Something needs to be done when functions like parse_str() are
  given tainted data: the question is how to represent the taintedness
  of the resulting hash table lookup keys. These strings could be
  harmful when used in file names, in database names, or in other
  critical information.

- Something needs to be done about null strings; it would be silly
  to insist that data composed from a null string be converted with
  htmlspecialchars() etc.  On the other hand, a null string does
  change the syntactical structure of information, so we have to
  be careful.

While adding taint propagation I found that a lot of PHP source
code fails to use the official macros when initializing a zval. In
these cases I added another line of code to initialize the taint
bits by hand.  Also, more internal documentation (other than empty
man page skeletons) could have reduced development time significantly.

Major loose ends
----------------

I already mentioned the loose wires hanging out of the machine; the
user interface for taint policy control will need to be made more
suitable for people who aren't primarily interested in PHP core
hacking.

Support for tainted objects is still incomplete. In particular,
conversions from objects to non-objects may lose taint bits.

For now, I manually added taint support to a number of standard
built-ins (file, process, *scanf, *printf, and a subset of the
string functions) and extensions (mysql, mysli). I hope this will
be sufficient to get some experience with taint support.

Right now, taint-unaware extensions will work properly as long as
taint checks are disabled (the default), and as long as they are
recompiled with the patched PHP header files. When taint checking
is turned on, some extensions may cause false alarms when they fail
to use the official macros to initialize zval structures, thereby
leaving some taint bits at uninitialized values.

I still hope that it will somehow be possible to annotate extensions
so that taint support can be added without modifying lots of extension
source code. However, having multiple flavors of taint, instead of
just one, will make the job so much more interesting.

Distant future
--------------

Currently, only data is labeled (currently, only with binary
attributes). No corresponding attributes exist for sources and sinks
(files, network, databases, etc.).  If we knew that a connection
is encrypted, or whether something is an intranet or extranet
destination, or some other property, then we could implement more
sophisticated policies than the simple MS-DOS like file permissions
that I have implemented now.

But all this is miles beyond the immediate problem that I am trying
to solve today: helping programmers find the holes in their own code
before other people do.

        Wietse

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to