A while ago I posted a proposal to add support for tainted variables to PHP, to alert programmers at run-time when they make the common mistake of using uncleansed input with include, echo, system, open, etc. A copy of the original post can be found on-line, for example, at http://marc.info/?l=php-internals&m=116621380305497&w=2
After working on this over the summer in bursts, I now have a first implementation that adds taint support to the core engine and to a selection of built-ins and extensions. Not surprisingly, the initial plan needed to be updated in the light of experience. The good news is that performance is better than I hoped it would be. A quick example --------------- To give an idea of the functionality, consider the following program with an obvious HTML injection bug: <?php $username = $_GET['username']; echo "Welcome back, $username\n"; ?> With default .ini settings, this program does exactly what the programmer wrote: it echos the contents of the username request attribute, including all the malicious HTML code that an attacker may have supplied along with it. When I change one .ini setting: taint_error_level = E_WARNING the program produces the same output, but it also produces a warning: Warning: echo(): Argument contains data that is not converted with htmlspecialchars() or htmlentities() in /path/to/script on line 3 Changing E_WARNING into E_ERROR causes execution to terminate, as one would expect. These and other settings can appear in php.ini, on the PHP command line, or they can be made with ini_set() calls in the application itself. Current status -------------- My plan is to release something in the next few days, and to go through a several iterations while adjusting the course based on the feedback I receive. The performance is good: the overhead for "make test" is within the measurement error of 1-2% (when comparing user-mode CPU time). I do realize that a portion of the total "make test" time is spent in non-PHP processing, but with the current numbers it doesn't really matter if taint overhead is 1% or 2%. If someone has a better "macro" benchmark, then of course I'm interested. Although the code is still green, its present form already shows potential to do things like labeling privacy sensitive data (social security or credit card numbers) and preventing such data from flying out the window via HTML web pages. Multiple flavors of taint ------------------------- As many know, a PHP program should use htmlspecialchars() or htmlentities() before rendering data as HTML output. And before rendering data in other "output" like contexts, a program should use the proper conversion function, too: escapeshellcmd() etc. for shell commands, mysqli_real_escape_string() for mysqli queries. These conversions are not only needed for security reasons, they are also required for robustness; shell or SQL commands may fail unexpectedly when given a legitimate name such as O'Reilly. What I have implemented encourages programmers to do exactly that: using the RIGHT conversion function. To achieve this goal, the code implements multiple flavors of taint. In my initial post I proposed black-and-white taint, but I changed my mind for human interface reasons. To help programmers eliminate security holes from their code, they need to be told how to fix the problem. With black-and-white taint, the system just didn't know if the programmer had chosen the right conversion function. The tool was too simplistic for PHP's complex environment. Low-level implementation ------------------------ Taint support is implemented with some of the unused bits in the zval data structure. The zval is the PHP equivalent of a memory cell. Besides a type (string, number etc.) and value, each zval has a reference count and a flag that says whether the zval is a reference to yet another zval that contains the actual value. Right now I am using seven bits, but there is room for more: 32-bit UNIX compilers such as GCC add 16 bits of padding to the current zval data structure, and this amount isn't going to be smaller on 64-bit architectures. If I really have to squeeze the taint bits in-between the existing bits, the taint support performance hit goes up to perhaps 2% in the macro benchmarks that I mentioned above (but again, this number is barely above the 1-2% measurement error, so don't take it too literally). The preliminary configuration user interface is rather low-level, somewhat like MS-DOS file permissions :-( This is good enough for testing and debugging the taint support itself, but I would not want to have wires hanging out of the machine like this forever. The raw bits will need to be encapsulated so that applications can work with meaningful names and abstractions. To give an idea of what the nuts and bolts look like, this is the preliminary list of bits, or should I say: binary properties, together with the parameters that control their handling: - TC_HTML: - By default, data with this bit is not allowed in HTML output (parameter: taint_checks_html). This requirement is not enforced with the default setting of taint_error_level = 0. - The htmlspecialchars() and htmlentities() functions produce output without this bit. - By default, this bit is set on all data from the web (parameter: taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or from elsewhere (parameter: taint_marks_other). - TC_SHELL: - By default, data with this bit is not allowed in shell commands (parameter: taint_checks_shell). This requirement is not enforced with the default setting of taint_error_level = 0. - The escapeshellarg() and escapeshellcmd() functions produce output without this bit. - By default, this bit is set on all data from the web (parameter: taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or elsewhere (parameter: taint_marks_other). - TC_MYSQL: - By default, data with this bit is not allowed in mysql_query() (parameter: taint_checks_mysql). This requirement is not enforced with the default setting of taint_error_level = 0. - The mysql_real_escape_string() function produces output without this bit. - By default, this bit is set on all data from the web (parameter: taint_marks_egpcs), from DBMS (parameter: taint_marks_dbms) or elsewhere (parameter: taint_marks_other). - TC_MYSQLI: - Like TC_MYSQL, except this one is for the mysqli_query() API. The TC_MYSQLI bit is removed with mysqli_real_escape_string(). - TC_SELF: - By default, data with this bit cannot be used for internal control operations such as eval, include, as file name argument, as network destination, or in other contexts where someone could take away control from the application (parameter: taint_checks_self). This requirement disappears with taint_error_level = 0. - I haven't yet implemented a dedicated conversion function; to shut up the warnings, this data needs to be marked as "safe" with a low-level untaint($foo, TC_SELF) call. - By default, this bit is set on all data from the web (parameter: taint_marks_egpcs). - TC_USER1, TC_USER2: These are labels that an application can set on specific data. For example, it could set these bits when credit card or social security numbers come out of a database. The taint_checks_html policy for HTML output (see above) would then be configured to disallow data with not only with the TC_HTML property, but also with TC_USER1 or TC_USER2. Obviously some polished user interface would need to be built on top of this to make application-defined attributes usable. Taint propagation policy ------------------------ Before implementing the above policies, the first order of business was adding taint propagation to the PHP core: for each operator, including type conversion, a decision had to made how to propagate taint from source operands to results. The general taint propagation rules are: - Pure arithmetic and pure string operations propagate all the taint bits from their operands to their results. The rules become more complicated with operators whose operands have different types. - Conversions from string to number or to boolean remove all but a few taint bits (by default, only the TC_SELF bit stays). This prevents silly warnings about having to use htmlspecialchars() or mysql_real_escape_string() when rendering non-string data in SQL/HTML/shell context, while still protecting the application against some control hijacking attacks. - Conversions from number or boolean to string preserve all the taint bits. - Comparison operators currently ignore taint bits. Most of this taint propagation is done, but there are a few minor issues that still need to be resolved. - Something needs to be done when functions like parse_str() are given tainted data: the question is how to represent the taintedness of the resulting hash table lookup keys. These strings could be harmful when used in file names, in database names, or in other critical information. - Something needs to be done about null strings; it would be silly to insist that data composed from a null string be converted with htmlspecialchars() etc. On the other hand, a null string does change the syntactical structure of information, so we have to be careful. While adding taint propagation I found that a lot of PHP source code fails to use the official macros when initializing a zval. In these cases I added another line of code to initialize the taint bits by hand. Also, more internal documentation (other than empty man page skeletons) could have reduced development time significantly. Major loose ends ---------------- I already mentioned the loose wires hanging out of the machine; the user interface for taint policy control will need to be made more suitable for people who aren't primarily interested in PHP core hacking. Support for tainted objects is still incomplete. In particular, conversions from objects to non-objects may lose taint bits. For now, I manually added taint support to a number of standard built-ins (file, process, *scanf, *printf, and a subset of the string functions) and extensions (mysql, mysli). I hope this will be sufficient to get some experience with taint support. Right now, taint-unaware extensions will work properly as long as taint checks are disabled (the default), and as long as they are recompiled with the patched PHP header files. When taint checking is turned on, some extensions may cause false alarms when they fail to use the official macros to initialize zval structures, thereby leaving some taint bits at uninitialized values. I still hope that it will somehow be possible to annotate extensions so that taint support can be added without modifying lots of extension source code. However, having multiple flavors of taint, instead of just one, will make the job so much more interesting. Distant future -------------- Currently, only data is labeled (currently, only with binary attributes). No corresponding attributes exist for sources and sinks (files, network, databases, etc.). If we knew that a connection is encrypted, or whether something is an intranet or extranet destination, or some other property, then we could implement more sophisticated policies than the simple MS-DOS like file permissions that I have implemented now. But all this is miles beyond the immediate problem that I am trying to solve today: helping programmers find the holes in their own code before other people do. Wietse -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php