Hi Nikita, few notes:
It looks like the patch messes the attempt of catching the problem (few lines related to zend_hash_find_bucket changes) with optimization to compensate collision check overhead (everything else). I think it's better to separate these parts. Introduction of zend_hash_add_or_return() in 7.0.1 is going to make forward incompatibility with 7.0.0. However, we may reserve it for internal usage removing ZEND_API. I don't think PHP should prevent user from construction of arrays he likes, because of internal problems (like in your example). > $s = 1 << 16; $a = []; > for ($i = 0; count($a) < $s; $i += $s) { $a[$i] = 0; } It makes performance problem, but it doesn't mean we should kill it. We have "max_execution_time" to catch them all. I think only big arrays coming from external sources should be checked. Your solution is incomplete, anyway, because of crafting a single list with 10000 collisions, attacker will able to use N lists with 1000 collisions. Thanks. Dmitry. On Thu, Nov 26, 2015 at 8:24 PM, Nikita Popov <nikita....@gmail.com> wrote: > Hi internals! > > This mail turned out to be rather long, so I'll start with a TL;DR: > > To fix the HashDos vulnerability for *all* cases (rather than just GET/POST > parsing), I propose to introduce collision counting during hashtable > insertion operations. This will throw a fatal error if the number of > collisions during an insertion operation exceed a certain threshold. > > Implementation: https://github.com/php/php-src/pull/1565 > > From my testing the change has no negative performance impact. The change > is small and does not break ABI. > > Tracking bug (with various implementations): > https://bugs.php.net/bug.php?id=70644 > > What are your thoughts on this? > > Nikita > > --------- > > For those not familiar with HashDos or interested in some considerations > regarding alternative implementations, the original mail: > > In PHP 5.3.9 a partial fix for the HashDos vulnerability was introduced in > the form of max_input_vars. As a recap, HashDos exploits the fact that > hashtables (which PHP uses ubiquitously) perform very badly if the hashes > of many elements collide. In particular the performance of inserting N > elements into a hashtable can degrade from O(N) to O(N^2) using carefully > chosen keys. > > A very simple demo script for creating a hashtable with many collisions: > > $s = 1 << 16; $a = []; > for ($i = 0; count($a) < $s; $i += $s) { $a[$i] = 0; } > > This code creates an array with about 65k elements and runs in 30s (PHP > 5.6) or 4s (PHP 7) on my machine. > > This behavior can be exploited by an attacker whenever a server running PHP > can be made to create a decent-sized array (or object) with > attacker-controlled keys. This DOS vulnerability is efficient: A 700kb > payload can easily take 5 CPU seconds to process on PHP 7 and from there it > goes up quadratically (i.e. 1.4MB payload takes 20 seconds etc). > > The most obvious attack vector are GET and POST variables, which are > automatically decoded into arrays. PHP 5.3.9 prevented this attack avenue > by introducing max_input_vars, which prevents creating GET/POST arrays with > more than a certain number of elements. As the HashDos attack relies on > O(N^2) asymptotic runtime, limiting N to small numbers is an effective fix. > > However, GET/POST are not the only ways an attacker can trigger > array/object creation with specific keys. For example many PHP applications > have endpoints that accept JSON encoded payloads, which are not protected > by max_input_vars. The only means of protection currently available to > userland code is to never decode JSON payloads that exceed some > conservative size limit like 50-100KB (often not practical). > > Apart from JSON, there will likely be various other application-specific > situations where arrays are generated which contain user-provided keys. > While we could add something similar to max_input_vars to json_decode and > unserialize, this would not solve the problem for any custom code. > > There are essentially three ways to prevent HashDos attacks in general > (rather than patching specific places): > > 1. If there are many collisions for a hash slot, switch collision > resolution from linked lists to self-balancing binary trees. > 2. Switch hashtables to a keyed cryptographic hash like SipHash. > 3. If there are very many collisions for a hash slot, throw a fatal error. > > I will comment on these possibilities as they apply to PHP. > > 1. (Self-balancing binary trees). The idea here is that a balanced binary > tree has worst-case insertion time O(log N), while the linked list we > normally use has worst-case insertion time O(N). This means that the > worst-case execution time for N insertions into a hashtable goes from > O(N^2) to O(N log N), i.e. from catastrophic to "a few times slower than > usual". > > I don't think this solution is viable for PHP. Supporting binary trees next > to linked list collision resolution will make the hashtable implementation > *a lot* more complicated -- and it isn't exactly simple as is. > > 2. (Randomized cryptographic hash). This approach doesn't change anything > about how collisions are handled, instead it prevents the attacker from > constructing such collisions in the first place. > > This is done by using a fast cryptographic hash function (typically > SipHash), which is keyed with a (cryptographically strong) random key. As > the attacker does not know the key, he cannot efficiently construct arrays > with many collisions. > > There are a number of factors to take into account when trying to apply > this to PHP: > > a) The SipHash key would have to be the same for all processes in an FPM > pool. We can't use a per-process or per-request key due to arrays living in > opcache SHM. As far as I know SipHash is designed to be secure even when > the used key is long living, so this does not appear to be an issue. > > b) SipHash is slower (especially for very small strings) than the hash > function we currently use (DJBX33A). As PHP 7 caches the hashes for > strings, I do not believe this to be a significant problem. > > c) The real issue: Currently PHP does not hash integer keys, but they are > just as susceptible to HashDos as string keys. To protect against this > vector we'd have to start hashing integer keys as well. Especially when > using a hash function as expensive as SipHash, this would make operations > on arrays with (non-ascending or non-dense) integer keys significantly > slower. > > Here is an implementation of using SipHash for both strings and integers: > https://github.com/php/php-src/compare/master...nikic:integerHash > > Testing this against Wordpress showed only a small (but non-trivial) > slowdown. This is probably attributable to the fact that most > integer-indexed arrays are packed (not using a hash) and thus this change > has less impact than it might have had on PHP 5. > > However, testing raw array operations, working on non-packed integer arrays > can easily be three times slower using this implementation. I do not think > this is acceptable. Maybe Wordpress doesn't use many unpacked integer > arrays, but that does not have to hold generally and this difference is > just too drastic. > > 3. (Fatal error on many collisions). While the two previous options try to > ensure that hashtables stay efficient regardless of the used keys, the last > option aims to simply detect malicious array keys and abort the script in > such a case. > > This is done by counting the number of collisions during hashtable > insertion operations and throwing a fatal error if this collisions count > exceeds a certain threshold. > > Here is an implementation of this mechanism for PHP: > https://github.com/php/php-src/pull/1565 > > This is the approach I would recommend for PHP. The patch for this change > is small and non-intrusive and does not break ABI. From my testing it has > no adverse performance impact. (In microbenchmarks I found it to even be > consistently faster, as it uses a more specialized implementation for a > commonly used type of insertion operation.) > > While this is certainly less elegant than the alternatives, this would > allow us to fix the security issue in a timely matter. Switching to > SipHash, while maybe a possibility for the future, would still require a > significant amount of work to be at all feasible. > > Nikita > <https://github.com/php/php-src/pull/1565> >