On Sat, Nov 28, 2015 at 12:22 AM, Thomas Hruska <thru...@cubiclesoft.com>
wrote:

> On 11/27/2015 2:21 PM, Yasuo Ohgaki wrote:
>
>> Hi Thomas,
>>
>> In practice, we wouldn't have problems with max number of collisions.
>>
>
> Is CLI going to be or can CLI be excluded from max collisions?  After
> thinking about it for a long while, that's my only area of concern here.
> SAPI can (fatal) error to its heart's content and I'll find ways to live
> with it.  But CLI needs stability guarantees.  'max_input_vars' didn't
> affect CLI.  However, max collisions could negatively affect CLI because
> the change affects all arrays instead of just the superglobals.
>
> Real-world scenario:  About once every 4 to 6 months I will write a script
> to load up a single PHP array with a few million records.  Partly because I
> can and partly because I need to.  On CLI, especially for my one-off
> quick-n-dirty scripts, I don't care how much CPU, RAM, and other resources
> are used nor how much time it takes to complete.  I just care that the
> script finishes running successfully.  If the script reads in a record and
> attempts to add it to the array, the proposed max collisions might trigger
> a fatal error if it hits the max collision limit for arrays.  My experience
> is that CLI is silent about 50% of the time when it encounters a fatal
> error.  So my script would drop back to the prompt after spending many
> minutes to hours loading the data, not having done any work, and not emit
> any error(s).  I would think that it had completed successfully until I
> went to look at the results and the results I would be expecting to see
> wouldn't be there.
>
> I abuse PHP arrays.  Especially on CLI.  Sorry.
>

To make sure there's no confusion on this point: When I say "number of
collisions" I do not mean the total sum of collisions in a hashtable.
That's a meaningless value, as a large hashtable will have a large total
number of collisions, even if it is not malicious. What I rather mean is
the number of collisions in a *single* collisions resolution chain.

Lets to to quantify the probability of reaching the collision limit C with
a hashtable of size N and assuming a random hash distribution. The upper
bound for this should be (N choose C) * (1/N)^(C-1), with (1/N)^(C-1) being
the probability that C elements hash to one value and (N over C) the number
of C element subsets of an N element set. For large N and N >> C we
approximate (N choose C) to (e*N/C)^C / sqrt(2pi*C). As such our upper
bound becomes N * (e/C)^C / sqrt(2pi*C). Choosing N = 2^31 (largest
possible hashtable) we get for C = 20 probability 8.9E-10 and for C = 100
probability 2.3E-149. The patch uses C = 1000.

Now my probability-foo is pretty weak, so it's not unlikely that this
result is totally wrong ^^ In any case, the point is that even for very
large arrays, there is no way you're going to reach the collision limit
with random data. Of course it can happen that your (obviously not
perfectly random) data set happens to have keys that our hash function
behaves particularly badly with. E.g. if you create a hashtable with
pointers, you'll likely be getting an expected length of 8 for each
collision chain due to alignment. If for whatever reason you are inserting
integers that all happen to be a factor of 2^10, you could reach the
collision limit. If that's the case you're probably still better off
knowing that you're trying to Dos yourself, rather than just silently
slowing down to a crawl ;)

I'm not adverse to making the collision limit configurable, but it would
still be enabled by default for CLI.

Nikita

Reply via email to