On Sat, Nov 28, 2015 at 12:22 AM, Thomas Hruska <thru...@cubiclesoft.com> wrote:
> On 11/27/2015 2:21 PM, Yasuo Ohgaki wrote: > >> Hi Thomas, >> >> In practice, we wouldn't have problems with max number of collisions. >> > > Is CLI going to be or can CLI be excluded from max collisions? After > thinking about it for a long while, that's my only area of concern here. > SAPI can (fatal) error to its heart's content and I'll find ways to live > with it. But CLI needs stability guarantees. 'max_input_vars' didn't > affect CLI. However, max collisions could negatively affect CLI because > the change affects all arrays instead of just the superglobals. > > Real-world scenario: About once every 4 to 6 months I will write a script > to load up a single PHP array with a few million records. Partly because I > can and partly because I need to. On CLI, especially for my one-off > quick-n-dirty scripts, I don't care how much CPU, RAM, and other resources > are used nor how much time it takes to complete. I just care that the > script finishes running successfully. If the script reads in a record and > attempts to add it to the array, the proposed max collisions might trigger > a fatal error if it hits the max collision limit for arrays. My experience > is that CLI is silent about 50% of the time when it encounters a fatal > error. So my script would drop back to the prompt after spending many > minutes to hours loading the data, not having done any work, and not emit > any error(s). I would think that it had completed successfully until I > went to look at the results and the results I would be expecting to see > wouldn't be there. > > I abuse PHP arrays. Especially on CLI. Sorry. > To make sure there's no confusion on this point: When I say "number of collisions" I do not mean the total sum of collisions in a hashtable. That's a meaningless value, as a large hashtable will have a large total number of collisions, even if it is not malicious. What I rather mean is the number of collisions in a *single* collisions resolution chain. Lets to to quantify the probability of reaching the collision limit C with a hashtable of size N and assuming a random hash distribution. The upper bound for this should be (N choose C) * (1/N)^(C-1), with (1/N)^(C-1) being the probability that C elements hash to one value and (N over C) the number of C element subsets of an N element set. For large N and N >> C we approximate (N choose C) to (e*N/C)^C / sqrt(2pi*C). As such our upper bound becomes N * (e/C)^C / sqrt(2pi*C). Choosing N = 2^31 (largest possible hashtable) we get for C = 20 probability 8.9E-10 and for C = 100 probability 2.3E-149. The patch uses C = 1000. Now my probability-foo is pretty weak, so it's not unlikely that this result is totally wrong ^^ In any case, the point is that even for very large arrays, there is no way you're going to reach the collision limit with random data. Of course it can happen that your (obviously not perfectly random) data set happens to have keys that our hash function behaves particularly badly with. E.g. if you create a hashtable with pointers, you'll likely be getting an expected length of 8 for each collision chain due to alignment. If for whatever reason you are inserting integers that all happen to be a factor of 2^10, you could reach the collision limit. If that's the case you're probably still better off knowing that you're trying to Dos yourself, rather than just silently slowing down to a crawl ;) I'm not adverse to making the collision limit configurable, but it would still be enabled by default for CLI. Nikita