On Thu, Sep 22, 2016 at 1:07 AM, Stanislav Malyshev <smalys...@gmail.com>
wrote:

> Hi!
>
> >> Lets [try] to quantify the probability of reaching the collision limit C
> > with a hashtable of size N and assuming a random hash distribution. The
> > upper bound for this should be (N choose C) * (1/N)^(C-1), with
> (1/N)^(C-1)
> > being the probability that C elements hash to one value and (N over C)
> the
> > number of C element subsets of an N element set. For large N and N >> C
> we
> > approximate (N choose C) to (e*N/C)^C / sqrt(2pi*C). As such our upper
> > bound becomes N * (e/C)^C / sqrt(2pi*C). Choosing N = 2^31 (largest
> > possible hashtable) we get for C = 20 probability 8.9E-10 and for C = 100
> > probability 2.3E-149. The patch uses C = 1000.
>
> I assume you're talking here about per-hashtable limit?
> Here I notice two things:
>
> 1. Hash keys are not really random. Which can both improve and make it
> worse, but we have a lot of keys named "key" and not a lot of keys named
> "\x07\xF5\xDD".
>
> 2. This does not account for the fact that if we have two keys used by
> app colliding, we'll probably see this collision a lot. Of course, if
> we're counting per-hashtable, that would make is somewhat less, but still.
>
> To validate this theory, I applied the following patch:
>
> https://gist.github.com/smalyshev/2c3c085998ba81d6c163a27386aaacd8
>
> And run composer update on one of my projects. I got these lines:
>
> Destroyed hash with coll = 48454
> Destroyed hash with coll = 41871
> Destroyed hash with coll = 41828
> Destroyed hash with coll = 40535
> Destroyed hash with coll = 32774
> Destroyed hash with coll = 22782
> Destroyed hash with coll = 21381
> Destroyed hash with coll = 15232
> Destroyed hash with coll = 14745
> Destroyed hash with coll = 12216
> Destroyed hash with coll = 11742
> Destroyed hash with coll = 11625
> Destroyed hash with coll = 8916
> Destroyed hash with coll = 6820
> Destroyed hash with coll = 5911
> Destroyed hash with coll = 5429
> Destroyed hash with coll = 3390
> Destroyed hash with coll = 1188
>
> I.e. there were 18 hashes with more than 1000 collisions on the course
> of the run. Is there something wrong with my count or am I counting
> wrong things (I didn't see the patch for limiting counts so maybe I'm
> still counting wrong things).
>
> > In other words, it is extremely unlikely that you hit the collision limit
> > by accident, with non-malicious data. So no, the user does not have to
> > discriminate normal from malicious causes. If the limit triggers, it's
> > malicious.
>
> I'm not so sure of this now - unless, again, I misunderstand what we're
> counting.


As linked in my first reply, there was a previous discussion on the topic:
http://markmail.org/message/ttbgcvdu4f7uymfb
The collision-counting approach that Yasuo references is linked at the
start of the first mail: https://github.com/php/php-src/pull/1565

Collisions are counted during insertion operations. While we perform a
hashtable insertion, we first need to iterate over the existing buckets for
a certain hash to see if the insert is actually an update. The
implementation simply keeps track of how many buckets we inspect while
doing that. If the number is above a certain threshold, an error is
generated.

As such, the number of collisions per hashtable is not important. Only the
number of collisions for a certain hash-value is relevant. Getting C=1000
collisions for a single hash-value is extremely unlikely until you
explicitly design your keys to trigger this situation.

Nikita

Reply via email to