On Thu, Sep 22, 2016 at 1:07 AM, Stanislav Malyshev <smalys...@gmail.com> wrote:
> Hi! > > >> Lets [try] to quantify the probability of reaching the collision limit C > > with a hashtable of size N and assuming a random hash distribution. The > > upper bound for this should be (N choose C) * (1/N)^(C-1), with > (1/N)^(C-1) > > being the probability that C elements hash to one value and (N over C) > the > > number of C element subsets of an N element set. For large N and N >> C > we > > approximate (N choose C) to (e*N/C)^C / sqrt(2pi*C). As such our upper > > bound becomes N * (e/C)^C / sqrt(2pi*C). Choosing N = 2^31 (largest > > possible hashtable) we get for C = 20 probability 8.9E-10 and for C = 100 > > probability 2.3E-149. The patch uses C = 1000. > > I assume you're talking here about per-hashtable limit? > Here I notice two things: > > 1. Hash keys are not really random. Which can both improve and make it > worse, but we have a lot of keys named "key" and not a lot of keys named > "\x07\xF5\xDD". > > 2. This does not account for the fact that if we have two keys used by > app colliding, we'll probably see this collision a lot. Of course, if > we're counting per-hashtable, that would make is somewhat less, but still. > > To validate this theory, I applied the following patch: > > https://gist.github.com/smalyshev/2c3c085998ba81d6c163a27386aaacd8 > > And run composer update on one of my projects. I got these lines: > > Destroyed hash with coll = 48454 > Destroyed hash with coll = 41871 > Destroyed hash with coll = 41828 > Destroyed hash with coll = 40535 > Destroyed hash with coll = 32774 > Destroyed hash with coll = 22782 > Destroyed hash with coll = 21381 > Destroyed hash with coll = 15232 > Destroyed hash with coll = 14745 > Destroyed hash with coll = 12216 > Destroyed hash with coll = 11742 > Destroyed hash with coll = 11625 > Destroyed hash with coll = 8916 > Destroyed hash with coll = 6820 > Destroyed hash with coll = 5911 > Destroyed hash with coll = 5429 > Destroyed hash with coll = 3390 > Destroyed hash with coll = 1188 > > I.e. there were 18 hashes with more than 1000 collisions on the course > of the run. Is there something wrong with my count or am I counting > wrong things (I didn't see the patch for limiting counts so maybe I'm > still counting wrong things). > > > In other words, it is extremely unlikely that you hit the collision limit > > by accident, with non-malicious data. So no, the user does not have to > > discriminate normal from malicious causes. If the limit triggers, it's > > malicious. > > I'm not so sure of this now - unless, again, I misunderstand what we're > counting. As linked in my first reply, there was a previous discussion on the topic: http://markmail.org/message/ttbgcvdu4f7uymfb The collision-counting approach that Yasuo references is linked at the start of the first mail: https://github.com/php/php-src/pull/1565 Collisions are counted during insertion operations. While we perform a hashtable insertion, we first need to iterate over the existing buckets for a certain hash to see if the insert is actually an update. The implementation simply keeps track of how many buckets we inspect while doing that. If the number is above a certain threshold, an error is generated. As such, the number of collisions per hashtable is not important. Only the number of collisions for a certain hash-value is relevant. Getting C=1000 collisions for a single hash-value is extremely unlikely until you explicitly design your keys to trigger this situation. Nikita