On Sep 24 08:07:30, hru...@gmail.com wrote:
> On Fri, 20 Sep 2013, Johan Mellberg wrote:
> 
> > Your error in thinking is that if we have an extremely large set of strings,
> > a very large set is mapped to each hash value. Therefore you reason that a
> > collision is very likely. But if you are comparing two specific strings, the
> > likelihood of them hashing to the same value is EXTREMELY small (other
> > people replying have provided the numbers).
> 
> No. I do not reason like that. The collision is *by esperience* very seldom. 
> The estimation given by people, including Trigdell, are optimistic and 
> no upper bounds of the probability are given.
> 
> > Thus, if hash(A)=hash(B), A=B quite a bit more often than not.
> 
> No. You can prove (analytically, hence not empirical observation) 
> that the probability of A=B under the condition hash(A)=hash(B) 
> is close to zero.
> 
> Let us suppose, the hash function h maps X to Y, that Y have n elements,
> that X have m elements; that X_i is the subset of the elements of X
> mapped to the element i of Y, that the cardinality of X_i is m_i.
> X is the disjunct union of the n sets X_i, m is the sum of the n numbers
> m_i. We suppose that the elements of X are events, the probability is uniform
> distributed, hence, each element have probability 1/m.

> When speaking about collisions or about the probability of A=B, we are
> dealing with two elements of X, with pairs. We must concentrate on the
> product space X*X with m^2 elements, each with probability 1/m^2. For
> calculating probabilities we must count there the number of elements
> of subsets. For this purpose, we arrange the elements of X in a line,
> the elements of each X_i contigously, and we arrange the elements of
> X*X in a square whose sides are that arrange of the elements of X.
> In the diagonal of this square are the elements of the form (a,a). 
> The elements (a,b) with h(a)=h(b)=i are in smaller squares with m_i^2 
> elements and whose diagonal coincide with the diagonal of the big square.

I see you already got to page 7 of your set theory textbook,
but got the notation slightly wrong.

> The event A=B is a subset of the event h(A)=h(B) (collision).
> 
> The probability of A=B is m*(1/m^2) = 1/m.
> 
> The probability of h(A)=h(B) is (\sum m_i^2)*(1/m^2).
> 
> The probability of A=B under the condition h(A)=h(B) is the division
> of the above probabilities: m / (\sum m_i^2).
> 
> Let p=m/n. If h maps X to Y uniformly, then all m_i will be p. We observe
> m_i as p+(m_i - p):
> 
> m_i = p + (m_i - p)
> 
> m_i^2 = p^2 + 2*p*(m_i - p) + (m_i - p)^2
> 
> \sum m_i^2 = n*p^2 + 2*p*(\sum m_i - n*p) + \sum (m_i - p)^2
> 
>            = m^2/n + \sum (m_i -p)^2 
> 
>            >= m^2/n   (the equality holds only when all m_i=p, 
>                        i.e. h distributes X uniformly in Y.)
>  
> Conclusions:
> 
> (1) The probability of a collision, of h(A)=h(B), is hence *bigger or 
> equal* than 1/n, and only equal to it when h distributes X uniformly 
> in Y.
> 
> (2) The probability of A=B under the condition h(A)=h(B) is less than
> n/m, and only equal to it when h distributes X uniformly in Y. Hence,
> the probability is very small when m is much bigger than n.

Yes. There are, in fact, infinitely many possible inputs,
and only finitely many possible hashes; specifically,
2^128 for a 128bit hash function. So with a uniform
distribution, you will in fact get infinitely many
different inputs with the same hash.

But that's irrelevant to your original question,
because none of these colliding inputs
will ever be a file on your rsync server.

For example, you might have a file on your rsync server,
named ~/.signature, with the content "I am a fucking troll";
this file has a certain hash.

Now, you are right that there are infinitely many possible
files with a different content but the same hash. Note:
none of these will ever exist on your rsync server.
More precisely: the probability of that happening
is negligibly small.

Read the above until you completely understand it,
and then don't come back here.

Reply via email to