Re: cvsync, rsync

hruodr Tue, 24 Sep 2013 01:16:08 -0700

On Fri, 20 Sep 2013, Johan Mellberg wrote:

> Your error in thinking is that if we have an extremely large set of strings,
> a very large set is mapped to each hash value. Therefore you reason that a
> collision is very likely. But if you are comparing two specific strings, the
> likelihood of them hashing to the same value is EXTREMELY small (other
> people replying have provided the numbers).

No. I do not reason like that. The collision is *by esperience* very seldom. 
The estimation given by people, including Trigdell, are optimistic and 
no upper bounds of the probability are given.

> Thus, if hash(A)=hash(B), A=B quite a bit more often than not.

No. You can prove (analytically, hence not empirical observation) 
that the probability of A=B under the condition hash(A)=hash(B) 
is close to zero.

Let us suppose, the hash function h maps X to Y, that Y have n elements,
that X have m elements; that X_i is the subset of the elements of X
mapped to the element i of Y, that the cardinality of X_i is m_i.
X is the disjunct union of the n sets X_i, m is the sum of the n numbers
m_i. We suppose that the elements of X are events, the probability is uniform
distributed, hence, each element have probability 1/m.

When speaking about collisions or about the probability of A=B, we are
dealing with two elements of X, with pairs. We must concentrate on the
product space X*X with m^2 elements, each with probability 1/m^2. For
calculating probabilities we must count there the number of elements
of subsets. For this purpose, we arrange the elements of X in a line,
the elements of each X_i contigously, and we arrange the elements of
X*X in a square whose sides are that arrange of the elements of X.
In the diagonal of this square are the elements of the form (a,a). 
The elements (a,b) with h(a)=h(b)=i are in smaller squares with m_i^2 
elements and whose diagonal coincide with the diagonal of the big square.

The event A=B is a subset of the event h(A)=h(B) (collision).

The probability of A=B is m*(1/m^2) = 1/m.

The probability of h(A)=h(B) is (\sum m_i^2)*(1/m^2).

The probability of A=B under the condition h(A)=h(B) is the division
of the above probabilities: m / (\sum m_i^2).

Let p=m/n. If h maps X to Y uniformly, then all m_i will be p. We observe
m_i as p+(m_i - p):

m_i = p + (m_i - p)

m_i^2 = p^2 + 2*p*(m_i - p) + (m_i - p)^2

\sum m_i^2 = n*p^2 + 2*p*(\sum m_i - n*p) + \sum (m_i - p)^2

           = m^2/n + \sum (m_i -p)^2 

           >= m^2/n   (the equality holds only when all m_i=p, 
                       i.e. h distributes X uniformly in Y.)

Conclusions:

(1) The probability of a collision, of h(A)=h(B), is hence *bigger or 
equal* than 1/n, and only equal to it when h distributes X uniformly 
in Y.

(2) The probability of A=B under the condition h(A)=h(B) is less than
n/m, and only equal to it when h distributes X uniformly in Y. Hence,
the probability is very small when m is much bigger than n.

These are my calculations, you may correct me, but the probability of
getting insults and diffamation are bigger.

--

On Tue, 24 Sep 2013, Janne Johansson wrote:

> I didn't seem to get an answer here.
>
> How would I know that the 4G wav-file I sent from one box to another is 100%
> identical?
>
> If we assume (and I think that is what you seem to claim) that we can't
> blindly trust hashing, but we will assume that no cosmic rays nor hard-drive
> bit failures can affect the contents, [...]

To much stories about cosmic rays and the age of the universe in relation
to the probability of collitions of functions that no one knows exactly.
Let the software be software, the hardware be hardware.

Rodrigo.

Re: cvsync, rsync

Reply via email to