On Sep 6, 4:04 am, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote: > On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d...@davea.name> wrote: > > For random strings (as defined below), the average compare time is > > effectively unrelated to the size of the string, once the size > passes > > some point. > > Define random string as being a selection from a set of characters, > with > > replacement. So if we pick some set of characters, say 10 (or 256, > it > > doesn't really matter), the number of possible strings is 10**N. > > The likelihood of not finding a mismatch within k characters is > > (1/10)**k The likelihood of actually reaching the end of the > random > > string is (1/10)**N. (which is the reciprocal of the number of > strings, > > naturally) > > If we wanted an average number of comparisons, we'd have to > calculate a > > series, where each term is a probability times a value for k. > > sum((k * 9*10**-k) for k in range(1, 10)) > > Those terms very rapidly approach 0, so it's safe to stop after a > few. > > Looking at the first 9 items, I see a value of 1.1111111 > > This may not be quite right, but the value is certainly well under > 2 for > > a population of 10 characters, chosen randomly. And notice that N > > doesn't really come into it. > > It's exactly right. You can obtain this result analytically from > Johannes' formula above. Just replace 256 with 10 to get that the > expected number of comparisons is > > (10/9)*(1 - 10**(-N))
The math here is simple, but of course the average running time is also driven by usage patterns. Let's say you have two independently produced strings with N number of random bits. The average amount of bit comparisons that it takes to determine that the strings are different is between 1 and 2 checks for any value of N, for the same reason that the average number of times you need to flip a coin before either coming up heads or exhausting your N tries is always less than 2, and pretty darn close to 2 for even relatively small values of N. def compute_checks_for_different_strings(n): x = 1 p = 0.5 for i in range(0, n): x += p * i p = p * 0.5 print n, x for n in [10, 100, 1000, 10000]: compute_checks_for_different_strings(n) 10 1.9892578125 100 2.0 1000 2.0 10000 2.0 Now let's say your string comparison function is used to verify 100- bit passwords, and 99% of the time the password is from a legit user who types it correctly. Except for the 1% of a time when somebody fat fingers the copy/paste of their password, every strcmp call is gonna have to compare 100 pairs of characters, or N pairs. If the usage pattern says that 99% of the time you're simply verifying that two strings are equal, then the number of bits is the main driving factor for running time. -- http://mail.python.org/mailman/listinfo/python-list