On 10/09/2012 18:07, Dan Goodman wrote:
On 04/09/2012 03:54, Roy Smith wrote:
Let's assume you're testing two strings for equality.  You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning?  If the strings are indeed equal, it's the
same amount of work starting from either end.  But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

 From the rest of the thread, it looks like in most situations it won't
make much difference as typically very few characters need to be
compared if they are unequal.

However, if you were in a situation with many strings which were almost
equal, the most general way to improve the situation might be store a
hash of the string along with the string, i.e. store (hash(x), x) and
then compare equality of this tuple. Almost all of the time, if the
strings are unequal the hash will be unequal. Or, as someone else
suggested, use interned versions of the strings. This is basically the
same solution but even better. In this case, your startup costs will be
higher (creating the strings) but your comparisons will always be instant.

Just had another thought about this. Although it's unlikely to be necessary in practice since (a) it's rarely necessary at all, and (b) when it is, hashing and optionally interning seems like the better approach, I had another idea that would be more general. Rather than starting from the beginning or the end, why not do something like: check the first and last character, then the len/2 character, then the len/4, then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever about making sure you hit every character but I'm sure someone's already got an efficient algorithm for this. You could probably even make this cache efficient by working on cache line length blocks. Almost certainly entirely unnecessary, but I like the original question and it's a nice theoretical problem.

Dan
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to