Alternatively...why you should definitely use binary searches:

Python 3.5.2+ (default, Aug 30 2016, 19:08:42) 
[GCC 6.2.0 20160822] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import timeit
>>> hashes =  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in 
>>> range(10000000)]
>>> sortedhashes = sorted(hashes)
>>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
1.9478233020054176
>>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
18.001392804995703

I thought this was curious behavior.  I created a list of random-looking 
strings, then made a sorted copy.  I then found that using "in" to see if a 
string exists in the sorted list took over 9 times as long!

At first, I thought since both lists are the same size, and the 'in' test is a 
linear search, shouldn't they take the same amount of time?  Even if there was 
some trickery with branch prediction happening, that would have benefited the 
sorted list.  Then I remembered how lists work in Python.  The original list is 
going to be mostly contiguous in memory, making the memory cache quite 
effective.  When I create the sorted copy, I'm creating a list of references to 
strings that are all over the place in memory, causing tons of cache misses.

Of course, the best solution was to implement a binary search.  That turned the 
membership check into a 300-700 microsecond operation.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to