Alternatively...why you should definitely use binary searches: Python 3.5.2+ (default, Aug 30 2016, 19:08:42) [GCC 6.2.0 20160822] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import hashlib >>> import timeit >>> hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in >>> range(10000000)] >>> sortedhashes = sorted(hashes) >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) 1.9478233020054176 >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10) 18.001392804995703
I thought this was curious behavior. I created a list of random-looking strings, then made a sorted copy. I then found that using "in" to see if a string exists in the sorted list took over 9 times as long! At first, I thought since both lists are the same size, and the 'in' test is a linear search, shouldn't they take the same amount of time? Even if there was some trickery with branch prediction happening, that would have benefited the sorted list. Then I remembered how lists work in Python. The original list is going to be mostly contiguous in memory, making the memory cache quite effective. When I create the sorted copy, I'm creating a list of references to strings that are all over the place in memory, causing tons of cache misses. Of course, the best solution was to implement a binary search. That turned the membership check into a 300-700 microsecond operation. -- https://mail.python.org/mailman/listinfo/python-list