On Friday, December 16, 2016 at 6:27:24 PM UTC-8, Chris Angelico wrote: > On Sat, Dec 17, 2016 at 1:20 PM, <sohcahto...@gmail.com> wrote: > > I thought this was curious behavior. I created a list of random-looking > > strings, then made a sorted copy. I then found that using "in" to see if a > > string exists in the sorted list took over 9 times as long! > > > > My numbers replicate yours (though my computer's faster). But my > conclusion is different: > > Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47) > [GCC 6.2.0 20161027] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import hashlib > >>> import timeit > >>> hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in > >>> range(10000000)] > >>> sortedhashes = sorted(hashes) > >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) > 0.8167107938788831 > >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, > >>> number=10) > 5.029693723190576 > >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10) > 0.855183657258749 > >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, > >>> number=10) > 5.0585526106879115 > >>> sethashes = set(hashes) > >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10) > 6.13601878285408e-06 > > You want containment checks? Use a set :) > > ChrisA
Ah, I forgot about the optimizations of a set. The only time I ever find myself using set is when I want to remove all the duplicates in a list. I convert it to a set and then back. For fun, I ran an experiment: ### BEGIN listsearch.py import hashlib import timeit import sys from bisect import bisect_left def bisect_search(a, x, lo=0, hi=None): # can't use a to specify default for hi # From http://stackoverflow.com/questions/212358/binary-search-bisection-in-python hi = hi if hi is not None else len(a) # hi defaults to len(a) pos = bisect_left(a,x,lo,hi) # find insertion position return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end def bin_search(haystack, needle): start = 0 end = len(haystack) - 1 while True: if start > end: return False middle = (start + end) // 2 if needle < haystack[middle]: end = middle - 1 continue elif needle > haystack[middle]: start = middle + 1 continue elif needle == haystack[middle]: return True print('Python version: ', sys.version_info) print('building hashes...') hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)] print('sorting...') sortedhashes = sorted(hashes) print('creating set...') sethashes = set(hashes) print('Unsorted list:', timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)) print('Sorted:', timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)) print('set:', timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)) print('binary search:', timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bin_search}, number=10)) print('binary search with bisect:', timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bisect_search}, number=10)) ### END listsearch.py > python3 listsearch.py Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0) building hashes... sorting... creating set... Unsorted list: 1.7384763684627569 Sorted: 9.248799958145042 set: 1.4614161294446149e-06 binary search: 0.00010902164328108199 binary search with bisect: 1.7829276782066472e-05 Yup. set is definitely the way to go! -- https://mail.python.org/mailman/listinfo/python-list