Re: Cache memory and its effect on list searching

sohcahtoa82 Fri, 16 Dec 2016 20:52:37 -0800

On Friday, December 16, 2016 at 6:27:24 PM UTC-8, Chris Angelico wrote:
> On Sat, Dec 17, 2016 at 1:20 PM,  <sohcahto...@gmail.com> wrote:
> > I thought this was curious behavior.  I created a list of random-looking 
> > strings, then made a sorted copy.  I then found that using "in" to see if a 
> > string exists in the sorted list took over 9 times as long!
> >
> 
> My numbers replicate yours (though my computer's faster). But my
> conclusion is different:
> 
> Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47)
> [GCC 6.2.0 20161027] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import hashlib
> >>> import timeit
> >>> hashes =  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in 
> >>> range(10000000)]
> >>> sortedhashes = sorted(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.8167107938788831
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, 
> >>> number=10)
> 5.029693723190576
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.855183657258749
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, 
> >>> number=10)
> 5.0585526106879115
> >>> sethashes = set(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)
> 6.13601878285408e-06
> 
> You want containment checks? Use a set :)
> 
> ChrisA


Ah, I forgot about the optimizations of a set.  The only time I ever find 
myself using set is when I want to remove all the duplicates in a list.  I 
convert it to a set and then back.

For fun, I ran an experiment:

### BEGIN listsearch.py
import hashlib 
import timeit
import sys

from bisect import bisect_left

def bisect_search(a, x, lo=0, hi=None):   # can't use a to specify default for 
hi
    # From 
http://stackoverflow.com/questions/212358/binary-search-bisection-in-python
    hi = hi if hi is not None else len(a) # hi defaults to len(a)
    pos = bisect_left(a,x,lo,hi)          # find insertion position
    return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end

def bin_search(haystack, needle):
    start = 0
    end = len(haystack) - 1
    while True:
        if start > end:
            return False
        middle = (start + end) // 2
        if needle < haystack[middle]:
            end = middle - 1
            continue
        elif needle > haystack[middle]:
            start = middle + 1
            continue
        elif needle == haystack[middle]:
            return True

print('Python version: ', sys.version_info)
print('building hashes...')
hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in 
range(10000000)]
print('sorting...')
sortedhashes = sorted(hashes)
print('creating set...')
sethashes = set(hashes)
print('Unsorted list:', timeit.timeit('"x" in hashes', globals={'hashes': 
hashes}, number=10))
print('Sorted:', timeit.timeit('"x" in hashes', globals={'hashes': 
sortedhashes}, number=10))
print('set:', timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, 
number=10))
print('binary search:',  timeit.timeit('binsearch(hashes, "x")', 
globals={'hashes': sortedhashes, 'binsearch': bin_search}, number=10))
print('binary search with bisect:',  timeit.timeit('binsearch(hashes, "x")', 
globals={'hashes': sortedhashes, 'binsearch': bisect_search}, number=10))
### END listsearch.py

> python3 listsearch.py
Python version:  sys.version_info(major=3, minor=5, micro=2, 
releaselevel='final', serial=0)
building hashes...
sorting...
creating set...
Unsorted list: 1.7384763684627569
Sorted: 9.248799958145042
set: 1.4614161294446149e-06
binary search: 0.00010902164328108199
binary search with bisect: 1.7829276782066472e-05

Yup.  set is definitely the way to go!
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Cache memory and its effect on list searching

Reply via email to