Dmitry Chichkov wrote: > I was looking over documentation of the bisect module and encountered the > following very strange statement there: > > From https://docs.python.org/2/library/bisect.html > > ...it does not make sense for the bisect() functions to have key or > reversed arguments because that would lead to an inefficient design > (successive calls to bisect functions would not "remember" all of the > previous key lookups). > > Instead, it is better to search a list of precomputed keys to find the > index of the record in question... > > > Is that right that the documentation encourages to use O(N) algorithm [by > making a copy of all keys] instead of using O(log(N)) bisect with > kay=attrgetter? And claims that O(N) is more efficient than O(log(N))?
Apparently :-) The documentation may not be completely clear, but what it is arguing is this: If you are making repeated bisect calls, then using a key function is inefficient because the key gets lost after each call and has to be recalculated over and over again. A concrete example: data = [i/100 for i in range(1, 7000001, 7)] data.sort(key=str) for x in many_x_values: bisect.insort(data, x, key=str) # Pretend this works. The first time you call insort, the key function (str in this case) will be called O(log N) times. The second time you call insort, the key function must be called again, even for the same data points, because the keys str(x) are thrown away and lost after each call to insort. After M calls to insort, there will have been on average M*(log(N)+1) calls to the key function. (The +1 is because the x gets str'ed as well, and there are M loops.) As an alternative, suppose we do this: data = [i/100 for i in range(1, 7000001, 7)] data.sort(key=str) keyed_data = [str(x) for x in data] for x in many_x_values: key = str(x) pos = bisect.bisect(keyed_data, key) keyed_data.insert(pos, key) data.insert(pos, x) This costs N calls to str in preparing the keyed_data list, plus M calls to str in the loop. The documentation suggests that we consider this will be cheaper in the long run, in other words: N + M < M*(log(N) + 1) N + M < M + M*log(N) That's true whenever N < M*log(N). The documentation assumes we should care about large N and large M. As they say, "everything is fast for small N": if you are doing three bisections on a list with ten items, then *who cares* how you do it, it's going to be quite fast. So let's say you have a list of 10000 items, that is N = 10000. Then it is faster to build a second keyed_list when: 10000 < M*log(10000) = M*13 # approximately i.e. M > 770 give or take. That's a fairly small number, and it's only about 8% of the size of N. So the general perspective is: If you have a list of N items, and you call bisect more than about (10% of N) times, it will likely be faster to pre-prepare a keyed list than call a key function repeatedly for each call to bisect. If you call bisect less than (10% of N) times, then it doesn't matter, because that's comparatively a small number and will be fast enough either way. Is the documentation correct? I don't know. I don't have an opinion on this. I recommend that you fork the bisect module, call it mybisect, add a key function parameter, and compare the speed of the two versions, with or without a key function. Remember: - Recent versions of bisect have a C accelerated version. Make sure you disable that, so you are comparing the speed of Python versus Python not Python versus C. - Nobody will care about small lists. I suggest that you start with at least ten thousand items, and move up to ten billion or so. For each N in (10**4, 10**5, ... 10**10) find out how many calls to insort or bisect does it take before the key function version becomes slower. Or possibly faster. - Is there are difference between fast key functions like str, and slow ones that have to do a lot of processing? -- Steven -- https://mail.python.org/mailman/listinfo/python-list