Steve D'Aprano wrote: > I have experimented with two solutions, and hope somebody might be able to > suggest some improvements. Attached is the test code I ran, suggestions > for improving the performance will be appreciated.
If there was an attachment to this text -- that didn't make it to the mailing list or the news group. > I decided on these two data structures: > > (1) The naive solution: don't worry about duplicate values, or the fact > that keys come in contiguous groups, and just store each individual key > and value in a dict; this is faster but uses more memory. > > (2) The clever solution: use a pair of lists, one holding the starting > value of each group of keys, and the other holding the common values. This > saves a lot of memory, but is slower. > > A concrete example might help. Suppose I have 15 keys in five groups: > > D = {0: 10, > 1: 20, 2: 20, > 3: 30, 4: 30, 5: 30, > 6: 40, 7: 40, 8: 40, 9: 40, > 10: 50, 11: 50, 12: 50, 13: 50, 14: 50} > > (Remember, in reality I could have as many as a million or two keys. This > is just a simple toy example.) > > Instead of using a dict, I also set up a pair of lists: > > L = [0, 1, 3, 6, 10, 15] # starting value of each group > V = [10, 20, 30, 40, 50] # value associated with each group > > Note that the first list has one extra item, namely the number one past > the final group. > > I can do a key look-up using either of these: > > D[key] > > V[bisect.bisect_right(L, i) - 1] > > > I tested the memory consumption and speed of these two solutions with > (approximately) one million keys. I found: > > - the dict solution uses a lot more memory, about 24 bytes per key, > compared to the pair of lists solution, which is about 0.3 bytes per key; > > - but on my computer, dict lookups are about 3-4 times faster. Only three to four times? You basically get that from a no-op function call: $ python3 -m timeit -s 'd = {1: 2}; k = 1' 'd[k]' 10000000 loops, best of 3: 0.0935 usec per loop $ python3 -m timeit -s 'd = {1: 2}; k = 1' 'd[int(k)]' 1000000 loops, best of 3: 0.304 usec per loop Even adding a dummy value to V to go from > V[bisect.bisect_right(L, i) - 1] to V[bisect.bisect_right(L, i)] might be visible in your benchmark. > Any suggestions for improving the speed of the binary search version, or > the memory consumption of the dict? Depending on the usage pattern you might try bisect combined with a LRU cache. If runs of four or more nearby keys are common you can remember the current span: # untested, watch out for off-by-one errors ;) start = stop = 0 for key in data: if start <= key < stop: pass # reuse value else: index = bisect.bisect_right(L, key) start, stop = L[index: index + 2] value = V[index - 1] # use value (You might also fill V with (value, start, stop) tuples)) > By the way: the particular pattern of groups in the sample code (groups of > size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is > just demo. In my real data, the sizes of the groups are all over the > place, in an unpredictable pattern. > > > > Thanks in advance. > > -- https://mail.python.org/mailman/listinfo/python-list