On 12/15/2016 12:06 PM, Steve D'Aprano wrote:
I have some key:value data where the keys often are found in contiguous
ranges with identical values. For example:
{1: "foo",
2: "foo",
3: "foo",
# same for keys 4 through 99
100: "foo",
101: "bar",
102: "bar",
103: "foobar",
104: "bar",
105: "foo",
}
For this particular example (untested):
def lookup(n):
F, B, FB = 'foo', 'bar', 'foobar' # Ensure value objects reused
if 1 <= n <= 100:
return F
elif 101 <= n <= 105:
return (B, B, FB, B, F')[n - 101]
else:
raise ValueError('n must be in range(1, 106)')
So in this case, the keys 1 through 100, plus 105, all have the same value.
I have about a million or two keys, with a few hundred or perhaps a few
thousand distinct values. The size of each contiguous group of keys with
the same value can vary from 1 to perhaps a hundred or so.
Has anyone dealt with data like this and could give a recommendation of the
right data structure to use?
The obvious way is to explicitly list each key, in a regular dict. But I
wonder whether there are alternatives which may be better (faster, more
memory efficient)?
If, as in the example, the keys comprise a contiguous sequence of ints,
the 'obvious' way to me is sequence of values. A tuple of constants
gets compiled and will load quickly from a .pyc file. 4 or 8 bytes per
entry times 1 or 2 million entries is usually tolerable on a gigabyte
machine.
Or trade time for space with binary search or search in explicit binary
tree. Or combine binary search and indexed lookup as I did above.
--
Terry Jan Reedy
--
https://mail.python.org/mailman/listinfo/python-list