On 12/15/2016 12:06 PM, Steve D'Aprano wrote:
I have some key:value data where the keys often are found in contiguous
ranges with identical values. For example:

{1: "foo",
 2: "foo",
 3: "foo",
 # same for keys 4 through 99
 100: "foo",
 101: "bar",
 102: "bar",
 103: "foobar",
 104: "bar",
 105: "foo",
}

For this particular example (untested):

def lookup(n):
    F, B, FB = 'foo', 'bar', 'foobar'  # Ensure value objects reused
    if 1 <= n <= 100:
        return F
    elif 101 <= n <= 105:
        return (B, B, FB, B, F')[n - 101]
    else:
        raise ValueError('n must be in range(1, 106)')

So in this case, the keys 1 through 100, plus 105, all have the same value.

I have about a million or two keys, with a few hundred or perhaps a few
thousand distinct values. The size of each contiguous group of keys with
the same value can vary from 1 to perhaps a hundred or so.

Has anyone dealt with data like this and could give a recommendation of the
right data structure to use?

The obvious way is to explicitly list each key, in a regular dict. But I
wonder whether there are alternatives which may be better (faster, more
memory efficient)?

If, as in the example, the keys comprise a contiguous sequence of ints, the 'obvious' way to me is sequence of values. A tuple of constants gets compiled and will load quickly from a .pyc file. 4 or 8 bytes per entry times 1 or 2 million entries is usually tolerable on a gigabyte machine.

Or trade time for space with binary search or search in explicit binary tree. Or combine binary search and indexed lookup as I did above.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to