TomF wrote: > I have a program that manipulates lots of very large indices, which I > implement as bit vectors (via the bitarray module). These are too > large to keep all of them in memory so I have to come up with a way to > cache and load them from disk as necessary. I've been reading about > weak references and it looks like they may be what I want. > > My idea is to use a WeakValueDictionary to hold references to these > bitarrays, so Python can decide when to garbage collect them. I then > keep a key-value database of them (via bsddb) on disk and load them > when necessary. The basic idea for accessing one of these indexes is: > > _idx_to_bitvector_dict = weakref.WeakValueDictionary()
In a well written script that cache will be almost empty. You should compare the weakref approach against a least-recently-used caching strategy. In newer Pythons you can use collections.OrderedDict to implement an LRU cache or use the functools.lru_cache decorator. > def retrieve_index(idx): > if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is > not None: > return _idx_to_bitvector_dict[idx] In a multi-threaded environment the above may still return None or raise a KeyError. > else: # it's been gc'd > bv_str = bitvector_from_db[idx] # Load from bsddb > bv = cPickle.loads(bv_str) # Deserialize the string > _idx_to_bitvector_dict[idx] = bv # Re-initialize the weak > dict element > return bv > > Hopefully that's not too confusing. Comments on this approach? I'm > wondering whether the weakref stuff isn't duplicating some of the > caching that bsddb might be doing. Even if the raw value is cached somewhere you save the overhead of deserialisation. Peter -- http://mail.python.org/mailman/listinfo/python-list