On Thursday 02 August 2007, Joshua J. Kugler wrote: > I am using shelve to store some data since it is probably the best solution > to my "data formats, number of columns, etc can change at any time" > problem. However, I seem to be dealing with bloat. > > My original data is 33MB. When each row is converted to python lists, and > inserted into a shelve DB, it balloons to 69MB. Now, there is some > additional data in there namely a list of all the keys containing data (vs. > the keys that contain version/file/config information), BUT if I copy all > the data over to a dict and dump the dict to a file using cPickle, that > file is only 49MB. I'm using pickle protocol 2 in both cases. > > Is this expected? Is there really that much overhead to using shelve and > dbm files? Are there any similar solutions that are more space efficient? > I'd use straight pickle.dump, but loading requires pulling the entire thing > into memory, and I don't want to have to do that every time. > > [Note, for those that might suggest a standard DB. Yes, I'd like to use a > regular DB, but I have a domain where the number of data points in a sample > may change at any time, so a timestamp-keyed dict is arguably the best > solution, thus my use of shelve.]
Have you considered a directory full of pickle files ? (In effect, replacing the dbm with the file system) i.e. something like (untested) class DirShelf(dict): def __init__(self, dirname): self.dir = dirname self.__repl_dict = {} def __contains__(self, key): assert isinstance(key, str) assert key.isalnum() # or similar portable check for is-name-ok return os.path.exists(os.path.join(self.dir, key)) def has_key(self, key): return key in self def __getitem__(self, key): try: if key not in self.__repl_dict: self.__repl_dict[key] = \ cPickle.load(file(os.path.join(self.dir, key), 'rb'), protocol=2) return self.__repl_dict[key] except IOError, e: raise KeyError(e) def __setitem__(self, key, val): assert isinstance(key, str) assert key.isalnum() # or similar portable check for is-name-ok self.__repl_dict[key] = val self.flush() def flush(self): for k, v in self.__repl_dict.iteritems(): cPickle.dump(v, file(os.path.join(self.dir, k), 'wb'), protocol=2) def __del__(self): self.flush() -- Regards, Thomas Jollans GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu Hacker key <http://hackerkey.com/>: v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6
signature.asc
Description: This is a digitally signed message part.
-- http://mail.python.org/mailman/listinfo/python-list