Hello, While playing to write an inverted index (see: http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with a classic dict, (i have thousand of documents and millions of terms, stemming or other filtering are not considered, i wanted to understand how to handle GB of text first). I found ZODB and try to use it a bit, but i think i must be misunderstanding how to use it even after reading http://www.zope.org/Wikis/ZODB/guide/node3.html...
i would like to use it once to build my inverted index, save it to disk via a FileStorage, and then reuse this previously created inverted index from the previously created FileStorage, but it looks like i am unable to reread/reload it in memory, or i am missing how to do it... firstly each time i use the code below, it looks everything is added another time, is there a way to rather rewrite/replace it? and how am i suppose to use it after an initial creation? i thought that using the same FileStorage would reload my object inside dbroot, but it doesn't. i was also interested by the cache mecanisms, are they transparent? or maybe do you know a good tutorial to understand ZODB? thx for any help, regards. here is a sample code : import sys from BTrees.OOBTree import OOBTree from BTrees.OIBTree import OIBTree from persistent import Persistent class IDF2: def __init__(self): self.docs = OIBTree() self.idfs = OOBTree() def add(self, term, fromDoc): self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1 if not self.idfs.has_key(term): self.idfs[term] = OIBTree() self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1 def N(self, term): "total number of occurrences of 'term'" return sum(self.idfs[term].values()) def n(self, term): "number of documents containing 'term'" return len(self.idfs[term]) def ndocs(self): "number of documents" return len(self.docs) def __getitem__(self, key): return self.idfs[key] def iterdocs(self): for doc in self.docs.iterkeys(): yield doc def iterterms(self): for term in self.idfs.iterkeys(): yield term storage = FileStorage.FileStorage("%s.fs" % sys.argv[1]) db = DB(storage) conn = db.open() dbroot = conn.root() if not dbroot.has_key('idfs'): dbroot['idfs'] = IDF2() idfs = dbroot['idfs'] import transaction for i, line in enumerate(open(sys.argv[1])): # considering doc is linenumber... for word in line.split(): idfs.add(word, i) # Commit the change transaction.commit() --- i was expecting : storage = FileStorage.FileStorage("%s.fs" % sys.argv[1]) db = DB(storage) conn = db.open() dbroot = conn.root() print dbroot.has_key('idfs') => to return True -- http://mail.python.org/mailman/listinfo/python-list