Dan Stromberg wrote: > I've been putting a little bit of time into a file indexing engine [...]
To solve the O.P.'s first problem, the facility we need is an efficient externally-stored multimap. A multimap is like a map, except that each key is associated with a collection of values, not just a single value. Obviously we could simply encode multiple values into a single string -- and that's what the O.P. did -- but updating large strings is inefficient. Fortunately, the standard Python distribution now includes an efficient multimap facility, though the standard library doc does not yet say so. The bsddb module is, in the current version, built on bsddb3, which exposes far more features of the Berkeley DB library than the bsddb module. http://pybsddb.sourceforge.net/bsddb3.html Sleepycat Software's Berkeley DB library: supports an option of mapping keys to multiple values: http://sleepycat.com/docs/ref/am_conf/dup.html Below is a simple example. --Bryan import bsddb def add_words_from_file(index, fname, word_iterator): """ Pass the open-for-write bsddb B-Tree, a filename, and a list (or any interable) of the words in the file. """ s = set() for word in word_iterator: if word not in s: s.add(word) index.put(word, fname) index.sync() print def lookup(index, word): """ Pass the index (as built with add_words_from_file) and a word to look up. Returns list of files containing the word. """ l = [] cursor = index.cursor() item = cursor.set(word) while item != None: l.append(item[1]) item = cursor.next_dup() cursor.close() return l def test(): env = bsddb.db.DBEnv() env.open('.', bsddb.db.DB_CREATE | bsddb.db.DB_INIT_MPOOL) db = bsddb.db.DB(env) db.set_flags(bsddb.db.DB_DUP) db.open( 'junktest.bdb', None, bsddb.db.DB_HASH, bsddb.db.DB_CREATE | bsddb.db.DB_TRUNCATE) data =[ ('bryfile.txt', 'nor heed the rumble of a distant drum'), ('junkfile.txt', 'this is the beast, the beast so sly'), ('word file.txt', 'is this the way it always is here in Baltimore') ] for (fname, text) in data: words = text.split() add_words_from_file(db, fname, words) for word in ['is', 'the', 'heed', 'this', 'way']: print '"%s" is in files: %s' % (word, lookup(db, word)) test() -- http://mail.python.org/mailman/listinfo/python-list