About indexes everywhere: Yes, you don't have to be a DB expert to know that indexes everywhere is bad. But look at this example. There are really two ways that the data is going to get accessed in regular use. Either they are going to ask for all files that have a word (most likely) or they are going to see which words are in a file.
I'm going to have to name the tables now, aren't I? Here's a simple schema: words -------- word_id word files ------ file_id filename word_files -------------- file_id word_id If you are going to lookup by word, you'll need an index on words.word. You'll also need an index on word_files.word_id. And then you'll need an index on files.file_id. If you are going to lookup by file, you'll need an index on files.filename, word_files.file_id, and words.word_id. So it ends up in this situation you need indexes everywhere. Now, when you are doing the initial population, you should drop all the indexes you don't need during population. That means everything but words.word has to go. (You'll need to find the word_id for previously seen words.) After the initial population, then is the time to build and add the indexes. it's much faster to build an index when you have the entire set of data in front of you than to do it piece-by-piece. Some indexes actually get built better than they would've piecemeal. Unfortunately this is no longer strictly topical to Python. But if you free your mind from thinking in terms of SQL databases and look at indexes as dicts or whatnot, then you can see that this is really a general programming problem. -- http://mail.python.org/mailman/listinfo/python-list