So in answer to some of the questions: - There are about 15 files, each roughly representing a table. - Within the files, each line represents a record. - The formatting for the lines is like so:
File1: somval1|ID|someval2|someval3|etc. File2: ID|someval1|someval2|somewal3|etc. Where ID is the one and only value linking "records" from one file to "records" in another file - moreover, as far as I can tell, the relationships are all 1:1 (or 1:0) (I don't have the full dataset yet, just a sampling, so I'm flying a bit in the dark). - I believe that individual "records" within each of the files is unique with respect to the identifier (again, not certain because I'm only working with sample data). - As the example shows, the position of the ID is not the same for all files. - I don't know how big N is since I only have a sample to work with, and probably won't get the full dataset anytime soon. (Lets just take it as a given that I won't get that information until AFTER a first implementation...politics.) - I don't know how many identifiers either, although it has to be at least as large as the number of lines in the largest file (again, I don't have the actual data yet). So as an exercise, lets assume 800MB file, each line of data taking up roughly 150B (guesstimate - based on examination of sample data)...so roughly 5.3 million unique IDs. With that size, I'll have to load them into temp db. I just can't see holding that much data in memory... -- http://mail.python.org/mailman/listinfo/python-list