On May 26, 6:17 pm, "Jack" <[EMAIL PROTECTED]> wrote: > I have tens of millions (could be more) of document in files. Each of them > has other > properties in separate files. I need to check if they exist, update and > merge properties, etc.
And then save the results where? Option (0) retain it in memory Option (1) a file Option (2) a database And why are you doing this agglomeration of information? Presumably so that it can be queried. Do you plan to load the whole file into memory in order to satisfy a simple query? > And this is not a one time job. Because of the quantity of the files, I > think querying and > updating a database will take a long time... Don't think, benchmark. > > Let's say, I want to do something a search engine needs to do in terms of > the amount of > data to be processed on a server. I doubt any serious search engine would > use a database > for indexing and searching. A hash table is what I need, not powerful > queries. Having a single hash table permits two not very powerful query methods: (1) return the data associated with a single hash key (2) trawl through the whole hash table, applying various conditions to the data. If that is all you want, then comparisons with a serious search engine are quite irrelevant. What is relevant is that the whole hash table has be in virtual memory before you can start either type of query. This is not the case with a database. Type 1 queries (with a suitable index on the primary key) should use only a fraction of the memory that a full hash table would. What is the primary key of your data? -- http://mail.python.org/mailman/listinfo/python-list