> > 2) Maintaining a copy wastes memory > > 3) I don't have a good solution if I delete items from the set > > (calculating the difference will return an empty set but I need to > > actually delete stuff). > > (3) is easy -- the difference originalset-finalset is the set of things > you have to delete.
Now why didn't I think of that? > > Does anyone have any ideas? > > Your problem is really a thinly disguised version of one of the most > ancient and venerable ones in data processing -- the "master file / > transaction file" problem. With sets (which are built as hash tables) > you have very powerful weapons for this fray (it used to be, many years > ago, that sorting and "in-step processing of sorted files" was the only > feasible approach to such problems), but you still have to wield them > properly. > > In your shoes, I would write a class whose instances hold three sets: > -- the "master set" is what you originally read from the file > -- the "added set" is the set of things you've added since then > -- the "deleted set" is the set of things you've deleted since them > > You can implement a set-like interface; for example, your .add method > will need to remove the item from the deleted set (if it was there), > then add it to the added set (if it wasn't already in the master set); > and so on, for each and every method you actually desire (including no > doubt __contains__ for membership testing). E.g., with suitably obvious > names for the three sets, we could have...: > > class cleverset(object): > ... > def __contains__(self, item): > if item in self.deleted: return False > elif item in self.added: return True > else: return item in self.master > > When the time comes to save back to disk, you'll perform deletions and > additions based on self.deleted and self.added, of course. Ok. This class sounds like a good idea. I'll try it. > I'm not addressing the issue of how to best represent such sets on disk > because it's not obvious to me what operations you need to perform on > the on-disk representation -- for example, deletions can be very costly > unless you can afford to simply set a "logically deleted" flag on > deleted items; but, depending on how often you delete stuff, that will > eventually degrade performance due to fragmentation, and maintaining the > indexing (if you need one) can be a bear. I actually don't have very many transactions I need to perform on disk. Once I've loaded these sets, I normally use the contains operation and union/intersection/difference operations. I can definitely use a tombstone on the record to mark it as logically deleted. The problem right now is that I don't need indexing within sets (I either load the whole set or don't bother at all) - I'm only indexing the entire file (one entry in the index per tablespace). So even if I use tombstones, I'll have to still go through all the tablespace to find the tombstoned files. Alternatively, I could write the deleted data at the end of the tablespace and mark it with the tombstone (since I don't care about space so much) and when I load the data, I can eliminate the tombstoned entries. I'm give that a try and report my performance results on this thread - I absolutely love that I can make this important a change in one or two days using Python! > Normally, I would use a good relational database (which has no doubt > already solved on my behalf all of these issues) for purpose of > persistent storage of such data -- usually sqlite for reasonably small > amounts of data, PostgreSQL if I need enormous scalability and other > "enterprise" features, but I hear other people are happy with many other > engines for relational DBs, such as Oracle (used to be superb if you > could afford full-time specialized db admins), IBM's DB2, SAP's database > (now opensourced), Firebird, even MySQL (I've never, but NEVER, had any > good experience with the latter, but I guess maybe it's only me, as > otherwise it could hardly be so popular). The main point here is that a > relational DB's tables _should_ be already very well optimized for > storing "sets", since those table ARE exactly nothing but sets of tuples > (and a 1-item tuple is a tuple for all that:-). Thanks Alex, but we're actually implementing a (non-relational) database engine. This particular file is one of many files (which are all kept in sync by the engine) I need to persist (some files use other methods - including pickling - for persistence). I've already solved most of our other problems and according to hotshot, this is one of the next hurdles (this commit is responsible for 8% of the total CPU time @ 3000 items and 3 files). If you're interested, you can check it out at http://www.brainwavelive.com or email me. Prateek -- http://mail.python.org/mailman/listinfo/python-list