On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote: > On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet- > [EMAIL PROTECTED]> wrote: >> Licheng Fang wrote: >> > I find myself frequently in need of classes like this for two >> > reasons. First, it's efficient in memory. >> >> Are you using millions of objects, or MB size objects? Otherwise, this >> is no argument. > > Yes, millions.
Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope????? Tens of megabytes is not a lot of data. If the average word size is ten characters, then one million words takes ten million bytes, or a little shy of ten megabytes. Even if you are using four-byte characters, you've got 40 MB, still a moderate amount of data on a modern system. > In my natural language processing tasks, I almost always > need to define patterns, identify their occurrences in a huge data, and > count them. Say, I have a big text file, consisting of millions of > words, and I want to count the frequency of trigrams: > > trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] > > I can save the counts in a dict D1. Later, I may want to recount the > trigrams, with some minor modifications, say, doing it on every other > line of the input file, and the counts are saved in dict D2. Problem is, > D1 and D2 have almost the same set of keys (trigrams of the text), yet > the keys in D2 are new instances, even though these keys probably have > already been inserted into D1. So I end up with unnecessary duplicates > of keys. And this can be a great waste of memory with huge input data. All these keys will almost certainly add up to only a few hundred megabytes, which is a reasonable size of data but not excessive. This really sounds to me like a case of premature optimization. I think you are wasting your time solving a non-problem. [snip] > Wow, I didn't know this. But exactly how Python manage these strings? My > interpretator gave me such results: > >>>> a = 'this' >>>> b = 'this' >>>> a is b > True >>>> a = 'this is confusing' >>>> b = 'this is confusing' >>>> a is b > False It's an implementation detail. You shouldn't use identity testing unless you actually care that two names refer to the same object, not because you want to save a few bytes. That's poor design: it's fragile, complicated, and defeats the purpose of using a high-level language like Python. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list