On Nov 27, 10:45 am, Steven D'Aprano <[EMAIL PROTECTED]> wrote: > On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote: > > I mentioned trigram counting as an illustrative case. In fact, you'll > > often need to define patterns more complex than that, and tens of > > megabytes of text may generate millions of them, and I've observed they > > quickly ate up the 8G memory of a workstation in a few minutes. > > Manipulating these patterns can be tricky, you can easily waste a lot of > > memory without taking extra care. I just thought if I define my pattern > > class with this 'atom' property, coding efforts could be easier later. > > I'm just not getting the same results as you when I try this. I'm finding > that with no extra effort at all, it just works. > > The size of your corpus is not important. Neither is the complexity of > how you generate the patterns. What's important is the number of patterns > you produce, and "millions" isn't that huge a number, even without > interning the strings. > > Obviously I'm not running your code, but I can build a dict with millions > of patterns, from hundreds of megabytes of text, on a PC with just 1GB of > memory and not run into any serious problems. > > I've just processed roughly 240MB of random emails, generating n-grams up > to length 5. The emails include binary attachments and HTML etc., so I'm > getting lots of patterns that don't normally exist in natural languages > (e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has > only 1GB, and that's being shared with about a dozen other apps (including > such memory hogs as Firefox). > > Results? 64939962 patterns found, of which 17031467 are unique. There's > paging, yes, and my PC runs a little slow when I try to switch from one > running application to another, but nothing unusable. Opening a dozen > YouTube videos at once impacts performance worse. > > I can't think what you're doing to use up 8GB of RAM for merely > "millions" of strings, even if you are keeping two, three, ten redundant > copies. Assuming an average length of twenty bytes per pattern (you said > trigrams, but I'd rather over-estimate than under), and even assuming > that only half the 8GB are available to Python, you should be able to > store something of the order of one hundred million patterns easily:
My task is identifying sequences of tokens (phrases) that are possible tranlations of each other from a bilingual corpus. I need to check all the subsequences of a sentence to get the possible phrase pairs. This makes the problem different from n-gram counting in that the number of possible phrases doesn't grow linearly with n, but approximately with n^2. (n being the sentence length.) My first version of the program consumes almost twice as much memory as the current one because I discovered in doing different statistics I was regenerating the patterns, and the counting dictionaries ended up with duplicated pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I can restrict the pattern class to not generate identical instances? So I can avoid such subtle but significant bugs. > > 4 bytes for a pointer plus 20 bytes for the string = 24 bytes > > 4*1024**3 / 24 = 178,956,970 > > (This is a ballpark figure. Real Python strings will have additional > overhead.) > > If you're running into problems with merely millions of patterns, then > you're doing worse by probably two orders of magnitude. > > I don't think that the problem lies where you think it does. If you have > a dict with millions of keys, it doesn't matter how many times each > pattern exists in the corpus because the key only exists *once* in the > dict. Duplicate the dict, or generate it from scratch even, and at worse > you double the memory used by the keys. And probably not even that. > > The only thing I can think of that might explain why you're using so much > memory is if you are generating *all* the patterns up front, say in a > list, before adding them to the dict: > > # Generate one massive list of patterns containing many duplicates > patterns = make_patterns(text) > # returns a massive list like ['fre', 'req', 'equ', 'que' ...] > d = {} > for pattern in patterns: > d[pattern] = d.get(pattern, 0) + 1 > No, I wasn't doing that. BTW, do you think the pattern counting code can avoid hashing the pattern twice? Is there a way to do that when the dictionary values are of a primitive type? > Notice that the real killer in the above algorithm is that you need > enough storage, not just for the unique patterns, but for EVERY separate > occurrence of each pattern. Nothing to do with how dicts operate, and > everything to do with the algorithm you (hypothetically) are using. > > If that's what you're doing, then no wonder you're running out of memory. > With 200MB of text, you have 209715198 trigrams in your list. The > pointers alone will take almost a gigabyte, assuming 32-bit pointers. > > If this is your approach, interning the strings won't save you. You > almost certainly should change to a lazy approach, and use a generator to > make each pattern as needed, then thrown away: > > def make_patterns(s, n=3): > """Yield n-grams.""" > if len(s) <= n: > yield s > else: > for i in range(len(s)-n+1): > yield s[i:i+n] > > d = {} > fp = open('corpus', 'r') > for line in fp: > for word in line.split(): > for pattern in make_patterns(word.strip()): > d[pattern] = d.get(pattern, 0) + 1 > > Obviously I haven't seen your code and don't actually know what you are > doing. If my 1GB machine can deal with a dict of 17 million unique keys > from a corpus of 240MB with no special memory management, your 8GB > machine should too. > > -- > Steven. -- http://mail.python.org/mailman/listinfo/python-list