Re: Memory usage per top 10x usage per heapy
Hi Tim, thanks for the response. - check how you're reading the data: are you iterating over the lines a row at a time, or are you using .read()/.readlines() to pull in the whole file and then operate on that? I'm using enumerate() on an iterable input (which in this case is the filehandle). - check how you're storing them: are you holding onto more than you think you are? I've used ipython to look through my data structures (without going into ungainly detail, 2 dicts with X numbers of key/value pairs, where X = number of lines in the file), and everything seems to be working correctly. Like I say, heapy output looks reasonable - I don't see anything surprising there. In one dict I'm storing a id string (the first token in each line of the file) with values as (again, without going into massive detail) the md5 of the contents of the line. The second dict has the md5 as the key and an object with __slots__ set that stores the line number of the file and the type of object that line represents. Would it hurt to switch from a dict to store your data (I'm assuming here) to using the anydbm module to temporarily persist the large quantity of data out to disk in order to keep memory usage lower? That's the thing though - according to heapy, the memory usage *is* low and is more or less what I expect. What I don't understand is why top is reporting such vastly different memory usage. If a memory profiler is saying everything's ok, it makes it very difficult to figure out what's causing the problem. Based on heapy, a db based solution would be serious overkill. -MrsE On 9/24/2012 4:22 PM, Tim Chase wrote: On 09/24/12 16:59, MrsEntity wrote: I'm working on some code that parses a 500kb, 2M line file line by line and saves, per line, some derived strings into various data structures. I thus expect that memory use should monotonically increase. Currently, the program is taking up so much memory - even on 1/2 sized files - that on 2GB machine I'm thrashing swap. It might help to know what comprises the "into various data structures". I do a lot of ETL work on far larger files, with similar machine specs, and rarely touch swap. 2) How can I diagnose (and hopefully fix) what's causing the massive memory usage when it appears, from heapy, that the code is performing reasonably? I seem to recall that Python holds on to memory that the VM releases, but that it *should* reuse it later. So you'd get the symptom of the memory-usage always increasing, never decreasing. Things that occur to me: - check how you're reading the data: are you iterating over the lines a row at a time, or are you using .read()/.readlines() to pull in the whole file and then operate on that? - check how you're storing them: are you holding onto more than you think you are? Would it hurt to switch from a dict to store your data (I'm assuming here) to using the anydbm module to temporarily persist the large quantity of data out to disk in order to keep memory usage lower? Without actual code, it's hard to do a more detailed analysis. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory usage per top 10x usage per heapy
Just curious; which is it, two million lines, or half a million bytes? I have, in fact, this very afternoon, invented a means of writing a carriage return character using only 2 bits of information. I am prepared to sell licenses to this revolutionary technology for the low price of $29.95 plus tax. Sorry, that should've been a 500Mb, 2M line file. which machine is 2gb, the Windows machine, or the VM? VM. Winders is 4gb. ...but I would point out that just because you free up the memory from the Python doesn't mean it gets released back to the system. The C runtime manages its own heap, and is pretty persistent about hanging onto memory once obtained. It's not normally a problem, since most small blocks are reused. But it can get fragmented. And i have no idea how well Virtual Box maps the Linux memory map into the Windows one. Right, I understand that - but what's confusing me is that, given the memory use is (I assume) monotonically increasing, the code should never use more than what's reported by heapy once all the data is loaded into memory, given that memory released by the code to the Python runtime is reused. To the best of my ability to tell I'm not storing anything I shouldn't, so the only thing I can think of is that all the object creation and destruction, for some reason, it preventing reuse of memory. I'm at a bit of a loss regarding what to try next. Cheers, MrsE On 9/24/2012 6:14 PM, Dave Angel wrote: On 09/24/2012 05:59 PM, MrsEntity wrote: Hi all, I'm working on some code that parses a 500kb, 2M line file Just curious; which is it, two million lines, or half a million bytes? line by line and saves, per line, some derived strings into various data structures. I thus expect that memory use should monotonically increase. Currently, the program is taking up so much memory - even on 1/2 sized files - that on 2GB machine which machine is 2gb, the Windows machine, or the VM? You could get thrashing at either level. I'm thrashing swap. What's strange is that heapy (http://guppy-pe.sourceforge.net/) is showing that the code uses about 10x less memory than reported by top, and the heapy data seems consistent with what I was expecting based on the objects the code stores. I tried using memory_profiler (http://pypi.python.org/pypi/memory_profiler) but it didn't really provide any illuminating information. The code does create and discard a number of objects per line of the file, but they should not be stored anywhere, and heapy seems to confirm that. So, my questions are: 1) For those of you kind enough to help me figure out what's going on, what additional data would you like? I didn't want swamp everyone with the code and heapy/memory_profiler output but I can do so if it's valuable. 2) How can I diagnose (and hopefully fix) what's causing the massive memory usage when it appears, from heapy, that the code is performing reasonably? Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64 Thanks very much. Tim raised most of my concerns, but I would point out that just because you free up the memory from the Python doesn't mean it gets released back to the system. The C runtime manages its own heap, and is pretty persistent about hanging onto memory once obtained. It's not normally a problem, since most small blocks are reused. But it can get fragmented. And i have no idea how well Virtual Box maps the Linux memory map into the Windows one. -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory usage per top 10x usage per heapy
I'm a bit surprised you aren't beyond the 2gb limit, just with the structures you describe for the file. You do realize that each object has quite a few bytes of overhead, so it's not surprising to use several times the size of a file, to store the file in an organized way. I did some back of the envelope calcs which more or less agreed with heapy. The code stores 1 string, which is, on average, about 50 chars or so, and one MD5 hex string per line of code. There's about 40 bytes or so of overhead per string per sys.getsizeof(). I'm also storing an int (24b) and a <10 char string in an object with __slots__ set. Each object, per heapy (this is one area where I might be underestimating things) takes 64 bytes plus instance variable storage, so per line: 50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB plus some memory for the dicts, which is about what heapy is reporting (note I'm currently not actually running all 2M lines, I'm just running subsets for my tests). Is there something I'm missing? Here's the heapy output after loading ~300k lines: Partition of a set of 1199849 objects. Total size = 89965376 bytes. Index Count % Size% Cumulative % Kind 0 59 50 3839992043 3839992043 str 1 5 0 2516722428 6356714471 dict 2 28 25 1919987221 8276701692 0xa13330 3 299836 25 7196064 8 89963080100 int 4 4 0 11520 89964232100 collections.defaultdict Note that 3 of the dicts are empty. I assume that 0xa13330 is the address of the object. I'd actually expect to see 900k strings, but the <10 char string is always the same in this case so perhaps the runtime is using the same object...? At this point, top reports python as using 1.1g of virt and 1.0g of res. I also wonder if heapy has been written to take into account the larger size of pointers in a 64bit build. That I don't know, but that would only explain, at most, a 2x increase in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing. Another thing is to make sure that the md5 object used in your two maps is the same object, and not just one with the same value. That's certainly the way the code is written, and heapy seems to confirm that the strings aren't duplicated in memory. Thanks for sticking with me on this, MrsE On 9/25/2012 4:06 AM, Dave Angel wrote: On 09/25/2012 12:21 AM, Junkshops wrote: Just curious; which is it, two million lines, or half a million bytes? Sorry, that should've been a 500Mb, 2M line file. which machine is 2gb, the Windows machine, or the VM? VM. Winders is 4gb. ...but I would point out that just because you free up the memory from the Python doesn't mean it gets released back to the system. The C runtime manages its own heap, and is pretty persistent about hanging onto memory once obtained. It's not normally a problem, since most small blocks are reused. But it can get fragmented. And i have no idea how well Virtual Box maps the Linux memory map into the Windows one. Right, I understand that - but what's confusing me is that, given the memory use is (I assume) monotonically increasing, the code should never use more than what's reported by heapy once all the data is loaded into memory, given that memory released by the code to the Python runtime is reused. To the best of my ability to tell I'm not storing anything I shouldn't, so the only thing I can think of is that all the object creation and destruction, for some reason, it preventing reuse of memory. I'm at a bit of a loss regarding what to try next. I'm not familiar with heapy, but perhaps it's missing something there. I'm a bit surprised you aren't beyond the 2gb limit, just with the structures you describe for the file. You do realize that each object has quite a few bytes of overhead, so it's not surprising to use several times the size of a file, to store the file in an organized way. I also wonder if heapy has been written to take into account the larger size of pointers in a 64bit build. Perhaps one way to save space would be to use a long to store those md5 values. You'd have to measure it, but I suspect it'd help (at the cost of lots of extra hexlify-type calls). Another thing is to make sure that the md5 object used in your two maps is the same object, and not just one with the same value. -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory usage per top 10x usage per heapy
Can you give an example of how these data structures look after reading only the first 5 lines? Sure, here you go: In [38]: mpef._ustore._store Out[38]: defaultdict(, {'Measurement': {'8991c2dc67a49b909918477ee4efd767': , '7b38b429230f00fe4731e60419e92346': , 'b53531471b261c44d52f651add647544': , '44ea6d949f7c8c8ac3bb4c0bf4943f82': , '0de96f928dc471b297f8a305e71ae3e1': }}) In [39]: mpef._ustore._store['Measurement']['b53531471b261c44d52f651add647544'].typeStr Out[39]: 'Measurement' In [40]: mpef._ustore._store['Measurement']['b53531471b261c44d52f651add647544'].lineNumber Out[40]: 5 In [41]: mpef._ustore._idstore Out[41]: defaultdict('micropheno.exchangeformat.KBaseID.IDStore'>, {'Measurement': }) In [43]: mpef._ustore._idstore['Measurement']._SIDstore Out[43]: defaultdict( at 0x2ece7d0>, {'emailRemoved': defaultdict( at 0x2c4caa0>, {'microPhenoShew2011': defaultdict(, {0: {'MLR_124572462': '8991c2dc67a49b909918477ee4efd767', 'MLR_124572161': '7b38b429230f00fe4731e60419e92346', 'SMMLR_12551352': 'b53531471b261c44d52f651add647544', 'SMMLR_12551051': '0de96f928dc471b297f8a305e71ae3e1', 'SMMLR_12550750': '44ea6d949f7c8c8ac3bb4c0bf4943f82'}})})}) -MrsE On 9/25/2012 4:33 AM, Oscar Benjamin wrote: On 25 September 2012 00:58, Junkshops <mailto:junksh...@gmail.com>> wrote: Hi Tim, thanks for the response. - check how you're reading the data: are you iterating over the lines a row at a time, or are you using .read()/.readlines() to pull in the whole file and then operate on that? I'm using enumerate() on an iterable input (which in this case is the filehandle). - check how you're storing them: are you holding onto more than you think you are? I've used ipython to look through my data structures (without going into ungainly detail, 2 dicts with X numbers of key/value pairs, where X = number of lines in the file), and everything seems to be working correctly. Like I say, heapy output looks reasonable - I don't see anything surprising there. In one dict I'm storing a id string (the first token in each line of the file) with values as (again, without going into massive detail) the md5 of the contents of the line. The second dict has the md5 as the key and an object with __slots__ set that stores the line number of the file and the type of object that line represents. Can you give an example of how these data structures look after reading only the first 5 lines? Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory usage per top 10x usage per heapy
On 9/25/2012 11:17 AM, Oscar Benjamin wrote: On 25 September 2012 19:08, Junkshops <mailto:junksh...@gmail.com>> wrote: In [38]: mpef._ustore._store Out[38]: defaultdict(, {'Measurement': {'8991c2dc67a49b909918477ee4efd767': , '7b38b429230f00fe4731e60419e92346': , 'b53531471b261c44d52f651add647544': , '44ea6d949f7c8c8ac3bb4c0bf4943f82': , '0de96f928dc471b297f8a305e71ae3e1': }}) Have these exceptions been raised from somewhere before being stored? I wonder if you're inadvertently keeping execution frames alive. There are some problems in CPython with this that are related to storing exceptions. FileContext objects aren't exceptions. They store information about where the stored object originally came from, so if there's an MD5 or ID clash with a later line in the file the code can report both the current line and the older clashing line to the user. I have an Exception subclass that takes a FileContext as an argument. There are no exceptions thrown in the file I processed to get the heapy results earlier in the thread. In [43]: mpef._ustore._idstore['Measurement']._SIDstore Out[43]: defaultdict( at 0x2ece7d0>, {'emailRemoved': defaultdict( at 0x2c4caa0>, {'microPhenoShew2011': defaultdict(, {0: {'MLR_124572462': '8991c2dc67a49b909918477ee4efd767', 'MLR_124572161': '7b38b429230f00fe4731e60419e92346', 'SMMLR_12551352': 'b53531471b261c44d52f651add647544', 'SMMLR_12551051': '0de96f928dc471b297f8a305e71ae3e1', 'SMMLR_12550750': '44ea6d949f7c8c8ac3bb4c0bf4943f82'}})})}) Also I think lambda functions might be able to keep the frame alive. Are they by any chance being created in a function that is called in a loop? Here's the context for the lambdas: def __init__(self): self._SIDstore = defaultdict(lambda: defaultdict(lambda: defaultdict(dict))) So the lambda is only being called when a new key is added to the top 3 levels of the datastructure, which in the test case I've been discussing, only happens once each. Although the suggestion to change the hex strings to ints is a good one and I'll do it, what I'm really trying to understand is why there's such a large difference between the memory use per top (and the fact that the code appears to thrash swap) and per heapy and my calculations of how much memory the code should be using. Cheers, MrsEntity -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory usage per top 10x usage per heapy
On 9/25/2012 11:50 AM, Dave Angel wrote: I suspect that heapy has some limitation in its reporting, and that's what the discrepancy. That would be my first suspicion as well - except that heapy's results agree so well with what I expect, and I can't think of any reason I'd be using 10x more memory. If heapy is wrong, then I need to try and figure out what's using up all that memory some other way... but I don't know what that way might be. ... can be an expensive proposition if you are building millions of them. So can nested functions with non-local variable references, in case you have any of those. Not as far as I know. Cheers, MrsEntity -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory usage per top 10x usage per heapy
On 9/25/2012 2:17 PM, Oscar Benjamin wrote: I don't know whether it would be better or worse but it might be worth seeing what happens if you replace the FileContext objects with tuples. I originally used a string, and it was slightly better since you don't have the object overhead, but I wanted to code to an interface for the context information so started a Context abstract class that FileContext inherits from (both have __slots__ set). Using an object without __slots__ set was a disaster. However, the difference between a string and an object with __slots__ isn't severe. I can't see anything wrong with that but then I'm not sure if the lambda function always keeps its frame alive. If there's only that one line in the __init__ function then I'd expect it to be fine. That's it, I'm afraid. Perhaps you could see what objgraph comes up with: http://pypi.python.org/pypi/objgraph So far as I know objgraph doesn't tell you how big objects are but it does give a nice graphical representation of which objects are alive and which other objects they are referenced by. You might find that some other object is kept alive that you didn't expect. I'll give it a shot and see what happens. Cheers, MrsEntity -- http://mail.python.org/mailman/listinfo/python-list