Re: Memory usage per top 10x usage per heapy

Junkshops Mon, 24 Sep 2012 17:04:53 -0700

Hi Tim, thanks for the response.

- check how you're reading the data:  are you iterating over
   the lines a row at a time, or are you using
   .read()/.readlines() to pull in the whole file and then
   operate on that?

I'm using enumerate() on an iterable input (which in this case is thefilehandle).

- check how you're storing them:  are you holding onto more
   than you think you are?

I've used ipython to look through my data structures (without going intoungainly detail, 2 dicts with X numbers of key/value pairs, where X =number of lines in the file), and everything seems to be workingcorrectly. Like I say, heapy output looks reasonable - I don't seeanything surprising there. In one dict I'm storing a id string (thefirst token in each line of the file) with values as (again, withoutgoing into massive detail) the md5 of the contents of the line. Thesecond dict has the md5 as the key and an object with __slots__ set thatstores the line number of the file and the type of object that linerepresents.

Would it hurt to switch from a
   dict to store your data (I'm assuming here) to using the
   anydbm module to temporarily persist the large quantity of
   data out to disk in order to keep memory usage lower?

That's the thing though - according to heapy, the memory usage *is* lowand is more or less what I expect. What I don't understand is why top isreporting such vastly different memory usage. If a memory profiler issaying everything's ok, it makes it very difficult to figure out what'scausing the problem. Based on heapy, a db based solution would beserious overkill.


-MrsE

On 9/24/2012 4:22 PM, Tim Chase wrote:

On 09/24/12 16:59, MrsEntity wrote:

I'm working on some code that parses a 500kb, 2M line file line
by line and saves, per line, some derived strings into various
data structures. I thus expect that memory use should
monotonically increase. Currently, the program is taking up so
much memory - even on 1/2 sized files - that on 2GB machine I'm
thrashing swap.

It might help to know what comprises the "into various data
structures".  I do a lot of ETL work on far larger files,
with similar machine specs, and rarely touch swap.

2) How can I diagnose (and hopefully fix) what's causing the
massive memory usage when it appears, from heapy, that the code
is performing reasonably?

I seem to recall that Python holds on to memory that the VM
releases, but that it *should* reuse it later.  So you'd get
the symptom of the memory-usage always increasing, never
decreasing.

Things that occur to me:

- check how you're reading the data:  are you iterating over
   the lines a row at a time, or are you using
   .read()/.readlines() to pull in the whole file and then
   operate on that?

- check how you're storing them:  are you holding onto more
   than you think you are?  Would it hurt to switch from a
   dict to store your data (I'm assuming here) to using the
   anydbm module to temporarily persist the large quantity of
   data out to disk in order to keep memory usage lower?

Without actual code, it's hard to do a more detailed
analysis.

-tkc

--
http://mail.python.org/mailman/listinfo/python-list

Re: Memory usage per top 10x usage per heapy

Reply via email to