[issue15535] Fix pickling efficiency of named tuples in 2.7.3

Ben Hoyt Thu, 25 Apr 2013 18:02:26 -0700

Ben Hoyt added the comment:

I just hit this issue in a big way -- would have been nice for this fix to go 
into Python 2.7.4. :-)


It was quite hard to track down (as in, a day or two of debugging :-) because 
the symptoms didn't point directly to namedtuple. In our setup we 
pickle/unpickle some big files, and the symptoms we noticed were extremely high 
memory usage after *un*pickling -- as in, 3x what we were getting before 
upgrading from Python 2.6. We first tracked it down to unpickling, and then 
from there narrowed it down to namedtuple.

The first "fix" I discovered was that I could use pickletools.optimize() to 
reduce the memory-usage-after-unpickling back down to sensible levels. I don't 
know enough about pickle to know exactly why this is -- perhaps fragmentation 
due to extra unpickling data structures allocated on the heap, that optimize() 
removes?

Here's the memory usage of our Python process after unpickling a ~9MB pickle 
file (protocol 2) which includes a lot of namedtuples. This is on Python 2.7.4 
64-bit. With the original collections.py -- "normal" means un-optimized pickle, 
"optimized" means run through pickletools.optimize():

Memory usage after loading normal: 106664 KB
Memory usage after loading optimized: 31424 KB

With collections.py modified so namedtuple's templates include "def 
__getstate__(self): return None":

Memory usage after loading normal: 33676 KB
Memory usage after loading optimized: 26392 KB

So you can see the Python 2.7 version of namedtuple makes the process use 
basically 3x the RAM when unpickled (without pickletools.optimize). Note that 
Python 2.6 does *not* do this (it doesn't define __dict__ or use OrderedDict so 
doesn't have this issue). And for some Python 3.3(.1) doesn't have the issue 
either, even though that does define __dict__ and use OrderedDict. I guess 
Python 3.3 does pickling (or garbage collection?) somewhat differently.

You can verify this yourself using the attached unpickletest.py script. Note 
that I'm running on Windows 7, but I presume this would happen on Linux/OS X 
too, as this issue has nothing to do with the OS. The script should work on 
non-Windows OSes, but you have to type in the RAM usage figures manually (using 
"top" or similar).

Note that I'm doing a gc.collect() just before fetching the memory usage figure 
just in case there's uncollected cyclical garbage floating around, and I didn't 
want that to affect the measurement.

I'm not sure I fully understand the cause (of where all this memory is going), 
or the fix for that matter. The OrderedDict is being pickled along with the 
namedtuple instance, because an OrderedDict is returned by __dict__, and pickle 
uses that. But is that staying in memory on unpickling? Why does optimizing the 
pickle fix the RAM usage issue to a large extent?

In any case, I've made the __getstate__ fix in our code, and that definitely 
fixes the RAM usage for us. (We're also going to be optimizing our pickles from 
now on.)

----------
nosy: +benhoyt
Added file: http://bugs.python.org/file30021/unpickletest.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue15535>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue15535] Fix pickling efficiency of named tuples in 2.7.3

Reply via email to