[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6

2012-05-10 Thread stw

New submission from stw :

I've found that unpickling a certain kind of dictionary is substantially slower 
in python 2.7 compared to python 2.6. The dictionary has keys that are tuples 
of strings - a 1-tuple is enough to see the effect. The problem seems to be 
caused by garbage collection, as turning it off eliminates the slowdown. Both 
pickle and cPickle modules are affected.


I've attached two files to demonstrate this. The file 'make_file.py'
creates a dictionary of specified size, with keys containing 1-tuples of random 
strings. It then dumps the dictionary to a pickle file using a specified pickle 
module.

The file 'load_file.py' unpickles the file created by 'make_file.py', using a 
specified pickle module, and prints the time taken. The code can be run with 
garbage collection either on or off.

The results below are for a dictionary of 20 entries. Each entry is the 
time taken in seconds with garbage collection on / garbage collection off. The 
row headings are the module used to pickle the data, the column headings the 
module used to unpickle it.


python 2.6, n = 20

   sizepickle  cPickle
pickle 4.3M3.02/2.65   0.786/0.559 
cPickle3.4M2.27/2.04   0.66/0.443 


python 2.7, n = 20

   sizepickle  cPickle
pickle 4.3M10.5/2.67   6.62/0.563 
cPickle2.4M1.45/1.39   0.362/0.325 


When pickle is used to pickle the data, there is a significant slowdown in 
python 2.7 compared to python 2.6 with garbage collection on. With garbage 
collection off the times in python 2.7 are essentially identical to those in 
python 2.6.

When cPickle is used to pickle the data, both unpicklers are faster in python 
2.7 than in python 2.6. Presumably the speedup is due to the dictionary 
optimizations introduced from issue #5670.


Both pickle and cPickle show a slowdown when data pickled in python 2.6 is 
unpickled in python 2.7:


pickled in python 2.6, unpickled in python 2.7, n = 20

  sizepickle (2.7)cPickle (2.7)
pickle (2.6)  4.3M10.4/2.66   6.64/0.56 
cPickle (2.6) 3.4M8.73/2.08   6.1/0.452 


I don't know enough about the internals of the pickle modules or garbage 
collector to offer an explanation/fix. The list of optimizations for python 2.7 
indicates changes to both pickle modules (issues #5670 and #5084) and the 
garbage collector (issues #4074 and #4688). It seems possible that the slowdown 
is the result of some interaction between these changes.


Further notes:

1. System details: python 2.6.5 and python 2.7.3 on Ubuntu 10.04, 1.73GHz 
Pentium M processor.

2. Only pickle files created with protocols 1 and 2 are affected. Pickling with 
protocol 0 gives similar timings on python 2.6 and 2.7.

3. The fact that the dictionary's keys are tuples is relevant, although the 
length of the tuple is not. Unpickling a dictionary whose keys are strings does 
not show any slowdown.

------
files: make_file.py
messages: 160368
nosy: stw
priority: normal
severity: normal
status: open
title: Slow unpickling of certain dictionaries in python 2.7 vs python 2.6
type: performance
versions: Python 2.7
Added file: http://bugs.python.org/file25524/make_file.py

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6

2012-05-10 Thread stw

Changes by stw :


Added file: http://bugs.python.org/file25525/load_file.py

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6

2012-05-21 Thread stw

stw  added the comment:

Thanks for the nod to the pickletools module. The structure of the streams do 
change between python 2.6 and 2.7, at least for files created with cPickle. You 
can see this from the file sizes that I quoted in my previous post. Below are 
some snippets of the disassemblies of pickle files generated using make_file.py:

pickle, python 2.6 and 2.7:

0: \x80 PROTO  2
2: }EMPTY_DICT
3: qBINPUT 0
5: (MARK
6: USHORT_BINSTRING 'ArLLX'
   13: qBINPUT 1
   15: \x85 TUPLE1
   16: qBINPUT 2
   18: KBININT10
   ...

cPickle, python 2.6:

0: \x80 PROTO  2
2: }EMPTY_DICT
3: qBINPUT 1
5: (MARK
6: USHORT_BINSTRING 'XYoGB'
   13: \x85 TUPLE1
   14: qBINPUT 2
   16: KBININT10
   ...

cPickle, python 2.7:

0: \x80 PROTO  2
2: }EMPTY_DICT
3: qBINPUT 1
5: (MARK
6: USHORT_BINSTRING 'QGgRa'
   13: \x85 TUPLE1
   14: KBININT11
   ...

As you can see, pickle memoizes both the string and the tuple. cPickle2.6 just 
memoizes the tuple, while cPickle2.7 doesn't memoize either. It therefore seems 
that the problem is somehow related to the memo...

--

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6

2012-05-21 Thread stw

stw  added the comment:

One other thing - python 2.7 added the is_tracked function to the gc module. As 
I understand it (from issue #4074) tuples of atomic types are supposed to be 
untracked. However, in python 2.7:

>>> gc.is_tracked((1, 2))
True
>>> gc.is_tracked("hello")
True

--

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Dict untracking can result in quadratic dict build-up

2012-05-22 Thread stw

stw  added the comment:

I'd come to the same conclusion - as the new dict is built up (using batch 
build) it keeps appearing in generation 0, and the gc has to walk over it to 
determine that it should be untracked. 

However, this only seems to be true if the pickle file is created using pickle 
- it doesn't happen with files generated with cPickle. With cPickle the dict 
remains tracked, and passes from generation 0 to 1 to 2. The only difference is 
that pickle memoizes the tuples, while cPickle doesn't. Why does the 
memoization make a difference?

--

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Dict untracking can result in quadratic dict build-up

2012-05-22 Thread stw

stw  added the comment:

> Probably because memoization itself uses a dict.

Right, but as far as I can tell it's not the memo dict that keeps being 
tracked/untracked. Rather, it's the dict that is being unpickled.

Anyway, I suppose the point is that the issue of whether an object is 
tracked/untracked is not solely determined by its type:

1. All containers are tracked by default.

2. Tuples can only become untracked after a generation 0 gc pass.

3. With the new patch, dicts can only become untracked after a generation 2 gc 
pass.

4. Also, am I right in thinking that whether a container gets untracked or not 
depends not only on its contents, but also on the order of the objects in the 
gc list? That is, all of the contents of a container must get untracked before 
the container is considered.

--

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Dict untracking can result in quadratic dict build-up

2012-05-23 Thread stw

stw  added the comment:

I had a thought about untracking tuples. If a tuple contains only immutable 
objects (atomics and tuples of atomics etc), then it should be untracked. Once 
untracked, it will never need to be tracked again since the tuple is immutable. 
If a tuple contains mutable objects, it will always need to be tracked.

I was wondering whether it is possible to determine whether a tuple needs to be 
tracked or not the first time it appears in generation 0 - tuples in older 
generations would then not need to be considered.

--

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Dict untracking can result in quadratic dict build-up

2012-05-23 Thread stw

stw  added the comment:

> I had a thought about untracking tuples. If a tuple contains only
> immutable objects (atomics and tuples of atomics etc), then it should
> be untracked. Once untracked, it will never need to be tracked again
> since the tuple is immutable. If a tuple contains mutable objects, it
> will always need to be tracked.

> True. However, some tuples may be in an unfinished state (they are 
> being built up internally and a GC collection occurred in the middle).

So the tuple is linked-in to the garbage collection list before its contents 
are constructed?

> I was wondering whether it is possible to determine whether a tuple
> needs to be tracked or not the first time it appears in generation 0 -
> tuples in older generations would then not need to be considered.

> I'm not sure that would make much of a difference in practice. Most
> tuples are very short, so checking them for untracking should be
> reasonably fast.

Could tuples not be untracked at creation time then, or do not enough survive 
to gc to make this worthwhile?

--

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14775] Dict untracking can result in quadratic dict build-up

2012-05-25 Thread stw

stw  added the comment:

> So the tuple is linked-in to the garbage collection list before its
> contents are constructed?

> It is. It typically happens when you do (in C code):

Ok, thanks. I couldn't see how a tuple could be created before its contents in 
python code, but it is clearly possible in C.


I have written up some documentation on how untracking is handled in the gc - 
please see the attached patch. It summarises our discussion and some of the 
posts in issue #4688.

--
Added file: http://bugs.python.org/file25711/untracking_docs.patch

___
Python tracker 
<http://bugs.python.org/issue14775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com