[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6
New submission from stw : I've found that unpickling a certain kind of dictionary is substantially slower in python 2.7 compared to python 2.6. The dictionary has keys that are tuples of strings - a 1-tuple is enough to see the effect. The problem seems to be caused by garbage collection, as turning it off eliminates the slowdown. Both pickle and cPickle modules are affected. I've attached two files to demonstrate this. The file 'make_file.py' creates a dictionary of specified size, with keys containing 1-tuples of random strings. It then dumps the dictionary to a pickle file using a specified pickle module. The file 'load_file.py' unpickles the file created by 'make_file.py', using a specified pickle module, and prints the time taken. The code can be run with garbage collection either on or off. The results below are for a dictionary of 20 entries. Each entry is the time taken in seconds with garbage collection on / garbage collection off. The row headings are the module used to pickle the data, the column headings the module used to unpickle it. python 2.6, n = 20 sizepickle cPickle pickle 4.3M3.02/2.65 0.786/0.559 cPickle3.4M2.27/2.04 0.66/0.443 python 2.7, n = 20 sizepickle cPickle pickle 4.3M10.5/2.67 6.62/0.563 cPickle2.4M1.45/1.39 0.362/0.325 When pickle is used to pickle the data, there is a significant slowdown in python 2.7 compared to python 2.6 with garbage collection on. With garbage collection off the times in python 2.7 are essentially identical to those in python 2.6. When cPickle is used to pickle the data, both unpicklers are faster in python 2.7 than in python 2.6. Presumably the speedup is due to the dictionary optimizations introduced from issue #5670. Both pickle and cPickle show a slowdown when data pickled in python 2.6 is unpickled in python 2.7: pickled in python 2.6, unpickled in python 2.7, n = 20 sizepickle (2.7)cPickle (2.7) pickle (2.6) 4.3M10.4/2.66 6.64/0.56 cPickle (2.6) 3.4M8.73/2.08 6.1/0.452 I don't know enough about the internals of the pickle modules or garbage collector to offer an explanation/fix. The list of optimizations for python 2.7 indicates changes to both pickle modules (issues #5670 and #5084) and the garbage collector (issues #4074 and #4688). It seems possible that the slowdown is the result of some interaction between these changes. Further notes: 1. System details: python 2.6.5 and python 2.7.3 on Ubuntu 10.04, 1.73GHz Pentium M processor. 2. Only pickle files created with protocols 1 and 2 are affected. Pickling with protocol 0 gives similar timings on python 2.6 and 2.7. 3. The fact that the dictionary's keys are tuples is relevant, although the length of the tuple is not. Unpickling a dictionary whose keys are strings does not show any slowdown. ------ files: make_file.py messages: 160368 nosy: stw priority: normal severity: normal status: open title: Slow unpickling of certain dictionaries in python 2.7 vs python 2.6 type: performance versions: Python 2.7 Added file: http://bugs.python.org/file25524/make_file.py ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6
Changes by stw : Added file: http://bugs.python.org/file25525/load_file.py ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6
stw added the comment: Thanks for the nod to the pickletools module. The structure of the streams do change between python 2.6 and 2.7, at least for files created with cPickle. You can see this from the file sizes that I quoted in my previous post. Below are some snippets of the disassemblies of pickle files generated using make_file.py: pickle, python 2.6 and 2.7: 0: \x80 PROTO 2 2: }EMPTY_DICT 3: qBINPUT 0 5: (MARK 6: USHORT_BINSTRING 'ArLLX' 13: qBINPUT 1 15: \x85 TUPLE1 16: qBINPUT 2 18: KBININT10 ... cPickle, python 2.6: 0: \x80 PROTO 2 2: }EMPTY_DICT 3: qBINPUT 1 5: (MARK 6: USHORT_BINSTRING 'XYoGB' 13: \x85 TUPLE1 14: qBINPUT 2 16: KBININT10 ... cPickle, python 2.7: 0: \x80 PROTO 2 2: }EMPTY_DICT 3: qBINPUT 1 5: (MARK 6: USHORT_BINSTRING 'QGgRa' 13: \x85 TUPLE1 14: KBININT11 ... As you can see, pickle memoizes both the string and the tuple. cPickle2.6 just memoizes the tuple, while cPickle2.7 doesn't memoize either. It therefore seems that the problem is somehow related to the memo... -- ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Slow unpickling of certain dictionaries in python 2.7 vs python 2.6
stw added the comment: One other thing - python 2.7 added the is_tracked function to the gc module. As I understand it (from issue #4074) tuples of atomic types are supposed to be untracked. However, in python 2.7: >>> gc.is_tracked((1, 2)) True >>> gc.is_tracked("hello") True -- ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Dict untracking can result in quadratic dict build-up
stw added the comment: I'd come to the same conclusion - as the new dict is built up (using batch build) it keeps appearing in generation 0, and the gc has to walk over it to determine that it should be untracked. However, this only seems to be true if the pickle file is created using pickle - it doesn't happen with files generated with cPickle. With cPickle the dict remains tracked, and passes from generation 0 to 1 to 2. The only difference is that pickle memoizes the tuples, while cPickle doesn't. Why does the memoization make a difference? -- ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Dict untracking can result in quadratic dict build-up
stw added the comment: > Probably because memoization itself uses a dict. Right, but as far as I can tell it's not the memo dict that keeps being tracked/untracked. Rather, it's the dict that is being unpickled. Anyway, I suppose the point is that the issue of whether an object is tracked/untracked is not solely determined by its type: 1. All containers are tracked by default. 2. Tuples can only become untracked after a generation 0 gc pass. 3. With the new patch, dicts can only become untracked after a generation 2 gc pass. 4. Also, am I right in thinking that whether a container gets untracked or not depends not only on its contents, but also on the order of the objects in the gc list? That is, all of the contents of a container must get untracked before the container is considered. -- ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Dict untracking can result in quadratic dict build-up
stw added the comment: I had a thought about untracking tuples. If a tuple contains only immutable objects (atomics and tuples of atomics etc), then it should be untracked. Once untracked, it will never need to be tracked again since the tuple is immutable. If a tuple contains mutable objects, it will always need to be tracked. I was wondering whether it is possible to determine whether a tuple needs to be tracked or not the first time it appears in generation 0 - tuples in older generations would then not need to be considered. -- ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Dict untracking can result in quadratic dict build-up
stw added the comment: > I had a thought about untracking tuples. If a tuple contains only > immutable objects (atomics and tuples of atomics etc), then it should > be untracked. Once untracked, it will never need to be tracked again > since the tuple is immutable. If a tuple contains mutable objects, it > will always need to be tracked. > True. However, some tuples may be in an unfinished state (they are > being built up internally and a GC collection occurred in the middle). So the tuple is linked-in to the garbage collection list before its contents are constructed? > I was wondering whether it is possible to determine whether a tuple > needs to be tracked or not the first time it appears in generation 0 - > tuples in older generations would then not need to be considered. > I'm not sure that would make much of a difference in practice. Most > tuples are very short, so checking them for untracking should be > reasonably fast. Could tuples not be untracked at creation time then, or do not enough survive to gc to make this worthwhile? -- ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14775] Dict untracking can result in quadratic dict build-up
stw added the comment: > So the tuple is linked-in to the garbage collection list before its > contents are constructed? > It is. It typically happens when you do (in C code): Ok, thanks. I couldn't see how a tuple could be created before its contents in python code, but it is clearly possible in C. I have written up some documentation on how untracking is handled in the gc - please see the attached patch. It summarises our discussion and some of the posts in issue #4688. -- Added file: http://bugs.python.org/file25711/untracking_docs.patch ___ Python tracker <http://bugs.python.org/issue14775> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com