In an application dealing with very large text files, I need to create dictionaries indexed by tuples of words (bi-, tri-, n-grams) or nested dictionaries. The number of different data structures in memory grows into orders beyond 1E7.
It turns out that the default behaviour of Python is not very suitable for such a situation, as garbage collection occasionally traverses all objects in memory (including all tuples) in order to find out which could be collected. This leads to the sitation that creating O(N) objects effectively takes O(N*N) time. Although this can be cured by switching garbage collection off before the data structures are built and back on afterwards, it may easily lead a user not familiar with the fine details of garbage collection behaviour to the impression of weak scalability, which would be a pity, as the real problem is much more specific and can be cured. The problem is already clearly visible for 1M objects, but for larger numbers it gets much more pronounced. Here is a very simple example that displays the problem. > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(1000000)]' 10 loops, best of 3: 329 msec per loop > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(1000000)]' 10 loops, best of 3: 4.06 sec per loop > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]' 10 loops, best of 3: 662 msec per loop > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]' 10 loops, best of 3: 15.2 sec per loop In the latter case, the garbage collector apparently consumes about 97% of the overall time. I would suggest to configure the default behaviour of the garbage collector in such a way that this squared complexity is avoided without requiring specific knowledge and intervention by the user. Not being an expert in these details I would like to ask the gurus how this could be done. I hope this should be at least technically possible, whether it is really desirable or important for a default installation of Python could then be discussed once the disadvantages of such a setting would be apparent. Thanks a lot for your consideration, and best regards, Andreas ------ Dr. Andreas Eisele, Senior Researcher DFKI GmbH, Language Technology Lab, [EMAIL PROTECTED] Saarland University Computational Linguistics Stuhlsatzenhausweg 3 tel: +49-681-302-5285 D-66123 Saarbrücken fax: +49-681-302-5338 -- http://mail.python.org/mailman/listinfo/python-list