built-in pow() vs. math.pow()
I sometimes make use of the fact that the built-in pow() function has an optional third argument for modulo calculation, which is handy when dealing with tasks from number theory, very large numbers, problems from Project Euler, etc. I was unpleasantly surprised that math.pow() does not have this feature, hence "from math import *" overwrites the built-in pow() function with a function that lacks functionality. I am wondering for the rationale of this. Does math.pow() do anything that the built-in version can not do, and if not, why is it even there? Thanks in advance for any enlightening comment on this. Best regards, Andreas -- https://mail.python.org/mailman/listinfo/python-list
Re: built-in pow() vs. math.pow()
Andreas Eisele schrieb am Donnerstag, 30. März 2023 um 11:16:02 UTC+2: > I sometimes make use of the fact that the built-in pow() function has an > optional third argument for modulo calculation, which is handy when dealing > with tasks from number theory, very large numbers, problems from Project > Euler, etc. I was unpleasantly surprised that math.pow() does not have this > feature, hence "from math import *" overwrites the built-in pow() function > with a function that lacks functionality. I am wondering for the rationale of > this. Does math.pow() do anything that the built-in version can not do, and > if not, why is it even there? > Thanks in advance for any enlightening comment on this. > Best regards, Andreas Thanks a lot to all of you for the insightful replies! I now understand why math.pow() behaves the way it does and whoever tells me I should have read the doc of the math module before using it is 100% on point. BTW, there is another difference: built-in pow() deals with complex arguments, while functions in math won't accept them at all. I also agree with the warning that importing * from any module is generally a bad idea, and I normally would not do it. But here I had written a little tool that evaluates an expression given on the command line, and in this case having math functions available without further ado is very convenient, so it seemed appropriate to make an exeption to this rule. I ended up assigning the built-in pow function to a different name before importing the math module, which is a good way to keep both variants accessible. Best regards, and thanks again, Andreas -- https://mail.python.org/mailman/listinfo/python-list
Tremendous slowdown due to garbage collection
In an application dealing with very large text files, I need to create dictionaries indexed by tuples of words (bi-, tri-, n-grams) or nested dictionaries. The number of different data structures in memory grows into orders beyond 1E7. It turns out that the default behaviour of Python is not very suitable for such a situation, as garbage collection occasionally traverses all objects in memory (including all tuples) in order to find out which could be collected. This leads to the sitation that creating O(N) objects effectively takes O(N*N) time. Although this can be cured by switching garbage collection off before the data structures are built and back on afterwards, it may easily lead a user not familiar with the fine details of garbage collection behaviour to the impression of weak scalability, which would be a pity, as the real problem is much more specific and can be cured. The problem is already clearly visible for 1M objects, but for larger numbers it gets much more pronounced. Here is a very simple example that displays the problem. > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(100)]' 10 loops, best of 3: 329 msec per loop > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(100)]' 10 loops, best of 3: 4.06 sec per loop > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(200)]' 10 loops, best of 3: 662 msec per loop > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(200)]' 10 loops, best of 3: 15.2 sec per loop In the latter case, the garbage collector apparently consumes about 97% of the overall time. I would suggest to configure the default behaviour of the garbage collector in such a way that this squared complexity is avoided without requiring specific knowledge and intervention by the user. Not being an expert in these details I would like to ask the gurus how this could be done. I hope this should be at least technically possible, whether it is really desirable or important for a default installation of Python could then be discussed once the disadvantages of such a setting would be apparent. Thanks a lot for your consideration, and best regards, Andreas -- Dr. Andreas Eisele,Senior Researcher DFKI GmbH, Language Technology Lab, [EMAIL PROTECTED] Saarland UniversityComputational Linguistics Stuhlsatzenhausweg 3 tel: +49-681-302-5285 D-66123 Saarbrückenfax: +49-681-302-5338 -- http://mail.python.org/mailman/listinfo/python-list
Re: Tremendous slowdown due to garbage collection
I should have been more specific about possible fixes. > > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(200)]' > > 10 loops, best of 3: 662 msec per loop > > > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(200)]' > > 10 loops, best of 3: 15.2 sec per loop > > In the latter case, the garbage collector apparently consumes about > 97% of the overall time. > In a related thread on http://bugs.python.org/issue2607 Amaury Forgeot d'Arc suggested a setting of the GC thresholds that actually solves the problem for me: > Disabling the gc may not be a good idea in a real application; I suggest > you to play with the gc.set_threshold function and set larger values, at > least while building the dictionary. (700, 1000, 10) seems to yield good > results. > python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in > range(200)]' 10 loops, best of 3: 658 msec per loop which made me suggest to use these as defaults, but then Martin v. Löwis wrote that > No, the defaults are correct for typical applications. At that point I felt lost and as the general wish in that thread was to move discussion to comp.lang.python, I brought it up here, in a modified and simplified form. > I would suggest to configure the default behaviour of the garbage > collector in such a way that this squared complexity is avoided > without requiring specific knowledge and intervention by the user. Not > being an expert in these details I would like to ask the gurus how > this could be done. > I hope this should be at least technically possible, whether it is > really desirable or important for a default installation of Python > could then be discussed once the disadvantages of such a setting would > be apparent. I still don't see what is so good about defaults that lead to O(N*N) computation for a O(N) problem, and I like Amaury's suggestion a lot, so I would like to see comments on its disadvantages. Please don't tell me that O(N*N) is good enough. For N>1E7 it isn't. About some other remarks made here: > I think the linguists of the world should write better automated > translation systems. Not being an expert in these details I would like > to ask the gurus how it could be done. I fully agree, and that is why I (as someone involved with MT) would prefer to focus on MT and not on memory allocation issues, and by the way, this is why I normally prefer to work in Python, as it normally lets me focus on the problem instead of the tool. > There are going to be pathological cases in any memory allocation > scheme. The fact that you have apparently located one calls for you to > change your approach, not for the finely-tuned well-conceived garbage > collection scheme to be abandoned for your convenience. I do not agree at all. Having to deal with N>1E7 objects is IMHO not pathological, it is just daily practice in data-oriented work (we're talking about deriving models from Gigawords, not about toy systems). > If you had made a definite proposal that could have been evaluated you > request would perhaps have seemed a little more approachable. I feel it is ok to describe a generic problem without having found the answer yet. (If you disagree: Where are your definite proposals wrt. MT ? ;-) But I admit, I should have brought up Amaury's definite proposal right away. > A question often asked ... of people who are creating > very large data structures in Python is "Why are you doing that?" > That is, you should consider whether some kind of database solution > would be better. You mention lots of dicts--it sounds like some > balanced B-trees with disk loading on demand could be a good choice. I do it because I have to count frequencies of pairs of things that appear in real data. Extremely simple stuff, only interesting because of the size of the data set. After Amaury's hint to switch GC temporarily off I can count 100M pairs tokens (18M pair types) in about 15 minutes, which is very cool, and I can do it in only a few lines of code, which is even cooler. Any approach that would touch the disk would be too slow for me, and bringing in complicated datastructures by myself would distract me too much from what I need to do. That's exactly why I use Python; it has lots of highly fine-tuned complicated stuff behind the scenes, that is just doing the right thing, so I should not have to (and I could not) re-invent this myself. Thanks for the helpful comments, Andreas -- http://mail.python.org/mailman/listinfo/python-list
Re: Tremendous slowdown due to garbage collection
Sorry, I have to correct my last posting again: > > Disabling the gc may not be a good idea in a real application; I suggest > > you to play with the gc.set_threshold function and set larger values, at > > least while building the dictionary. (700, 1000, 10) seems to yield good > > results. > > python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in > > range(200)]' > > 10 loops, best of 3: 658 msec per loop > Looks nice, but it is pointless, as timeit disables gc by default, and the set_threshold hence has no effect. In order to see the real effect of this, I need to say > python2.5 -m timeit 'gc.enable();gc.set_threshold(700,1000,10);l=[(i,) for i > in range(200)]' which gives me 10 loops, best of 3: 1.26 sec per loop Still a factor of more than 10 faster than the default settings. > which made me suggest to use these as defaults, but then > Martin v. Löwis wrote that > > > No, the defaults are correct for typical applications. Complex topic... -- http://mail.python.org/mailman/listinfo/python-list
Re: Tremendous slowdown due to garbage collection
> Martin said that the default settings for the cyclic gc works for most > people. I agree. > Your test case has found a pathologic corner case which is *not* > typical for common application but typical for an artificial benchmark. I agree that my "corner" is not typical, but I strongly disagree with the classification as pathological. The only feature of my test case that is not typical is the huge number of distinct objects that are allocated. I admit that 1E7 objects is today still fairly untypical, but there is nothing pathological about it, it is just bigger. I is about as pathological as a file size >2G, which a few years ago seemed so outrageous that no OS bothered to support it, but is fairly common nowadays, so that a lack of support would appear as an arbitrary and unmotivated limitation nowadays. We all enjoy seeing Python adopted on a large scale and used by a broad community, so we should not accept arbitrary size limits. You could call a string with more than 2GB pathological, but I very much appreciate the fact that Python supports such strings for the few cases where they are needed (on a 64 bit architecture). Now a O(N*N) effort for large numbers of objects isn't such a hard limit, but in practice boils down to the same effect, that people cannot use Python in such circumstances. I would prefer it very much if such "soft limits" could be avoided as well. Given there is a fairly simple workaround (thanks again to Amaury!), the issue is not urgent, but I still think it is important in the long run. > Python is optimized for regular apps, not for benchmark (like some video > drivers). > I still think it would be worthwhile to support very large numbers of objects in a way that they can just be used, without knowledge of special tricks, and I would be fairly optimistic that those who have designed the current GC schemes could generalize them slightly so that these marginal cases will work better without imposing a penalty on the more typical cases. > By the way you shouldn't use range for large ranges of more than a > thousand items. xrange() should be faster and it will definitely use > much less memory - and memory Python 2.5 and older will never release > again. I'm going to fix the issue for Python 2.6 and 3.0. > Thanks for this hint, and for the work on the newer versions. This is very much appreciated. Andreas -- http://mail.python.org/mailman/listinfo/python-list