built-in pow() vs. math.pow()

2023-03-30 Thread Andreas Eisele
I sometimes make use of the fact that the built-in pow() function has an 
optional third argument for modulo calculation, which is handy when dealing 
with tasks from number theory, very large numbers, problems from Project Euler, 
etc. I was unpleasantly surprised that math.pow() does not have this feature, 
hence "from math import *" overwrites the built-in pow() function with a 
function that lacks functionality. I am wondering for the rationale of this. 
Does math.pow() do anything that the built-in version can not do, and if not, 
why is it even there?
Thanks in advance for any enlightening comment on this.
Best regards, Andreas
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: built-in pow() vs. math.pow()

2023-04-03 Thread Andreas Eisele
Andreas Eisele schrieb am Donnerstag, 30. März 2023 um 11:16:02 UTC+2:
> I sometimes make use of the fact that the built-in pow() function has an 
> optional third argument for modulo calculation, which is handy when dealing 
> with tasks from number theory, very large numbers, problems from Project 
> Euler, etc. I was unpleasantly surprised that math.pow() does not have this 
> feature, hence "from math import *" overwrites the built-in pow() function 
> with a function that lacks functionality. I am wondering for the rationale of 
> this. Does math.pow() do anything that the built-in version can not do, and 
> if not, why is it even there? 
> Thanks in advance for any enlightening comment on this. 
> Best regards, Andreas

Thanks a lot to all of you for the insightful replies! 
I now understand why math.pow() behaves the way it does and whoever tells me I 
should have read the doc of the math module before using it is 100% on point. 
BTW, there is another difference: built-in pow() deals with complex arguments, 
while functions in math won't accept them at all.
I also agree with the warning that importing * from any module is generally a 
bad idea, and I normally would not do it.  But here I had written a little tool 
that evaluates an expression given on the command line, and in this case having 
math functions available without further ado is very convenient, so it seemed 
appropriate to make an exeption to this rule.
I ended up assigning the built-in pow function to a different name before 
importing the math module, which is a good way to keep both variants accessible.
Best regards, and thanks again, Andreas
-- 
https://mail.python.org/mailman/listinfo/python-list


Tremendous slowdown due to garbage collection

2008-04-12 Thread andreas . eisele
In an application dealing with very large text files, I need to create
dictionaries indexed by tuples of words (bi-, tri-, n-grams) or nested
dictionaries. The number of different data structures in memory grows
into orders beyond 1E7.

It turns out that the default behaviour of Python is not very suitable
for such a situation, as garbage collection occasionally traverses all
objects in memory (including all tuples) in order to find out which
could be collected. This leads to the sitation that creating O(N)
objects effectively takes O(N*N) time.
Although this can be cured by switching garbage collection off before
the data structures are built and back on afterwards, it may easily
lead a user not familiar with the fine details of garbage collection
behaviour to the impression of weak scalability, which would be a
pity, as the real problem is much more specific and can be cured.

The problem is already clearly visible for 1M objects, but for larger
numbers it gets much more pronounced. Here is a very simple example
that displays the problem.

> python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(100)]'
10 loops, best of 3: 329 msec per loop

> python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(100)]'
10 loops, best of 3: 4.06 sec per loop

> python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(200)]'
10 loops, best of 3: 662 msec per loop

> python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(200)]'
10 loops, best of 3: 15.2 sec per loop

In the latter case, the garbage collector apparently consumes about
97% of the overall time.

I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
I hope this should be at least technically possible, whether it is
really desirable or important for a default installation of Python
could then be discussed once the disadvantages of such a setting would
be apparent.

Thanks a lot for your consideration, and best regards,
Andreas

--
Dr. Andreas Eisele,Senior Researcher
DFKI GmbH,  Language Technology Lab,  [EMAIL PROTECTED]
Saarland UniversityComputational Linguistics
Stuhlsatzenhausweg 3   tel: +49-681-302-5285
D-66123 Saarbrückenfax: +49-681-302-5338
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Tremendous slowdown due to garbage collection

2008-04-12 Thread andreas . eisele
I should have been more specific about possible fixes.

> > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(200)]'
>
> 10 loops, best of 3: 662 msec per loop
>
> > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(200)]'
>
> 10 loops, best of 3: 15.2 sec per loop
>
> In the latter case, the garbage collector apparently consumes about
> 97% of the overall time.
>

In a related thread on http://bugs.python.org/issue2607
Amaury Forgeot d'Arc suggested a setting of the GC
thresholds that actually solves the problem for me:

> Disabling the gc may not be a good idea in a real application; I suggest
> you to play with the gc.set_threshold function and set larger values, at
> least while building the dictionary. (700, 1000, 10) seems to yield good
> results.

> python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in 
> range(200)]'
10 loops, best of 3: 658 msec per loop

which made me suggest to use these as defaults, but then
Martin v. Löwis wrote that

> No, the defaults are correct for typical applications.

At that point I felt lost and as the general wish in that thread was
to move
discussion to comp.lang.python, I brought it up here, in a modified
and simplified form.

> I would suggest to configure the default behaviour of the garbage
> collector in such a way that this squared complexity is avoided
> without requiring specific knowledge and intervention by the user. Not
> being an expert in these details I would like to ask the gurus how
> this could be done.
> I hope this should be at least technically possible, whether it is
> really desirable or important for a default installation of Python
> could then be discussed once the disadvantages of such a setting would
> be apparent.

I still don't see what is so good about defaults that lead to O(N*N)
computation for a O(N) problem, and I like Amaury's suggestion a lot,
so I would like to see comments on its disadvantages. Please don't
tell me that O(N*N) is good enough. For N>1E7 it isn't.

About some other remarks made here:

> I think the linguists of the world should write better automated
> translation systems. Not being an expert in these details I would like
> to ask the gurus how it could be done.

I fully agree, and that is why I (as someone involved with MT) would
prefer to focus on MT and not on memory allocation issues, and by the
way, this is why I normally prefer to work in Python, as it normally
lets me focus on the problem instead of the tool.

> There are going to be pathological cases in any memory allocation
> scheme. The fact that you have apparently located one calls for you to
> change your approach, not for the finely-tuned well-conceived garbage
> collection scheme to be abandoned for your convenience.

I do not agree at all. Having to deal with N>1E7 objects is IMHO not
pathological, it is just daily practice in data-oriented work (we're
talking about deriving models from Gigawords, not about toy systems).

> If you had made a definite proposal that could have been evaluated you
> request would perhaps have seemed a little more approachable.

I feel it is ok to describe a generic problem without having found
the answer yet.
(If you disagree: Where are your definite proposals wrt. MT ? ;-)
But I admit, I should have brought up Amaury's definite proposal
right away.

> A question often asked ... of people who are creating
> very large data structures in Python is "Why are you doing that?"
> That is, you should consider whether some kind of database solution
> would be better.  You mention lots of dicts--it sounds like some
> balanced B-trees with disk loading on demand could be a good choice.

I do it because I have to count frequencies of pairs of things that
appear in real data. Extremely simple stuff, only interesting because
of the size of the data set. After Amaury's hint to switch GC
temporarily
off I can count 100M pairs tokens (18M pair types) in about 15
minutes,
which is very cool, and I can do it in only a few lines of code, which
is even cooler.
Any approach that would touch the disk would be too slow for me, and
bringing in complicated datastructures by myself would distract me
too much from what I need to do. That's exactly why I use Python;
it has lots of highly fine-tuned complicated stuff behind the scenes,
that is just doing the right thing, so I should not have to (and I
could
not) re-invent this myself.

Thanks for the helpful comments,
Andreas


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Tremendous slowdown due to garbage collection

2008-04-12 Thread andreas . eisele
Sorry, I have to correct my last posting again:

> > Disabling the gc may not be a good idea in a real application; I suggest
> > you to play with the gc.set_threshold function and set larger values, at
> > least while building the dictionary. (700, 1000, 10) seems to yield good
> > results.
> > python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in 
> > range(200)]'
>
> 10 loops, best of 3: 658 msec per loop
>

Looks nice, but it is pointless, as timeit disables gc by default,
and the set_threshold hence has no effect. In order to see the real
effect of this, I need to say

> python2.5 -m timeit 'gc.enable();gc.set_threshold(700,1000,10);l=[(i,) for i 
> in range(200)]'

which gives me

10 loops, best of 3: 1.26 sec per loop

Still a factor of more than 10 faster than the default settings.

> which made me suggest to use these as defaults, but then
> Martin v. Löwis wrote that
>
> > No, the defaults are correct for typical applications.

Complex topic...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Tremendous slowdown due to garbage collection

2008-04-12 Thread andreas . eisele

> Martin said that the default settings for the cyclic gc works for most
> people.

I agree.

> Your test case has found a pathologic corner case which is *not*
> typical for common application but typical for an artificial benchmark.

I agree that my "corner" is not typical, but I strongly disagree with
the classification as pathological. The only feature of my test case
that is not typical is the huge number of distinct objects that are
allocated. I admit that 1E7 objects is today still fairly untypical,
but there is nothing pathological about it, it is just bigger. I is
about as pathological as a file size >2G, which a few years ago seemed
so outrageous that no OS bothered to support it, but is fairly common
nowadays, so that a lack of support would appear as an arbitrary and
unmotivated limitation nowadays. We all enjoy seeing Python adopted on
a large scale and used by a broad community, so we should not accept
arbitrary size limits. You could call a string with more than 2GB
pathological, but I very much appreciate the fact that Python supports
such strings for the few cases where they are needed (on a 64 bit
architecture). Now a O(N*N) effort for large numbers of objects isn't
such a hard limit, but in practice boils down to the same effect, that
people cannot use Python in such circumstances. I would prefer it very
much if such "soft limits" could be avoided as well.

Given there is a fairly simple workaround (thanks again to Amaury!),
the issue is not urgent, but I still think it is important in the long
run.

> Python is optimized for regular apps, not for benchmark (like some video
> drivers).
>

I still think it would be worthwhile to support very large numbers of
objects in a way that they can just be used, without knowledge of
special tricks, and I would be fairly optimistic that those who have
designed the current GC schemes could generalize them slightly so that
these marginal cases will work better without imposing a penalty on
the more typical cases.

> By the way you shouldn't use range for large ranges of more than a
> thousand items. xrange() should be faster and it will definitely use
> much less memory - and memory Python 2.5 and older will never release
> again. I'm going to fix the issue for Python 2.6 and 3.0.
>

Thanks for this hint, and for the work on the newer versions. This is
very much appreciated.

Andreas

-- 
http://mail.python.org/mailman/listinfo/python-list