Re: [Python-Dev] [Python-checkins] cpython: Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs.

2012-05-17 Thread Victor Stinner
> http://hg.python.org/cpython/rev/cdcc816dea85
> changeset:   76971:cdcc816dea85
> user:        Antoine Pitrou 
> date:        Tue May 15 23:48:04 2012 +0200
> summary:
>  Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs.
> Patch by Serhiy Storchaka.

Such optimization should be mentioned in the What's New in Python 3.3
doc if Python 3.3 is now faster than Python 3.2. Same remark for the
UTF-8 optimization.

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread martin

So maybe it's worth thinking about making a general mechanism
available for third parties to extend the type object without
them all needing to have their own tp_flags bits and without
needing to collude with each other to avoid conflicts.


That mechanism is already available. Subclass PyTypeType, and
add whatever fields you want.

Regards,
Martin


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread Mark Shannon

Dag Sverre Seljebotn wrote:

On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
On Wed, May 16, 2012 at 11:33 AM, "Martin v. 
Löwis"  wrote:

Does this use case make sense to everyone?

The reason why we are discussing this on python-dev is that we are 
looking

for a general way to expose these C level signatures within the Python
ecosystem. And Dag's idea was to expose them as part of the type 
object,

basically as an addition to the current Python level tp_call() slot.


The use case makes sense, yet there is also a long-standing solution 
already

to expose APIs and function pointers: the capsule objects.

If you want to avoid dictionary lookups on the server side, implement
tp_getattro, comparing addresses of interned strings.


Yes, that's an idea worth looking at. The point implementing
tp_getattro to avoid dictionary lookups overhead is a good one, worth
trying at least. One drawback is that this approach does require the
GIL (as does _PyType_Lookup).

Regarding the C function being faster than the dictionary lookup (or
at least close enough that the lookup takes time), yes, this happens
all the time. For example one might be solving differential equations
and the "user input" is essentially a set of (usually simple) double
f(double) and its derivatives.


To underline how this is performance critical to us, perhaps a full 
Cython example is useful.


The following Cython code is a real world usecase. It is not too 
contrived in the essentials, although simplified a little bit. For 
instance undergrad engineering students could pick up Cython just to 
play with simple scalar functions like this.


from numpy import sin
# assume sin is a Python callable and that NumPy decides to support
# our spec to also support getting a "double (*sinfuncptr)(double)".

# Our mission: Avoid to have the user manually import "sin" from C,
# but allow just using the NumPy object and still be fast.

# define a function to integrate
cpdef double f(double x):
return sin(x * x) # guess on signature and use "fastcall"!

# the integrator
def integrate(func, double a, double b, int n):
cdef double s = 0
cdef double dx = (b - a) / n
for i in range(n):
# This is also a fastcall, but can be cached so doesn't
# matter...
s += func(a + i * dx)
return s * dx

integrate(f, 0, 1, 100)

There are two problems here:

 - The "sin" global can be reassigned (monkey-patched) between each call 
to "f", no way for "f" to know. Even "sin" could do the reassignment. So 
you'd need to check for reassignment to do caching...


Since Cython allows static typing why not just declare that func can 
treat sin as if it can't be monkeypatched?
Moving the load of a global variable out of the loop does seem to be a 
rather obvious optimisation, if it were declared to be legal.




 - The fastcall inside of "f" is separated from the loop in "integrate". 
And since "f" is often in another module, we can't rely on static full 
program analysis.


These problems with monkey-patching disappear if the lookup is negligible.

Some rough numbers:

 - The overhead with the tp_flags hack is a 2 ns overhead (something 
similar with a metaclass, the problems are more how to synchronize that 
metaclass across multiple 3rd party libraries)


Does your approach handle subtyping properly?



 - Dict lookup 20 ns


Did you time _PyType_Lookup() ?



 - The sin function is about 35 ns. And, "f" is probably only 2-3 ns, 
and there could very easily be multiple such functions, defined in 
different modules, in a chain, in order to build up a formula.




Such micro timings are meaningless, because the working set often tends 
to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles.



Cheers,
Mark.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread Stefan Behnel
Mark Shannon, 17.05.2012 12:38:
> Dag Sverre Seljebotn wrote:
>> On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
>>> On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis" 
>>> wrote:
> Does this use case make sense to everyone?
>
> The reason why we are discussing this on python-dev is that we are
> looking
> for a general way to expose these C level signatures within the Python
> ecosystem. And Dag's idea was to expose them as part of the type object,
> basically as an addition to the current Python level tp_call() slot.

 The use case makes sense, yet there is also a long-standing solution
 already
 to expose APIs and function pointers: the capsule objects.

 If you want to avoid dictionary lookups on the server side, implement
 tp_getattro, comparing addresses of interned strings.
>>>
>>> Yes, that's an idea worth looking at. The point implementing
>>> tp_getattro to avoid dictionary lookups overhead is a good one, worth
>>> trying at least. One drawback is that this approach does require the
>>> GIL (as does _PyType_Lookup).
>>>
>>> Regarding the C function being faster than the dictionary lookup (or
>>> at least close enough that the lookup takes time), yes, this happens
>>> all the time. For example one might be solving differential equations
>>> and the "user input" is essentially a set of (usually simple) double
>>> f(double) and its derivatives.
>>
>> To underline how this is performance critical to us, perhaps a full
>> Cython example is useful.
>>
>> The following Cython code is a real world usecase. It is not too
>> contrived in the essentials, although simplified a little bit. For
>> instance undergrad engineering students could pick up Cython just to play
>> with simple scalar functions like this.
>>
>> from numpy import sin
>> # assume sin is a Python callable and that NumPy decides to support
>> # our spec to also support getting a "double (*sinfuncptr)(double)".
>>
>> # Our mission: Avoid to have the user manually import "sin" from C,
>> # but allow just using the NumPy object and still be fast.
>>
>> # define a function to integrate
>> cpdef double f(double x):
>> return sin(x * x) # guess on signature and use "fastcall"!
>>
>> # the integrator
>> def integrate(func, double a, double b, int n):
>> cdef double s = 0
>> cdef double dx = (b - a) / n
>> for i in range(n):
>> # This is also a fastcall, but can be cached so doesn't
>> # matter...
>> s += func(a + i * dx)
>> return s * dx
>>
>> integrate(f, 0, 1, 100)
>>
>> There are two problems here:
>>
>>  - The "sin" global can be reassigned (monkey-patched) between each call
>> to "f", no way for "f" to know. Even "sin" could do the reassignment. So
>> you'd need to check for reassignment to do caching...
> 
> Since Cython allows static typing why not just declare that func can treat
> sin as if it can't be monkeypatched?

You'd simply say

cdef object sin# declare it as a C variable of type 'object'
from numpy import sin

That's also the one obvious way to do it in Cython.


> Moving the load of a global variable out of the loop does seem to be a
> rather obvious optimisation, if it were declared to be legal.

My proposal was to simply extract any C function pointers at assignment
time, i.e. at import time in the example above. Signature matching can then
be done at the first call and the result can be cached as long as the
object variable isn't changed. All of that is local to the module and can
thus easily be controlled at code generation time.

Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sys.implementation

2012-05-17 Thread Eric Snow
PEP 421 has reached a good place and I'd like to ask for pronouncement.  Thanks!

-eric
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread Dag Sverre Seljebotn

Mark Shannon  wrote:

Dag Sverre Seljebotn wrote:

from numpy import sin
# assume sin is a Python callable and that NumPy decides to support
# our spec to also support getting a "double (*sinfuncptr)(double)".

# Our mission: Avoid to have the user manually import "sin" from C,
# but allow just using the NumPy object and still be fast.

# define a function to integrate
cpdef double f(double x):
return sin(x * x) # guess on signature and use "fastcall"!

# the integrator
def integrate(func, double a, double b, int n):
cdef double s = 0
cdef double dx = (b - a) / n
for i in range(n):
# This is also a fastcall, but can be cached so doesn't
# matter...
s += func(a + i * dx)
return s * dx

integrate(f, 0, 1, 100)

There are two problems here:

 - The "sin" global can be reassigned (monkey-patched) between each

call

to "f", no way for "f" to know. Even "sin" could do the reassignment.

So

you'd need to check for reassignment to do caching...


Since Cython allows static typing why not just declare that func can
treat sin as if it can't be monkeypatched?


If you want to manually declare stuff, you can always use a C function 
pointer too...



Moving the load of a global variable out of the loop does seem to be a
rather obvious optimisation, if it were declared to be legal.


In case you didn't notice, there was no global variable loads inside the 
loop...


You can keep chasing this, but there's *always* cases where they don't 
(and you need to save the situation by manual typing).


Anyway: We should really discuss Cython on the Cython list. If my 
motivating example wasn't good enough for you there's really nothing I 
can do.



Some rough numbers:

 - The overhead with the tp_flags hack is a 2 ns overhead (something
similar with a metaclass, the problems are more how to synchronize

that

metaclass across multiple 3rd party libraries)


Does your approach handle subtyping properly?


Not really.



 - Dict lookup 20 ns


Did you time _PyType_Lookup() ?


No, didn't get around to it yet (and thanks for pointing it out). 
(Though the GIL requirement is an issue too for Cython.)



 - The sin function is about 35 ns. And, "f" is probably only 2-3 ns,



and there could very easily be multiple such functions, defined in
different modules, in a chain, in order to build up a formula.



Such micro timings are meaningless, because the working set often tends

to fit in the hardware cache. A level 2 cache miss can takes 100s of
cycles.


I find this sort of response arrogant -- do you know the details of 
every usecase for a programming language under the sun?


Many Cython users are scientists. And in scientific computing in 
particular you *really* have the whole range of problems and working 
sets. Honestly. In some codes you only really care about the speed of 
the disk controller. In other cases you can spend *many seconds* working 
almost only in L1 or perhaps L2 cache (for instance when integrating 
ordinary differential equations in a few variables, which is not 
entirely different in nature from the example I posted). (Then, those 
many seconds are replicated many million times for different parameters 
on a large cluster, and a 2x speedup translates directly into large 
amounts of saved money.)


Also, with numerical codes you block up the problem so that loads to L2 
are amortized over sufficient FLOPs (when you can).


Every time Cython becomes able to do stuff more easily in this domain, 
people thank us that they didn't have to dig up Fortran but can stay 
closer to Python.


Sorry for going off on a rant. I find that people will give well-meant 
advice about performance, but that advice is just generalizing from 
computer programs in entirely different domains (web apps?), and 
sweeping generalizations has a way of giving the wrong answer.


Dag
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread Dag Sverre Seljebotn

On 05/17/2012 08:13 PM, Dag Sverre Seljebotn wrote:

Mark Shannon  wrote:

Dag Sverre Seljebotn wrote:

from numpy import sin
# assume sin is a Python callable and that NumPy decides to support
# our spec to also support getting a "double (*sinfuncptr)(double)".

# Our mission: Avoid to have the user manually import "sin" from C,
# but allow just using the NumPy object and still be fast.

# define a function to integrate
cpdef double f(double x):
return sin(x * x) # guess on signature and use "fastcall"!

# the integrator
def integrate(func, double a, double b, int n):
cdef double s = 0
cdef double dx = (b - a) / n
for i in range(n):
# This is also a fastcall, but can be cached so doesn't
# matter...
s += func(a + i * dx)
return s * dx

integrate(f, 0, 1, 100)

There are two problems here:

- The "sin" global can be reassigned (monkey-patched) between each

call

to "f", no way for "f" to know. Even "sin" could do the reassignment.

So

you'd need to check for reassignment to do caching...


Since Cython allows static typing why not just declare that func can
treat sin as if it can't be monkeypatched?


If you want to manually declare stuff, you can always use a C function
pointer too...


Moving the load of a global variable out of the loop does seem to be a
rather obvious optimisation, if it were declared to be legal.


In case you didn't notice, there was no global variable loads inside the
loop...

You can keep chasing this, but there's *always* cases where they don't
(and you need to save the situation by manual typing).

Anyway: We should really discuss Cython on the Cython list. If my
motivating example wasn't good enough for you there's really nothing I
can do.


Some rough numbers:

- The overhead with the tp_flags hack is a 2 ns overhead (something
similar with a metaclass, the problems are more how to synchronize

that

metaclass across multiple 3rd party libraries)


Does your approach handle subtyping properly?


Not really.



- Dict lookup 20 ns


Did you time _PyType_Lookup() ?


No, didn't get around to it yet (and thanks for pointing it out).
(Though the GIL requirement is an issue too for Cython.)


- The sin function is about 35 ns. And, "f" is probably only 2-3 ns,



and there could very easily be multiple such functions, defined in
different modules, in a chain, in order to build up a formula.



Such micro timings are meaningless, because the working set often tends

to fit in the hardware cache. A level 2 cache miss can takes 100s of
cycles.


I'm sorry; if my rant wasn't clear: Such micro-benchmarks do in fact 
mimic very closely what you'd do if you'd, say, integrate an ordinary 
differential equation. You *do* have a tight loop like that, just 
hammering on floating point numbers. Making that specific usecase more 
convenient was actually the original usecase that spawned this 
discussion on the NumPy list over a month ago...


Dag



I find this sort of response arrogant -- do you know the details of
every usecase for a programming language under the sun?

Many Cython users are scientists. And in scientific computing in
particular you *really* have the whole range of problems and working
sets. Honestly. In some codes you only really care about the speed of
the disk controller. In other cases you can spend *many seconds* working
almost only in L1 or perhaps L2 cache (for instance when integrating
ordinary differential equations in a few variables, which is not
entirely different in nature from the example I posted). (Then, those
many seconds are replicated many million times for different parameters
on a large cluster, and a 2x speedup translates directly into large
amounts of saved money.)

Also, with numerical codes you block up the problem so that loads to L2
are amortized over sufficient FLOPs (when you can).

Every time Cython becomes able to do stuff more easily in this domain,
people thank us that they didn't have to dig up Fortran but can stay
closer to Python.

Sorry for going off on a rant. I find that people will give well-meant
advice about performance, but that advice is just generalizing from
computer programs in entirely different domains (web apps?), and
sweeping generalizations has a way of giving the wrong answer.

Dag
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/d.s.seljebotn%40astro.uio.no



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread R. David Murray
On Thu, 17 May 2012 20:13:41 +0200, Dag Sverre Seljebotn 
 wrote:
> Every time Cython becomes able to do stuff more easily in this domain, 
> people thank us that they didn't have to dig up Fortran but can stay 
> closer to Python.
> 
> Sorry for going off on a rant. I find that people will give well-meant 
> advice about performance, but that advice is just generalizing from 
> computer programs in entirely different domains (web apps?), and 
> sweeping generalizations has a way of giving the wrong answer.

I don't have opinions on the specific topic under discussion, since I
don't get involved in the C level stuff unless I have to, but I do have
some small amount of background in scientific computing (many years ago).
I just want to chime in to say that I think it benefits the whole Python
community to to extend welcoming arms to the scientific Python community
and see what we can do to help them (without, of course, compromising
Python).

I think it is safe to assume that they do have significant experience
with real applications where timings at this level of detail do matter.
The scientific computing community is pretty much by definition pushing
the limits of what's possible.

--David
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread Dag Sverre Seljebotn

On 05/17/2012 05:00 AM, Greg Ewing wrote:

On 17/05/12 12:17, Robert Bradshaw wrote:


This is exactly what was proposed to start this thread (with minimal
collusion to avoid conflicts, specifically partitioning up a global ID
space).


Yes, but I think this part of the mechanism needs to be spelled out in
more detail, perhaps in the form of a draft PEP. Then there will be
something concrete to discuss in python-dev.



Well, we weren't 100% sure what is the best mechanism, so the point 
really was to solicit input, even if I got a bit argumentative along the 
way. Thanks to all of you!


If we in the end decide that we would like a propose the PEP, does 
anyone feel the odds are anything but very, very slim? I don't think 
I've heard a single positive word about the proposal so far except from 
Cython devs, so I'm reluctant to spend my own and your time on a 
fleshing out a full PEP for that reason.


In a PEP, the proposal would likely be an additional pointer to a table 
of "custom PyTypeObject extensions"; not a flag bit. The whole point 
would be to only do that once, and after that PyTypeObject would be 
infinitely extensible for custom purposes without collisions (even as a 
way of pre-testing PEPs about PyTypeObject in the wild before final 
approval!). Of course, a pointer more per type object is a bigger burden 
to push on others.


The thing is, you *can* just use a subtype of PyType_Type for this 
purpose (or any purpose), it's just my opinion that it's not be best 
solution here; it means many different libraries need a common 
dependency for this reason alone (or dynamically handshake on a base 
class at runtime). You could just stick that base class in CPython, 
which would be OK I guess but not great (using the type hierarchy is 
quite intrusive in general; you didn't subclass PyType_Type to stick in 
tp_as_buffer either).


Dag
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread Nick Coghlan
I think the main things we'd be looking for would be:
- a clear explanation of why a new metaclass is considered too complex a
solution
- what the implications are for classes that have nothing to do with the
SciPy/NumPy ecosystem
- how subclassing would behave (both at the class and metaclass level)

Yes, defining a new metaclass for fast signature exchange has its
challenges - but it means that *our* concerns about maintaining consistent
behaviour in the default object model and avoiding adverse effects on code
that doesn't need the new behaviour are addressed automatically.

Also, I'd consider a functioning reference implementation using a custom
metaclass a requirement before we considered modifying type anyway, so I
think that's the best thing to pursue next rather than a PEP. It also has
the virtue of letting you choose which Python versions to target and
iterating at a faster rate than CPython.

Cheers,
Nick.
--
Sent from my phone, thus the relative brevity :)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] C-level duck typing

2012-05-17 Thread martin
If we in the end decide that we would like a propose the PEP, does  
anyone feel the odds are anything but very, very slim? I don't think  
I've heard a single positive word about the proposal so far except  
from Cython devs, so I'm reluctant to spend my own and your time on  
a fleshing out a full PEP for that reason.


Before you do that, it might be useful to publish a precise, reproducible,
complete benchmark first, to support the performance figures you have been
quoting.

I'm skeptical by nature, so I don't believe any of the numbers you have given
until I can reproduce them myself. More precisely, I fail to understand what
they mean without seeing the source code that produced them (perhaps along
with an indication what hardware, operating system, compiler version,
and Python version were used to produce them).

Regards,
Martin


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com