Re: [Python-Dev] [Python-checkins] cpython: Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs.
> http://hg.python.org/cpython/rev/cdcc816dea85 > changeset: 76971:cdcc816dea85 > user: Antoine Pitrou > date: Tue May 15 23:48:04 2012 +0200 > summary: > Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs. > Patch by Serhiy Storchaka. Such optimization should be mentioned in the What's New in Python 3.3 doc if Python 3.3 is now faster than Python 3.2. Same remark for the UTF-8 optimization. Victor ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
So maybe it's worth thinking about making a general mechanism available for third parties to extend the type object without them all needing to have their own tp_flags bits and without needing to collude with each other to avoid conflicts. That mechanism is already available. Subclass PyTypeType, and add whatever fields you want. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
Dag Sverre Seljebotn wrote: On 05/16/2012 10:24 PM, Robert Bradshaw wrote: On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis" wrote: Does this use case make sense to everyone? The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot. The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects. If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings. Yes, that's an idea worth looking at. The point implementing tp_getattro to avoid dictionary lookups overhead is a good one, worth trying at least. One drawback is that this approach does require the GIL (as does _PyType_Lookup). Regarding the C function being faster than the dictionary lookup (or at least close enough that the lookup takes time), yes, this happens all the time. For example one might be solving differential equations and the "user input" is essentially a set of (usually simple) double f(double) and its derivatives. To underline how this is performance critical to us, perhaps a full Cython example is useful. The following Cython code is a real world usecase. It is not too contrived in the essentials, although simplified a little bit. For instance undergrad engineering students could pick up Cython just to play with simple scalar functions like this. from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)". # Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast. # define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"! # the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx integrate(f, 0, 1, 100) There are two problems here: - The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching... Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched? Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal. - The fastcall inside of "f" is separated from the loop in "integrate". And since "f" is often in another module, we can't rely on static full program analysis. These problems with monkey-patching disappear if the lookup is negligible. Some rough numbers: - The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries) Does your approach handle subtyping properly? - Dict lookup 20 ns Did you time _PyType_Lookup() ? - The sin function is about 35 ns. And, "f" is probably only 2-3 ns, and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula. Such micro timings are meaningless, because the working set often tends to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles. Cheers, Mark. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
Mark Shannon, 17.05.2012 12:38: > Dag Sverre Seljebotn wrote: >> On 05/16/2012 10:24 PM, Robert Bradshaw wrote: >>> On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis" >>> wrote: > Does this use case make sense to everyone? > > The reason why we are discussing this on python-dev is that we are > looking > for a general way to expose these C level signatures within the Python > ecosystem. And Dag's idea was to expose them as part of the type object, > basically as an addition to the current Python level tp_call() slot. The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects. If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings. >>> >>> Yes, that's an idea worth looking at. The point implementing >>> tp_getattro to avoid dictionary lookups overhead is a good one, worth >>> trying at least. One drawback is that this approach does require the >>> GIL (as does _PyType_Lookup). >>> >>> Regarding the C function being faster than the dictionary lookup (or >>> at least close enough that the lookup takes time), yes, this happens >>> all the time. For example one might be solving differential equations >>> and the "user input" is essentially a set of (usually simple) double >>> f(double) and its derivatives. >> >> To underline how this is performance critical to us, perhaps a full >> Cython example is useful. >> >> The following Cython code is a real world usecase. It is not too >> contrived in the essentials, although simplified a little bit. For >> instance undergrad engineering students could pick up Cython just to play >> with simple scalar functions like this. >> >> from numpy import sin >> # assume sin is a Python callable and that NumPy decides to support >> # our spec to also support getting a "double (*sinfuncptr)(double)". >> >> # Our mission: Avoid to have the user manually import "sin" from C, >> # but allow just using the NumPy object and still be fast. >> >> # define a function to integrate >> cpdef double f(double x): >> return sin(x * x) # guess on signature and use "fastcall"! >> >> # the integrator >> def integrate(func, double a, double b, int n): >> cdef double s = 0 >> cdef double dx = (b - a) / n >> for i in range(n): >> # This is also a fastcall, but can be cached so doesn't >> # matter... >> s += func(a + i * dx) >> return s * dx >> >> integrate(f, 0, 1, 100) >> >> There are two problems here: >> >> - The "sin" global can be reassigned (monkey-patched) between each call >> to "f", no way for "f" to know. Even "sin" could do the reassignment. So >> you'd need to check for reassignment to do caching... > > Since Cython allows static typing why not just declare that func can treat > sin as if it can't be monkeypatched? You'd simply say cdef object sin# declare it as a C variable of type 'object' from numpy import sin That's also the one obvious way to do it in Cython. > Moving the load of a global variable out of the loop does seem to be a > rather obvious optimisation, if it were declared to be legal. My proposal was to simply extract any C function pointers at assignment time, i.e. at import time in the example above. Signature matching can then be done at the first call and the result can be cached as long as the object variable isn't changed. All of that is local to the module and can thus easily be controlled at code generation time. Stefan ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sys.implementation
PEP 421 has reached a good place and I'd like to ask for pronouncement. Thanks! -eric ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
Mark Shannon wrote: Dag Sverre Seljebotn wrote: from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)". # Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast. # define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"! # the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx integrate(f, 0, 1, 100) There are two problems here: - The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching... Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched? If you want to manually declare stuff, you can always use a C function pointer too... Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal. In case you didn't notice, there was no global variable loads inside the loop... You can keep chasing this, but there's *always* cases where they don't (and you need to save the situation by manual typing). Anyway: We should really discuss Cython on the Cython list. If my motivating example wasn't good enough for you there's really nothing I can do. Some rough numbers: - The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries) Does your approach handle subtyping properly? Not really. - Dict lookup 20 ns Did you time _PyType_Lookup() ? No, didn't get around to it yet (and thanks for pointing it out). (Though the GIL requirement is an issue too for Cython.) - The sin function is about 35 ns. And, "f" is probably only 2-3 ns, and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula. Such micro timings are meaningless, because the working set often tends to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles. I find this sort of response arrogant -- do you know the details of every usecase for a programming language under the sun? Many Cython users are scientists. And in scientific computing in particular you *really* have the whole range of problems and working sets. Honestly. In some codes you only really care about the speed of the disk controller. In other cases you can spend *many seconds* working almost only in L1 or perhaps L2 cache (for instance when integrating ordinary differential equations in a few variables, which is not entirely different in nature from the example I posted). (Then, those many seconds are replicated many million times for different parameters on a large cluster, and a 2x speedup translates directly into large amounts of saved money.) Also, with numerical codes you block up the problem so that loads to L2 are amortized over sufficient FLOPs (when you can). Every time Cython becomes able to do stuff more easily in this domain, people thank us that they didn't have to dig up Fortran but can stay closer to Python. Sorry for going off on a rant. I find that people will give well-meant advice about performance, but that advice is just generalizing from computer programs in entirely different domains (web apps?), and sweeping generalizations has a way of giving the wrong answer. Dag ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
On 05/17/2012 08:13 PM, Dag Sverre Seljebotn wrote: Mark Shannon wrote: Dag Sverre Seljebotn wrote: from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)". # Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast. # define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"! # the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx integrate(f, 0, 1, 100) There are two problems here: - The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching... Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched? If you want to manually declare stuff, you can always use a C function pointer too... Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal. In case you didn't notice, there was no global variable loads inside the loop... You can keep chasing this, but there's *always* cases where they don't (and you need to save the situation by manual typing). Anyway: We should really discuss Cython on the Cython list. If my motivating example wasn't good enough for you there's really nothing I can do. Some rough numbers: - The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries) Does your approach handle subtyping properly? Not really. - Dict lookup 20 ns Did you time _PyType_Lookup() ? No, didn't get around to it yet (and thanks for pointing it out). (Though the GIL requirement is an issue too for Cython.) - The sin function is about 35 ns. And, "f" is probably only 2-3 ns, and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula. Such micro timings are meaningless, because the working set often tends to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles. I'm sorry; if my rant wasn't clear: Such micro-benchmarks do in fact mimic very closely what you'd do if you'd, say, integrate an ordinary differential equation. You *do* have a tight loop like that, just hammering on floating point numbers. Making that specific usecase more convenient was actually the original usecase that spawned this discussion on the NumPy list over a month ago... Dag I find this sort of response arrogant -- do you know the details of every usecase for a programming language under the sun? Many Cython users are scientists. And in scientific computing in particular you *really* have the whole range of problems and working sets. Honestly. In some codes you only really care about the speed of the disk controller. In other cases you can spend *many seconds* working almost only in L1 or perhaps L2 cache (for instance when integrating ordinary differential equations in a few variables, which is not entirely different in nature from the example I posted). (Then, those many seconds are replicated many million times for different parameters on a large cluster, and a 2x speedup translates directly into large amounts of saved money.) Also, with numerical codes you block up the problem so that loads to L2 are amortized over sufficient FLOPs (when you can). Every time Cython becomes able to do stuff more easily in this domain, people thank us that they didn't have to dig up Fortran but can stay closer to Python. Sorry for going off on a rant. I find that people will give well-meant advice about performance, but that advice is just generalizing from computer programs in entirely different domains (web apps?), and sweeping generalizations has a way of giving the wrong answer. Dag ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/d.s.seljebotn%40astro.uio.no ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
On Thu, 17 May 2012 20:13:41 +0200, Dag Sverre Seljebotn wrote: > Every time Cython becomes able to do stuff more easily in this domain, > people thank us that they didn't have to dig up Fortran but can stay > closer to Python. > > Sorry for going off on a rant. I find that people will give well-meant > advice about performance, but that advice is just generalizing from > computer programs in entirely different domains (web apps?), and > sweeping generalizations has a way of giving the wrong answer. I don't have opinions on the specific topic under discussion, since I don't get involved in the C level stuff unless I have to, but I do have some small amount of background in scientific computing (many years ago). I just want to chime in to say that I think it benefits the whole Python community to to extend welcoming arms to the scientific Python community and see what we can do to help them (without, of course, compromising Python). I think it is safe to assume that they do have significant experience with real applications where timings at this level of detail do matter. The scientific computing community is pretty much by definition pushing the limits of what's possible. --David ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
On 05/17/2012 05:00 AM, Greg Ewing wrote: On 17/05/12 12:17, Robert Bradshaw wrote: This is exactly what was proposed to start this thread (with minimal collusion to avoid conflicts, specifically partitioning up a global ID space). Yes, but I think this part of the mechanism needs to be spelled out in more detail, perhaps in the form of a draft PEP. Then there will be something concrete to discuss in python-dev. Well, we weren't 100% sure what is the best mechanism, so the point really was to solicit input, even if I got a bit argumentative along the way. Thanks to all of you! If we in the end decide that we would like a propose the PEP, does anyone feel the odds are anything but very, very slim? I don't think I've heard a single positive word about the proposal so far except from Cython devs, so I'm reluctant to spend my own and your time on a fleshing out a full PEP for that reason. In a PEP, the proposal would likely be an additional pointer to a table of "custom PyTypeObject extensions"; not a flag bit. The whole point would be to only do that once, and after that PyTypeObject would be infinitely extensible for custom purposes without collisions (even as a way of pre-testing PEPs about PyTypeObject in the wild before final approval!). Of course, a pointer more per type object is a bigger burden to push on others. The thing is, you *can* just use a subtype of PyType_Type for this purpose (or any purpose), it's just my opinion that it's not be best solution here; it means many different libraries need a common dependency for this reason alone (or dynamically handshake on a base class at runtime). You could just stick that base class in CPython, which would be OK I guess but not great (using the type hierarchy is quite intrusive in general; you didn't subclass PyType_Type to stick in tp_as_buffer either). Dag ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
I think the main things we'd be looking for would be: - a clear explanation of why a new metaclass is considered too complex a solution - what the implications are for classes that have nothing to do with the SciPy/NumPy ecosystem - how subclassing would behave (both at the class and metaclass level) Yes, defining a new metaclass for fast signature exchange has its challenges - but it means that *our* concerns about maintaining consistent behaviour in the default object model and avoiding adverse effects on code that doesn't need the new behaviour are addressed automatically. Also, I'd consider a functioning reference implementation using a custom metaclass a requirement before we considered modifying type anyway, so I think that's the best thing to pursue next rather than a PEP. It also has the virtue of letting you choose which Python versions to target and iterating at a faster rate than CPython. Cheers, Nick. -- Sent from my phone, thus the relative brevity :) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] C-level duck typing
If we in the end decide that we would like a propose the PEP, does anyone feel the odds are anything but very, very slim? I don't think I've heard a single positive word about the proposal so far except from Cython devs, so I'm reluctant to spend my own and your time on a fleshing out a full PEP for that reason. Before you do that, it might be useful to publish a precise, reproducible, complete benchmark first, to support the performance figures you have been quoting. I'm skeptical by nature, so I don't believe any of the numbers you have given until I can reproduce them myself. More precisely, I fail to understand what they mean without seeing the source code that produced them (perhaps along with an indication what hardware, operating system, compiler version, and Python version were used to produce them). Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
