Re: [Python-Dev] [Python-checkins] Daily reference leaks (cbd4a6a2657e): sum=134
Hi, I pushed a fix before you sent your message. At least test_ast should be fixed. https://hg.python.org/cpython/rev/c5df914e73ad FYI I'm unable to reproduce the test_collections leak. Victor Le mardi 26 janvier 2016, Brett Cannon a écrit : > Looks like Victor's ast.Constant change introduced a refleak. > > On Tue, 26 Jan 2016 at 00:47 > wrote: > >> results for cbd4a6a2657e on branch "default" >> >> >> test_ast leaked [39, 39, 39] references, sum=117 >> test_ast leaked [5, 5, 5] memory blocks, sum=15 >> test_collections leaked [-2, 0, 0] references, sum=-2 >> test_functools leaked [0, 2, 2] memory blocks, sum=4 >> >> >> Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R', >> '3:3:/home/psf-users/antoine/refleaks/reflogqIZGVY', '--timeout', '7200'] >> ___ >> Python-checkins mailing list >> [email protected] >> >> https://mail.python.org/mailman/listinfo/python-checkins >> > ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 27 January 2016 at 05:23, Sjoerd Job Postmus wrote:
> On Mon, Jan 25, 2016 at 11:58:12PM +0100, Victor Stinner wrote:
>> ...
>> Oh, they are a lot of things to do! ...
>
> Just wondering, do you also need a set of (abusive) test-cases which
> check 100% conformity to the CPython semantics? I'm sure many of us
> would be able to whip up some ideas of things that are possible with
> Python and are of the kind "but you should not do that! That's bad
> programming!" which may or may not break the optimizations (especially
> specialized functions).
>
> I'm thinking of things like
>
> def override_length_function_test():
> global len
> value = len("abc")
> len = lambda obj: ord(obj[0]))
> value += len("abc")
> assert value == 3 + 97, "Outdated `len` used."
>
> And also cases where `len` was changed not in the function itself, but
> in another function that was called indirectly (maybe 4 functions down,
> one of which was monkey-patched in after the compilation step):
>
> module_a.py
> def test_func(callback):
> value = len("abc")
> callback()
> value += len("abc")
> assert value == 3 + 97, "Outdated `len` used."
>
> module_b.py
> import module_a
> def override_length_function_test():
> def callback():
> module_a.len = lambda obj: ord(obj[0])
> assert module_a.test_func(callback) == 3 + 97, "Outdated `len` used."
>
> (I'm sure some of the other readers of this list can be even more
> creative in trying to show that making optimizations like this can break
> semantics.)
>
> Other fun tricks I'd like to try is overriding the `len` method from a
> signal handler, what happens when you monkey-patch a dependent method,
> having `__getitem__` and `__getattr__` on some value override `len`.
>
> Basically: trying things that I normally should not try during my
> working hours, on account of wanting to still have a job the next day.
>
> Kind regards,
> Sjoerd Job
Maybe I'm just nasty, but IMO those kinds of "torture tests" are just
as valuable in general, so I'd encourage people to submit patches to
the main Python test suite to add them.
Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
Python has test_dynamic which tests such corner cases.
For example, test_modify_builtins_while_generator_active(): "Modify
the builtins out from under a live generator."
https://hg.python.org/cpython/file/58266f5101cc/Lib/test/test_dynamic.py#l49
Victor
2016-01-27 10:28 GMT+01:00 Paul Moore :
> On 27 January 2016 at 05:23, Sjoerd Job Postmus wrote:
>> On Mon, Jan 25, 2016 at 11:58:12PM +0100, Victor Stinner wrote:
>>> ...
>>> Oh, they are a lot of things to do! ...
>>
>> Just wondering, do you also need a set of (abusive) test-cases which
>> check 100% conformity to the CPython semantics? I'm sure many of us
>> would be able to whip up some ideas of things that are possible with
>> Python and are of the kind "but you should not do that! That's bad
>> programming!" which may or may not break the optimizations (especially
>> specialized functions).
>>
>> I'm thinking of things like
>>
>> def override_length_function_test():
>> global len
>> value = len("abc")
>> len = lambda obj: ord(obj[0]))
>> value += len("abc")
>> assert value == 3 + 97, "Outdated `len` used."
>>
>> And also cases where `len` was changed not in the function itself, but
>> in another function that was called indirectly (maybe 4 functions down,
>> one of which was monkey-patched in after the compilation step):
>>
>> module_a.py
>> def test_func(callback):
>> value = len("abc")
>> callback()
>> value += len("abc")
>> assert value == 3 + 97, "Outdated `len` used."
>>
>> module_b.py
>> import module_a
>> def override_length_function_test():
>> def callback():
>> module_a.len = lambda obj: ord(obj[0])
>> assert module_a.test_func(callback) == 3 + 97, "Outdated `len` used."
>>
>> (I'm sure some of the other readers of this list can be even more
>> creative in trying to show that making optimizations like this can break
>> semantics.)
>>
>> Other fun tricks I'd like to try is overriding the `len` method from a
>> signal handler, what happens when you monkey-patch a dependent method,
>> having `__getitem__` and `__getattr__` on some value override `len`.
>>
>> Basically: trying things that I normally should not try during my
>> working hours, on account of wanting to still have a job the next day.
>>
>> Kind regards,
>> Sjoerd Job
>
> Maybe I'm just nasty, but IMO those kinds of "torture tests" are just
> as valuable in general, so I'd encourage people to submit patches to
> the main Python test suite to add them.
>
> Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Buildbot timing out - test suite failure - test_socket issue with UDP6?
2016-01-23 7:03 GMT+01:00 Chris Angelico :
> I just had a major crash on the system that hosts the
> angelico-debian-amd64 buildbot, and as usual, checked it carefully
> after bringing everything up. It seems now to be timing out after an
> hour of operation:
>
> http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/3132/steps/test/logs/stdio
I opened http://bugs.python.org/issue26206 to track this issue.
> Running just that test file:
>
> $ ./python Lib/test/test_socket.py
> ... chomp lots of lines ...
> testRecvmsgPeek (__main__.RecvmsgUDP6Test) ...
>
> seems to indicate that the stall is due to IPv6 and UDP. The VM should
> have full IPv6 support, although my ISPs don't carry IPv6 traffic, so
> it won't be able to reach the internet proper; but it should be able
> to do all manner of local tests.
Try to apply attached patch and run:
$ ./python -m test -v -m testRecvmsgPeek test_socket
(...)
testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
SERV SOCK
CLI SOCK ('::1', 44347, 0, 0)
SERV SOCK ('::1', 40488, 0, 0)
ok
testRecvmsgPeek (test.test_socket.RecvmsgIntoUDP6Test) ... CLI SOCK
SERV SOCK
CLI SOCK ('::1', 52721, 0, 0)
SERV SOCK ('::1', 43967, 0, 0)
ok
(...)
As you can see: the test uses the local loopback interface.
Inet6TestBase.host is "::1".
You can try to run a UDP server using netcat: "nc -l -u ::1 12345".
Keep the command running in a terminal, and then run in a different
terminal: "echo abc | nc -u ::1 12345". You should receive abc in the
server.
Victor
diff -r 58266f5101cc Lib/test/test_socket.py
--- a/Lib/test/test_socket.py Tue Jan 26 21:46:03 2016 -0800
+++ b/Lib/test/test_socket.py Wed Jan 27 10:35:32 2016 +0100
@@ -2049,6 +2049,8 @@ class SendrecvmsgConnectionlessBase(Send
return ([], [], 0, self.serv_addr)
def sendToServer(self, msg):
+print("CLI SOCK", self.cli_sock)
+print("CLI SOCK", self.cli_sock.getsockname())
return self.cli_sock.sendto(msg, self.serv_addr)
@@ -2371,6 +2373,8 @@ class RecvmsgGenericTests(SendrecvmsgBas
# data without consuming it.
# Receive part of data with MSG_PEEK.
+print("SERV SOCK", self.serv_sock)
+print("SERV SOCK", self.serv_sock.getsockname())
msg, ancdata, flags, addr = self.doRecvmsg(self.serv_sock,
len(MSG) - 3, 0,
socket.MSG_PEEK)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 1/26/2016 12:35 PM, Sven R. Kunze wrote: I completely agree with INADA. I an not sure you do. It's like saying, because a specific crossroad features a higher accident rate, *people need to change their driving behavior*. *No!* People won't change and it's not necessary either. The crossroad needs to be changed to be safer. Safer crossroads tend to be slower unless one switched to alternate designs that eliminate crossing streams of traffic. Python is safer, in everyday use as well as in artificial benchmarks, and is slower as a result. Languages that don't have integers but use residue classes (with wraparound) or finite integer classes (with overflow) as a (faster) substitute have, in practice, lots of accidents (bugs) when used by non-experts. Guido noticed this, gave up on changing coder behavior, and put the expert behavior of checking for wraparound/overflow and switching to real integers (longs) into the language. (I forget when this was added.) The purpose of the artificially low input to fib() is to hide and avoid the bugginess of most languages. The analogous trick with testing crossroads would be to artificially restrict the density of cars to mask the accident-proneness of a 'fast, consenting-adults' crossroads with no stop signs and no stop lights. Same goes for Python. If it's slow using the very same piece of code (even superficially), you better make the language faster. Developers won't change and they won't change their code either. Just not necessary. Instead of making people rewrite fib to dramatically increase speed, we added the lru-cache decorator to get most of the benefit without a rewrite. But Inada rejected this Python speedup. An ast optimizer could potentially do the same speedup without the explicit decorator. (No side-effects? Multiple recursive calls? Add a cache!) Btw. it would be a great feature for Python 3 to be faster than Python 2. We all agree on that. One way for this to happen is to add optimizers that would make Python 'cheat' on micrebenchmarks -- Terry Jan Reedy ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 27 January 2016 at 03:35, Sven R. Kunze wrote: > I completely agree with INADA. > > It's like saying, because a specific crossroad features a higher accident > rate, people need to change their driving behavior. > No! People won't change and it's not necessary either. The crossroad needs > to be changed to be safer. Umm, no, that's not how this works - developers contribute to community driven projects for their *own* reasons. Nobody gets to tell them what to do unless they're paying them. Micro-optimising a poor algorithm won't deliver macro level improvements because macro level code uses things like space-speed trade-offs to improve the algorithmic efficiency (as in the example of applying functools.lru_cache to a naive recursive fibonacci implementation). Victor's work on FAT optimiser is interesting because it offers opportunities to speed up even code that is already algorithmically efficient, as well as making CPython a better platform for experimenting with those kinds of changes. More generally though, much larger opportunities for improvement lie in persuading people to *stop writing code*, and instead spending more of their time on finding and assembling code other people have *already written* into solutions to interesting problems. *That's* the kind of improvement that turns enormously complex problems like facial recognition into 25 line Python scripts: https://realpython.com/blog/python/face-recognition-with-python/ Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 1/26/2016 12:51 PM, Stephen J. Turnbull wrote:
Terry Reedy writes:
> On 1/26/2016 12:02 AM, INADA Naoki wrote:
>
> > People use same algorithm on every language when compares base language
> > performance [1].
>
> The python code is NOT using the same algorithm. The proof is that the
> Python function will return the correct value for, say fib(50) while
> most if not all the other versions will not.
Let me try to be clearer.
1. Like everyone else, I would like Python function calls to be faster
either in general or in special cases detected during compilation. This
will require micro-benchmark for function calls that do just that.
First time an empty loop, then time a loop with a call to an empty
function. Do the same for various signatures, and maybe other special
cases.
2. Cross-language micro-benchmarks aimed at timing specific operations
are tough. To run on multiple languages, they must be restricted to a
lowest-common-denominator of features. It is impossible to make every
implementation perform exactly the same set of operations. Some
languages may bundle features together. Some languages and
implementations have optimizations that avoid unneeded operations. Not
all optimizatons can be turned off.
3. While there are trends as to the speed of implementations of a
language, benchmarks time particular implementations. Shedskin would
compile fib to a C function that runs much faster than fib with the
restricted subset of CPython allowed for the benchmark.
True, but that's not a reasonable criterion for "same algorithm" in
this context. Naoki's application ("base language performance"
benchmarking) requires fib(n) only for n < 40, and run it in a loop
100 times if you want 2 more decimal places of precision ("40" is
appropriate for an implementation with 32-bit ints).
So you agree that the limit of 39 is not intrinsic to the fib function
or its uses, but is an after-the-fact limit imposed to mask the bug
proneness of using substitutes for integers.
To my mind, a fairer and more useful benchmark of 'base language
performance' based on fib would use a wider domain. The report would
say that CPython (with lru_cache disallowed) is slow but works over a
wide range of inputs, while some other implementations of other
languages run faster for small inputs but fail catastrophically for
larger inputs. Users could then make a more informed pick.
Also see my answer to Sven Kunze.
--
Terry Jan Reedy
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Buildbot timing out - test suite failure - test_socket issue with UDP6?
On Wed, Jan 27, 2016 at 8:39 PM, Victor Stinner
wrote:
> 2016-01-23 7:03 GMT+01:00 Chris Angelico :
>> Running just that test file:
>>
>> $ ./python Lib/test/test_socket.py
>> ... chomp lots of lines ...
>> testRecvmsgPeek (__main__.RecvmsgUDP6Test) ...
>>
>> seems to indicate that the stall is due to IPv6 and UDP. The VM should
>> have full IPv6 support, although my ISPs don't carry IPv6 traffic, so
>> it won't be able to reach the internet proper; but it should be able
>> to do all manner of local tests.
>
> Try to apply attached patch and run:
>
> $ ./python -m test -v -m testRecvmsgPeek test_socket
> (...)
> testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
> type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 44347, 0, 0)>
> SERV SOCK type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 40488, 0, 0)>
> CLI SOCK ('::1', 44347, 0, 0)
> SERV SOCK ('::1', 40488, 0, 0)
> ok
> testRecvmsgPeek (test.test_socket.RecvmsgIntoUDP6Test) ... CLI SOCK
> type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 52721, 0, 0)>
> SERV SOCK type=SocketKind.SOCK_DGRAM, proto=0, laddr=('::1', 43967, 0, 0)>
> CLI SOCK ('::1', 52721, 0, 0)
> SERV SOCK ('::1', 43967, 0, 0)
> ok
> (...)
>
> As you can see: the test uses the local loopback interface.
> Inet6TestBase.host is "::1".
Confirmed. It does two tests of IPv4 which run just fine, and then:
testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
CLI SOCK ('::1', 47518, 0, 0)
SERV SOCK
SERV SOCK ('::1', 42421, 0, 0)
and hangs until I interrupt it.
> You can try to run a UDP server using netcat: "nc -l -u ::1 12345".
> Keep the command running in a terminal, and then run in a different
> terminal: "echo abc | nc -u ::1 12345". You should receive abc in the
> server.
(Oddly, the default netcat-traditional doesn't seem to support this,
but installing netcat-openbsd adds IPv6 support.)
Yep, and that works flawlessly. It's nothing weird about that
particular port, either - nc can use 42421 without a problem.
After digging through test_socket.py for over an hour (the MRO for
RecvmsgUDP6Test is enormous!!), I've boiled the issue down to this:
import socket
MSG = b'asdf qwer zxcv'
serv = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
serv.bind(("::1", 0))
cli = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
cli.bind(("::1", 0))
cli.sendto(MSG, serv.getsockname())
print(serv.recvmsg(len(MSG) - 3, 0, socket.MSG_PEEK))
print(serv.recvmsg(len(MSG), 0, socket.MSG_PEEK))
print(serv.recvmsg(len(MSG)))
On my main system, this produces three lines of output: the first has
truncated text, the second has full text, and the third also has full
text. This proves that MSG_PEEK is working correctly. On the buildbot,
though, the first one stalls out. Commenting that line out produces
correct results - peek the full data, then read it, and all is well.
Any idea why partial read on a datagram socket would sometimes stall?
ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
Terry Reedy writes: > So you agree that the limit of 39 is not intrinsic to the fib function > or its uses, but is an after-the-fact limit imposed to mask the bug > proneness of using substitutes for integers. I don't know what the limit used in the benchmark is, but it must be quite a bit lower than 50 for 32-bit integers and could be greater than 90 for 64-bit integers. And it's not "masking bugs", it's "respecting the domain of valid input", s'il vous plait. > To my mind, a fairer and more useful benchmark of 'base language > performance' based on fib would use a wider domain. "Fair", maybe. But why play this game at all? These benchmarks are simply not useful to users choosing languages, unless they already know the difficulties of interpreting benchmarks and are willing to expend the effort to account for them. Without that knowledge and effort, choosing a programming language based on microbenchmarks is like choosing a car based on the leg-length of the model sitting on the hood in the TV commercial. > The report would say that CPython (with lru_cache disallowed) is > slow but works over a wide range of inputs, No, the report would say "Use of this benchmark for cross-language comparison of function call speed is more or less inaccurate due to differences in representation of integers and in handling the possibility of exceptions in 'integer' arithmetic." You are picking one tiny difference, but there are potentially many, some quite a bit larger on the tested domain (for example, some languages may be able to optimize fib() to unboxed integers, in which case they'll blow away all those that don't). > Users could then make a more informed pick. My point in my reply to Nick is that users aren't making informed picks. If they were, we wouldn't even be thinking about having this conversation. I'm not sure what they are doing (maybe, as Nick suggests, justifying their "tribal" prejudices?), but it's not that. ;-) Sure, other things being equal, better benchmarks will improve runtime performance, but other things are so far from being equal even an economist can't say "ceteris paribus" here. To expand that point: I don't really see a point in users (ie, developers in Python and other such languages) looking at these benchmarks except for the fun of feeling like implementers, to be honest. Even the implementers shouldn't much care about cross- language benchmarks, except that when a "similar"[1] language does significantly better on a particular benchmark, it's often useful to wonder "how dey do dat?!" Typically the answer is "they 'cheat'" == "fail one of the properties we consider required", but sometimes it's "ooh, that's cute, and I bet we could make Python work the same way" or "urkh, we can't do *that* (yuck!) but we could FATten up Python with similar effect". (Let me take this opportunity to say "Thank you, Victor!") Of course in the case of a controlled experiment like "configure in Victor's changes and run the benchmarks to make sure they're not detectably slower", they're invaluable regression tests, and more or less valuable (ie, YMMV) as measures of improvement to compare to costs they may impose in other features or (even more fuzzy) in developer time. Footnotes: [1] Whatever that means ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 27.01.2016 11:59, Terry Reedy wrote: On 1/26/2016 12:35 PM, Sven R. Kunze wrote: I completely agree with INADA. I an not sure you do. I am sure I am. He wants to solve a problem the way that is natural to him as a unique human being. It's like saying, because a specific crossroad features a higher accident rate, *people need to change their driving behavior*. *No!* People won't change and it's not necessary either. The crossroad needs to be changed to be safer. Safer crossroads tend to be slower unless one switched to alternate designs that eliminate crossing streams of traffic. So Python can be safer AND faster ( = different design) if we try hard enough. Languages that don't have integers but use residue classes (with wraparound) or finite integer classes (with overflow) as a (faster) substitute have, in practice, lots of accidents (bugs) when used by non-experts. Guido noticed this, gave up on changing coder behavior, and put the expert behavior of checking for wraparound/overflow and switching to real integers (longs) into the language. (I forget when this was added.) I am glad he did because it helps humans solve their problems in a natural way without artificial boundaries. :) The purpose of the artificially low input to fib() is to hide and avoid the bugginess of most languages. The analogous trick with testing crossroads would be to artificially restrict the density of cars to mask the accident-proneness of a 'fast, consenting-adults' crossroads with no stop signs and no stop lights. I am completely with you here, however I disagree about suspected hiding/avoiding mentality. You say: Python -> *no problem with big integers but slow at small integers* Other Language -> *faster but breaks at big integers* Yes. That's it. We haven't solved the human side, however. A human AGAIN would need to compromise on either speed or safety. My point is: it would be insanely great if Python could be more like "*fast AND no problem with big integers*". No compromise here (at least no noticeable). So, people could entirely *concentrate on their problem domain* without every worrying about such tiny little, nitty-gritty computer science details. I love computer science but people of other domains don't have the time nor the knowledge to decide properly. That's the reason why they might decide by using some weird micro-benchmarks. Just humans. Same goes for Python. If it's slow using the very same piece of code (even superficially), you better make the language faster. Developers won't change and they won't change their code either. Just not necessary. Instead of making people rewrite fib to dramatically increase speed, we added the lru-cache decorator to get most of the benefit without a rewrite. But Inada rejected this Python speedup. An ast optimizer could potentially do the same speedup without the explicit decorator. (No side-effects? Multiple recursive calls? Add a cache!) Bingo! That's the spirit. Why that decorator in the first place? Hey, I mean, if I ever want to write some cryptic-looking source code with 3-letters abbreviations (LRU), I use Assembler again. But I discovered and love Python and I never want to go back when my problem domain does not require me to. So, when a machine can detect such an optimization, hell, do it, please. It's more likely that I apply it at the wrong function AND only in 10% of the correct cases: missing 90% and introducing some wild errors. Again human stuff. Btw. it would be a great feature for Python 3 to be faster than Python 2. We all agree on that. One way for this to happen is to add optimizers that would make Python 'cheat' on micrebenchmarks Then, we are all set. :) Best, Sven ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-27 3:01 PM, Brett Cannon wrote: [..] We can also optimize LOAD_METHOD. There are high chances, that 'obj' in 'obj.method()' will be of the same type every time we execute the code object. So if we'd have an opcodes cache, LOAD_METHOD could then cache a pointer to the resolved unbound method, a pointer to obj.__class__, and tp_version_tag of obj.__class__. Then it would only need to check if the cached object type is the same (and that it wasn't modified) and that obj.__dict__ doesn't override 'method'. Long story short, this caching really speeds up method calls on types implemented in C. list.append becomes very fast, because list doesn't have a __dict__, so the check is very cheap (with cache). What would it take to make this work with Python-defined classes? It already works for Python-defined classes. But it's a bit more expensive because you still have to check object's __dict__. Still, there is a very noticeable performance increase (see the results of benchmark runs). I guess that would require knowing the version of the instance's __dict__, the instance's __class__ version, the MRO, and where the method object was found in the MRO and any intermediary classes to know if it was suddenly shadowed? I think that's everything. :) No, unfortunately we can't use the version of the instance's __dict__ as it is very volatile. The current implementation of opcode cache works because types are much more stable. Remember, the cache is per *code object*, so it should work for all times when code object is executed. class F: def spam(self): self.ham() # <- version of self.__dict__ is unstable #so we'll endup invalidating the cache #too often __class__ version, MRO changes etc are covered by tp_version_tag, which I use as one of guards. Obviously that's a lot, but I wonder how many classes have a deep inheritance model vs. inheriting only from `object`? In that case you only have to check self.__dict__.ma_version, self.__class__, self.__class__.__dict__.ma_version, and self.__class__.__class__ == `type`. I guess another way to look at this is to get an idea of how complex do the checks have to get before caching something like this is not worth it (probably also depends on how often you mutate self.__dict__ thanks to mutating attributes, but you could in that instance just decide to always look at self.__dict__ for the method's key and then do the ma_version cache check for everything coming from the class). Otherwise we can consider looking at the the caching strategies that Self helped pioneer (http://bibliography.selflanguage.org/) that all of the various JS engines lifted and consider caching all method lookups. Yeah, hidden classes are great. But the infrastructure to support them properly is huge. I think that to make them work you'll need a JIT -- to trace, deoptimize, optimize, and do it all with a reasonable memory footprint. My patch is much smaller and simpler, something we can realistically tune and ship in 3.6. A straightforward way to implement such a cache is simple, but consumes a lot of memory, that would be just wasted, since we only need such a cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the cache design. Here's what I came up with: 1. We add a few fields to the code object. 2. ceval will count how many times each code object is executed. 3. When the code object is executed over ~900 times, we mark it as "hot". What happens if you simply consider all code as hot? Is the overhead of building the mapping such that you really need this, or is this simply to avoid some memory/startup cost? That's the first step for this patch. I think we need to profile several big applications (I'll do it later for some of my code bases) and see how big is the memory impact if we optimize everything. In any case, I expect it to be noticeable (which may be acceptable), so we'll probably try to optimize it. We also create an 'unsigned char' array "MAPPING", with length set to match the length of the code object. So we have a 1-to-1 mapping between opcodes and MAPPING array. 4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and LOAD_METHOD do "MAPPING[opcode_offset()]++". 5. After 1024 calls to the code object, ceval loop will iterate through the MAPPING, counting all opcodes that were executed more than 50 times. Where did the "50 times" boundary come from? Was this measured somehow or did you just guess at a number? If the number is too low, then you'll optimize code in branches that are rarely executed. So I picked 50, because I only trace opcodes for 100 calls. All of those numbers can be (should be?) changed, and I think we should experiment with different heuristics.
[Python-Dev] Speeding up CPython 5-10%
Hi, tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. There are twodifferent optimizations that yield this speedup: LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop. LOAD_METHOD & CALL_METHOD - We had a lot of conversations with Victor about his PEP 509, and he sent me a link to his amazing compilation of notes about CPython performance [2]. One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an idea first originated in PyPy. There is a patch that implements this optimization, it's tracked here: [3]. There are some low level details that I explained in the issue, but I'll go over the high level design in this email as well. Every time you access a method attribute on an object, a BoundMethod object is created. It is a fairly expensive operation, despite a freelist of BoundMethods (so that memory allocation is generally avoided). The idea is to detect what looks like a method call in the compiler, and emit a pair of specialized bytecodes for that. So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD. LOAD_METHOD looks at the object on top of the stack, and checks if the name resolves to a method or to a regular attribute. If it's a method, then we push the unbound method object and the object to the stack. If it's an attribute, we push the resolved attribute and NULL. When CALL_METHOD looks at the stack it knows how to call the unbound method properly (pushing the object as a first arg), or how to call a regular callable. This idea does make CPython faster around 2-4%. And it surely doesn't make it slower. I think it's a safe bet to at least implement this optimization in CPython 3.6. So far, the patch only optimizes positional-only method calls. It's possible to optimize all kind of calls, but this will necessitate 3 more opcodes (explained in the issue). We'll need to do some careful benchmarking to see if it's really needed. Per-opcode cache in ceval - While reading PEP 509, I was thinking about how we can use dict->ma_version in ceval to speed up globals lookups. One of the key assumptions (and this is what makes JITs possible) is that real-life programs don't modify globals and rebind builtins (often), and that most code paths operate on objects of the same type. In CPython, all pure Python functions have code objects. When you call a function, ceval executes its code object in a frame. Frames contain contextual information, including pointers to the globals and builtins dict. The key observation here is that almost all code objects always have same pointers to the globals (the module they were defined in) and to the builtins. And it's not a good programming practice to mutate globals or rebind builtins. Let's look at this function: def spam(): print(ham) Here are its opcodes: 2 0 LOAD_GLOBAL 0 (print) 3 LOAD_GLOBAL 1 (ham) 6 CALL_FUNCTION1 (1 positional, 0 keyword pair) 9 POP_TOP 10 LOAD_CONST 0 (None) 13 RETURN_VALUE The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at the first one, that loads the 'print' function from builtins. The opcode knows the following bits of information: - its offset (0), - its argument (0 -> 'print'), - its type (LOAD_GLOBAL). And these bits of information will *never* change. So if this opcode could resolve the 'print' name (from globals or builtins, likely the latter) and save the pointer to it somewhere, along with globals->ma_version and builtins->ma_version, it could, on its second call, just load this cached info back, check that the globals and builtins dict haven't changed and push the cached ref to the stack. That would save it from doing two dict lookups. We can also optimize LOAD_METHOD. There are high chances, that 'obj' in 'obj.method()' will be of the same type every time we execute the code object. So if we'd have an opcodes cache, LOAD_METHOD could then cache a pointer to the resolved unbound method, a pointer to obj.__class__, and tp_version_tag of obj.__class__. Then it would only need to check if the cached object type is the same (and that it wasn't modified) and that obj.__dict__ doesn't override 'method'. Long story short, this caching really speeds up method calls on types implemented in C. list.append becomes very fast, because list doesn't have a __dict__, so the check is very cheap (with cache). A straightforward way to implement such a cache is simple, but consumes a lot of memory, that would be just wasted, since we only need such a
Re: [Python-Dev] FAT Python (lack of) performance
On 1/27/2016 9:12 AM, Stephen J. Turnbull wrote: Without that knowledge and effort, choosing a programming language based on microbenchmarks is like choosing a car based on the leg-length of the model sitting on the hood in the TV commercial. +1 QOTD ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-27 3:10 PM, Damien George wrote:
Hi Yuri,
I think these are great ideas to speed up CPython. They are probably
the simplest yet most effective ways to get performance improvements
in the VM.
Thanks!
MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired
by PyPy, and the main reason to have it is because you don't need to
allocate on the heap when doing a simple method call). The specific
opcodes are:
LOAD_METHOD # same behaviour as you propose
CALL_METHOD # for calls with positional and/or keyword args
CALL_METHOD_VAR_KW # for calls with one or both of */**
We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for
non-method calls.
Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
kind of method calls. However, I'm not sure how big the impact will be,
need to do more benchmarking.
BTW, how do you benchmark MicroPython?
MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal. We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element. Eg for LOAD_GLOBAL
we have (pseudo code):
CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
// found the element straight away
} else {
// not found, do a full lookup and save the offset
offset_guess = dict_lookup(global_dict, key);
UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);
We have found that such caching gives a massive performance increase,
on the order of 20%. The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time. Those things are important
in the embedded world, but not so much on the desktop.
That's a neat idea! You're right, it does require bytecode to become
writeable. I considered implementing a similar strategy, but this would
be a big change for CPython. So I decided to minimize the impact of the
patch and leave the opcodes untouched.
Thanks!
Yury
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On Wed, 27 Jan 2016 at 10:12 Sven R. Kunze wrote: > On 27.01.2016 11:59, Terry Reedy wrote: > > On 1/26/2016 12:35 PM, Sven R. Kunze wrote: > > I completely agree with INADA. > > > I an not sure you do. > > > I am sure I am. He wants to solve a problem the way that is natural to him > as a unique human being. > > > It's like saying, because a specific crossroad features a higher > accident rate, *people need to change their driving behavior*. > *No!* People won't change and it's not necessary either. The crossroad > needs to be changed to be safer. > > > Safer crossroads tend to be slower unless one switched to alternate > designs that eliminate crossing streams of traffic. > > > So Python can be safer AND faster ( = different design) if we try hard > enough. > > > Languages that don't have integers but use residue classes (with > wraparound) or finite integer classes (with overflow) as a (faster) > substitute have, in practice, lots of accidents (bugs) when used by > non-experts. Guido noticed this, gave up on changing coder behavior, and > put the expert behavior of checking for wraparound/overflow and switching > to real integers (longs) into the language. (I forget when this was > added.) > > > I am glad he did because it helps humans solve their problems in a natural > way without artificial boundaries. :) > > > The purpose of the artificially low input to fib() is to hide and avoid > the bugginess of most languages. The analogous trick with testing > crossroads would be to artificially restrict the density of cars to mask > the accident-proneness of a 'fast, consenting-adults' crossroads with no > stop signs and no stop lights. > > > I am completely with you here, however I disagree about suspected > hiding/avoiding mentality. You say: > > Python -> *no problem with big integers but slow at small integers* > Other Language -> *faster but breaks at big integers* > > Yes. That's it. > > We haven't solved the human side, however. A human AGAIN would need to > compromise on either speed or safety. > > > My point is: it would be insanely great if Python could be more like "*fast > AND no problem with big integers*". No compromise here (at least no > noticeable). > And this is why this entire email thread has devolved into a conversation that isn't really going anywhere. This whole thread has completely lost track of the point of Victor's earlier email saying "I'm still working on my FAT work and don't take any notice of the performance numbers until more stuff gets finished". And this discussion of what benchmarks to care about is rather pointless since the core team has an implicit understanding that any performance improvement is taken into consideration in terms of balancing complexity in CPython with how much improvement it gets us. So if someone wants to speed up Fibonacci then they are welcome to try, but the solution must be maintainable in proportion to the speed increase it buys Python as a whole. -Brett > > So, people could entirely *concentrate on their problem domain* without > every worrying about such tiny little, nitty-gritty computer science > details. I love computer science but people of other domains don't have the > time nor the knowledge to decide properly. That's the reason why they might > decide by using some weird micro-benchmarks. Just humans. > > > Same goes for Python. If it's slow using the very same piece of code > (even superficially), you better make the language faster. > Developers won't change and they won't change their code either. Just > not necessary. > > > Instead of making people rewrite fib to dramatically increase speed, we > added the lru-cache decorator to get most of the benefit without a > rewrite. But Inada rejected this Python speedup. An ast optimizer could > potentially do the same speedup without the explicit decorator. (No > side-effects? Multiple recursive calls? Add a cache!) > > > Bingo! That's the spirit. > > Why that decorator in the first place? Hey, I mean, if I ever want to > write some cryptic-looking source code with 3-letters abbreviations (LRU), > I use Assembler again. But I discovered and love Python and I never want to > go back when my problem domain does not require me to. So, when a machine > can detect such an optimization, hell, do it, please. It's more likely that > I apply it at the wrong function AND only in 10% of the correct cases: > missing 90% and introducing some wild errors. > > Again human stuff. > > > Btw. it would be a great feature for Python 3 to be faster than Python > 2. > > > We all agree on that. One way for this to happen is to add optimizers > that would make Python 'cheat' on micrebenchmarks > > > Then, we are all set. :) > > Best, > Sven > ___ > Python-Dev mailing list > [email protected] > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/brett%40python.org >
Re: [Python-Dev] FAT Python (lack of) performance
On 27.01.2016 12:16, Nick Coghlan wrote: On 27 January 2016 at 03:35, Sven R. Kunze wrote: I completely agree with INADA. It's like saying, because a specific crossroad features a higher accident rate, people need to change their driving behavior. No! People won't change and it's not necessary either. The crossroad needs to be changed to be safer. Umm, no, that's not how this works That's exactly how it works, Nick. INADA uses Python as I use crossroads each day. Daily human business. If you read his post carefully, you can discover that he just presented to you his perspective of the world. Moreover, I can assure you that he's not alone. As usual with humans it's not about facts or mathematically proven theorems but *perception*. It's more about marketing, little important details (or unimportant ones depending on whom you ask) and so on. Stating that he has a wrong perspective will not change anything. Believing that Python is treated unfair will not change that either. Most people believe what they see. When they see a "FUNCTION CALL", it's the same in every language. Why? Because it looks like a function call ( name + parentheses ), it's called "function call" even if it's implemented completely differently. It even doesn't matter if we use commas, 'def', return types, etc. Because people understand the bigger concept, so that is what people want to compare. Average Joe doesn't care and does not understand. He looks at the benchmarks. That is something he can understand. "While performance is not a matter when choosing first language, slowest of three makes bad impression and people feel less attractive about Python." << just like that Not saying that INADA is an Average Joe, but I think you get the idea. - developers contribute to community driven projects for their *own* reasons. Nobody gets to tell them what to do unless they're paying them. Bit off-topic. Micro-optimising a poor algorithm won't deliver macro level improvements because macro level code uses things like space-speed trade-offs to improve the algorithmic efficiency (as in the example of applying functools.lru_cache to a naive recursive fibonacci implementation). I completely agree, Nick. :) But that isn't the issue here. Victor's work on FAT optimiser is interesting because it offers opportunities to speed up even code that is already algorithmically efficient, as well as making CPython a better platform for experimenting with those kinds of changes. Exactly. :) More generally though, much larger opportunities for improvement lie in persuading people to *stop writing code*, and instead spending more of their time on finding and assembling code other people have *already written* into solutions to interesting problems. *That's* the kind of improvement that turns enormously complex problems like facial recognition into 25 line Python scripts: https://realpython.com/blog/python/face-recognition-with-python/ Interesting post. :) Thanks. Btw. I completely agree with you on the "improve programming education", but not everybody can do it; and not everybody wants to learn and to practice it properly. Best, Sven ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 27.01.2016 19:33, Brett Cannon wrote: And this is why this entire email thread has devolved into a conversation that isn't really going anywhere. This whole thread has completely lost track of the point of Victor's earlier email saying "I'm still working on my FAT work and don't take any notice of the performance numbers until more stuff gets finished". And this discussion of what benchmarks to care about is rather pointless since the core team has an implicit understanding that any performance improvement is taken into consideration in terms of balancing complexity in CPython with how much improvement it gets us. So if someone wants to speed up Fibonacci then they are welcome to try, but the solution must be maintainable in proportion to the speed increase it buys Python as a whole. +1 ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
As Brett suggested, I've just run the benchmarks suite with memory tracking on. The results are here: https://gist.github.com/1st1/1851afb2773526fd7c58 Looks like the memory increase is around 1%. One synthetic micro-benchmark, unpack_sequence, contains hundreds of lines that load a global variable and does nothing else, consumes 5%. Yury ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Wed, 27 Jan 2016 at 10:26 Yury Selivanov wrote: > Hi, > > > tl;dr The summary is that I have a patch that improves CPython > performance up to 5-10% on macro benchmarks. Benchmarks results on > Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available > at [1]. There are no slowdowns that I could reproduce consistently. > > There are twodifferent optimizations that yield this speedup: > LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop. > > > LOAD_METHOD & CALL_METHOD > - > > We had a lot of conversations with Victor about his PEP 509, and he sent > me a link to his amazing compilation of notes about CPython performance > [2]. One optimization that he pointed out to me was LOAD/CALL_METHOD > opcodes, an idea first originated in PyPy. > > There is a patch that implements this optimization, it's tracked here: > [3]. There are some low level details that I explained in the issue, > but I'll go over the high level design in this email as well. > > Every time you access a method attribute on an object, a BoundMethod > object is created. It is a fairly expensive operation, despite a > freelist of BoundMethods (so that memory allocation is generally > avoided). The idea is to detect what looks like a method call in the > compiler, and emit a pair of specialized bytecodes for that. > > So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have > LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD. > > LOAD_METHOD looks at the object on top of the stack, and checks if the > name resolves to a method or to a regular attribute. If it's a method, > then we push the unbound method object and the object to the stack. If > it's an attribute, we push the resolved attribute and NULL. > > When CALL_METHOD looks at the stack it knows how to call the unbound > method properly (pushing the object as a first arg), or how to call a > regular callable. > > This idea does make CPython faster around 2-4%. And it surely doesn't > make it slower. I think it's a safe bet to at least implement this > optimization in CPython 3.6. > > So far, the patch only optimizes positional-only method calls. It's > possible to optimize all kind of calls, but this will necessitate 3 more > opcodes (explained in the issue). We'll need to do some careful > benchmarking to see if it's really needed. > > > Per-opcode cache in ceval > - > > While reading PEP 509, I was thinking about how we can use > dict->ma_version in ceval to speed up globals lookups. One of the key > assumptions (and this is what makes JITs possible) is that real-life > programs don't modify globals and rebind builtins (often), and that most > code paths operate on objects of the same type. > > In CPython, all pure Python functions have code objects. When you call > a function, ceval executes its code object in a frame. Frames contain > contextual information, including pointers to the globals and builtins > dict. The key observation here is that almost all code objects always > have same pointers to the globals (the module they were defined in) and > to the builtins. And it's not a good programming practice to mutate > globals or rebind builtins. > > Let's look at this function: > > def spam(): > print(ham) > > Here are its opcodes: > >2 0 LOAD_GLOBAL 0 (print) >3 LOAD_GLOBAL 1 (ham) >6 CALL_FUNCTION1 (1 positional, 0 keyword pair) >9 POP_TOP > 10 LOAD_CONST 0 (None) > 13 RETURN_VALUE > > The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at > the first one, that loads the 'print' function from builtins. The > opcode knows the following bits of information: > > - its offset (0), > - its argument (0 -> 'print'), > - its type (LOAD_GLOBAL). > > And these bits of information will *never* change. So if this opcode > could resolve the 'print' name (from globals or builtins, likely the > latter) and save the pointer to it somewhere, along with > globals->ma_version and builtins->ma_version, it could, on its second > call, just load this cached info back, check that the globals and > builtins dict haven't changed and push the cached ref to the stack. > That would save it from doing two dict lookups. > > We can also optimize LOAD_METHOD. There are high chances, that 'obj' in > 'obj.method()' will be of the same type every time we execute the code > object. So if we'd have an opcodes cache, LOAD_METHOD could then cache > a pointer to the resolved unbound method, a pointer to obj.__class__, > and tp_version_tag of obj.__class__. Then it would only need to check > if the cached object type is the same (and that it wasn't modified) and > that obj.__dict__ doesn't override 'method'. Long story short, this > caching really speeds up method calls on types implemented in C. > list.append becomes very fast, because list doesn't have a __dict__, so
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Yuri,
I think these are great ideas to speed up CPython. They are probably
the simplest yet most effective ways to get performance improvements
in the VM.
MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired
by PyPy, and the main reason to have it is because you don't need to
allocate on the heap when doing a simple method call). The specific
opcodes are:
LOAD_METHOD # same behaviour as you propose
CALL_METHOD # for calls with positional and/or keyword args
CALL_METHOD_VAR_KW # for calls with one or both of */**
We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for
non-method calls.
MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal. We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element. Eg for LOAD_GLOBAL
we have (pseudo code):
CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
// found the element straight away
} else {
// not found, do a full lookup and save the offset
offset_guess = dict_lookup(global_dict, key);
UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);
We have found that such caching gives a massive performance increase,
on the order of 20%. The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time. Those things are important
in the embedded world, but not so much on the desktop.
Good luck with it!
Regards,
Damien.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
BTW, this optimization also makes some old optimization tricks obsolete. 1. No need to write 'def func(len=len)'. Globals lookups will be fast. 2. No need to save bound methods: obj = [] obj_append = obj.append for _ in range(10**6): obj_append(something) This hand-optimized code would only be marginally faster, because of LOAD_METHOD and how it's cached. Yury ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Yury,
(Sorry for misspelling your name previously!)
> Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
> kind of method calls. However, I'm not sure how big the impact will be,
> need to do more benchmarking.
I never did such fine grained analysis with MicroPython. I don't
think there are many uses of * and ** that it'd be worth it, but
definitely there are lots of uses of plain keywords. Also, you'd want
to consider how simple/complex it is to treat all these different
opcodes in the compiler. For us, it's simpler to treat everything the
same. Otherwise your LOAD_METHOD part of the compiler will need to
peek deep into the AST to see what kind of call it is.
> BTW, how do you benchmark MicroPython?
Haha, good question! Well, we use Pystone 1.2 (unmodified) to do
basic benchmarking, and find it to be quite good. We track our code
live at:
http://micropython.org/resources/code-dashboard/
You can see there the red line, which is the Pystone result. There
was a big jump around Jan 2015 which is when we introduced opcode
dictionary caching. And since then it's been very gradually
increasing due to small optimisations here and there.
Pystone is actually a great benchmark for embedded systems because it
gives very reliable results there (almost zero variation across runs)
and if we can squeeze 5 more Pystones out with some change then we
know that it's a good optimisation (for efficiency at least).
For us, low RAM usage and small code size are the most important
factors, and we track those meticulously. But in fact, smaller code
size quite often correlates with more efficient code because there's
less to execute and it fits in the CPU cache (at least on the
desktop).
We do have some other benchmarks, but they are highly specialised for
us. For example, how fast can you bit bang a GPIO pin using pure
Python code. Currently we get around 200kHz on a 168MHz MCU, which
shows that pure (Micro)Python code is about 100 times slower than C.
> That's a neat idea! You're right, it does require bytecode to become
> writeable. I considered implementing a similar strategy, but this would
> be a big change for CPython. So I decided to minimize the impact of the
> patch and leave the opcodes untouched.
I think you need to consider "big" changes, especially ones like this
that can have a great (and good) impact. But really, this is a
behind-the-scenes change that *should not* affect end users, and so
you should not have any second thoughts about doing it. One problem I
see with CPython is that it exposes way too much to the user (both
Python programmer and C extension writer) and this hurts both language
evolution (you constantly need to provide backwards compatibility) and
ability to optimise.
Cheers,
Damien.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Damien,
On 2016-01-27 4:20 PM, Damien George wrote:
Hi Yury,
(Sorry for misspelling your name previously!)
NP. As long as the first letter is "y" I don't care ;)
Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all
kind of method calls. However, I'm not sure how big the impact will be,
need to do more benchmarking.
I never did such fine grained analysis with MicroPython. I don't
think there are many uses of * and ** that it'd be worth it, but
definitely there are lots of uses of plain keywords. Also, you'd want
to consider how simple/complex it is to treat all these different
opcodes in the compiler. For us, it's simpler to treat everything the
same. Otherwise your LOAD_METHOD part of the compiler will need to
peek deep into the AST to see what kind of call it is.
BTW, how do you benchmark MicroPython?
Haha, good question! Well, we use Pystone 1.2 (unmodified) to do
basic benchmarking, and find it to be quite good. We track our code
live at:
http://micropython.org/resources/code-dashboard/
The dashboard is cool!
An off-topic: have you ever tried hg.python.org/benchmarks
or compare MicroPython vs CPython? I'm curious if MicroPython
is faster -- in that case we'll try to copy some optimization
ideas.
You can see there the red line, which is the Pystone result. There
was a big jump around Jan 2015 which is when we introduced opcode
dictionary caching. And since then it's been very gradually
increasing due to small optimisations here and there.
Do you use opcode dictionary caching only for LOAD_GLOBAL-like
opcodes? Do you have an equivalent of LOAD_FAST, or you use
dicts to store local variables?
That's a neat idea! You're right, it does require bytecode to become
writeable. I considered implementing a similar strategy, but this would
be a big change for CPython. So I decided to minimize the impact of the
patch and leave the opcodes untouched.
I think you need to consider "big" changes, especially ones like this
that can have a great (and good) impact. But really, this is a
behind-the-scenes change that *should not* affect end users, and so
you should not have any second thoughts about doing it.
If we change the opcode size, it will probably affect libraries
that compose or modify code objects. Modules like "dis" will
also need to be updated. And that's probably just a tip of the
iceberg.
We can still implement your approach if we add a separate
private 'unsigned char' array to each code object, so that
LOAD_GLOBAL can store the key offsets. It should be a bit
faster than my current patch, since it has one less level
of indirection. But this way we loose the ability to
optimize LOAD_METHOD, simply because it requires more memory
for its cache. In any case, I'll experiment!
One problem I
see with CPython is that it exposes way too much to the user (both
Python programmer and C extension writer) and this hurts both language
evolution (you constantly need to provide backwards compatibility) and
ability to optimise.
Right. Even though CPython explicitly states that opcodes
and code objects might change in the future, we still have to
be careful about changing them.
Yury
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 1/27/2016 12:37 PM, Yury Selivanov wrote:
MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal. We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element. Eg for LOAD_GLOBAL
we have (pseudo code):
CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
// found the element straight away
} else {
// not found, do a full lookup and save the offset
offset_guess = dict_lookup(global_dict, key);
UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);
We have found that such caching gives a massive performance increase,
on the order of 20%. The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time. Those things are important
in the embedded world, but not so much on the desktop.
That's a neat idea! You're right, it does require bytecode to become
writeable.
Would it?
Remember "fixup lists"? Maybe they still exist for loading function
addresses from one DLL into the code of another at load time?
So the equivalent for bytecode requires a static table of offset_guess,
and the offsets into that table are allocated by the byte-code loader at
byte-code load time, and the byte-code is "fixed up" at load time to use
the correct offsets into the offset_guess table. It takes one more
indirection to find the guess, but if the result is a 20% improvement,
maybe you'd still get 19%...
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-27 3:46 PM, Glenn Linderman wrote:
On 1/27/2016 12:37 PM, Yury Selivanov wrote:
MicroPython also has dictionary lookup caching, but it's a bit
different to your proposal. We do something much simpler: each opcode
that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
etc) includes a single byte in the opcode which is an offset-guess
into the dictionary to find the desired element. Eg for LOAD_GLOBAL
we have (pseudo code):
CASE(LOAD_GLOBAL):
key = DECODE_KEY;
offset_guess = DECODE_BYTE;
if (global_dict[offset_guess].key == key) {
// found the element straight away
} else {
// not found, do a full lookup and save the offset
offset_guess = dict_lookup(global_dict, key);
UPDATE_BYTECODE(offset_guess);
}
PUSH(global_dict[offset_guess].elem);
We have found that such caching gives a massive performance increase,
on the order of 20%. The issue (for us) is that it increases bytecode
size by a considerable amount, requires writeable bytecode, and can be
non-deterministic in terms of lookup time. Those things are important
in the embedded world, but not so much on the desktop.
That's a neat idea! You're right, it does require bytecode to become
writeable.
Would it?
Remember "fixup lists"? Maybe they still exist for loading function
addresses from one DLL into the code of another at load time?
So the equivalent for bytecode requires a static table of
offset_guess, and the offsets into that table are allocated by the
byte-code loader at byte-code load time, and the byte-code is "fixed
up" at load time to use the correct offsets into the offset_guess
table. It takes one more indirection to find the guess, but if the
result is a 20% improvement, maybe you'd still get 19%...
Right, in my current patch I have an offset table per code object.
Essentially, this offset table adds 8bits per opcode. It also means
that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes *per-code-object*
are optimized (because the offset table only can store 8bit offsets),
which is usually enough (I think you need to have more than a 500 lines
of code function to reach that limit).
Yury
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
Brett Cannon writes: > the core team has an implicit understanding that any performance > improvement is taken into consideration in terms of balancing > complexity in CPython with how much improvement it gets us. EIBTI. I can shut up now. Thank you! ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FAT Python (lack of) performance
On 28 January 2016 at 04:40, Sven R. Kunze wrote: > On 27.01.2016 12:16, Nick Coghlan wrote: >> Umm, no, that's not how this works > That's exactly how it works, Nick. > > INADA uses Python as I use crossroads each day. Daily human business. > > If you read his post carefully, you can discover that he just presented to > you his perspective of the world. Moreover, I can assure you that he's not > alone. As usual with humans it's not about facts or mathematically proven > theorems but perception. It's more about marketing, little important details > (or unimportant ones depending on whom you ask) and so on. Stating that he > has a wrong perspective will not change anything. The only part I disagree with is requesting that *other people* care about marketing numbers if that's not something they're already inclined to care about. I'm not in any way disputing that folks make decisions based on inappropriate metrics, nor that it bothers some folks that there are dozens of perfectly viable programming languages people may choose to use instead of Python. The fact remains that contributors to open source projects work on what they want to work on or on what their employers pay them to work on (for a lucky few, those are the same thing), so telling other contributors that they're working on the "wrong thing" because their priorities differ from our priorities is almost always going to be irritating rather than helpful. Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
