On Thu, Mar 14, 2013 at 4:42 AM, Thomas 'PointedEars' Lahn <pointede...@web.de> wrote: > Chris Angelico wrote: > >> On Wed, Mar 13, 2013 at 9:11 PM, rusi <rustompm...@gmail.com> wrote: >>> Uhhh.. >>> Making the subject line useful for all readers >> >> I should have read this one before replying in the other thread. >> >> jmf, I'd like to see evidence that there has been a performance >> regression compared against a wide build of Python 3.2. You still have >> never answered this fundamental, that the narrow builds of Python are >> *BUGGY* in the same way that JavaScript/ECMAScript is. > > Interesting. From my work I was under the impression that I knew ECMAScript > and its implementations fairly well, yet I have never heard of this before. > > What do you mean by “narrow build” and “wide build” and what exactly is the > bug “narrow builds” of Python 3.2 have in common with JavaScript/ECMAScript? > To which implementation of ECMAScript are you referring – or are you > referring to the Specification as such?
The ECMAScript spec says that strings are stored and represented in UTF-16. Python versions up to 3.2 came in two varieties: narrow, which included (I believe) the Windows builds available on python.org, and wide, which was (again, I think) the default Linux config. The problem predates Python 3 and its default string being Unicode - the Py2 unicode type has the same issue: Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32 >>> u"\U00012345" u'\U00012345' >>> len(_) 2 Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u"\U00012345" u'\U00012345' >>> len(_) 1 That's the Python msi installer, and the default system Python from an Ubuntu 10.10. The exact same code does different things on different platforms, and on the Windows (narrow-build), it's possible to split surrogates: >>> u"\U00012345"[0] u'\ud808' >>> u"\U00012345"[1] u'\udf45' You can see the same thing in Javascript too. Here's a little demo I just knocked together: <script> function foo() { var txt=document.getElementById("in").value; var msg=""; for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+" "+txt.charCodeAt(i).toString(16)+"\n"; document.getElementById("out").value=msg; } </script> <input id=in><input type=button onclick="foo()" value="Show"><br><textarea id=out rows=25 cols=80></textarea> Give it an ASCII string and you'll see, as expected, one index (based on string indexing or charCodeAt, same thing) for each character. Same if it's all BMP. But put an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR notation doesn't work in Unicode) come up. I raised this issue on the Google V8 list and on the ECMAScript list es-disc...@mozilla.org, and was basically told that since JavaScript has been buggy for so long, there's no chance of ever making it bug-free: https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html Fortunately for Python, there are version numbers, and policies that permit bugs to actually get fixed. (Which is why, for instance, Debian Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case some script is broken by that change. Can't do that with web browsers.) As of Python 3.3, all Pythons function the same way: it's semantically a "wide build" (UTF-32), but with a memory usage optimization. That's how it needs to be. ChrisA -- http://mail.python.org/mailman/listinfo/python-list