Chris Angelico wrote: > On Sat, Mar 16, 2013 at 1:44 PM, Thomas 'PointedEars' Lahn > <pointede...@web.de> wrote: >> Chris Angelico wrote: >>> The ECMAScript spec says that strings are stored and represented in >>> UTF-16. >> >> No, it does not (which Edition?). It says in Edition 5.1: > > Okay, I was sloppy in my terminology. A language will seldom, if ever, > specify the actual storage. But it does specify a representation (to > the script) of UTF-16,
No, it does not. > and I seriously cannot imagine any reason for an implementation to store a > string in any other way, given that string indexing is specifically based > on UTF-16: Non sequitur. >> | The length of a String is the number of elements (i.e., 16-bit values) >> | within it. >> | >> | […] >> | When a String contains actual textual data, each element is considered >> | to >> | be a single UTF-16 code unit. Whether or not this is the actual >> | storage format of a String, the characters within a String are numbered >> | by their initial code unit element position as though they were >> | represented using UTF-16. > > So, yes, it could be stored in some other way, but in terms of what I > was saying (comparing against Python 3.2 and 3.3), it's still a > specification that doesn't allow for the change that Python did. Yes, it does. You must have not been reading or understanding what I quoted. >>> You can see the same thing in Javascript too. Here's a little demo I >>> just knocked together: >>> >>> <script> >>> function foo() >>> { >>> var txt=document.getElementById("in").value; >>> var msg=""; >>> for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+" >>> "+txt.charCodeAt(i).toString(16)+"\n"; >>> document.getElementById("out").value=msg; >>> } >>> </script> >>> <input id=in><input type=button onclick="foo()" >>> value="Show"><br><textarea id=out rows=25 cols=80></textarea> >> >> What an awful piece of code. > > Ehh, it's designed to be short, not beautiful. Got any serious > criticisms of it? Better not here, lest another “moron” would complain. > It demonstrates what I'm talking about without being a page of code. It could have been written readable and efficient without that. >>> Give it an ASCII string >> >> You mean a string of Unicode characters that can also be represented with >> the US-ASCII encoding. There are no "ASCII strings" in conforming >> ECMAScript implementations. And a string of Unicode characters with code >> points within the BMP will suffice already. > > You can get a string of ASCII characters and paste them into the entry > field. Not likely these days, no. > They'll be turned into Unicode characters before the script > sees them. They will have become Windows-1252 or even Unicode characters long before. > But yes, okay, my terminology was a bit sloppy. It still is. >>> and you'll see, as expected, one index (based on string indexing or >>> charCodeAt, same thing) for each character. Same if it's all BMP. But >>> put an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR >>> notation doesn't work in Unicode) come up. I raised this issue on the >>> Google V8 list and on the ECMAScript list es-disc...@mozilla.org, and >>> was basically told that since JavaScript has been buggy for so long, >>> there's no chance of ever making it bug-free: >>> >>> https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html >> >> You misunderstand, and I am not buying Rick's answer. The problem is not >> that String values are defined as units of 16 bits. The problem is that >> the length of a primitive String value in ECMAScript, and the position of >> a character, is defined in terms of 16-bit units instead of characters. >> There is no bug, because ECMAScript specifies that Unicode characters >> beyond the Basic Multilingual Plane (BMP) need not be supported: > > So what you're saying is that an ES implementation is allowed to be > even buggier than I described, and that's somehow a justification? No, I am saying that you have no clue what you are talking about. >> But yes, there should be native support for Unicode characters with code >> points beyond the BMP, and evidently that does _not_ require a second >> language; just a few tweaks to the algorithms. > > No, it requires either a complete change of the language, […] No, it does not. Get yourself informed. >>> Can't do that with web browsers.) >> >> Yes, you could. It has been done before. > > Not easily. You have still no clue what you are talking about. Get yourself informed at least about the (deprecated/obsolete) “language” and the (standards- compliant) “type” attribute of SCRIPT/“script” elements before you post on this again. -- PointedEars -- http://mail.python.org/mailman/listinfo/python-list