On Mon, 18 Nov 2013 21:04:41 +1100, Chris Angelico wrote: > On Mon, Nov 18, 2013 at 8:44 PM, <wxjmfa...@gmail.com> wrote: >> string >> Satisfied Interfaces: Category, Cloneable<List<Element>>, >> Collection<Element>, Comparable<String>, >> Correspondence<Integer,Element>, Iterable<Element,Null>, >> List<Character>, Ranged<Integer,String>, Summable<String> A string of >> characters. Each character in the string is a 32-bit Unicode character. >> The internal UTF-16 encoding is hidden from clients. A string is a >> Category of its Characters, and of its substrings: > > I'm trying to figure this out. Reading the docs hasn't answered this. If > each character in a string is a 32-bit Unicode character, and (as can be > seen in the examples) string indexing and slicing are supported, then > does string indexing mean counting from the beginning to see if there > were any surrogate pairs?
I can't figure out what that means, since it contradicts itself. First it says *every* character is 32-bits (presumably UTF-32), then it says that internally it uses UTF-16. At least one of these statements is wrong. (They could both be wrong, but they can't both be right.) Unless they have done something *really* clever, the language designers lose a hundred million points for screwing up text strings. There is *absolutely no excuse* for a new, modern language with no backwards compatibility concerns to choose one of the three bad choices: * choose UTF-16 or UTF-8, and have O(n) primitive string operations (like Haskell and, apparently, Ceylon); * or UTF-16 without support for the supplementary planes (which makes it virtually UCS-2), like Javascript; * choose UTF-32, and use two or four times as much memory as needed. -- Steven -- https://mail.python.org/mailman/listinfo/python-list