On Sun, 7 Jun 2015 06:21 pm, Thomas 'PointedEars' Lahn wrote: > Ned Batchelder wrote: > >> On Saturday, May 23, 2015 at 9:01:29 AM UTC-4, Steven D'Aprano wrote: >>> On Sat, 23 May 2015 10:33 pm, Thomas 'PointedEars' Lahn wrote: >>> > If only characters were represented as sequences UTF-16 code units in >>> > ECMAScript implementations like JavaScript, there would not be a >>> > problem beyond the BMP; >>> >>> Are you being sarcastic? >> >> IIUC, Thomas' point is that *characters* should be sequences of >> codepoints, not that *strings* should be. > > No, my point is that one character should be a sequence of code _units_ > (for a code point value).
I don't understand this sentence. "Code point value" doesn't appear to be meaningful. "Code point" is a value in the Unicode codespace, informally "a character" (but see below); code points can take on values in the range 0 to 1114111, usually written in hex as U+0000 to U+10FFFF. "Code value" is an obsolete term for code unit, that is, the smallest chunk of memory used to represent a code point. For example, UTF-8 uses 8-bit code units, UTF-32 uses 32 bit code units. But "code point value", I'm not sure what you mean by that. Consequently I have no idea what you think a character should be. Is "Hello World" a character? How about "Æ" or "û"? The term "character" is problematic, because what counts as a character depends on where you are and how the string is normalised. For example: "ij" could be two characters, the letters i followed by j, or one, the 25th letter of the Dutch language [and not even the Dutch agree on this]; conversely, "ij" could be a single character, or a ligature of two characters. "Ḗ" (U+1E16 LATIN CAPITAL LETTER E WITH MACRON AND ACUTE) could be considered one character, or three 'E\u0304\u0301', depending on whether it is normalised or not. So I'm afraid I do not understand your sentence. Code point: http://www.unicode.org/glossary/#code_point Code unit: http://www.unicode.org/glossary/#code_unit Code value: http://www.unicode.org/glossary/#code_value See also http://unicode.org/faq/char_combmark.html > But in ECMAScript implementations (so far), a *code > point value* equals a character, and that is a problem in ECMAScript > because > there the value range is limited to what can be encoded in 16 bit. The > problem starts beyond the BMP where 16 bit are no longer sufficient for a > code sequence and code point value, and code sequence and code point value > are no longer equal. This is no clearer. I *think* what you are trying to say is that ECMAScript assumes that one code point is always represented by a single code unit. So a sequence of code points ABCD will be correctly interpreted as four "characters" so long as each of those code points are in the BMP (i.e. between U+0000 and U+FFFF inclusive), but *not* if they are from one of the supplementary planes. This is the same problem that older Python "narrow builds" suffered from. The solutions in Python was to use a wide-build (each code point is represented by a single UTF-32 code unit, that is, four bytes) or to upgrade to Python 3.3, which uses a compressed coding scheme where strings are represented by either 1-byte per code point, 2-bytes per code point, or 4-bytes per code point, whichever is the minimum needed for that particular string. My opinion is that a programming language like Python or ECMAScript should operate on *code points*. If we want to call them "characters" informally, that should be allowed, but whenever there is ambiguity we should remember we're dealing with code points. The implementation shouldn't matter: compliant Python interpreters might choose to use UTF-8 internally, or UTF-16, or UTF-32, or something else, and still agree on how many characters a string contains. Normalisation is still an issue, of course, but any decent Unicode implementation will include a way to normalise or denormalise strings. The question of graphemes (what "ordinary people" consider letters and characters, e.g. "ch" is two letters to an English speaker but one letter to a Czech speaker) should be left to libraries. It's a much harder problem to solve in the full general case, requires localisation, and is overkill for many string-processing tasks. -- Steven -- https://mail.python.org/mailman/listinfo/python-list