Marko Rauhamaa wrote: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > >> Marko Rauhamaa wrote: >>> '\udd00' is a valid str object: >> >> Is it though? Perhaps the bug is not UTF-8's inability to encode lone >> surrogates, but that Python allows you to create lone surrogates in >> the first place. That's not a rhetorical question. It's a genuine >> question. > > The problem is that no matter how you shuffle surrogates, encoding > schemes, coding points and the like, a wrinkle always remains.
Really? Define your terms. Can you define "wrinkles", and prove that it is impossible to remove them? What's so bad about wrinkles anyway? > I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But > that's where the buck stops; traditional arithmetic functions are closed > under ℂ. That's simply incorrect. What's z/(0+0i)? There are many more number sets used by mathematicians, some going back to the 1800s. Here are just a few: * ℝ-overbar or [−∞, +∞], which adds a pair of infinities to ℝ. * ℝ-caret or ℝ+{∞}, which does the same but with a single unsigned infinity. * A similar extended version of ℂ with a single infinity. * Split-complex or hyperbolic numbers, defined similarly to ℂ except with i**2 = +1 (rather than the complex i**2 = -1). * Dual numbers, which add a single infinitesimal number ε != 0 with the property that ε**2 = 0. * Hyperreal numbers. * John Conway's surreal numbers, which may be the largest possible set, in the sense that it can construct all finite, infinite and infinitesimal numbers. (The hyperreals and dual numbers can be considered subsets of the surreals.) The process of extending ℝ to ℂ is formally known as Cayley–Dickson construction, and there is an infinite number of algebras (and hence number sets) which can be constructed this way. The next few are: * Hamilton's quaternions ℍ, very useful for dealing with rotations in 3D space. They fell out of favour for some decades, but are now experiencing something of a renaissance. * Octonions or Cayley numbers. * Sedenions. > Unicode apparently hasn't found a similar closure. Similar in what way? And why do you think this is important? It is not a requirement for every possible byte sequence to be a valid Unicode string, any more than it is a requirement for every possible byte sequence to be valid JPG, zip archive, or ELF executable. Some byte strings simply are not JPG images, zip archives or ELF executables -- or Unicode strings. So what? Why do you think that is a problem that needs fixing by the Unicode standard? It may be a problem that needs fixing by (for example) programming languages, and Python invented the surrogatesescape encoding to smuggle such invalid bytes into strings. Other solutions may exist as well. But that's not part of Unicode and it isn't a problem for Unicode. > That's why I think that while UTF-8 is a fabulous way to bring Unicode > to Linux, Linux should have taken the tack that Unicode is always an > application-level interpretation with few operating system tie-ins. "Should have"? That is *exactly* the status quo, and while it was the only practical solution given Linux's history, it's a horrible idea. That Unicode is stuck on top of an OS which is unaware of Unicode is precisely why we're left with problems like "how do you represent arbitrary bytes as Unicode strings?". > Unfortunately, the GNU world is busy trying to build a Unicode frosting > everywhere. The illusion can never be complete but is convincing enough > for application developers to forget to handle corner cases. > > To answer your question, I think every code point from 0 to 1114111 > should be treated as valid and analogous. Your opinion isn't very relevant. What is relevant is what the Unicode standard demands, and I think it requires that strings containing surrogates are illegal (rather like x/0 is illegal in the real numbers). Wikipedia states: The Unicode standard permanently reserves these code point values [U+D800 to U+DFFF] for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points. However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case. http://en.wikipedia.org/wiki/UTF-16 So yet again we are left with the conclusion that *buggy implementations* of Unicode cause problems, not the Unicode standard itself. > Thus Python is correct here: > > >>> len('\udd00') > 1 > >>> len('\ufeff') > 1 > > The alternatives are far too messy to consider. Not at all. '\udd00' should be a SyntaxError. -- Steven -- https://mail.python.org/mailman/listinfo/python-list