Marko Rauhamaa wrote: > Michael Torrie <torr...@gmail.com>: > >> Unicode can only be encoded to bytes. >> Bytes can only be decoded to unicode. > > I don't really like it how Unicode is equated with text, or even > character strings.
That surely depends on the context. To be technically correct, Unicode is a character set together with a set of rules for dealing with them (e.g. rules for uppercasing characters, sorting rules, etc.). When referring to the standard, "Unicode" is a noun; when referring to text, it is actually an adjective being used as a noun. That is, "Unicode text" has become abbreviated as just "Unicode" in much the same way as "human beings" has become abbreviated as just "humans". In that sense, "text is Unicode" just means "in the context in which we are talking, when I say 'text' I mean 'Unicode text' as opposed to (for example) 'ASCII text' or 'KOI-8 text'." It certainly doesn't mean that *all* text in other contexts are Unicode, since that is obviously untrue. (E.g. there are millions of existing files across the world containing text which use legacy encodings that are not compatible with Unicode.) > There's barely any difference between the truth value of these > statements: > > Python strings are ASCII. > > Python strings are Latin-1. > > Python strings are Unicode. > > Each of those statements is true as long as you stay within the > respective character sets, and cease to be true when your text contains > characters outside the character sets. When we say "Python strings are FOO", we are making a statement about arbitrary Python strings, not a particular set of concrete examples of strings. If Python strings are FOO, that means that for all possible Python strings s, "s is FOO" is a true statement. We cannot say that Python strings are uppercase, because we can easily find counter-examples such as 'xyz'. Likewise we cannot say Python strings are ASCII, or Latin-1, because we can easily find counter-examples such as 'Ř' On the other hand, Python strings *are* Unicode, because by design Python strings are limited to Unicode. Every Python string is a Unicode string. > Now, it is true that Python currently limits itself to the 1,114,112 > Unicode code points. And it likely won't adopt more characters unless > Unicode does it first. However, text is something more lofty and > abstract than a sequence of Unicode code points. You are certainly correct that in it's full generality, "text" is much more than just a string of code points. Unicode strings is a primitive data type. A powerful and sophisticated text processing application may even find Python strings too primitive, possibly needing something like ropes of graphemes rather than strings of code points. We Western and Northern European speakers -- and I don't know whether Finns are counted as Northern Europeans or Eastern Europeans -- are lucky in that our natural languages are well-covered by Unicode. All our graphemes are also code points, even the "funny ones with accents". As an English speaker. I have to remind myself that not every grapheme is a single code point, but Devanagari or Navajo writers will never make that mistake. > We shouldn't call strings Unicode any more than we call numbers IEEE or > times ISO. We certainly shouldn't call numbers IEEE, but we might very well call them IEEE-754. Actually, since IEEE-754 covers multiple formats, we have to be more specific: Python floats are IEEE-754 double-precision binary floats. -- Steven -- https://mail.python.org/mailman/listinfo/python-list