On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Michael Torrie <torr...@gmail.com>: > >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: >>> Ethan Furman <et...@stoneleaf.us>: >>>> ASCII is *not* the state of "this string has no encoding" -- that >>>> would be Unicode; a Unicode string, as a data type, has no encoding. >>> >>> Huh? >> >> [...] >> >> What part of his statement are you saying "Huh?" about? > > Unicode, like ASCII, is a code. Representing text in unicode is > encoding.
Yes and no. "ASCII" means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. "Unicode", on the other hand, is only the first part. It maps all the same characters to the same numbers that ASCII does, and then adds a few more... a few followed by a few, followed by... okay, quite a lot more. Unicode specifies that the character OK HAND SIGN, which looks like 👌 if you have the right font, is number 1F44C in hex (128076 decimal). This is the "Universal Character Set" or UCS. ASCII could specify a single encoding, because that encoding makes sense for nearly all purposes. (There are times when you transmit ASCII text and use the high bit to mean something else, like parity or "this is the end of a word" or something, but even then, you follow the same convention of packing a number into the low seven bits of a byte.) Unicode can't, because there are many different pros and cons to the different encodings, and so we have UCS Transformation Formats like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint to a sequence of bytes. You can't represent text in "Unicode" in a computer. Somewhere along the way, you have to figure out how to store those codepoints as bytes, or something more concrete (you could, for instance, use a Python list of Python integers; I can't say that it would be in any way more efficient than alternatives, but it would be plausible); and that's the encoding. ChrisA -- https://mail.python.org/mailman/listinfo/python-list