On Thu, 12 Sep 2013 10:31:26 +1000, Chris Angelico wrote: > On Thu, Sep 12, 2013 at 10:25 AM, Mark Janssen > <dreamingforw...@gmail.com> wrote:
>> Well now, this is an area that is not actually well-defined. I would >> say 16-bit Unicode is binary data if you're encoding in base 65,536, >> just as 8-bit ascii is binary data if you're encoding in base-256. >> Which is to say: there is no intervening data to suggest a TYPE. > > Unicode is not 16-bit any more than ASCII is 8-bit. And you used the > word "encod[e]", which is the standard way to turn Unicode into bytes > anyway. No, a Unicode string is a series of codepoints - it's most > similar to a list of ints than to a stream of bytes. And not necessarily ints, for that matter. Let's be clear: the most obvious, simple, hardware-efficient way to implement a Unicode string holding arbitrary characters is as an array of 32-bit signed integers restricted to the range 0x0 - 0x10FFFF. That gives you a one-to-one mapping of int <-> code point. But it's not the only way. One could implement Unicode strings using any similar one-to-one mapping. Taking a leaf out of the lambda calculus, I might implement each code point like this: NULL pointer <=> Code point 0 ^NULL <=> Code point 1 ^^NULL <=> Code point 2 ^^^NULL <=> Code point 3 and so on, where ^ means "pointer to". Obviously this is mathematically neat, but practically impractical. Code point U+10FFFF would require a chain of 1114111 pointer-to-pointer-to- pointer before the NULL. But it would work. Or alternatively, I might choose to use floats, mapping (say) 0.25 <=> U+0376. Or whatever. What we can say, though, is that to represent the full Unicode charset requires 21 bits per code-point, although you can get away with fewer bits if you have some out-of-band mechanism for recognising restricted subsets of the charset. (E.g. you could use just 7 bits if you only handled the characters in ASCII, or just 3 bits if you only cared about decimal digits.) In practice, computers tend to be much faster when working with multiples of 8 bits, so we use 32 bits instead of 21. In that sense, Unicode is a 32 bit character set. But Unicode is absolutely not a 16 bit character set. And of course you can use *more* bits than 21, or 32. If you had a computer where the native word-size was (say) 50 bits, it would make sense to use 50 bits per character. As for the question of "binary data versus text", well, that's a thorny one, because really *everything* in a computer is binary data, since it's stored using bits. But we can choose to *interpret* some binary data as text, just as we interpret some binary data as pictures, sound files, video, Powerpoint presentations, and so forth. A reasonable way of defining a text file might be: If you decode the bytes making up an alleged text file into code-points, using the correct encoding (which needs to be known a priori, or stored out of band somehow), then provided that none of the code-points have Unicode General Category Cc, Cf, Cs, Co or Cn (control, format, surrogate, private-use, non-character/reserved), you can claim that it is at least plausible that the file contains text. Whether that text is meaningful is another story. You might wish to allow Cf and possibly even Co (format and private-use), depending on the application. -- Steven -- https://mail.python.org/mailman/listinfo/python-list