MRAB wrote:
<div class="moz-text-flowed" style="font-family: -moz-fixed">Dave
Angel wrote:
¯º¿Â wrote:
On 3 Αύγ, 18:41, Dave Angel <da...@ieee.org> wrote:
Different encodings equal different ways of storing the data to the
media, correct?
Exactly. The file is a stream of bytes, and Unicode has more than 256
possible characters. Further, even the subset of characters that *do*
take one byte are different for different encodings. So you need to
tell
the editor what encoding you want to use.
For example an 'a' char in iso-8859-1 is stored different than an 'a'
char in iso-8859-7 and an 'a' char of utf-8 ?
Nope, the ASCII subset is identical. It's the ones between 80 and ff
that differ, and of course not all of those. Further, some of the
codes that are one byte in 8859 are two bytes in utf-8.
You *could* just decide that you're going to hardwire the assumption
that you'll be dealing with a single character set that does fit in 8
bits, and most of this complexity goes away. But if you do that, do
*NOT* use utf-8.
But if you do want to be able to handle more than 256 characters, or
more than one encoding, read on.
Many people confuse encoding and decoding. A unicode character is an
abstraction which represents a raw character. For convenience, the
first 128 code points map directly onto the 7 bit encoding called
ASCII. But before Unicode there were several other extensions to 256,
which were incompatible with each other. For example, a byte which
might be a European character in one such encoding might be a
kata-kana character in another one. Each encoding was 8 bits, but it
was difficult for a single program to handle more than one such
encoding.
One encoding might be ASCII + accented Latin, another ASCII + Greek,
another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
Greek then you'd need more than 1 byte per character.
If you're working with multiple alphabets it gets very messy, which is
where Unicode comes in. It contains all those characters, and UTF-8 can
encode all of them in a straightforward manner.
So along comes unicode, which is typically implemented in 16 or 32
bit cells. And it has an 8 bit encoding called utf-8 which uses one
byte for the first 192 characters (I think), and two bytes for some
more, and three bytes beyond that.
[snip]
In UTF-8 the first 128 codepoints are encoded to 1 byte.
Thanks for the correction. As I said, I wasn't sure. I did utf-8 encoder
and decoder about a dozen years ago, and I remember parts of it use the
top two bits specially. But I've checked now, and you're right, the
cutoff is 7f.
DaveA
--
http://mail.python.org/mailman/listinfo/python-list