Re: Trying to set a cookie within a python script

Dave Angel Tue, 03 Aug 2010 13:56:02 -0700


MRAB wrote:

<div class="moz-text-flowed" style="font-family: -moz-fixed">DaveAngel wrote:
¯º¿Â wrote:
On 3 Αύγ, 18:41, Dave Angel <da...@ieee.org> wrote:
Different encodings equal different ways of storing the data to the
media, correct?
Exactly. The file is a stream of bytes, and Unicode has more than 256
possible characters. Further, even the subset of characters that *do*
take one byte are different for different encodings. So you need totell
the editor what encoding you want to use.
For example an 'a' char in iso-8859-1 is stored different than an 'a'
char in iso-8859-7 and an 'a' char of utf-8 ?
Nope, the ASCII subset is identical. It's the ones between 80 and ffthat differ, and of course not all of those. Further, some of thecodes that are one byte in 8859 are two bytes in utf-8.
You *could* just decide that you're going to hardwire the assumptionthat you'll be dealing with a single character set that does fit in 8bits, and most of this complexity goes away. But if you do that, do*NOT* use utf-8.
But if you do want to be able to handle more than 256 characters, ormore than one encoding, read on.
Many people confuse encoding and decoding. A unicode character is anabstraction which represents a raw character. For convenience, thefirst 128 code points map directly onto the 7 bit encoding calledASCII. But before Unicode there were several other extensions to 256,which were incompatible with each other. For example, a byte whichmight be a European character in one such encoding might be akata-kana character in another one. Each encoding was 8 bits, but itwas difficult for a single program to handle more than one suchencoding.
One encoding might be ASCII + accented Latin, another ASCII + Greek,
another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
Greek then you'd need more than 1 byte per character.

If you're working with multiple alphabets it gets very messy, which is
where Unicode comes in. It contains all those characters, and UTF-8 can
encode all of them in a straightforward manner.
So along comes unicode, which is typically implemented in 16 or 32bit cells. And it has an 8 bit encoding called utf-8 which uses onebyte for the first 192 characters (I think), and two bytes for somemore, and three bytes beyond that.
[snip]
In UTF-8 the first 128 codepoints are encoded to 1 byte.

Thanks for the correction. As I said, I wasn't sure. I did utf-8 encoderand decoder about a dozen years ago, and I remember parts of it use thetop two bits specially. But I've checked now, and you're right, thecutoff is 7f.


DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to set a cookie within a python script

Reply via email to