Re: UTF-8 and stdin/stdout?

Ulrich Eckhardt Wed, 28 May 2008 03:36:20 -0700

Chris wrote:
> On May 28, 11:08 am, [EMAIL PROTECTED] wrote:
>> Say I have a file, utf8_input, that contains a single character, é,
>> coded as UTF-8:
>>
>> $ hexdump -C utf8_input
>> 00000000  c3 a9
>> 00000002
[...]
> weird thing is 'c3 a9' is Ã© on my side... and copy/pasting the é
> gives me 'e9' with the first script giving a result of zero and second
> script gives me 1


Don't worry, it can be that those are equivalent. The point is that some
characters exist more than once and some exist in a composite form (e with
accent) and separately (e and combining accent).

Looking at http://unicode.org/charts I see that the letter above should have
codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent).

0xe9 = 1110 1001 (codepoint)
0xc3 0xa9 = 1100 0011  1010 1001 (UTF-8)

Anyhow, further looking at this shows that your editor simply doesn't
interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where
they represent the capital A with tilde and the copyrigth sign.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

--
http://mail.python.org/mailman/listinfo/python-list

Re: UTF-8 and stdin/stdout?

Reply via email to