Chris wrote: > On May 28, 11:08 am, [EMAIL PROTECTED] wrote: >> Say I have a file, utf8_input, that contains a single character, é, >> coded as UTF-8: >> >> $ hexdump -C utf8_input >> 00000000 c3 a9 >> 00000002 [...] > weird thing is 'c3 a9' is é on my side... and copy/pasting the é > gives me 'e9' with the first script giving a result of zero and second > script gives me 1
Don't worry, it can be that those are equivalent. The point is that some characters exist more than once and some exist in a composite form (e with accent) and separately (e and combining accent). Looking at http://unicode.org/charts I see that the letter above should have codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent). 0xe9 = 1110 1001 (codepoint) 0xc3 0xa9 = 1100 0011 1010 1001 (UTF-8) Anyhow, further looking at this shows that your editor simply doesn't interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where they represent the capital A with tilde and the copyrigth sign. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 -- http://mail.python.org/mailman/listinfo/python-list