On 14 July 2013 04:09, <vek.m1...@gmail.com> wrote: > http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29-unicode > > "directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply > produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, > U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This > is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to > represent the single character U+00F1, not the two characters U+00C3 and > U+00B1."
Correct. > My original question was: Shouldn't this be 8 characters - not 9? No, Python tends to be right on these things. > He says: \xc3\xb1 is supposed to represent the single character. However > after some interaction with fellow Pythonistas i'm even more confused. You would be, given the way he said it. > With reference to the above para: > 1. What does he mean by "writing a raw UTF-8 encoded string"?? Well, that doesn't really mean much with no context like he gave it. > In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string where > each glyph is 1 byte long when stored internally so each glyph is associated > with an integer as per charset ASCII or Latin-1. If these charsets have a > funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! These > are plain bytes (8 bits). > > Unicode is a really big mapping table between glyphs and integers and are > denoted as Uxxxx or Uxxxx-xxxx. *Waits for our resident unicode experts to explain why you're actually wrong* > UTF-8 UTF-16 are encodings to store those big integers in an efficient > manner. So when DB says "writing a raw UTF-8 encoded string" - well the only > way to do this is to use Python3 where the default string literals are stored > in Unicode which then will use a UTF-8 UTF-16 internally to store the bytes > in their respective structures; or, one could use u'Jalape' which is unicode > in both languages (note the leading 'u'). Correct. > 2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for > readability) what DB is saying is that, the stupid-user would expect Jalapeno > with a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for > readability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct? I think so. > 3. Which leaves me wondering what he means by: > "This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to > represent the single character U+00F1, not the two characters U+00C3 and > U+00B1" He's mixed some things up, AFAICT. > Could someone take the time to read carefully and clarify what DB is saying?? Here's a simple explanation: you're both wrong (or you're both *almost* right): As of Python 3: >>> "\xc3\xb1" 'ñ' >>> b"\xc3\xb1".decode() 'ñ' "WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain. Python 3's strings want you to give each character separately (*winces in case I'm wrong*). Python is interpreting the "\xc3" as "\N{LATIN CAPITAL LETTER A WITH TILDE}" and "\xb1" as "\N{PLUS-MINUS SIGN}"¹. This means that Python is given *two* characters. Python is basically doing this: number = int("c3", 16) # Convert from base16 chr(number) # Turn to the character from the Unicode mapping When you give Python *raw bytes*, you are saying that this is what the string looks like *when encoded* -- you are not giving Python Unicode, but *encoded Unicode*. This means that when you decode it (.decode()) it is free to convert multibyte sections to their relevant characters. To see how an *encoded string* is not the same as the string itself, see: >>> "Jalepeño".encode("ASCII", errors="xmlcharrefreplace") b'Jalepeño' Those *represent* the same thing, but the first (according to Python) *is* the thing, the second needs to be *decoded*. Now, bringing this back to the original: >>> "\xc3\xb1".encode() b'\xc3\x83\xc2\xb1' You can see that the *encoded* bytes represent the *two* characters; the string you see above is *not the encoded one*. The encoding is *internal to Python*. I hope that helps; good luck. ¹ Note that I find the "\N{...}" form much easier to read, and recommend it. -- http://mail.python.org/mailman/listinfo/python-list