On Sat, 13 Jul 2013 20:09:31 -0700, vek.m1234 wrote: > http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29- unicode > > "directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' > simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, > U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you > intended. This is because in UTF-8, the multi-byte sequence \xc3\xb1 is > supposed to represent the single character U+00F1, not the two > characters U+00C3 and U+00B1."
This demonstrates confusion of the fundamental concepts, while still accidentally stumbling across the basic facts right. No wonder it is confusing you, it confuses me too! :-) Encoding does not generate a character string, it generates bytes. So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all. Either of these would work: ... a UTF-8 encoded byte-string b'Jalape\xc3\xb1o' ... UTF-8 encoded bytes b'Jalape\xc3\xb1o' For older versions of Python (2.5 or older), unfortunately the b'' notation does not work, and you have to leave out the b. Even better would be if Python did not conflate ASCII characters with bytes, and forced you to write byte strings like this: ... a UTF-8 encoded byte-string b'\x4a\x61\x6c\x61\x70\x65\xc3\xb1\x6f' thus keeping the distinction between ASCII characters and bytes clear. But that would break backwards compatibility *way* too much, and so Python continues to conflate ASCII characters with bytes, even in Python 3. But I digress. The important thing here is that bytes b'Jalape\xc3\xb1o' consists of nine hexadecimal values, as shown above. Seven of them represent the ASCII characters Jalape and o and two of them are not ASCII. Their meaning depends on what encoding you are using. (To be precise, even the meaning of the other seven bytes depends on the encoding. Fortunately, or unfortunately as the case may be, *most* but not all encodings use the same hex values for ASCII characters as ASCII itself does, so I will stop mentioning this and just pretend that character J always equals hex byte 4A. But now you know the truth.) Since we're using the UTF-8 encoding, the two bytes \xc3\xb1 represent the character ñ, also known as LATIN SMALL LETTER N WITH TILDE. In other encodings, those two bytes will represent something different. So, I presume that the original person's *intention* was to get a Unicode text string 'Jalapeño'. If they were wise in the ways of Unicode, they would write one of these: 'Jalape\N{LATIN SMALL LETTER N WITH TILDE}o' 'Jalape\u00F1o' 'Jalape\U000000F1o' 'Jalape\xF1o' # hex 'Jalape\361o' # octal and be happy. (In Python 2, they would need to prefix all of these with u, to use Unicode strings instead of byte strings.) But alas they have been mislead by those who propagate myths, misunderstandings and misapprehensions about Unicode all over the Internet, and so they looked up ñ somewhere, discovered that it has the double-byte hex value c3b1 in UTF-8, and thought they could write this: 'Jalape\xc3\xb1o' This does not do what they think it does. It creates a *text string*, a Unicode string, with NINE characters: J a l a p e à ± o Why? Because character à has ordinal value 195, which is c3 in hex, hence \xc3 is the character Ã; likewise \xb1 is the character ± which has ordinal value 177 (b1 in hex). And so they have discovered the wickedness that is mojibake. http://en.wikipedia.org/wiki/Mojibake Instead, if they had started with a *byte-string*, and explicitly decoded it as UTF-8, they would have been fine: # I manually encoded 'Jalapeño' to get the bytes below: bytes = b'Jalape\xc3\xb1o' print(bytes.decode('utf-8')) > My original question was: Shouldn't this be 8 characters - not 9? He > says: \xc3\xb1 is supposed to represent the single character. However > after some interaction with fellow Pythonistas i'm even more confused. Depends on the context. \xc3\xb1 could mean the Unicode string '\xc3\xb1' (in Python 2, written u'\xc3\xb1') or it could mean the byte- string b'\xc3\xb1' (in Python 2.5 or older, written without the b). As a string, \xc3\xb1 means two characters, with ordinal values 0xC3 (or decimal 195) and 0xB1 (or decimal 177), namely 'Ã' and '±'. As bytes, \xc3\xb1 represent two bytes (well, duh), which could mean nearly anything: - the 16-bit Big Endian integer 50097 - the 16-bit Little Endian integer 45507 - a 4x4 black and white bitmap - the character '簽' (CJK UNIFIED IDEOGRAPH-7C3D) in Big5 encoded bytes - '뇃' (HANGUL SYLLABLE NWAES) in UTF-16 (Big Endian) encoded bytes - 'ñ' in UTF-8 encoded bytes - the two characters 'ñ' in Latin-1 encoded bytes - '√±' in MacRoman encoded bytes - 'Γ±' in ISO-8859-7 encoded bytes and so forth. Without knowing the context, there is no way of telling what those two bytes represent, or whether they need to be taken together as a pair, or as two distinct things. > With reference to the above para: > 1. What does he mean by "writing a raw UTF-8 encoded string"?? He means he is confused. You don't get a text string by encoding, you get bytes (I will accept "byte-string"). The adjective "raw" doesn't really mean anything in this context. You have bytes that were encoded, or you have a string containing characters. Raw doesn't really mean anything except "hey, pay attention, this is low-level stuff" (for some definition of "low level"). > In Python2, once can do 'Jalape funny-n o'. Nothing funny about it to Spanish speakers. Personally, I have always considered "o" to be pretty funny. Say "woman" and "women" aloud -- in the first one, it sounds like "w-oo-man", in the second it sounds like "w-i-men". Now that's funny. But I digress. If you type 'Jalapeño' in Python 2 (with or without the b prefix), the result you get will depend on your terminal settings, but the chances are high that the terminal will internally represent the string as UTF-8, which gives you bytes b'Jalape\xc3\xb1o' which is *nine* bytes. When printed, your terminal will try to print each byte separately, giving: byte \x4a prints as J byte \x61 prints as a byte \x6c prints as l ... and so forth. If you are *unlucky* your terminal may even be smart enough to print the two bytes \xc3\xb1 as one character, giving you the ñ you were hoping for. Why unlucky? Because you got the right result by accident. Next time you do the same thing, on a different terminal, or the same terminal set to a different encoding, you will get a completely different result, and think that Unicode is too messed up to use. Using Python 2.5, here I print the same string three times in a row, changing the terminal's encoding each time: py> print 'Jalape\xc3\xb1o' # terminal set to UTF-8 Jalapeño py> print 'Jalape\xc3\xb1o' # and ISO-8859-6 (Arabic) Jalapeأ�o py> print 'Jalape\xc3\xb1o' # and ISO-8859-5 (Cyrillic) JalapeУБo Which one is "right"? Answer: none of them. Not even the first, which by accident just happened to be what we were hoping for. Really, don't feel bad that you are confused. Between Python 2, and the terminal trying *really hard* to do the right thing, it is easy to get confused because something the right thing happens and sometimes it doesn't. > This is a 'bytes' string where each glyph is 1 byte long Nope. It's a string of characters. Glyphs don't come into it. Glyphs are the little pictures of letters that you see on the screen, or printed on paper. They could be bitmaps, or fancy vector graphics. They are unlikely to be one byte each -- more likely 200 bytes per glyph, based on a very rough calculation[1], but depending on whether it is a bitmap, a Postscript font, an OpenType font, or something else. > when stored internally so each glyph is > associated with an integer as per charset ASCII or Latin-1. If these > charsets have a funny-n glyph then yay! else nay! There is no UTF-8 > here!! or UTF-16!! These are plain bytes (8 bits). You're getting closer. But you are right: Python 2 "strings" are byte- strings, which means UTF-8 doesn't come into it. But your terminal might treat those bytes as UTF-8, and so accidentally do the "right" (wrong) thing. > Unicode is a really big mapping table between glyphs and integers and Not glyphs. Between abstract "characters" and integers, called Code Points. Unicode contains: - distinct letters, digits, characters - accented letters - accents on their own - symbols, emoticons - ligatures and variant forms of characters - chars required only for backwards-compatibility with older encodings - whitespace - control characters - code points reserved for private use, which can mean anything you like - code points reserved as "will never be used" - code points explicitly labelled "not a character" and possibly others I have forgotten. > are denoted as Uxxxx or Uxxxx-xxxx. The official Unicode notation is: U+xxxx U+xxxxx U+xxxxxx that is U+ followed by exactly four, five or six hex digits. The U is always uppercase. Unfortunately Python doesn't support that notation, and you have to use either four or eight hex digits, e.g.: \uFFFF \U0010FFFF For code points (ordinals) up to 255, you can also use hex or octal escapes, e.g. \xFF \3FF > UTF-8 UTF-16 are encodings to store > those big integers in an efficient manner. Almost correct. They're not necessarily efficient. Unicode code points are just abstract numbers that we give some meaning to. Code point 65 (U+0041, because hex 41 == decimal 65) means letter A, and so forth. Imagine these abstract code points floating in your head. How do you get the abstract concept of a code point into concrete form on a computer? The same way *everything* is put in a computer: as bytes, so we have to turn each abstract code point (a number) into a series of bytes. Unicode code points range from U+0000 to U+10FFFF, which means we could just use exactly three bytes, which take values from 000000 to 10FFFF in hexadecimal. Values outside of this range, say 110000, would be an error. For reasons of efficiency, it's faster and better to use *four* bytes, even though one of the four will always have the value zero. In a nutshell, that's the UTF-32 encoding: ever character uses exactly four bytes. E.g. code point U+0041 (character A) is hex bytes 00000041, or possible 41000000, depending on whether your computer is Big Endian or Little Endian. Since *most* text uses quite low ordinal values, that's awfully wasteful of memory. So UTF-16 uses just two bytes per character, and a weird scheme using so-called "surrogate pairs" for everything that won't fit into two bytes. It works, for some definition of "works", but is complicated, and you really want to avoid UTF-16 if you need code points above U+FFFF. UTF-8 uses a neat variable encoding where characters with low ordinal values get encoded as a single byte (better still: it is the same byte as ASCII uses, which means old software that assumes everything in the world is ASCII will keep working, well mostly working). Higher ordinals get encoded as two, three or four bytes[2]. Best of all, unlike most historical variable-width encodings, UTF-8 is self-synchronising. In legacy encodings, if a single byte gets corrupted, it can mangle *everything* from that point on. With UTF-8, a single corrupted byte will mangle only the single code-point containing it, everything following will be okay. > So when DB says "writing a > raw UTF-8 encoded string" - well the only way to do this is to use > Python3 where the default string literals are stored in Unicode which > then will use a UTF-8 UTF-16 internally to store the bytes in their > respective structures; or, one could use u'Jalape' which is unicode in > both languages (note the leading 'u'). Python never uses UTF-8 internally for storing strings in memory. Because it is a variable width encoding, you cannot index strings efficiently if they use UTF-8 for storage. Instead, Python uses one of three different systems: - Up to Python 3.3, you have a choice. When you compile the Python interpreter, you can choose whether it should use UTF-16 or UTF-32 for in- memory storage. This choice is called "narrow" or "wide" build. A narrow build uses less memory, but cannot handle code points above U+FFFF very well. A wide build uses more memory, but handles the complete range of code points perfectly. - Starting in Python 3.3, the choice of how to store the string in memory is no longer decided up front when you build the Python interpreter. Instead, Python automatically chooses the most efficient internal representation for each individual string. Strings which only use ASCII or Latin-1 characters use one byte per character; string which use code points up to U+FFFF use two bytes per character; and only strings which use code points above that use four bytes per character. > 2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for > readability) what DB is saying is that, the stupid-user would expect > Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2 > o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8 > characters. Correct? Kind of. See above. > 3. Which leaves me wondering what he means by: "This is because in > UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the > single character U+00F1, not the two characters U+00C3 and U+00B1" He means that the single code point U+00F1 (character ñ, n with a tilde) is stored as the two bytes c3b1 (in hexadecimal) if you encode it using UTF-8. But if you stuff characters \xc3 \xb1 into a Unicode string (instead of bytes), then you get two Unicode characters U+00C3 and U+00B1. To put it another way, inside strings, Python treats the hex escape \xC3 as just a different way of writing the Unicode code point \u00C3 or \U000000C3. However, if you create a byte-string: b'Jalape\xc3\xb1o' by looking up a table of UTF-8 encodings, as presumably the original poster did, and then decode those bytes to a string, you will get what you expect. Using Python 2.5, where the b prefix is not needed: py> tasty = 'Jalape\xc3\xb1o' # actually bytes py> tasty.decode('utf-8') u'Jalape\xf1o' py> print tasty.decode('utf-8') # oops I forgot to reset my terminal JalapeУБo py> print tasty.decode('utf-8') # terminal now set to UTF-8 Jalapeño > Could someone take the time to read carefully and clarify what DB is > saying?? Hope this helps. [1] Assume the font file is 100K in size, and it has glyphs for 256 characters. That works out to 195 bytes per glyph. [2] Technically, the UTF-8 scheme can handle 31-bit code points, up to the (hypothetical) code point U+7FFFFFFF, using up to six bytes per code point. But Unicode officially will never go past U+10FFFF, and so UTF-8 also will never go past four bytes per code point. -- Steven -- http://mail.python.org/mailman/listinfo/python-list