> 1) If you print a unicode string: > > *print implicitly calls str()*
No. print does nothing if the object is already a string or unicode object, and calls str() only otherwise. > a) str() calls encode(), and encode() tries to convert the unicode > string to a regular string. encode() uses the default encoding, which > is ascii. If encode() can't convert a character, then encode() raises > an exception. Yes and no. This is what str() does, but str() isn't called. Instead, print inspects sys.stdout.encoding, and uses that encoding to encode the string. That, in turn, may raise an exception (in particular if sys.stdout.encoding is "ascii" or not set). > b) repr() calls encode(), but if encode() raises an exception for a > character, repr() catches the exception and skips over the character > leaving the character unchanged. No. repr() never calls encode. Instead, each type, including unicode, may have its own __repr__ which is called. unicode.__repr__ escapes all non-ASCII characters. > 2) If you print a regular string containing characters in unicode > syntax: No. There is no such thing: py> len("\u") 2 py> "\u"[0] '\\' py> "\u"[1] 'u' In a regular string, \u has no meaning, so \ stands just for itself. > a) str() calls encode(), but if encode() raises an exception for a > character, str() catches the exception and skips over the character > leaving the character unchanged. Same as 1b. No. Printing a string never invokes .encode(), and no exception occurs at all. Instead, the \ just gets printed as is. > b) repr() similar to a), but repr() then escapes the escapes in the > string. str.__repr__ escapes the backslash just in case, so that it won't have to check for the next character; in that sense, it generates a normal form. > 3) If you print a regular string containing characters in utf-8 > syntax: > > a) str() outputs the string to your terminal, and if your terminal can > convert the utf-8 numerical codes to characters it does so. Correct. In general, you should always use the terminal's encoding when printing to the terminal. That way, you can print everything just fine what the terminal can display, and get an exception if you try to print something that the terminal would be unable to display. > b) repr() blocks your terminal from interpreting the characters by > escaping the escapes in your string. Why don't I see two slashes like > in the output for 2b? str.__repr__ produces an output that is legal Python syntax for a string literal. len(u'\u9999'.encode('utf-8')) is 3, so this Chinese character really encodes as three separate bytes. As these are non-ASCII bytes, __repr__ choses a representation that is legal Python syntax. For that characters, only \xe9, \xa6 and \x99 are valid Python syntax (each representing a single byte). For a backslash, Python could have generated \x5c or \134 as well, which are all different spellings of "backslash in a string literal". Python chose the most legible one, which is the double-backslash. HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list