On Thu, May 12, 2011 11:22 am, harrismh777 wrote: > John Machin wrote: >> (1) You cannot work without using bytes sequences. Files are byte >> sequences. Web communication is in bytes. You need to (know / assume / >> be >> able to extract / guess) the input encoding. You need to encode your >> output using an encoding that is expected by the consumer (or use an >> output method that will do it for you). >> >> (2) You don't need to use bytes to specify a Unicode code point. Just >> use >> an escape sequence e.g. "\u0404" is a Cyrillic character. >> > > Thanks John. In reverse order, I understand point (2). I'm less clear > on point (1). > > If I generate a string of characters that I presume to be ascii/utf-8 > (no \u0404 type characters) > and write them to a file (stdout) how does > default encoding affect that file.by default..? I'm not seeing that > there is anything unusual going on...
About """characters that I presume to be ascii/utf-8 (no \u0404 type characters)""": All Unicode characters (including U+0404) are encodable in bytes using UTF-8. The result of sys.stdout.write(unicode_characters) to a TERMINAL depends mostly on sys.stdout.encoding. This is likely to be UTF-8 on a linux/OSX/platform. On a typical American / Western European /[former] colonies Windows box, this is likely to be cp850 on a Command Prompt window, and cp1252 in IDLE. UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises if the terminal can't render the character -- you'll get spaces or blobs or boxes with hex digits in them or nothing. Windows (Command Prompt window): only a small subset of characters can be encoded in e.g. cp850; anything else causes an exception. Windows (IDLE): ignores sys.stdout.encoding and renders the characters itself. Same outcome as *x/UTF-8 above. If you write directly (or sys.stdout is redirected) to a FILE, the default encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless the machine's site.py has been fiddled with to make it UTF-8 or something else. > If I open the file with vi? If > I open the file with gedit? emacs? Any editor will have a default encoding; if that doesn't match the file encoding, you have a (hopefully obvious) problem if the editor doesn't detect the mismatch. Consult your editor's docs or HTFF1K. > Another question... in mail I'm receiving many small blocks that look > like sprites with four small hex codes, scattered about the mail... > mostly punctuation, maybe? ... guessing, are these unicode code > points, yes > and if so what is the best way to 'guess' the encoding? google("chardet") or rummage through the mail headers (but 4 hex digits in a box are a symptom of inability to render, not necessarily caused by an incorrect decoding) ... is > it coded in the stream somewhere...protocol? Should be. -- http://mail.python.org/mailman/listinfo/python-list