Piet van Oostrum wrote:
Dave Angel <da...@ieee.org> (DA) wrote:
[snip]
DA> Thanks for the correction. What I meant by "works for me" is that the
DA> single example in the docstring translated okay. But I do have a lot to
DA> learn about using Unicode in sources, and I want to learn.
DA> So tell me, how were we supposed to guess what encoding the original
DA> message used? I originally had the mailing list message (in Thunderbird
DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
DA> let me save because the file type was ASCII. So I randomly chosen latin-1
DA> for file type, and it seemed to like it.
You can see the encoding of the message in its headers. But it is not
important, as the Unicode characters you see is what it is about. You
just copy and paste them in your Python file. The Python file does not
have to use the same encoding as the message from which you pasted. The
editor will do the proper conversion. (If it doesn't throw it away
immediately.) Only for the Python file you must choose an encoding that
can encode all the characters that are in the file. In this case utf-8
is the only reasonable choice, but if there are only latin-1 characters
in the file then of course latin-1 (iso-8859-1) will also be good.
Any decent editor will only allow you to save in an encoding that can
encode all the characters in the file, otherwise you will lose some
characters.
Because Python must also know which encoding you used and this is not in
itself deductible from the file contents, you need the coding
declaration. And it must be the same as the encoding in which the file
is saved, otherwise Python will see something different than you saw in
your editor. Sooner or later this will give you a big headache.
DA> At that point I expected and got errors from Python because I had no coding
DA> declaration. I used latin-1, and still had problems, though I forget what
DA> they were. Only when I changed the file encoding type again, to utf-8, did
DA> the errors go away. I agree that they should agree, but I don't know how to
DA> reconcile the copy/paste boundary, the file type (without BOM, which is
DA> another variable), the coding declaration, and the stdout implicit ASCII
DA> encoding. I understand a bunch of it, but not enough to be able to safely
DA> walk through the choices.
DA> Is this all written up in one place, to where an experienced programmer can
DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8
DA> encoder/decoder a dozen years ago).
I don't know a place. Usually utf-8 is a safe bet but in some cases can
be overkill. And then in you Python input/output (read/write) you may
have to use a different encoding if the programs that you have to
communicate with expect something different.
I know what I was missing. The copy/paste must be doing it in pure
Unicode. And the in-memory version of the source text is in Unicode.
So the text editor's encoding affects how that Unicode is encoded into 8
bit bytes for the file (and how it will be reloaded next time). OK,
that seems to make sense.
I know that the clipboard has type tags, but I haven't looked at them in
so long that I forget what they look like. For text, is it just ASCII
and Unicode? Or are there other possible encodings that the source and
sink negotiate?
Thanks for the clear explanation.
DaveA
--
http://mail.python.org/mailman/listinfo/python-list