On 01/13/2015 10:26 PM, Peng Yu wrote:
Hi,


First, you should always specify your Python version and OS version when asking questions here. Even if you've been asking questions, many of us cannot keep track of everyone's specifics, and need to refer to a standard place, the head of the current thread.

I'll assume you're using Python 2.7, on Linux or equivalent.

I am trying to understand what does encode() do. What are the hex
representations of "u" in main.py? Why there is UnicodeEncodeError
when main.py is piped to xxd? Why there is no such error when it is
not piped? Thanks.

~$ cat main.py
#!/usr/bin/env python

u = unichr(40960) + u'abcd' + unichr(1972)
print u

The unicode characters in 'u' must be decoded to a byte stream before sent to the standard out device. How they're decoded depends on the device, and what Python knows (or thinks it knows) about it.

~$ cat main_encode.py
#!/usr/bin/env python

u = unichr(40960) + u'abcd' + unichr(1972)
print u.encode('utf-8')

Here, print is trying to send bytes to a byte-device, and doesn't try to second guess anything.

$ ./main.py
ꀀabcd޴
~$ cat main.sh
#!/usr/bin/env bash

set -v
./main.py | xxd
./main_encode.py | xxd

~$ ./main.sh
./main.py | xxd
Traceback (most recent call last):
   File "./main.py", line 4, in <module>
     print u
UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in
position 0: ordinal not in range(128)
./main_encode.py | xxd
0000000: ea80 8061 6263 64de b40a                 ...abcd...


I'm guessing (since i already guessed you're running on Linux) that in the main_encode case, you're printing to a terminal window that Python already knows is utf-8.

But in the pipe case, it cannot tell what's on the other side. So it guesses ASCII, and runs into the conversion problem.

(Everything's different in Python 3.x, though in general the problem still exists. If the interpreter cannot tell what encoding is needed, it has to guess.)

There are ways to tell Python 2.7 what encoding a given file object should have, so you could tell Python to use utf-8 for sys.stdout. I don't know if that's the best answer, but here's what my notes say:

    import sys, codecs
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

Once you've done that, print output will go through the specified codec on the way to the redirected pipe.


--
DaveA

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to