On May 28, 11:08 am, [EMAIL PROTECTED] wrote: > Hi, > > I have problems getting my Python code to work with UTF-8 encoding > when reading from stdin / writing to stdout. > > Say I have a file, utf8_input, that contains a single character, é, > coded as UTF-8: > > $ hexdump -C utf8_input > 00000000 c3 a9 > 00000002 > > If I read this file by opening it in this Python script: > > $ cat utf8_from_file.py > import codecs > file = codecs.open('utf8_input', encoding='utf-8') > data = file.read() > print "length of data =", len(data) > > everything goes well: > > $ python utf8_from_file.py > length of data = 1 > > The contents of utf8_input is one character coded as two bytes, so > UTF-8 decoding is working here. > > Now, I would like to do the same with standard input. Of course, this: > > $ cat utf8_from_stdin.py > import sys > data = sys.stdin.read() > print "length of data =", len(data) > > does not work: > > $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input > length of data = 2 > > Here, the contents of utf8_input is not interpreted as UTF-8, so > Python believes there are two separate characters. > > The question, then: > How could one get utf8_from_stdin.py to work properly with UTF-8? > (And same question for stdout.) > > I googled around, and found rather complex stuff (see, for > example,http://blog.ianbicking.org/illusive-setdefaultencoding.html), but even > that didn't work: I still get "length of data = 2" even after > successively calling sys.setdefaultencoding('utf-8'). > > -- dave
weird thing is 'c3 a9' is é on my side... and copy/pasting the é gives me 'e9' with the first script giving a result of zero and second script gives me 1 -- http://mail.python.org/mailman/listinfo/python-list