On 2013-06-29, fob...@gmail.com wrote: > Hi, > > I am trying to use a program called MeCab, which does syntax analysis on > Japanese text. The problem I am having is that it returns a byte string > and if I try to print it, it prints question marks for almost all > characters. However, if I try to use .decide, it throws an error. Here > is my code: > > #!/usr/bin/python > # -*- coding:utf-8 -*- > > import MeCab > tagger = MeCab.Tagger("-Owakati") > text = 'MeCab????????????????????????' > > result = tagger.parse(text) > print result > > result = result.decode('utf-8') > print result > > And here is the output: > > MeCab ?????? ?????? ?????????????????? ?????? ???????????? > > Traceback (most recent call last): > File "test.py", line 11, in <module> > result = result.decode('utf-8') > File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: > invalid continuation byte > > > ------------------ > (program exited with code: 1) > Press return to continue >
Find out what the output of tagger.parse is. Your program assumes it is a bytestring that contains the utf-8 encoded representation of some text, but it is obvious that this assumption is wrong. -- Real (i.e. statistical) tennis and snooker player rankings and ratings: http://www.statsfair.com/ -- http://mail.python.org/mailman/listinfo/python-list