New submission from Devin Jeanpierre <jeanpierr...@gmail.com>: tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).
The naive approach might be something like: def my_readline(): return my_oldreadline().encode('utf-8') But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one): def my_readline_safe(was_read=[]): if not was_read: was_read.append(True)can return b'# coding: utf-8' return my_oldreadline().encode('utf-8') tokenstream = tokenize.tokenize(my_readline_safe) Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function: tokenstream = tokenize._tokenize(my_readline, 'utf-8') Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function: tokenstream = tokenize.utokenize(my_oldreadline) ---------- components: Library (Lib) messages: 139733 nosy: Devin Jeanpierre priority: normal severity: normal status: open title: tokenize module should have a unicode API _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12486> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com