New submission from Devin Jeanpierre <jeanpierr...@gmail.com>:

tokenize only deals with bytes. Users might want to deal with unicode source 
(for example, if python source is embedded into a document with an 
already-known encoding).

The naive approach might be something like:

  def my_readline():
      return my_oldreadline().encode('utf-8')

But this doesn't work for python source that declares its encoding, which might 
be something other than utf-8. The only safe ways are to either manually add a 
coding line yourself (there are lots of ways, I picked a dumb one):

  def my_readline_safe(was_read=[]):
      if not was_read:
          was_read.append(True)can 
          return b'# coding: utf-8'
      return my_oldreadline().encode('utf-8')

  tokenstream = tokenize.tokenize(my_readline_safe)

Or to use the same my_readline as before (no added coding line), but instead of 
passing it to tokenize.tokenize, you could pass it to the undocumented 
_tokenize function:

    tokenstream = tokenize._tokenize(my_readline, 'utf-8')

Or, ideally, you'd just pass the original readline that produces unicode into a 
utokenize function:

    tokenstream = tokenize.utokenize(my_oldreadline)

----------
components: Library (Lib)
messages: 139733
nosy: Devin Jeanpierre
priority: normal
severity: normal
status: open
title: tokenize module should have a unicode API

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12486>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to