New submission from Eric Snow <ericsnowcurren...@gmail.com>:

(see http://mail.python.org/pipermail/python-dev/2012-April/118889.html)

The behavior of tokenize.detect_encoding() and 
PyTokenizer_FindEncodingFilename() is unexpectedly different and this has 
bearing on the current work on imports.

When a file has no encoding indicator (see PEP 263) it falls back to UTF8 (see 
PEP 3120).  The tokenize module (Lib/tokenize.py) facilitates this through 
"detect_encoding()".  The CPython internal tokenizer (Python/tokenizer.c) does 
so through "PyTokenizer_FindEncodingFilename()".  Both check the first two 
lines of the file, per PEP 263.

When faced with an unparsable file (per the encoding), 
tokenize.detect_encoding() will gladly give you the encoding without any fuss.  
However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that 
situation.

The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that 
demonstrates this discrepency.  I'll use it in the following example.

 ---

For tokenize.detect_encoding():

  import tokenize
  enc = 
tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline)
  print(enc)  # "utf-8" (no SyntaxError)

For PyTokenizer_FindEncodingFilename():

I've attached the source for a C extension module ('_tokenizer') that wraps 
PyTokenizer_FindEncodingFilename().

  import _tokenizer
  enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py")
  print(enc)  # raises SyntaxError

 ---

Some relevant, related notes:

The discrepencies extend further too.  The following code returns a 
UnicodeDecodeError, rather than a SyntaxError:

  
tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline)

In 3.1 (C-based import machinery, Python/import.c), the following results in a 
SyntaxError, during encoding detection.  In the current repo tip 
(importlib-based import machinery, Lib/importlib/_bootstrap.py), the following 
results in a SyntaxError much later, during compilation.

  import test.badsyntax_pep3120

importlib uses tokenize.detect_encoding() and import.c uses 
PyTokenizer_FindEncodingFilename()...

----------
components: Library (Lib)
files: _tokenizer.c
messages: 158797
nosy: brett.cannon, eric.snow, loewis
priority: normal
severity: normal
status: open
title: discrepency between tokenize.detect_encoding() and 
PyTokenizer_FindEncodingFilename()
type: behavior
versions: Python 3.3
Added file: http://bugs.python.org/file25283/_tokenizer.c

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue14629>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to