[issue26152] A non-breaking space in a source

Andrew Barnert Wed, 20 Jan 2016 14:03:12 -0800

Andrew Barnert added the comment:

> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP 
> and the “2” as part of the same token, because NBSP is non-ASCII.


It's more complicated than that. When you get an invalid character, it splits 
the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 
and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get 
a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a 
`NUMBER` token for the `2`.

But I think the code that generates the `SyntaxError` must be trying to 
re-generate the "intended token" from the broken one. For example:

    >>> eval('1\xa0\xa02a')
    File "<string>", line 1
      1  2a
          ^
    SyntaxError: invalid character in identifier

And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which 
matches what you see.

But if you tokenize it (e.g., 
`list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, 
but you'll probably want to wrap that up in a function if you're playing with 
it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another 
`ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does 
the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? 
Presumably there's some logic that tries to work out that the two 
`ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier 
and points at that instead.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26152>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26152] A non-breaking space in a source

Reply via email to