Andrew Barnert added the comment:
> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP
> and the “2” as part of the same token, because NBSP is non-ASCII.
It's more complicated than that. When you get an invalid character, it splits
the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3
and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get
a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a
`NUMBER` token for the `2`.
But I think the code that generates the `SyntaxError` must be trying to
re-generate the "intended token" from the broken one. For example:
>>> eval('1\xa0\xa02a')
File "<string>", line 1
1 2a
^
SyntaxError: invalid character in identifier
And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which
matches what you see.
But if you tokenize it (e.g.,
`list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`,
but you'll probably want to wrap that up in a function if you're playing with
it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another
`ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does
the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`?
Presumably there's some logic that tries to work out that the two
`ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier
and points at that instead.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26152>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com