Andrew Barnert added the comment: > Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP > and the “2” as part of the same token, because NBSP is non-ASCII.
It's more complicated than that. When you get an invalid character, it splits the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a `NUMBER` token for the `2`. But I think the code that generates the `SyntaxError` must be trying to re-generate the "intended token" from the broken one. For example: >>> eval('1\xa0\xa02a') File "<string>", line 1 1 2a ^ SyntaxError: invalid character in identifier And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which matches what you see. But if you tokenize it (e.g., `list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, but you'll probably want to wrap that up in a function if you're playing with it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another `ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? Presumably there's some logic that tries to work out that the two `ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier and points at that instead. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26152> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com