[issue40980] group names of bytes regexes are strings

Quentin Wenger Tue, 16 Jun 2020 17:27:16 -0700

Quentin Wenger <[email protected]> added the comment:

I just had an "aha moment": What re claims is that, rather than doing as I 
suggested:


> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
> 
> # what character does the group name correspond to?
> # maybe we can try to infer it by decoding the bytestring?
> # let's try to do it with the default encoding... that's natural, right?
> >>> p.decode()
> '(?P<ú>)'
> ```

the actual way to know what group name is represented would be to look at the 
(unicode) string with the same "graphical representation":

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<Ãº>)'

# ok so the group name will be "Ãº"
```

This way of going from bytes to strings _naively_ (which happens to be called 
latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be 
the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a 
code point is fundamentally different from what is stored in memory.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue40980>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40980] group names of bytes regexes are strings

Reply via email to