[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger
Quentin Wenger added the comment: bytes are _not_ Unicode code points, not even in the 256 range. End of the story. -- ___ Python tracker ___

[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger
Quentin Wenger added the comment: If I don't have to think about the str -> bytes direction, re should first stop going in the other direction. When I have bytes regexes I actually don't care about strings and would happily receive group names as bytes. But no, re decides that latin-1 is the

[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger
Quentin Wenger added the comment: Because utf-8 is Python's default encoding, e.g. in source files, decode() and encode(). Literally everywhere. If you ask around "I have a bytestring, I need a string, what do I do?", using latin-1 will not be the first answer (and moreover, the correct answ

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: Why you always want to use "utf-8" encoded identifier as group name in `bytes` pattern. The direction is: a group name written in `bytes` pattern, and will convert to `str. Not this direction: `str` group name -(utf8)-> `bytes` pattern -> `str` group name

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: I just had an "aha moment": What re claims is that, rather than doing as I suggested: > ``` > # consider the following bytestring pattern > >>> p = b"(?P<\xc3\xba>)" > > # what character does the group name correspond to? > # maybe we can try to infer it by d

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-c

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: The problem can also be played in reverse, maybe it is more telling: ``` # consider the following bytestring pattern >>> p = b"(?P<\xc3\xba>)" # what character does the group name correspond to? # maybe we can try to infer it by decoding the bytestring? # let'

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: And there's no need for a cryptic encoding like cp1250 for this problem to arise. Here is a simple example with Python's default encoding utf-8: ``` >>> a = "ú" >>> b = list(re.match(b"(?P<" + a.encode() + b">)", b"").groupdict())[0] >>> a.isidentifier() True

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: > > this limitation to the latin-1 subset is not compatible with the > > documentation, which says that valid Python identifiers are valid group > > names. > > Not all latin-1 characters are valid identifier, for example: > > >>> '\x94'.encode('latin1')

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: Please look at these: >>> orig_name = "Ř" >>> orig_ch = orig_name.encode("cp1250") # Because why not? >>> orig_ch b'\xd8' >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0] >>> name 'Ø' # '\xd8' >>> name ==

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: > this limitation to the latin-1 subset is not compatible with the > documentation, which says that valid Python identifiers are valid group names. Not all latin-1 characters are valid identifier, for example: >>> '\x94'.encode('latin1') b'\x94' >>> '\x9

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: I prove my point that the decoding to string is arbitrary: ``` >>> import re >>> orig_name = "Ř" >>> orig_ch = orig_name.encode("cp1250") # Because why not? >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0] >>> name == orig_name Fa

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: > It seems you don't know some knowledge of encoding yet. I don't have to be ashamed of my knowledge of encoding. Yet you are right that I was missing a subtlety, which is that latin-1 is a strict subset of Unicode rather than a completely arbitrary encoding

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: It seems you don't know some knowledge of encoding yet. Naturally, `bytes` cannot contain character which Unicode code point is greater than \u00ff. So you can only use "latin1" encoding, which map from character to byte (or reverse) directly. "utf-8", "utf-16" and

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: The issue with the second variant is that utf-8 is an arbitrary (although default) choice. But: re is doing that same arbitrary choice already in decoding the group names into a string, which is my original complaint! -- ___

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Sorry, b"(?P<\xce\x94>)" -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: But Δ has no latin-1 representation. So Δ currently cannot be used as a group name in bytes regex, although it is a valid Python identifier. So that's a bug. I mean, if you insist of having group names as strings even for bytes regexes, then it is not reasona

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: In this case, you can only use 'latin1', which directly map one character (\u-\u00FF) to/from one byte. If use 'utf-8', it may map one character to multiple bytes, such as 'Δ' -> b'\xce\x94' '\x94' is an invalid identifier, it will raise an error: >>> '\xce

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: > So b'\xe9' is mapped to \u00e9, it is `é`. Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmati

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: `latin1` is the character set that Unicode code point from \u to \u00ff, and the characters are directly mapped from/to bytes. So b'\xe9' is mapped to \u00e9, it is `é`. Of course, characters with Unicode code point greater than 0xff are impossible to appear in

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Of course an inconvenience in my program is not per se the reason to change the language. I just wanted to motivate that the current situation gives unexpected results. "\xe9" doesn't look like proper utf-8 to me: ``` >>> "é".encode("latin-1") b'\xe9' >>> "é

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin
Ma Lin added the comment: > a non-ascii group name will raise an error in bytes, even if encoded Looks like this is a language limitation: >>> b'é' File "", line 1 SyntaxError: bytes can only contain ASCII literal characters. No problem if you use escaped character: >>> re.

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: should *be a valid name -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger
Quentin Wenger added the comment: Agreed to some extent, but there is the difference that group names are embedded in the pattern, which has to be bytes if the target is bytes. My use case is in an all-bytes, no-string project where I construct a large regular expression at startup, with semi

[issue40980] group names of bytes regexes are strings

2020-06-15 Thread Ma Lin
Ma Lin added the comment: Group name is `str` is very reasonable. Essentially it is just a name, it has nothing to do with `bytes`. Other names in Python are also `str` type, such as codec names, hashlib names. -- nosy: +Ma Lin ___ Python tracker

[issue40980] group names of bytes regexes are strings

2020-06-15 Thread Quentin Wenger
Quentin Wenger added the comment: This also affects functions/methods expecting a group name as parameter (e.g. match.group), the group name has to be passed as string. -- ___ Python tracker ___

[issue40980] group names of bytes regexes are strings

2020-06-14 Thread Quentin Wenger
New submission from Quentin Wenger : I noticed that match.groupdict() returns string keys, even for a bytes regex: ``` >>> import re >>> re.match(b"(?P)", b"").groupdict() {'a': b''} ``` This seems somewhat strange, because string and bytes matching in re are kind of two separate parts, cf. d