[issue40980] group names of bytes regexes are strings

Quentin Wenger Tue, 16 Jun 2020 13:38:22 -0700


Quentin Wenger <wenger.quen...@bluewin.ch> added the comment:


You questioned my knowledge of encodings. Let's quote from one of the most 
famous introductory articles on the subject 
(https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

> It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally 
utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing 
I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so 
that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my 
bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring 
to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the 
fact that it "naturally" converts bytes to unicode code points is an 
implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in 
the docs that group names come out as latin-1-encoded strings, with all the 
restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding 
altogether.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40980>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40980] group names of bytes regexes are strings

Reply via email to