[issue40980] group names of bytes regexes are strings

Quentin Wenger Wed, 17 Jun 2020 01:11:16 -0700


Quentin Wenger <wenger.quen...@bluewin.ch> added the comment:


Because utf-8 is Python's default encoding, e.g. in source files, decode() and 
encode(). Literally everywhere.

If you ask around "I have a bytestring, I need a string, what do I do?", using 
latin-1 will not be the first answer (and moreover, the correct answer should 
be "it depends on the encoding", which re happily ignores by just asserting 
one).

Saying "just strip that b prefix, it's fine" cannot be taken seriously.

Yes latin-1 will never give an error on converting a bytestring, because it has 
full coverage of the 256 byte values, but saying that this is the reason why it 
should be used instead of another is forgetting why we have Unicode in the 
first place. **It is just pretending that Unicode never was a thing**. It is 
not because it can decode any bytestring that it will not return garbage _when 
the bytestring is not latin-1-encoded in the first place_.

Take a look at the documentation: https://docs.python.org/3/howto/unicode.html
7 references to latin-1, none saying that latin-1 is the way to go because it 
is so much better than anything else.

latin-1 used to be prominent in the 2.x world, it should slowly be time to 
recognize that this is over, and we cannot ignore anymore that encoding is a 
thing.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40980>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40980] group names of bytes regexes are strings

Reply via email to