Quentin Wenger <wenger.quen...@bluewin.ch> added the comment:

> > this limitation to the latin-1 subset is not compatible with the 
> > documentation, which says that valid Python identifiers are valid group 
> > names.
> 
> Not all latin-1 characters are valid identifier, for example:
> 
>     >>> '\x94'.encode('latin1')
>     b'\x94'
>     >>> '\x94'.isidentifier()
>     False

True but that's not the point. Δ is a valid Python identifier but not a valid 
group name in bytes regexes, because it is not in the latin-1 plane. The 
documentation does not mention this.


> There is a workaround, you can convert `bytes` to `str` with "latin-1" 
> decoder before processing, IIRC there will be no extra overhead 
> (memory/speed) during processing, then the name and content are the same 
> type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re 
should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string 
regexes instead?


> Please look at these:
> 
>     >>> orig_name = "Ř"
>     >>> orig_ch = orig_name.encode("cp1250") # Because why not?
>     >>> orig_ch
>     b'\xd8'
>     >>> name = list(re.match(b"(?P<" + orig_ch + b">)", 
> b"").groupdict().keys())[0]
>     >>> name
>     'Ø'  # '\xd8'
>     >>> name == orig_name
>     False
>     >>> name.encode("latin-1")
>     b'\xd8'
>     >>> name.encode("latin-1") == orig_ch
>     True
> 
> "Ř" (\u0158) --cp1250--> b'\xd8'
> "Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be 
valid Python identifiers) can have the same single-byte representation, simply 
by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when 
you don't know where the bytes come from, or even whether they ever were 
strings? That should be left to the programmer.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40980>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to