Re: python3, regular expression and bytes text

MRAB Sat, 12 Oct 2019 13:31:52 -0700

On 2019-10-12 20:57, Eko palypse wrote:

You cannot. First, \w in re.LOCALE works only when the text is encodedwith the locale encoding (cp1252 in your case). Second, re.LOCALEsupports only 8-bit charsets. So even if you set the utf-8 locale, itwould not help.
Regular expressions with re.LOCALE are slow. It may be more efficient todecode text and use Unicode regular expression.
Thank you, I guess I'm convinced to always decode everything (re pattern and 
text) to utf8 internally and then do the re search but then I would need to 
figure out the correct position, hmm - some ongoing investigation needed, I 
guess.

You don't _decode_ to UTF-8, you _decode_ to Unicode and _encode_ to UTF-8:

     Decode: UTF-8   => Unicode

     Encode: Unicode => UTF-8

How the Unicode is stored internally is a detail of the implementation.
--
https://mail.python.org/mailman/listinfo/python-list

Re: python3, regular expression and bytes text

Reply via email to