What needs to be set in order to be able to use a re search within utf8 encoded bytes?
My test, being on a windows PC with cp1252 setup, looks like this import re import locale cp1252 = 'Ärger im Paradies'.encode('cp1252') utf8 = 'Ärger im Paradies'.encode('utf-8') print('cp1252:', cp1252) print('utf8 :', utf8) print('-'*80) print("search for 'Ärger'.encode('cp1252') in cp1252 encoded text") for m in re.finditer('Ärger'.encode('cp1252'), cp1252): print(m) print('-'*80) print("search for 'Ärger'.encode('') in utf8 encoded text") for m in re.finditer('Ärger'.encode(), utf8): print(m) print('-'*80) print("search for '\\w+'.encode('cp1252') in cp1252 encoded text") for m in re.finditer('\\w+'.encode('cp1252'), cp1252): print(m) print('-'*80) print("search for '\\w+'.encode('') in utf8 encoded text") for m in re.finditer('\\w+'.encode(), utf8): print(m) locale.setlocale(locale.LC_ALL, '') print('-'*80) print("search for '\\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text") for m in re.finditer('\\w+'.encode('cp1252'), cp1252, re.LOCALE): print(m) print('-'*80) print("search for '\\w+'.encode('') using ??? in utf8 encoded text") for m in re.finditer('\\w+'.encode(), utf8): print(m) if you run this you will get something like cp1252: b'\xc4rger im Paradies' utf8 : b'\xc3\x84rger im Paradies' -------------------------------------------------------------------------------- search for 'Ärger'.encode('cp1252') in cp1252 encoded text <re.Match object; span=(0, 5), match=b'\xc4rger'> -------------------------------------------------------------------------------- search for 'Ärger'.encode('') in utf8 encoded text <re.Match object; span=(0, 6), match=b'\xc3\x84rger'> -------------------------------------------------------------------------------- these two are ok BUT the result for \w+ shows a difference search for '\w+'.encode('cp1252') in cp1252 encoded text <re.Match object; span=(1, 5), match=b'rger'> <re.Match object; span=(6, 8), match=b'im'> <re.Match object; span=(9, 17), match=b'Paradies'> -------------------------------------------------------------------------------- search for '\w+'.encode('') in utf8 encoded text <re.Match object; span=(2, 6), match=b'rger'> <re.Match object; span=(7, 9), match=b'im'> <re.Match object; span=(10, 18), match=b'Paradies'> -------------------------------------------------------------------------------- it doesn't find the Ä, which from documentation point of view is expected and a hint to use locale is given, so let's do it and the results are search for '\w+'.encode('cp1252') using re.LOCALE in cp1252 encoded text <re.Match object; span=(0, 5), match=b'\xc4rger'> <re.Match object; span=(6, 8), match=b'im'> <re.Match object; span=(9, 17), match=b'Paradies'> -------------------------------------------------------------------------------- works for cp1252 BUT does not work for utf8 search for '\w+'.encode('') using ??? in utf8 encoded text <re.Match object; span=(2, 6), match=b'rger'> <re.Match object; span=(7, 9), match=b'im'> <re.Match object; span=(10, 18), match=b'Paradies'> So how can I make it work with utf8 encoded text? Note, decoding it to a string isn't preferred as this would mean allocating the bytes buffer a 2nd time and it might be that a buffer is several 100MBs, even GBs. Thank you Eren -- https://mail.python.org/mailman/listinfo/python-list