On 2016-08-04 15:45, Random832 wrote: > On Thu, Aug 4, 2016, at 15:22, Malcolm Greene wrote: >> Hi Chris, >> >> Thanks for your suggestions. I would like to capture the specific bad >> codes *before* they get replaced. So if a line of text has 10 bad codes >> (each one raising UnicodeError), I would like to track each exception's >> bad code but still return a valid decode line when finished. > Look into writing your own error handler - there's enough information > provided to do this. > > https://docs.python.org/3/library/codecs.html You could also use the 'surrogateescape' error handler, and count the number of high surrogate characters (each of which represents a byte that couldn't be decoded under the specified encoding). This will give you a text string, which you can then process to replace code points in the range U+DC80 - U+DCFF (inclusive).
""" In [1]: bad_byte_string = b'Some \xf9 odd \x84 bytes \xc2 here' In [2]: decoded = bad_byte_string.decode(errors='surrogateescape') In [3]: decoded Out[3]: 'Some \udcf9 odd \udc84 bytes \udcc2 here' In [4]: high_surrogate_range = range(0xdc80, 0xdd00) In [5]: sum(ord(char) in high_surrogate_range for char in decoded) Out[5]: 3 In [6]: from collections import Counter In [7]: from typing import Iterable In [8]: def get_bad_bytes(string: str) -> Iterable[bytes]: ...: for char in string: ...: if ord(char) in high_surrogate_range: ...: yield char.encode(errors='surrogateescape') ...: In [9]: bad_byte_counts = Counter(get_bad_bytes(decoded)) In [10]: bad_byte_counts Out[10]: Counter({b'\x84': 1, b'\xc2': 1, b'\xf9': 1}) """ MMR... -- https://mail.python.org/mailman/listinfo/python-list