Hi Chris, Thanks for your suggestions. I would like to capture the specific bad codes *before* they get replaced. So if a line of text has 10 bad codes (each one raising UnicodeError), I would like to track each exception's bad code but still return a valid decode line when finished.
My goal is to count the total number of UnicodeExceptions within a file (as a data quality metric) and track the frequency of specific bad code's (via a collections.counter dict) to see if there's a pattern that can be traced to bad upstream process. Malcolm <snipped> Remove them? Not sure what you mean, exactly; but would an errors="backslashreplace" decode do the job? Something like (assuming you use Python 3): def read_dirty_file(fn): with open(fn, encoding="utf-8", errors="backslashreplace") as f: for row in csv.DictReader(f): process(row) You'll get Unicode text, but any bytes that don't make sense in UTF-8 will be represented as eg \x80, with an actual backslash. Or use errors="replace" to hide them all behind U+FFFD, or other forms of error handling. That'll get done at a higher level than the CSV reader, like you suggest. </snipped> -- https://mail.python.org/mailman/listinfo/python-list