On Fri, Aug 5, 2016 at 4:47 AM, Malcolm Greene <pyt...@bdurham.com> wrote: > I'm processing a lot of dirty CSV files and would like to track the bad > codes that are raising UnicodeErrors. I'm struggling how to figure out > what the exact codes are so I can track them, them remove them, and then > repeat the decoding process for the current line until the line has been > fully decoded so I can pass this line on to the CSV reader. At a high > level it seems that I need to wrap the decoding of a line until it > passes with out any errors. Any suggestions appreciated.
Remove them? Not sure what you mean, exactly; but would an errors="backslashreplace" decode do the job? Something like (assuming you use Python 3): def read_dirty_file(fn): with open(fn, encoding="utf-8", errors="backslashreplace") as f: for row in csv.DictReader(f): process(row) You'll get Unicode text, but any bytes that don't make sense in UTF-8 will be represented as eg \x80, with an actual backslash. Or use errors="replace" to hide them all behind U+FFFD, or other forms of error handling. That'll get done at a higher level than the CSV reader, like you suggest. ChrisA -- https://mail.python.org/mailman/listinfo/python-list