On Thu, Aug 4, 2016 at 3:24 PM Malcolm Greene <pyt...@bdurham.com> wrote:
> Hi Chris, > > Thanks for your suggestions. I would like to capture the specific bad > codes *before* they get replaced. So if a line of text has 10 bad codes > (each one raising UnicodeError), I would like to track each exception's > bad code but still return a valid decode line when finished. > > My goal is to count the total number of UnicodeExceptions within a file > (as a data quality metric) and track the frequency of specific bad > code's (via a collections.counter dict) to see if there's a pattern that > can be traced to bad upstream process. > Give this a shot (below). It seems to do what you want. import csv from collections import Counter from io import BytesIO def _cleanline(line, counts=Counter()): try: return line.decode() except UnicodeDecodeError as e: counts[line[e.start:e.end]] += 1 return line[:e.start].decode() + _cleanline(line[e.end:], counts) def cleanlines(fp): ''' convert data to text; track decoding errors ``fp`` is an open file-like iterable of lines' ''' cleanlines.errors = Counter() for line in fp: yield _cleanline(line, cleanlines.errors) f = BytesIO(b'''\ this,is line,one line two,has junk,\xffin it so does,\xfa\xffline,three ''') for row in csv.reader(cleanlines(f)): print(row) print(cleanlines.errors.most_common()) -- https://mail.python.org/mailman/listinfo/python-list