Re: Capturing the bad codes that raise UnicodeError exceptions during decoding

Chris Angelico Thu, 04 Aug 2016 13:10:13 -0700

On Fri, Aug 5, 2016 at 5:22 AM, Malcolm Greene <pyt...@bdurham.com> wrote:
> Thanks for your suggestions. I would like to capture the specific bad
> codes *before* they get replaced. So if a line of text has 10 bad codes
> (each one raising UnicodeError), I would like to track each exception's
> bad code but still return a valid decode line when finished.
>


Interesting. Sounds to me like the simplest option is to open the file
as binary, split it on b"\n", and decode line by line before giving it
to the csv module. The csv.reader "csvfile" argument doesn't actually
have to be a file - it can be anything that yields lines. So you can
put a generator in between, like this:

def decode(binary):
    for line in binary:
        try:
            yield line.decode("utf-8")
        except UnicodeDecodeError:
            log_stats()

def read_dirty_file(fn):
    with open(fn, "rb") as f:
        for row in csv.DictReader(decode(f)):
            process(row)

Or what Random said, which is also viable.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Capturing the bad codes that raise UnicodeError exceptions during decoding

Reply via email to