Re: catch UnicodeDecodeError

Robert Miles Wed, 29 Aug 2012 22:58:15 -0700

On 7/26/2012 5:51 AM, Jaroslav Dobrek wrote:

And the cool thing is: you can! :)


In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
but it's still available:

     from io import open

     filename = "somefile.txt"
     try:
         with open(filename, encoding="utf-8") as f:
             for line in f:
                 process_line(line)  # actually, I'd use "process_file(f)"
     except IOError, e:
         print("Reading file %s failed: %s" % (filename, e))
     except UnicodeDecodeError, e:
         print("Some error occurred decoding file %s: %s" % (filename, e))


Thanks. I might use this in the future.

try:
     for line in f: # here text is decoded implicitly
        do_something()
except UnicodeDecodeError():
     do_something_different()

This isn't possible for syntactic reasons.


Well, you'd normally want to leave out the parentheses after the exception
type, but otherwise, that's perfectly valid Python code. That's how these
things work.


You are right. Of course this is syntactically possible. I was too
rash, sorry. In confused
it with some other construction I once tried. I can't remember it
right now.

But the code above (without the brackets) is semantically bad: The
exception is not caught.

The problem is that vast majority of the thousands of files that I
process are correctly encoded. But then, suddenly, there is a bad
character in a new file. (This is so because most files today are
generated by people who don't know that there is such a thing as
encodings.) And then I need to rewrite my very complex program just
because of one single character in one single file.


Why would that be the case? The places to change should be very local in
your code.


This is the case in a program that has many different functions which
open and parse different
types of files. When I read and parse a directory with such different
types of files, a program that
uses

for line in f:

will not exit with any hint as to where the error occurred. I just
exits with a UnicodeDecodeError. That
means I have to look at all functions that have some variant of

for line in f:

in them. And it is not sufficient to replace the "for line in f" part.
I would have to transform many functions that
work in terms of lines into functions that work in terms of decoded
bytes.

That is why I usually solve the problem by moving fles around until I
find the bad file. Then I recode or repair
the bad file manually.



Would it be reasonable to use pieces of the old program to write a
new program that prints the name for an input file, then searches
that input file for bad characters?  If it doesn't find any, it can
then go on to the next input file, or show a message saying that no
bad characters were found.

--
http://mail.python.org/mailman/listinfo/python-list

Re: catch UnicodeDecodeError

Reply via email to