catch UnicodeDecodeError
Hello, very often I have the following problem: I write a program that processes many files which it assumes to be encoded in utf-8. Then, some day, I there is a non-utf-8 character in one of several hundred or thousand (new) files. The program exits with an error message like this: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 60: invalid continuation byte I usually solve the problem by moving files around and by recoding them. What I really want to do is use something like try: # open file, read line, or do something else, I don't care except UnicodeDecodeError: sys.exit("Found a bad char in file " + file + " line " + str(line_number) Yet, no matter where I put this try-except, it doesn't work. How should I use try-except with UnicodeDecodeError? Jaroslav -- http://mail.python.org/mailman/listinfo/python-list
Re: catch UnicodeDecodeError
On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote: > Hi Jaroslav, > > you can catch a UnicodeDecodeError just like any other exception. Can > you provide a full example program that shows your problem? > > This works fine on my system: > > > import sys > open('tmp', 'wb').write(b'\xff\xff') > try: > buf = open('tmp', 'rb').read() > buf.decode('utf-8') > except UnicodeDecodeError as ude: > sys.exit("Found a bad char in file " + "tmp") > Thank you. I got it. What I need to do is explicitly decode text. But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances. What I am missing (especially for Python3) is something like: try: for line in sys.stdin: except UnicodeDecodeError: sys.exit("Encoding problem in line " + str(line_number)) I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line. -- http://mail.python.org/mailman/listinfo/python-list
Re: catch UnicodeDecodeError
On Jul 25, 8:50 pm, Dave Angel wrote: > On 07/25/2012 08:09 AM, jaroslav.dob...@gmail.com wrote: > > > > > > > > > > > On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote: > >> Hi Jaroslav, > > >> you can catch a UnicodeDecodeError just like any other exception. Can > >> you provide a full example program that shows your problem? > > >> This works fine on my system: > > >> import sys > >> open('tmp', 'wb').write(b'\xff\xff') > >> try: > >> buf = open('tmp', 'rb').read() > >> buf.decode('utf-8') > >> except UnicodeDecodeError as ude: > >> sys.exit("Found a bad char in file " + "tmp") > > > Thank you. I got it. What I need to do is explicitly decode text. > > > But I think trial and error with moving files around will in most cases be > > faster. Usually, such a problem occurs with some (usually complex) program > > that I wrote quite a long time ago. I don't like editing old and complex > > programs that work under all normal circumstances. > > > What I am missing (especially for Python3) is something like: > > > try: > > for line in sys.stdin: > > except UnicodeDecodeError: > > sys.exit("Encoding problem in line " + str(line_number)) > > > I got the point that there is no such thing as encoding-independent lines. > > But if no line ending can be found, then the file simply has one single > > line. > > i can't understand your question. if the problem is that the system > doesn't magically produce a variable called line_number, then generate > it yourself, by counting > in the loop. That was just a very incomplete and general example. My problem is solved. What I need to do is explicitly decode text when reading it. Then I can catch exceptions. I might do this in future programs. I dislike about this solution that it complicates most programs unnecessarily. In programs that open, read and process many files I don't want to explicitly decode and encode characters all the time. I just want to write: for line in f: or something like that. Yet, writing this means to *implicitly* decode text. And, because the decoding is implicit, you cannot say try: for line in f: # here text is decoded implicitly do_something() except UnicodeDecodeError(): do_something_different() This isn't possible for syntactic reasons. The problem is that vast majority of the thousands of files that I process are correctly encoded. But then, suddenly, there is a bad character in a new file. (This is so because most files today are generated by people who don't know that there is such a thing as encodings.) And then I need to rewrite my very complex program just because of one single character in one single file. -- http://mail.python.org/mailman/listinfo/python-list
Re: catch UnicodeDecodeError
> And the cool thing is: you can! :) > > In Python 2.6 and later, the new Py3 open() function is a bit more hidden, > but it's still available: > > from io import open > > filename = "somefile.txt" > try: > with open(filename, encoding="utf-8") as f: > for line in f: > process_line(line) # actually, I'd use "process_file(f)" > except IOError, e: > print("Reading file %s failed: %s" % (filename, e)) > except UnicodeDecodeError, e: > print("Some error occurred decoding file %s: %s" % (filename, e)) Thanks. I might use this in the future. > > try: > > for line in f: # here text is decoded implicitly > > do_something() > > except UnicodeDecodeError(): > > do_something_different() > > > This isn't possible for syntactic reasons. > > Well, you'd normally want to leave out the parentheses after the exception > type, but otherwise, that's perfectly valid Python code. That's how these > things work. You are right. Of course this is syntactically possible. I was too rash, sorry. In confused it with some other construction I once tried. I can't remember it right now. But the code above (without the brackets) is semantically bad: The exception is not caught. > > The problem is that vast majority of the thousands of files that I > > process are correctly encoded. But then, suddenly, there is a bad > > character in a new file. (This is so because most files today are > > generated by people who don't know that there is such a thing as > > encodings.) And then I need to rewrite my very complex program just > > because of one single character in one single file. > > Why would that be the case? The places to change should be very local in > your code. This is the case in a program that has many different functions which open and parse different types of files. When I read and parse a directory with such different types of files, a program that uses for line in f: will not exit with any hint as to where the error occurred. I just exits with a UnicodeDecodeError. That means I have to look at all functions that have some variant of for line in f: in them. And it is not sufficient to replace the "for line in f" part. I would have to transform many functions that work in terms of lines into functions that work in terms of decoded bytes. That is why I usually solve the problem by moving fles around until I find the bad file. Then I recode or repair the bad file manually. -- http://mail.python.org/mailman/listinfo/python-list
Re: catch UnicodeDecodeError
On Jul 26, 12:19 pm, wxjmfa...@gmail.com wrote: > On Thursday, July 26, 2012 9:46:27 AM UTC+2, Jaroslav Dobrek wrote: > > On Jul 25, 8:50 pm, Dave Angel <d...@davea.name> wrote: > > > On 07/25/2012 08:09 AM, jaroslav.dob...@gmail.com wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister > > wrote: > > > >> Hi Jaroslav, > > > > > > >> you can catch a UnicodeDecodeError just like any other > > exception. Can > > > >> you provide a full example program that shows your problem? > > > > > > >> This works fine on my system: > > > > > > >> import sys > > > >> open('tmp', > > 'wb').write(b'\xff\xff') > > > >> try: > > > >> buf = open('tmp', > > 'rb').read() > > > >> buf.decode('utf-8') > > > >> except UnicodeDecodeError as ude: > > > >> sys.exit("Found a bad char in file " + > > "tmp") > > > > > > > Thank you. I got it. What I need to do is explicitly decode text. > > > > > > > But I think trial and error with moving files around will in most > > cases be faster. Usually, such a problem occurs with some (usually complex) > > program that I wrote quite a long time ago. I don't like editing old > > and complex programs that work under all normal circumstances. > > > > > > > What I am missing (especially for Python3) is something like: > > > > > > > try: > > > > for line in sys.stdin: > > > > except UnicodeDecodeError: > > > > sys.exit("Encoding problem in line " + > > str(line_number)) > > > > > > > I got the point that there is no such thing as > > encoding-independent lines. But if no line ending can be found, then the > > file simply has one single line. > > > > > > i can't understand your question. if the problem is that the > > system > > > doesn't magically produce a variable called line_number, then > > generate > > > it yourself, by counting > > > in the loop. > > > That was just a very incomplete and general example. > > > My problem is solved. What I need to do is explicitly decode text when > > reading it. Then I can catch exceptions. I might do this in future > > programs. > > > I dislike about this solution that it complicates most programs > > unnecessarily. In programs that open, read and process many files I > > don't want to explicitly decode and encode characters all the time. I > > just want to write: > > > for line in f: > > > or something like that. Yet, writing this means to *implicitly* decode > > text. And, because the decoding is implicit, you cannot say > > > try: > > for line in f: # here text is decoded implicitly > > do_something() > > except UnicodeDecodeError(): > > do_something_different() > > > This isn't possible for syntactic reasons. > > > The problem is that vast majority of the thousands of files that I > > process are correctly encoded. But then, suddenly, there is a bad > > character in a new file. (This is so because most files today are > > generated by people who don't know that there is such a thing as > > encodings.) And then I need to rewrite my very complex program just > > because of one single character in one single file. > > In my mind you are taking the problem the wrong way. > > Basically there is no "real UnicodeDecodeError", you are > just wrongly attempting to read a file with the wrong > codec. Catching a UnicodeDecodeError will not correct > the basic problem, it will "only" show, you are using > a wrong codec. > There is still the possibility, you have to deal with an > ill-formed utf-8 codding, but I doubt it is the case. I participate in projects in which all files (raw text files, xml files, html files, ...) are supposed to be encoded in utf-8. I get many different files from many different people. They are almost always ancoded in utf-8. But sometimes a whole file or, more frequently, parts of a file are not encoded in utf-8. The reason is that most of the files stem from the internet. Files or strings are downloaded and, if possible, recoded. And they are often simply concatenated into larger strings or files. I think the most straightforward thing to do is to assume that I get utf-8 and raise an error if some file or character proves to be something different. -- http://mail.python.org/mailman/listinfo/python-list
Re: catch UnicodeDecodeError
> that tells you the exact code line where the error occurred. No need to > look around. You are right: try: for line in f: do_something() except UnicodeDecodeError: do_something_different() does exactly what one would expect it to do. Thank you very much for pointing this out and sorry for all the posts. This is one of the days when nothing seems to work and when I don't seem to able to read the simplest error message. -- http://mail.python.org/mailman/listinfo/python-list
subtraction of floating point numbers
Hello, when I have Python subtract floating point numbers it yields weird results. Example: 4822.40 - 4785.52 = 36.87992 Why doesn't Python simply yield the correct result? It doesn't have a problem with this: 482240 - 478552 = 3688 Can I tell Python in some way to do this differently? Jaroslav -- http://mail.python.org/mailman/listinfo/python-list
system call that is killed after n seconds if not finished
Hello, I would like to execute shell commands, but only if their execution time is not longer than n seconds. Like so: monitor(os.system("do_something"), 5) I.e. the command do_somthing should be executed by the operating system. If the call has not finished after 5 seconds, the process should be killed. How could this be done? Jaroslav -- http://mail.python.org/mailman/listinfo/python-list
parallel subprocess.getoutput
Hello, I wrote the following code for using egrep on many large files: MY_DIR = '/my/path/to/dir' FILES = os.listdir(MY_DIR) def grep(regex): i = 0 l = len(FILES) output = [] while i < l: command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' + FILES[i] result = subprocess.getoutput(command) if result: output.append(result) i += 1 return output Yet, I don't think that the files are searched in parallel. Am I right? How can I search them in parallel? Jaroslav -- http://mail.python.org/mailman/listinfo/python-list
Re: parallel subprocess.getoutput
Sorry, for code-historical reasons this was unnecessarily complicated. Should be: MY_DIR = '/my/path/to/dir' FILES = os.listdir(MY_DIR) def grep(regex): output = [] for f in FILES: command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' + f result = subprocess.getoutput(command) if result: output.append(result) return output -- http://mail.python.org/mailman/listinfo/python-list
read from file with mixed encodings in Python3
Hello, in Python3, I often have this problem: I want to do something with every line of a file. Like Python3, I presuppose that every line is encoded in utf-8. If this isn't the case, I would like Python3 to do something specific (like skipping the line, writing the line to standard error, ...) Like so: try: except UnicodeDecodeError: ... Yet, there is no place for this construction. If I simply do: for line in f: print(line) this will result in a UnicodeDecodeError if some line is not utf-8, but I can't tell Python3 to stop: This will not work: for line in f: try: print(line) except UnicodeDecodeError: ... because the UnicodeDecodeError is caused in the "for line in f"-part. How can I catch such exceptions? Note that recoding the file before opening it is not an option, because often files contain many different strings in many different encodings. Jaroslav -- http://mail.python.org/mailman/listinfo/python-list