[issue18282] Ugly behavior of binary and unicode handling on reading unknown encoded files

Sworddragon Sat, 22 Jun 2013 07:55:19 -0700

New submission from Sworddragon:

Currently Python 3 has some problems of handling files with an unknown 
encoding. In this example we have a file encoded as ISO-8859-1 with the content 
"ä" which should be tried to be read. Lets see what Python 3 can currently do 
here:


1. We can simply open the file and try to read the content. The encoding will 
be set in my case automatically to UTF-8. But the read() operation will throw 
an exception: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in 
position 0: unexpected end of data

2. Now lets look a little more into the arguments of open(): We will find an 
errors argument which could maybe be useful:
2.1. "strict" is the default behavior which was already tested.
2.2. "ignore" will not throw any exception but delete any character which can't 
be read. This would be problematic in many cases.
2.3. "replace" will replace any character which can't be read which will be 
problematic in many cases too.
2.4. "surrogateescape" can throw exceptions too: UnicodeEncodeError: 'utf-8' 
codec can't encode character '\udce4' in position 0: surrogates not allowed
2.5. "xmlcharrefreplace" and "backslashreplace" are not used for reading.

3. Since trying to decode the file will make many problems we can try to read 
the file as binary content. This will work in all cases but causing another 
problem: Any unicode string that must be concatenated with the content of the 
file must be converted to a binary string too (like b'some_unicode_content' or 
some_unicode_variable.encode()). The same happens for unicode strings that must 
be concatenated somewhere else with the newly converted unicode_to_binary 
variable even if they doesn't touch the file content. This behavior can affect 
the maintainability in a bad way.


As you can see all current solutions of Python 3 have big disadvantages. If I'm 
overlooking something feel free to correct me. Currently I have developed my 
own solution in Python which solved the problem: A function that autodetects 
the encoding of the file. Maybe there could also be a native way to do this on 
open() or maybe there could be another way found to solve this problem.

----------
components: IO
messages: 191643
nosy: Sworddragon
priority: normal
severity: normal
status: open
title: Ugly behavior of binary and unicode handling on reading unknown encoded 
files
type: enhancement
versions: Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18282>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18282] Ugly behavior of binary and unicode handling on reading unknown encoded files

Reply via email to