[issue10370] py3 readlines() reports wrong offset for UnicodeDecodeError

STINNER Victor Mon, 08 Nov 2010 18:03:11 -0800

STINNER Victor <victor.stin...@haypocalc.com> added the comment:

The error occurs in .readline(): .readline() fills a buffer by reading the file 
chunk by chunk. Each time a chunk is read, it is decoded by the stateful 
decoder. The problem is that the decoder doesn't know the file offset. Even if 
it knew, start and end attributes of UnicodeDecodeError are indexes in the 
(bytes) object.


> but reports an error at offset 4096 (reported as "0")

4096 is the buffer_size attribute of BufferedReader: .readline() -> 
._read_chunk() -> .buffer.read1().

> The misreported offset does not occur with read(), just with readlines().

.read() is special: it reads the whole file at once, and decode binary content 
at once.

--

I don't consider this issue as a bug, and so I'm closing it as invalid.

--

Use .readline() to locate an invalid byte is not the right algorithm. If you 
would like to do that, you should open the file in binary mode and decodes the 
content yourself, chunk by chunk. Or if you manipulate small files, you can use 
.read() as you wrote.

----------
nosy: +haypo
resolution:  -> invalid
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue10370>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10370] py3 readlines() reports wrong offset for UnicodeDecodeError

Reply via email to