[issue12508] Codecs Anomaly

Ezio Melotti Sat, 03 Sep 2011 21:20:23 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

IIUC this happens because StreamReader calls codecs.utf_8_decode without 
passing final=1 [0], so when the decoder finds the trailing F4 it doesn't 
decode it yet because it waits from the other 3 bytes (F4 is the start byte of 
a 4-bytes UTF-8 sequence):


>>> b = b'A\xf5BC\xf4'
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0)  # final=0
>>> chars, decnum
('A�BC', 4)  # F4 not decoded yet
>>> b = b[decnum:]
>>> b
b'\xf4'  # F4 still here
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 0)
>>> chars, decnum
('', 0)  # additional calls keep waiting for the other 3 bytes
>>> chars, decnum = codecs.utf_8_decode(b, 'replace', 1)  # final=1
>>> chars, decnum
('�', 1)  # when final=1 is passed F4 is decoded, but it never happens

While passing 1 makes the attached script work as expected, it breaks several 
other test in test_codecs (apparently not all the decoders accept the 'final' 
argument).
Also passing 1 should be done only for the last call: read can be called 
several times with a specific size, and it shouldn't use final=1 until the last 
call to avoid errors mid-stream.

[0]: see Lib/codecs.py:482

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12508>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12508] Codecs Anomaly

Reply via email to