New submission from Martin Panter: Many of the incremental codecs do not handle fragmented data very well. In the past I think I was interested in using the Base-64 and Quoted-printable codecs, and playing with other codecs today reveals many more issues. A lot of the issues reflect missing functionality, so maybe the simplest solution would be to document the codecs that don’t work.
Incremental decoding issues: >>> str().join(codecs.iterdecode(iter((b"\\", b"u2013")), "unicode-escape")) UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string # Same deal for raw-unicode-escape. >>> bytes().join(codecs.iterdecode(iter((b"3", b"3")), "hex-codec")) binascii.Error: Odd-length string >>> bytes().join(codecs.iterdecode(iter((b"A", b"A==")), "base64-codec")) binascii.Error: Incorrect padding >>> bytes().join(codecs.iterdecode(iter((b"=", b"3D")), "quopri-codec")) b'3D' # Should return b"=" >>> codecs.getincrementaldecoder("uu-codec")().decode(b"begin ") ValueError: Truncated input data Incremental encoding issues: >>> e = codecs.getincrementalencoder("base64-codec")(); >>> codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "base64-codec") b'1' # Should be b"12" >>> e = codecs.getincrementalencoder("quopri-codec")(); e.encode(b"1" * 50) + >>> e.encode(b"2" * 50, final=True) b'1111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222' # I suspect the line should have been split in two >>> e = codecs.getincrementalencoder("uu-codec")(); >>> codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "uu-codec") b'1' # Should be b"12" I also noticed iterdecode() does not work with “uu-codec”: >>> bytes().join(codecs.iterdecode(iter((b"begin 666 <data>\n \nend\n",)), >>> "uu-codec")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.3/codecs.py", line 1032, in iterdecode output = decoder.decode(b"", True) File "/usr/lib/python3.3/encodings/uu_codec.py", line 80, in decode return uu_decode(input, self.errors)[0] File "/usr/lib/python3.3/encodings/uu_codec.py", line 45, in uu_decode raise ValueError('Missing "begin" line in input data') ValueError: Missing "begin" line in input data And iterencode() does not work with any of the byte encoders, because it does not know what kind of empty string to pass to IncrementalEncoder.encode(final=True): >>> bytes().join(codecs.iterencode(iter(()), "base64-codec")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode output = encoder.encode("", True) File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode return base64.encodebytes(input) File "/usr/lib/python3.3/base64.py", line 343, in encodebytes raise TypeError("expected bytes, not %s" % s.__class__.__name__) TypeError: expected bytes, not str Finally, incremental UTF-7 encoding is suboptimal, and the decoder seems to buffer unlimited data, both defeating the purpose of using an incremental codec: >>> bytes().join(codecs.iterencode("\xA9" * 2, "utf-7")) b'+AKk-+AKk-' # b"+AKkAqQ-" would be better >>> d = codecs.getincrementaldecoder("utf-7")() >>> d.decode(b"+") b'' >>> any(d.decode(b"AAAAAAAA" * 100000) for _ in range(10)) False # No data returned: everything must be buffered >>> d.decode(b"-") == "\x00" * 3000000 True # Returned all buffered data in one go ---------- components: Library (Lib) messages: 207374 nosy: vadmium priority: normal severity: normal status: open title: Many incremental codecs don’t handle fragmented data type: behavior versions: Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue20132> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com