New submission from STINNER Victor <victor.stin...@haypocalc.com>:

HZ and ISO-2022 family codecs may generate an escape sequence at the end of a 
stream. For example, the HZ codec uses '~{' to switchs from ASCII to GB2312, 
and '~}' resets the encoding to ASCII. At the end of a stream, the encoding 
should be reset to ASCII. '\u804a'.encode('hz') returns b'~{AD~}', which is 
correct.

Incremental encoders generate also the escape sequence if the last call to 
encode() is done using final=True.

It would be nice to be able to generate the escape sequence without the final 
flag, because sometimes you don't know which call to encode() is the last one. 
For example if you write data in a file, you may want to write the escape 
sequence at the end when the file is closed.

I propose to change the reset() method of incremental encoders: they may return 
a bytes object to close the stream. For example, the reset() method of the HZ 
codec may returns b'~}' if the encoder was using GB2312 (if it emited 
previously b'~{').

So the 3 following code should returns b'~{AD~}':

 * '\u804a'.encode('hz')
 * encoder = codecs.lookup('hz').incrementalencoder(); encoder.encode('\u804a', 
final=True)
 * encoder = codecs.lookup('hz').incrementalencoder(); encoder.encode('\u804a') 
+ encoder.reset()

For backward compatibility, reset() returns None if there is no pending buffer 
or any escape sequence.

--

This proposition comes from #12000: Armin Rigo noticed that the reset method of 
the incremental encoders of CJK codecs calls the decoder reset function. 
Extract of Modules/cjkcodecs/multibytecodec.c:

static PyObject *
mbiencoder_reset(MultibyteIncrementalEncoderObject *self)
{
    if (self->codec->decreset != NULL &&
        self->codec->decreset(&self->state, self->codec->config) != 0)
        return NULL;
    self->pendingsize = 0;

    Py_RETURN_NONE;
}

I suppose that codec->encreset() is not called here because we need an output 
buffer, and there is no such buffer. Or it's just a copy-paste failure.

--

I am not sure that it is really useful to emit b'~}' at the end of a HZ stream, 
but the change is simple and if you don't care of b'~}': just ignore the result 
of reset() (as everybody does today, so it doesn't hurt).

--

Only HZ and ISO-2022 encodings implement the reset method of their incremental 
encoder. For example, encodings of the UTF family don't need to emit bytes at 
reset.

For the maximum length of reset() output: HZ may generates 2 bytes (b'~}') and 
ISO-2022 may generates 4 bytes (b'\x0F' + b'\x1F(B').

--

See also the issue #12100.

----------
components: Library (Lib), Unicode
files: cjk_reset_result.patch
keywords: patch
messages: 136794
nosy: arigo, gz, haypo, hyeshik.chang, lemburg
priority: normal
severity: normal
status: open
title: Reset method of the incremental encoders of CJK codecs calls the decoder 
reset function
versions: Python 3.3
Added file: http://bugs.python.org/file22096/cjk_reset_result.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12171>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to