[issue4862] utf-16 BOM is not skipped after seek(0)

Marc-Andre Lemburg Thu, 08 Jan 2009 05:50:59 -0800

Marc-Andre Lemburg <m...@egenix.com> added the comment:

On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
> 
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
> 
> Then read it twice:
> 
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
> 
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the 
> position 
> and the decoder state. Unfortunately, state=0 means 'endianness has been 
> determined: 
> native order'.
> 
> maybe a suggestion: handle seek(0) as a special value which calls 
> decoder.reset().
> The patch implement this idea.


This is a problem with the utf_16.py codec, not the io layer.
Opening a file in append mode is something that the io layer
would have to handle, since the codec doesn't know anything about
the underlying file mode.

Using .reset() will not help. The code for the StreamReader
and StreamWriter in utf_16.py will have to be modified to undo
the adjustment of the .encode() and .decode() method after using
.seek(0).

Note that there's also the case .seek(1) - I guess this must
be considered as resulting in undefined behavior.

----------
nosy: +lemburg

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4862>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue4862] utf-16 BOM is not skipped after seek(0)

Reply via email to