New submission from Amaury Forgeot d'Arc <amaur...@gmail.com>:

First write a utf-16 file with its signature:

>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>> f1.write('0123456789')
>>> f1.close()

Then read it twice:

>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>> print('read1', ascii(f2.read()))
read1 '0123456789'
>>> f2.seek(0)
0
>>> print('read2', ascii(f2.read()))
read2 '\ufeff0123456789'

The second read returns the BOM!
This is because the zero in seek(0) is a "cookie" which contains both the 
position 
and the decoder state. Unfortunately, state=0 means 'endianness has been 
determined: 
native order'.

maybe a suggestion: handle seek(0) as a special value which calls 
decoder.reset().
The patch implement this idea.

----------
files: io_utf16.patch
keywords: patch
messages: 79299
nosy: amaury.forgeotdarc
priority: critical
severity: normal
status: open
title: utf-16 BOM is not skipped after seek(0)
versions: Python 3.0
Added file: http://bugs.python.org/file12627/io_utf16.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4862>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to