[issue7651] Python3: guess text file charset using the BOM

STINNER Victor Wed, 06 Jan 2010 19:04:08 -0800

New submission from STINNER Victor <victor.stin...@haypocalc.com>:

If the file starts with a BOM, open(filename) should be able to guess the 
charset. It would be helpful for many high level modules:


 - #7519: ConfigParser
 - #7185: csv
 - and any module using open() to read a text file

Actually, the user have to choose between UTF-8 and UTF-8-SIG to skip the UTF-8 
BOM. For UTF-16, the user have to specify UTF-16-LE or UTF-16-BE, even if the 
file starts with a BOM (which should be the case most the time).

The idea is to delay the creation of the decoder and the encoder. Just after 
reading the first chunk: try to guess the charset by searching for a BOM (if 
the charset is unknown). If the BOM is found, fallback to current guess code 
(os.device_charset() or locale.getpreferredencoding()).

Concerned charsets: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE. Binary 
files are not concerned. If the encoding is specified to open(), the behaviour 
is unchanged.

I wrote a proof of concept, but there are still open issues:

 - append mode: should we seek at zero to read the BOM?
   old=tell(); seek(0); bytes=read(4); seek(old); search_bom(bytes)
 - read+write: should we guess the charset using the BOM if the first action is 
a write? or only search for a BOM if the first action is a read?

----------
components: Unicode
messages: 97341
nosy: haypo
severity: normal
status: open
title: Python3: guess text file charset using the BOM
versions: Python 2.7, Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue7651>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7651] Python3: guess text file charset using the BOM

Reply via email to