STINNER Victor <victor.stin...@haypocalc.com> added the comment: Attached patch is a partial fix: support UTF-16-LE, UTF-16-BE and UTF-32-LE. Some remarks about my patch: * UTF-32-BE is not supported because I'm too lazy tonigh to finish the patch and because such file begins with 0x00 0x00 whereas the parser doesn't like nul bytes * I disabled the cookie check if the file starts with a BOM (the cookie is ignored) because the charset name is not normalized and so if the cookie is not exactly the same as the hardcoded charset name (eg. "UTF-16LE"), the test will fail. Eg "utf-16le" != "UTF-16LE" :-( * compile() would require much more effort to support UTF-16-* and UTF-32-* because compile() simply rejects any string with nul byte. It's beause it uses functions like strlen() :-/ That's why I use subprocess([sys.executable, ...]) in the unit test and not simply compile()
Support UTF-{16,32}-{LE,BE} would be nice but it requires to hack to parser (especially compile() builtin function) to support nul bytes... ---------- keywords: +patch Added file: http://bugs.python.org/file13409/tokenizer_bom.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue1503789> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com