New submission from STINNER Victor <victor.stin...@haypocalc.com>:

I'm trying to document the encoding of all bytes argument of the C API: see 
#9738. I tried to understand which encoding is used by PyUnicode_FromFormat*() 
(and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like 
ISO-8859-1, see unicodeobject.c near line 1106:

    for (f = format; *f; f++) {
        if (*f == '%') {
            ...
        } else
            *s++ = *f; <~~~~ here
    }

... oh wait, it doesn't work for non-ascii text! Test in gdb:

(gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff"))
object  : 'iso-8859-1:\uffd0\uffff'
type    : str
refcount: 1
address : 0x83d5d80

b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug.

--

PyUnicode_FromFormatV() should raise an error on non-ascii format character, or 
decode it correctly as... ISO-8859-1 or something else. It's difficult to 
support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to 
raise an error, how can the user format a non-ascii string? Using 
its_unicode_format.format(...arguments...) or its_unicode_format % arguments? 
Is it easy to call these methods in C?

----------
components: Interpreter Core, Unicode
messages: 115542
nosy: haypo
priority: normal
severity: normal
status: open
title: PyUnicode_FromFormatV() doesn't handle non-ascii text correctly
versions: Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9769>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to