[issue7649] "u'%c' % char" broken for chars in range '\x80'-'\xFF'

Marc-Andre Lemburg Wed, 24 Feb 2010 02:02:51 -0800

Marc-Andre Lemburg <[email protected]> added the comment:

Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <[email protected]> added the comment:
> 
>> Could you please check for chars above 0x7f first and then use
>> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API
> 
> I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the 
> expected conversion char->unicode should use the default encoding (ascii).
> But why is it necessary to check for chars above 0x7f?


The Python default encoding has to be ASCII compatible,
so it's better to use a short-cut for pure-ASCII characters
and avoid the complete round-trip via a temporary Unicode
object.

>> (this API should not have been backported from the Python 3.x
>> in Python 2.6,
> This function is still useful when the chars come from a C string literal in 
> the source code (btw there should be something about the encoding used in C 
> files). But it's not always correctly used even in 3.x, in posixmodule.c for 
> example.

The function is a really just yet another interface to the
PyUnicode_DecodeUTF8() API and it's name is misleading in that:

Python 2.x uses the default encoding for converting strings without
known encoding to Unicode, the docs for the API say that
it decodes Latin-1 (!) and the interface makes it looks like
a drop-in replacement for PyString_FromStringAndSize() which
it isn't for Python 2.x.

For Python 3.x, the default encoding is fixed to UTF-8, so the
situation is different (though the docs are still wrong),
however I don't see the advantage of using a less explicit
name over the direct use of PyUnicode_DecodeUTF8().

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue7649>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7649] "u'%c' % char" broken for chars in range '\x80'-'\xFF'

Reply via email to