On 29/03/2013 00:54, Chris Angelico wrote:
On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
<steve+comp.lang.pyt...@pearwood.info> wrote:
ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?
PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.
utf8_length, utf8: UTF-8 representation (null-terminated).
data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).
All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""
If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...
I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.
... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.
Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?
You could ask the same question about any encoding.
It's only an issue if it's passed to a C function which expects a
null-terminated string.
Minor nitpick, btw:
(in which cast wstr_length differs form length)
Should be "in which case" and "from". Who has the power to correct
typos in PEPs?
--
http://mail.python.org/mailman/listinfo/python-list