On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only > strings. It's only strings in the SMPs that could need surrogate pairs, > and they don't need them in Python's implementation since it's a full 32- > bit implementation. So where do the surrogate pairs come into this?
PEP 393 says: """ wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length). wstr_length differs from length only if there are surrogate pairs in the representation. utf8_length, utf8: UTF-8 representation (null-terminated). data: shortest-form representation of the unicode string. The string is null-terminated (in its respective representation). All three representations are optional, although the data form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data. """ If the string was created from a wchar_t string, that string will be retained, and presumably can be used to re-output the original for a clean and fast round-trip. Same with... > I also wonder why the implementation bothers keeping a UTF-8 > representation. That sounds like premature optimization to me. Surely you > only need it when writing to a file with UTF-8 encoding? For most > strings, that will never happen. ... the UTF-8 version. It'll keep it if it has it, and not else. A lot of content will go out in the same encoding it came in in, so it makes sense to hang onto it where possible. Though, from the same quote: The UTF-8 representation is null-terminated. Does this mean that it can't be used if there might be a \0 in the string? Minor nitpick, btw: > (in which cast wstr_length differs form length) Should be "in which case" and "from". Who has the power to correct typos in PEPs? ChrisA -- http://mail.python.org/mailman/listinfo/python-list