On Fri, 29 Mar 2013 11:54:41 +1100, Chris Angelico wrote: > On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: >> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only >> strings. It's only strings in the SMPs that could need surrogate pairs, >> and they don't need them in Python's implementation since it's a full >> 32- bit implementation. So where do the surrogate pairs come into this? > > PEP 393 says: > """ > wstr_length, wstr: representation in platform's wchar_t > (null-terminated). If wchar_t is 16-bit, this form may use surrogate > pairs (in which cast wstr_length differs form length). wstr_length > differs from length only if there are surrogate pairs in the > representation. > > utf8_length, utf8: UTF-8 representation (null-terminated). > > data: shortest-form representation of the unicode string. The string is > null-terminated (in its respective representation). > > All three representations are optional, although the data form is > considered the canonical representation which can be absent only while > the string is being created. If the representation is absent, the > pointer is NULL, and the corresponding length field may contain > arbitrary data. > """
All the words are in English (well, most of them...) but what does it mean? > If the string was created from a wchar_t string, that string will be > retained, and presumably can be used to re-output the original for a > clean and fast round-trip. Under what circumstances will a string be created from a wchar_t string? How, and why, would such a string be created? Why would Python still support strings containing surrogates when it now has a nice, shiny, surrogate-free flexible representation? >> I also wonder why the implementation bothers keeping a UTF-8 >> representation. That sounds like premature optimization to me. Surely >> you only need it when writing to a file with UTF-8 encoding? For most >> strings, that will never happen. > > ... the UTF-8 version. It'll keep it if it has it, and not else. A lot > of content will go out in the same encoding it came in in, so it makes > sense to hang onto it where possible. Not to me. That almost doubles the size of the string, on the off-chance that you'll need the UTF-8 encoding. Which for many uses, you don't, and even if you do, it seems like premature optimization to keep it around just in case. Encoding to UTF-8 will be fast for small N, and for large N, why carry around (potentially) multiple megabytes of duplicated data just in case the encoded version is needed some time? > Though, from the same quote: The UTF-8 representation is > null-terminated. Does this mean that it can't be used if there might be > a \0 in the string? > > Minor nitpick, btw: >> (in which cast wstr_length differs form length) > Should be "in which case" and "from". Who has the power to correct typos > in PEPs? > > ChrisA -- http://mail.python.org/mailman/listinfo/python-list