On Thu, Mar 28, 2013 at 8:37 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: >>> I also wonder why the implementation bothers keeping a UTF-8 >>> representation. That sounds like premature optimization to me. Surely >>> you only need it when writing to a file with UTF-8 encoding? For most >>> strings, that will never happen. >> >> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot >> of content will go out in the same encoding it came in in, so it makes >> sense to hang onto it where possible. > > Not to me. That almost doubles the size of the string, on the off-chance > that you'll need the UTF-8 encoding. Which for many uses, you don't, and > even if you do, it seems like premature optimization to keep it around > just in case. Encoding to UTF-8 will be fast for small N, and for large > N, why carry around (potentially) multiple megabytes of duplicated data > just in case the encoded version is needed some time?
>From the PEP: """ A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). APIs that implicitly converts a string to a char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to compute a conversion. """ So the utf8 representation is not populated when the string is created, but when a utf8 representation is requested, and only when requested by the API that returns a char*, not by the API that returns a bytes object. -- http://mail.python.org/mailman/listinfo/python-list