On Fri, Aug 31, 2012 at 6:32 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > That's one thing that I'm unclear about -- under what circumstances will > a string be in compact versus non-compact form?
I understand it to be entirely dependent on which API is used to construct. The legacy API generates legacy strings, and the new API generates compact strings. From the comments in unicodeobject.h: /* ASCII-only strings created through PyUnicode_New use the PyASCIIObject structure. state.ascii and state.compact are set, and the data immediately follow the structure. utf8_length and wstr_length can be found in the length field; the utf8 pointer is equal to the data pointer. */ ... Legacy strings are created by PyUnicode_FromUnicode() and PyUnicode_FromStringAndSize(NULL, size) functions. They become ready when PyUnicode_READY() is called. ... /* Non-ASCII strings allocated through PyUnicode_New use the PyCompactUnicodeObject structure. state.compact is set, and the data immediately follow the structure. */ Since I'm not sure that this is clear, note that compact vs. legacy does not describe which character width is used (except that PyASCIIObject strings are always 1 byte wide). Legacy and compact strings can each use the 1, 2, or 4 byte representations. "Compact" merely denotes that the character data is stored inline with the struct (as opposed to being stored somewhere else and pointed at by the struct), not the relative size of the string data. Again from the comments: Compact strings use only one memory block (structure + characters), whereas legacy strings use one block for the structure and one block for characters. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list