Eryk Sun added the comment: > Why do strings cache their UTF-8 encoding?
Strings also cache the wide-string representation. For example: from ctypes import * s = '\241\242\243' pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None) pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None) >>> hex(id(s)) '0x7ffff69f8e98' (gdb) p *(PyCompactUnicodeObject *)0x7ffff69f8e98 $1 = {_base = {ob_base = {_ob_next = 0x7ffff697f890, _ob_prev = 0x7ffff6a04d40, ob_refcnt = 1, ob_type = 0x89d860 <PyUnicode_Type>}, length = 3, hash = -5238559198920514942, state = {interned = 0, kind = 1, compact = 1, ascii = 0, ready = 1}, wstr = 0x7ffff69690a0 L"¡¢£"}, utf8_length = 6, utf8 = 0x7ffff696b7e8 "¡¢£", wstr_length = 3} (gdb) p (char *)((PyCompactUnicodeObject *)0x7ffff69f8e98 + 1) $2 = 0x7ffff69f8ef0 "\241\242\243" This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t). ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue25709> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com