On 2021-05-26 08:18, Alan Gauld via Python-list wrote: > Does that mean that if I give Python a UTF8 string that is mostly > single byte characters but contains one 4-byte character that > Python will store the string as all 4-byte characters?
As best I understand it, yes: the cost of each "character" in a string is the same for the entire string, so even one lone 4-byte character in an otherwise 1-byte-character string is enough to push the whole string to 4-byte characters. Doesn't effect other strings though (so if you had a pure 7-bit string and a unicode string, the former would still be 1-byte-per-char…it's not a global aspect) If you encode these to a UTF8 byte-string, you'll get the space savings you seek, but at the cost of sensible O(1) indexing. Both are a trade-off, and if your data consists mostly of 7-bit ASCII characters, or lots of small strings, the overhead is less pronounced than if you have one single large blob of text as a string. > If so, doesn't that introduce a pretty big storage overhead for > large strings? Yes. Though such large strings tend to be more rare, largely because they become unweildy for other reasons. -tkc -- https://mail.python.org/mailman/listinfo/python-list