On Wed, May 26, 2021 at 10:04 PM Alan Gauld via Python-list <python-list@python.org> wrote: > > On 25/05/2021 23:23, Terry Reedy wrote: > > > In CPython's Flexible String Representation all characters in a string > > are stored with the same number of bytes, depending on the largest > > codepoint. > > I'm learning lots of new things in this thread! > > Does that mean that if I give Python a UTF8 string that is mostly single > byte characters but contains one 4-byte character that Python will store > the string as all 4-byte characters?
Nitpick: It won't be "a UTF-8 string"; it will be "a Unicode string". UTF-8 is a scheme for representing Unicode as a series of bytes, so if something is UTF-8, it'll be like b'Stra\xc3\x9fe' (with two bytes representing one non-ASCII character), whereas the corresponding Unicode string is 'Stra\xdfe' with a single character. Or, if it were beyond the first 256 characters, '\u2026' is an ellipsis, b'\xe2\x80\xa6' is a UTF-8 representation of that same character. And if it's beyond the BMP, then '\U0001F921' is one of the few non-ASCII characters that you can legitimately write off as a "funny character", and b'\xf0\x9f\xa4\xa1' is the UTF-8 byte sequence that would carry that. So. Yes, if you give Python a large ASCII string with a single non-BMP character, the entire string *will* be stored as four-byte characters. (Or, to nitpick against myself: CPython will do this. Other Python implementations are free to do differently, and for instance, uPy actually uses UTF-8 like you were predicting. For the rest of this post, when I say "Python", I actually mean "CPython 3.3 or later".) > If so, doesn't that introduce a pretty big storage overhead for > large strings? > > > > > >>> sys.getsizeof('\U00011111') > > 80 > > >>> sys.getsizeof('\U00011111'*2) > > 84 > > >>> sys.getsizeof('a\U00011111') > > 84 Correct. Each additional character is going to cost you four bytes. > Which is what this seems to be saying. > > I confess I had just assumed the unicode strings were stored > in native unicode UTF8 format. > UTF-8 isn't native any more than any other encoding. It's a good compact format for transmission, but it's quite inefficient for manipulation. Python opts to spend some memory in order to improve time, because that's usually the correct tradeoff to make - it means that indexing in a large string is fast, slicing a large string is fast, etc, etc, etc. Also, the truth is that, *in practice*, very few strings will pay this sort of penalty. If you have a whole lot of (say) Chinese text, there's going to be a small proportion of ASCII text, but most of the text is going to be wider characters. Working with most European languages will require the use of the BMP (which means 16-bit text), but not anything beyond. And if someone's going to use one emoji from the supplemental planes (which would require 32-bit text), it's fairly likely that they'll use multiple. And if you look at all strings in the Python interpreter, the vast majority of them will be ASCII-only, getting optimized all the way down to a single byte. Remember, every module-level variable is stored in that module's dictionary, keyed by its name - and *most* variable names in Python are ASCII. So while it's true that, in theory, a single wide character can cost you a lot of memory... in practice, this is still a lot more compact, overall, than storing all strings in UCS-2. ChrisA -- https://mail.python.org/mailman/listinfo/python-list