On 25/05/2021 23:23, Terry Reedy wrote: > In CPython's Flexible String Representation all characters in a string > are stored with the same number of bytes, depending on the largest > codepoint.
I'm learning lots of new things in this thread! Does that mean that if I give Python a UTF8 string that is mostly single byte characters but contains one 4-byte character that Python will store the string as all 4-byte characters? If so, doesn't that introduce a pretty big storage overhead for large strings? > > >>> sys.getsizeof('\U00011111') > 80 > >>> sys.getsizeof('\U00011111'*2) > 84 > >>> sys.getsizeof('a\U00011111') > 84 Which is what this seems to be saying. I confess I had just assumed the unicode strings were stored in native unicode UTF8 format. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos -- https://mail.python.org/mailman/listinfo/python-list