On Wed, May 26, 2021 at 8:27 AM Grant Edwards <grant.b.edwa...@gmail.com> wrote: > > On 2021-05-25, MRAB <pyt...@mrabarnett.plus.com> wrote: > > On 2021-05-25 16:41, Dennis Lee Bieber wrote: > > >> In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PER > >> CHARACTER (I don't recall if there is a 3-byte version). If your > >> input bytes are all 7-bit ASCII, then they map directly to a 1-byte > >> per character string. If they contain any 8-bit upper half > >> character they may map into a 2-byte per character string. > >> > > In CPython 3.3+: > > > > U+0000..U+00FF are stored in 1 byte. > > U+0100..U+FFFF are stored in 2 bytes. > > U+010000..U+10FFFF are stored in 4 bytes. > > Are all characters in a string stored with the same "width"? IOW, does > the presense of one Unicode character in the range U+010000..U+10FFFF > in a string that is otherwise all 7-bit ASCII values result in the > entire string being stored 4-bytes per character? Or is the storage > width variable within a single string? >
Yes, any given string has a single width, which makes indexing fast. The memory cost you're describing can happen, but apart from a BOM widening an otherwise-ASCII string to 16-bit, there aren't many cases where you'll get a single wide character in a narrow string. Usually, if there are any wide characters, there'll be a good number of them (for instance, text in any particular language will often have a lot of characters from a block of characters allocated to it). As an added benefit, keeping all characters the same width simplifies string searching algorithms, if I'm reading the code correctly. Checks like >>"foo" in some_string<< can widen the string "foo" to the width of the target string and then search efficiently. ChrisA -- https://mail.python.org/mailman/listinfo/python-list