On Sat, 08 Mar 2014 18:08:38 -0800, Dan Stromberg wrote: > OK, I know that Unicode data is stored in an encoding on disk. > > But how is it stored in RAM?
There are various common ways to store Unicode strings in RAM. The first, UTF-16, treats every character [aside: technically, a code point] as a double byte rather than a single byte. So the letter "A" is stored as two bytes 0x0041 (or 0x4100 depending on your platform's byte order). Using two bytes allows for a maximum of 65536 different characters, *way* too few for the whole Unicode character set, so UTF-16 has an escaping mechanism where characters beyond ordinal 0xFFFF are stored as *two* "characters" (again, actually, code points) called surrogate pairs. That means that a sequence of (say) four human-readable characters may, depending on those characters, take up anything from eight bytes to sixteen bytes, and you cannot tell which until you walk through the sequence inspecting each pair of bytes: while there are still pairs of bytes to inspect: c = get_next_pair() if is_low_surrogate(c): error elif is_high_surrogate(c): d = get_next_pair() if not is_low_surrogate(d): error print make_char_from_surrogate_pair(c, d) else: print make_char_from_double_byte(c) So UTF-16 is a *variable width* (could be 1 unit, could be 2 units) *double byte* encoding (each unit is two bytes). Prior to Python 3.3, using UTF-16 was an option when compiling Python's source code. Such versions of the interpreter are called "narrow builds". Another option is UTF-32. UTF-32 uses four bytes for every character. That's enough to store every Unicode character, and then some, so there are no surrogate pairs needed. But every character takes up four bytes: "A" would be stored as 0x00000041 or 0x41000000. Although UTF-32 is faster than UTF-16, because you don't have to walk the string checking each individual pair of bytes to see if they are part of a surrogate, strings use up to twice as much memory as UTF-16 whether they need it or not. (And four times more memory than ASCII strings.) Prior to Python 3.3, UTF-32 was a build option too. Such versions of the interpreter are called "wide builds". Another option is to use UTF-8 internally. With UTF-8, every character uses between 1 and 4 bytes. By design, ASCII characters are stored using a single byte, the same byte they would have in old fashioned single-byte ASCII: the letter "A" is stored as 0x41. (The algorithm used by UTF-8 can continue up to six bytes, but there is no need to since there aren't that many Unicode characters.) Because it's variable-width, you have the same variable-width issues as UTF-16, only even more so, but because most common characters (at least for English speakers) use only 1 or 2 bytes, it's much more compact than either. No version of Python has, to my knowledge, used UTF-8 internally. Some other languages, such as Go and Haskell, do, and consequently string processing is slow for them. In Python 3.3, CPython introduced an internal scheme that gives the best of all worlds. When a string is created, Python uses a different implementation depending on the characters in the string: * If all the characters are ASCII or Latin-1, then the string uses a single byte per character. * If all the characters are no greater than ordinal value 0xFFFF, then UTF-16 is used. Because the characters are all below 0xFFFF, no surrogate pairs are required. * Only if there is at least one ord() greater than 0xFFFF does Python use UTF-32 for that string. The end result is that creating strings is slightly slower, as Python may have to inspect each character at most twice to decide what system to use. But memory use is much improved: Python has *many* strings (every function, method and class uses many strings in their implementation) and the memory savings can be considerable. Depending on your application and what you do with those strings, that may even lead to time savings as well as memory savings. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list