On 3/8/14 9:08 PM, Dan Stromberg wrote:
OK, I know that Unicode data is stored in an encoding on disk.
But how is it stored in RAM?
I realize I shouldn't write code that depends on any relevant
implementation details, but knowing some of the more common
implementation options would probably help build an intuition for
what's going on internally.
I've heard that characters are no longer all c bytes wide internally,
so is it sometimes utf-8?
Thanks.
In abstract terms, a Unicode string is a sequence of integers (code
points). There are lots of ways to store a sequence of integers.
In Python 2.x, it's either a vector of 16-bit ints, or 32-bit ints.
These are the Unicode representations known as UTF-16 and UTF-32,
respectively, and which you have depends on whether you have a "narrow"
or "wide" build of Python. You can tell the difference by examining
sys.maxunicode, which is 65535 (narrow) or 1114111 (wide).
In Python 3.3, the representation was changed from narrow/wide to the
so-called Flexible String Representation which others here have
described. It uses either 1-, 2-, or 4-bytes per code point, depending
on the set of code points in the string. It's specified in PEP 393:
http://legacy.python.org/dev/peps/pep-0393/
--
Ned Batchelder, http://nedbatchelder.com
--
https://mail.python.org/mailman/listinfo/python-list