Re: How is unicode implemented behind the scenes?

Ned Batchelder Sat, 08 Mar 2014 19:51:12 -0800

On 3/8/14 9:08 PM, Dan Stromberg wrote:

OK, I know that Unicode data is stored in an encoding on disk.


But how is it stored in RAM?

I realize I shouldn't write code that depends on any relevant
implementation details, but knowing some of the more common
implementation options would probably help build an intuition for
what's going on internally.

I've heard that characters are no longer all c bytes wide internally,
so is it sometimes utf-8?

Thanks.

In abstract terms, a Unicode string is a sequence of integers (codepoints). There are lots of ways to store a sequence of integers.

In Python 2.x, it's either a vector of 16-bit ints, or 32-bit ints.These are the Unicode representations known as UTF-16 and UTF-32,respectively, and which you have depends on whether you have a "narrow"or "wide" build of Python. You can tell the difference by examiningsys.maxunicode, which is 65535 (narrow) or 1114111 (wide).

In Python 3.3, the representation was changed from narrow/wide to theso-called Flexible String Representation which others here havedescribed. It uses either 1-, 2-, or 4-bytes per code point, dependingon the set of code points in the string. It's specified in PEP 393:http://legacy.python.org/dev/peps/pep-0393/


--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list

Re: How is unicode implemented behind the scenes?

Reply via email to