On Sun, Mar 9, 2014 at 1:08 PM, Dan Stromberg <drsali...@gmail.com> wrote: > OK, I know that Unicode data is stored in an encoding on disk. > > But how is it stored in RAM? > > I realize I shouldn't write code that depends on any relevant > implementation details, but knowing some of the more common > implementation options would probably help build an intuition for > what's going on internally. > > I've heard that characters are no longer all c bytes wide internally, > so is it sometimes utf-8? >
As of Python 3.3, it's as MRAB described. If you like, Python chooses between one of three (or four) encodings, based on what can handle the string: 1) ASCII (there are some minor differences with 7-bit strings, eg it knows the conversion to UTF-8 is the identity function) 2) Latin-1 3) UCS-2 4) UCS-4 This means that finding the Nth codepoint in a string is simply a matter of shifting N by either 0, 0, 1, or 2, and picking the right number of bytes from that position. You can read the gory details in PEP 393: http://www.python.org/dev/peps/pep-0393/ but the important bit here is the "kind", which is 01 for Latin-1, 10 for UCS-2, 11 for UCS-4. (The "ascii-only" flag is stored elsewhere.) There's a functionally-identical field in Pike's strings, called size_shift - 0 for ASCII or Latin-1, 1 for UCS-2, 2 for UCS-4. Whichever it is, it's really efficient - and as an added bonus, all those ASCII-only strings that scripts are full of (you know, words like "print" and "len" and "int") are stored compactly, so it's much tighter than the 3.2 builds, even narrow ones. It's pretty awesome! ChrisA -- https://mail.python.org/mailman/listinfo/python-list