Re: How is unicode implemented behind the scenes?

MRAB Sat, 08 Mar 2014 18:47:25 -0800

On 2014-03-09 02:40, MRAB wrote:

On 2014-03-09 02:08, Dan Stromberg wrote:

OK, I know that Unicode data is stored in an encoding on disk.


But how is it stored in RAM?

I realize I shouldn't write code that depends on any relevant
implementation details, but knowing some of the more common
implementation options would probably help build an intuition for
what's going on internally.

I've heard that characters are no longer all c bytes wide internally,
so is it sometimes utf-8?

No.

  From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.

In Python terms:

if all(c <= '\xFF' for c in string):
      use 1 byte per codepoint
elif all(c <= '\xFFFF' for c in string):
      use 2 bytes per codepoint
else:
      use 4 bytes per codepoint

Oops! That should, of course, be:

if all(c <= '\xFF' for c in string):
    use 1 byte per codepoint
elif all(c <= '\uFFFF' for c in string):
    use 2 bytes per codepoint
else:
    use 4 bytes per codepoint

--
https://mail.python.org/mailman/listinfo/python-list

Re: How is unicode implemented behind the scenes?

Reply via email to