On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote:
> What does actual CPython do currently to find that s[1_000_000], assuming
> utf-8 internal representation?
CPython doesn't use a UTF-8 internal representation.
MicroPython *may*, but I don't know if they do anything fancy to avoid
O(N) indexing.
IronPython and Jython use whatever .Net and Java use.
CPython uses a custom implementation, the Flexible String
Representation, which picks the smallest code unit size required to
store all the characters in the string.
# Pseudo-code
c = max(string) # Highest code-point
if c <= '\xFF':
# effectively ASCII or Latin-1
use one byte per code point
elif c <= '\uFFFF':
# effectively UCS-2, or UTF-16 without the surregate pairs
use two bytes per code point
else:
assert c <= '\U0001FFFF':
# effectively UCS-4, or UTF-32
use four bytes per code point
--
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/5ALOHG346WTZ5OFIJPISTZCZR6KDPZQF/
Code of Conduct: http://python.org/psf/codeofconduct/