On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote: > On 30.08.12 09:55, Steven D'Aprano wrote: >> And Python's solution uses those: UCS-2, UCS-4, and UTF-8. > > I see that this misconception widely spread.
I am not familiar enough with the C implementation to tell what Python 3.3 actually does, and the PEP assumes a fair amount of familiarity with the CPython source. So I welcome corrections. > In fact Python 3.3 uses four kinds of ready strings. > > * ASCII. All codes <= U+007F. > * UCS1. All codes <= U+00FF, at least one code > U+007F. > * UCS2. All codes <= U+FFFF, at least one code > U+00FF. > * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF. Where UCS1 is equivalent to Latin-1, correct? UCS2 is what Python 3.2 narrow builds uses for all strings, including codes > U+FFFF using surrogate pairs. UCS4 is what Python 3.2 wide builds uses for all strings. This means that Python 3.3 will no longer have surrogate pairs. Am I right? > Indexing is O(0) for any string. I think you mean O(1) for constant-time lookups. > Also the string can optionally cache UTF-8 and wchar_t* representation. Right, that's the bit that wasn't clear -- the UTF-8 data is a cache, not the canonical representation. -- Steven -- http://mail.python.org/mailman/listinfo/python-list