Re: Flexible string representation, unicode, typography, ...

Terry Reedy Sun, 02 Sep 2012 19:37:26 -0700

On 9/2/2012 9:54 PM, Steven D'Aprano wrote:

On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote:

On 30.08.12 09:55, Steven D'Aprano wrote:

And Python's solution uses those: UCS-2, UCS-4, and UTF-8.


I see that this misconception widely spread.


I am not familiar enough with the C implementation to tell what Python
3.3 actually does, and the PEP assumes a fair amount of familiarity with
the CPython source. So I welcome corrections.

In fact Python 3.3 uses four kinds of ready strings.

* ASCII. All codes <= U+007F.
* UCS1. All codes <= U+00FF, at least one code > U+007F.
* UCS2. All codes <= U+FFFF, at least one code > U+00FF.
* UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.


Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Basically, yes. I believe CPython will only use surrogate code points ifone requests errors=surrogate-escape on decoding or explicitly puts themin a literal (\unnnn or \Ummmmmmmm). The consequences fall under the'consenting adults' policy.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Flexible string representation, unicode, typography, ...

Reply via email to