Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
My apologies for hammering on this, but I think it is quite important and
currently Python 3.0 seems confused about UCS-2 versus UTF-16.

-On [20080702 20:47], Guido van Rossum ([EMAIL PROTECTED]) wrote:
>No, Python already is aware of surrogates. I meant applications
>processing non-BMP text should beware of them.

Just to make sure people are fully aware of the distinctions:

UCS-2 uses 16 bits to encode Unicode data, does NOT support surrogate pairs
and therefore CANNOT represent data beyond U+ (thus only supporting the
Basic Multilingual Plane, BMP). It is a fixed-length character encoding.

UTF-16 also uses 16 bits to encode Unicode data, but DOES support surrogate
pairs and therefore CAN represent data beyond U+ by using said surrogate
pairs (thus supporting all planes). It is a variable-length character
encoding.

So a string representation in UCS-2 means every character occupies 16 bits.
A string representation in UTF-16 means characters can occupy 16 bits or
32-bits.

If one stays within the BMP than all is well, but when you move beyond the
BMP (U+1 - U+10) then Python needs to correctly check the string
for surrogate pairs and deal with them internally.

>If you find places where the Python core or standard library is doing
>Unicode processing that would break when surrogates are present you
>should file a bug. However this does not mean that every bit of code
>that slices a string at an arbitrary point (and hence risks slicing in
>the middle of a surrogate) is incorrect -- it all depends on what is
>done next with the slice.

Basically everything but string forming or string printing seems to be
broken for surrogate pairs, from what I can tell.
Also, I think you are confused about slicing in the middle of a surrogate
pair, from a UTF-16 perspective this is 1 codepoint! And as such Python
needs to treat it as one character/codepoint in a string, dealing with
slicing as appropriate. The way you currently describe it is that UTF-16
strings will be treated as UCS-2 when it comes to slicing and the likes.
From a UTF-16 point of view such slicing can NEVER occur unless you are bit
or byte slicing instead of character/codepoint slicing.

The documentation for len() says:
Return the length (the number of items) of an object.

I think it can be fairly said that an item in a string is a character or
codepoint. Take for example the following string:

a = '\U00020045\u942a' # Two hanzi/kanji/hanja

From a Unicode perspective we are looking at two characters/codepoints.
When we use a 4-byte Python 3.0 binary we get (as expected):
>>> len(a)
2

When we use a 2-byte Python 3.0 binary (the default) we get (not as
expected):
>>> len(a)
3

From a UTF-16 perspective a surrogate pair is one character/codepoint and
as such len() should have reported 2 as well. That the sequence is stored
internally as 0xd840 0xdc45 0x942a and occupies 3 bytes is not interesting.
But it seems as if len() is treating the string as being in UCS-2
(fixed-length), which is the only logical explanation for the number 3,
instead of treating it as UTF-16 (variable-length) and reporting the number
2.

Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires
a[2] on the 2-byte Python 3.0. As such the code you write for 2-byte and
4-byte Python 3.0 is *different* when you have to deal with the same Unicode
strings! This cannot be the desired situation, can it?

Two more examples:

>>> a.find('鐪') # 4-byte
1
>>> a.find('鐪') # 2-byte
2

>>> import re # 4-byte
>>> m = re.search('鐪', a)
>>> m.start()
1
>>> import re # 2-byte
>>> m = re.search('鐪', a)
>>> m.start()
2

This, in my opinion, has nothing to do with the application writers, but
more with Python's internals being confused about UCS-2 and UTF-16. We
accept full 32-bit codepoints with the \U escape in strings, and we may even
store it as UTF-16 internally, but we clearly do not deal with it properly
as UTF-16, but rather as UCS-2, when it comes to using said strings with
core functions and modules.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
For wouldst thou not carve at my Soul with thine sword of Supreme Truth?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Antoine Pitrou

Hi,

> Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires
> a[2] on the 2-byte Python 3.0.

How is it annoying *in practice*? In actual code the index, instead of
being a constant, will be retrieved through various means such as .find()
or re.search().start()... as you show yourself later in your message.

What is primordial is that Python shows a consistent behaviour, and it
does, since indices returned by .find() et al. have the same meaning as
indices you can use with the [] operator. AFAIK that's why Guido asked
for real-world rather than theoretical examples.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Nick Coghlan

Jeroen Ruigrok van der Werven wrote:

The documentation for len() says:
Return the length (the number of items) of an object.


So what this tells us is that in a UCS-2 build of Python, the "items" in 
a unicode string are not, strictly speaking, Unicode code points or 
characters. Instead, they are successive 16-bit fragments of a UTF-16 
encoded string (which correspond to characters only if there are no 
surrogate pairs present in the string).


Let's look at the options here:

1. System is NOT memory limited (i.e. most desktops): use a UCS-4 Python 
build, which is what most Linux distributions do (I'm not sure about the 
pydotorg provided Windows or Mac OS X builds).


2. System is memory limited, only BMP Unicode code points are used: use 
a UCS-2 Python build, limit yourself to characters on the BMP (possibly 
enforced by use of an appropriate codec to decode input text).


3. System is memory limited, but needs to support characters beyond the 
BMP: use a UCS-2 Python build, handling any codepoints outside the BMP 
in application code.


The current Python approach handles all three cases relatively 
gracefully and with minimal overhead. Dealing natively with surrogate 
pair issues could easily result in pointless complexity for cases 1 and 
2, while completely disallowing codepoints beyond the BMP in a UCS-2 
build would needlessly rule out option 3.


So here's the challenge:

1. If you are advocating disallowing the use of characters outside the 
BMP in a UCS-2 build, enumerate the advantages of doing so (paying 
particular attention to any advantages which cannot be obtained simply 
by using an appropriate codec that disallows non-BMP characters).


2. If you are advocating making the "items" in a Unicode string code 
points even in a UCS-2 build, enumerate all of the string behaviours 
that would have to change, as well as indicating how to avoid causing a 
reduction in speed for cases 1 and 2 above.


Sure, option 2 might be nice to have, but the purity argument isn't 
going to be anywhere near enough motivation to justify the additional 
code complexity - there need to be practical benefits that aren't better 
met just by sacrificing a bit of memory efficiency and switching to a 
UCS-4 build.


Cheers,
Nick.

--
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
http://www.boredomandlaziness.org
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg

I think the discussion is going in the wrong direction:

The choice between UCS2 and UCS4 builds is really only meant
to enhance the possibility to interface to native OS or
application APIs, e.g. Windows LIBC and Java use UTF-16, glibc
on Unix uses UCS4.

The problem of slicing Unicode objects is far more complicated
than just breaking a surrogate pair. Unicode if full of combining
code points - if you break such a sequence, the output will be
just as wrong; regardless of UCS2 vs. UCS4.

A long time ago we had a discussion about these problems. I had
suggested a new module (unicodeindex IIRC) which takes care of indexing
Unicode strings based on code points (which support for surrogates),
glyphs (taking combining code points into account) and words (with
support for various breaking/non-breaking separation code points).

Trying to solve such issues at the storage level is the wrong
approach, since the problem is application specific and thus requires
a higher-level set of possible solutions.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go

 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Daniel Arbuckle
On Thu, Jul 3, 2008 at 5:39 AM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> 1. If you are advocating disallowing the use of characters outside the BMP
> in a UCS-2 build, enumerate the advantages of doing so (paying particular
> attention to any advantages which cannot be obtained simply by using an
> appropriate codec that disallows non-BMP characters).

Right now, the same python code has different meaning, depending on a
compile-time option that most users didn't even set for themselves.
Moreover, the errors caused by this semantic difference are not
reported. There's just no way to justify that.

You can't solve this problem by saying 'programmers should choose a
codec that limits them to the BMP when they target 2-byte python,'
because the problem specifically arises when code that works correctly
in a 4-byte python is placed into a 2-byte python, an operation
performed by the users rather than by programmers.

Since 2-byte python is apparently a holdover for memory-limited (and
presumably CPU-limited as well) systems, it doesn't make sense to
impose on it the requirement of correctly dealing with surrogate
pairs. Given that, it seems to me that the best solution would be to
make 4-byte python the default, and also to make 2-byte python raise
an exception when it encounters characters outside the BMP. This way,
a mysterious and unreported semantic error becomes an explicit
syntactic error.

For programmers who want to target a 2-byte format (for win32
compatibility, for example), the correct choice of codec is a superior
solution to forcing a 2-byte internal representation on python.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
>Unicode if full of combining code points - if you break such a sequence,
>the output will be just as wrong; regardless of UCS2 vs. UCS4.

In my opinion you are confusing two related, but very separated things here.
Combining characters have nothing to do with breaking up the encoding of a
single codepoint. Sure enough, if you arbitrary slice up codepoints that
consist of combining characters then your result is indeed odd looking.

I never said that nor is that the point I am making.

Guido points out that Python supports surrogate pairs and says that if
Python is dealing wrongly with this in the core than it needs to be fixed.
I am pointing out that given the fact we allow surrogate pairs we deal
rather simplistic with it in the core. In fact, we do not consider them at
all. In essence: though we may accept full 21-bit codepoints in the form of
\U escape sequences and store them internally as UTF-16 (which I
still need to verify) we subsequently deal with them programmatically as
UCS-2, which is plain silly.

You either commit yourself fully to UTF-16 and surrogate pairs or not. Not
some form in-between, because that will ultimately lead to more confusion
due to the difference in results when dealing with Unicode.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Believe in Angels...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Mark Hammond
> For programmers who want to target a 2-byte format (for win32
> compatibility, for example)

As MAL said, this is taking the discussion in the wrong direction.

For people on Windows, win32 isn't a "compatibility" consideration.  I
suspect most users of the other platforms MAL mentioned and all others with
their own native unicode implementations would agree.

Cheers,

Mark

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg

On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:

-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:

Unicode if full of combining code points - if you break such a sequence,
the output will be just as wrong; regardless of UCS2 vs. UCS4.


In my opinion you are confusing two related, but very separated things here.
Combining characters have nothing to do with breaking up the encoding of a
single codepoint. Sure enough, if you arbitrary slice up codepoints that
consist of combining characters then your result is indeed odd looking.

I never said that nor is that the point I am making.


Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.


Guido points out that Python supports surrogate pairs and says that if
Python is dealing wrongly with this in the core than it needs to be fixed.
I am pointing out that given the fact we allow surrogate pairs we deal
rather simplistic with it in the core. In fact, we do not consider them at
all. In essence: though we may accept full 21-bit codepoints in the form of
\U escape sequences and store them internally as UTF-16 (which I
still need to verify) we subsequently deal with them programmatically as
UCS-2, which is plain silly.


Python applies conversion from non-BMP code points to surroagtes
for UCS builds in a few places and I agree that we should probably
do that at a few more places.

However, these are mainly conversion issues of encoded Unicode
representations vs. the internal Unicode storage where you want
to avoid exceptions in favor of finding a solution that preserves
data.

To make it clear: UCS2 builds of Python do not support non-BMP
code points out of the box.

A programmer will always have to use a codec to map the internal storage
on these builds to the full Unicode code point range. The following
codecs support surrogates on UCS2 builds:

 * UTF-8
 * UTF-16
 * UTF-32
 * unicode-escape
 * raw-unicode-escape


You either commit yourself fully to UTF-16 and surrogate pairs or not. Not
some form in-between, because that will ultimately lead to more confusion
due to the difference in results when dealing with Unicode.


Programmers will have to be aware of the fact that on UCS2
builds of Python non-BMP code points will have to be treated
differently than on UCS4 builds.

I don't see that as a problem. It is in a way similar to
32-bit vs. 64-bit builds of Python or the fact that floating point
numbers work differently depending on the Python platform or
compiler being used.

BTW: Have you ever run into any problems with UCS2 vs. UCS4
in practice that were not easy to solve ?

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go

 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 3:48 AM, Jeroen Ruigrok van der Werven
<[EMAIL PROTECTED]> wrote:
> My apologies for hammering on this, but I think it is quite important and
> currently Python 3.0 seems confused about UCS-2 versus UTF-16.
[...]

Your seem to be suggesting that len(u"\U00012345") should return 1 on
a system that internally uses UTF-16 and hence represents this string
as a surrogate pair.

This is not going to happen. You may as well complain to the authors
of the Java standard about the corresponding problem there.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Daniel Arbuckle
On Thu, Jul 3, 2008 at 6:42 AM, Mark Hammond <[EMAIL PROTECTED]> wrote:
> For people on Windows, win32 isn't a "compatibility" consideration.  I
> suspect most users of the other platforms MAL mentioned and all others with
> their own native unicode implementations would agree.

I'm sorry, but you're wrong. Interfacing python to interoperate with
the underlying system is compatibility. Surely your own win32
extensions already address this necessity.

Regardless, as I said before, nothing justifies silently changing the
meaning of a program based on an option that most users don't set for
themselves and are not aware of. When such a change would take place,
it should be reported explicitly as an error.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:
>Your seem to be suggesting that len(u"\U00012345") should return 1 on
>a system that internally uses UTF-16 and hence represents this string
>as a surrogate pair.

From a Unicode and UTF-16 point of view that makes the most sense. So yes, I
am suggesting that.

>This is not going to happen. You may as well complain to the authors
>of the Java standard about the corresponding problem there.

Why would I need to complain to them? They already fixed it since 1.5.0.

Java 1.5.0's release notes
(http://java.sun.com/developer/technicalArticles/releases/j2se15/):

Supplementary Character Support

32-bit supplementary character support has been carefully added to the
platform as part of the transition to Unicode 4.0 support. Supplementary
characters are encoded as a special pair of UTF16 values to generate a
different character, or codepoint. A surrogate pair is a combination of a
high UTF16 value and a following low UTF16 value. The high and low values
are from a special range of UTF16 values.

In general, when using a String or sequence of characters, the core API
libraries will transparently handle the new supplementary characters for
you.

See also http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

The methods that accept an int value support all Unicode characters,
including supplementary characters. For example, Character.isLetter(0x2F81A)
returns true because the code point value represents a letter (a CJK
ideograph).

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Life can only be understood backwards, but it must be lived forwards...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 7:46 AM, Jeroen Ruigrok van der Werven
<[EMAIL PROTECTED]> wrote:
> -On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:
>>Your seem to be suggesting that len(u"\U00012345") should return 1 on
>>a system that internally uses UTF-16 and hence represents this string
>>as a surrogate pair.
>
> From a Unicode and UTF-16 point of view that makes the most sense. So yes, I
> am suggesting that.
>
>>This is not going to happen. You may as well complain to the authors
>>of the Java standard about the corresponding problem there.
>
> Why would I need to complain to them? They already fixed it since 1.5.0.
>
> Java 1.5.0's release notes
> (http://java.sun.com/developer/technicalArticles/releases/j2se15/):
>
> Supplementary Character Support
>
> 32-bit supplementary character support has been carefully added to the
> platform as part of the transition to Unicode 4.0 support. Supplementary
> characters are encoded as a special pair of UTF16 values to generate a
> different character, or codepoint. A surrogate pair is a combination of a
> high UTF16 value and a following low UTF16 value. The high and low values
> are from a special range of UTF16 values.
>
> In general, when using a String or sequence of characters, the core API
> libraries will transparently handle the new supplementary characters for
> you.
>
> See also http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html
>
> The methods that accept an int value support all Unicode characters,
> including supplementary characters. For example, Character.isLetter(0x2F81A)
> returns true because the code point value represents a letter (a CJK
> ideograph).

I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2. Python 3 supports things like
chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
unichr and unicode literals.)

The one thing that may be missing from Python is things like
interpretation of surrogates by functions like isalpha() and I'm okay
with adding that (since those have to loop over the entire string
anyway).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Amaury Forgeot d'Arc
Hello,

2008/7/3 Guido van Rossum <[EMAIL PROTECTED]>:
> I don't see an answer there to the question of whether the length()
> method of a Java String object containing a single surrogate pair
> returns 1 or 2; I suspect it returns 2. Python 3 supports things like
> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
> unichr and unicode literals.)

python2.6 support for supplementary characters is not ideal:
>>> unichr(0x2f81a)
ValueError: unichr() arg not in range(0x1) (narrow Python build)
>>> ord(u'\U0002F81A')
TypeError: ord() expected a character, but string of length 2 found.

\U seems the only way to enter these characters.
3.0 is much better and passes the two tests above.

The unicodedata module gives good results in both versions:
>>> unicodedata.name(u'\U0002F81A')
'CJK COMPATIBILITY IDEOGRAPH-2F81A'
[34311 refs]
>>> unicodedata.category(u'\U0002F81A')
'Lo'

With python 3.0, I found only two places that refuse large code points
on narrow builds:
the "%c" format, and Py_BuildValue('C'). They should be fixed.

> The one thing that may be missing from Python is things like
> interpretation of surrogates by functions like isalpha() and I'm okay
> with adding that (since those have to loop over the entire string
> anyway).

In this case, a new .isascii() method would be needed for some uses.

-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Paul Moore
On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> I don't see an answer there to the question of whether the length()
> method of a Java String object containing a single surrogate pair
> returns 1 or 2; I suspect it returns 2.

It appears you're right:

>type testucs.java
class testucs {
public static void main(String[] args) {
StringBuilder s = new StringBuilder("Hello, ");
s.appendCodePoint(0x2F81A);
System.out.println(s); // Display the string.
System.out.println(s.length());
}
}

>java testucs
Hello, ?
9

>java -version
java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)

> Python 3 supports things like
> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
> unichr and unicode literals.)

And Java doesn't appear to - that appendCodePoint() method was
wonderfully hard to find :-)

Paul.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Armin Ronacher
Guido van Rossum  python.org> writes:

> The one thing that may be missing from Python is things like
> interpretation of surrogates by functions like isalpha() and I'm okay
> with adding that (since those have to loop over the entire string
> anyway).
That and methods to safely iterate and slice strings by codepoint.  Java
supports that via String.codePointCount / String.codePointAt /
String.codePointBefore / String.offsetByCodepoints.  Maybe not on the
unicode/str object itself but as part of unicodedata that would make sense
for applications that have to deal with unicode on that level.

Regards,
Armin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Steve Holden

Paul Moore wrote:

On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote:

I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2.


It appears you're right:


type testucs.java

class testucs {
public static void main(String[] args) {
StringBuilder s = new StringBuilder("Hello, ");
s.appendCodePoint(0x2F81A);
System.out.println(s); // Display the string.
System.out.println(s.length());
}
}


java testucs

Hello, ?
9


java -version

java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)


Python 3 supports things like
chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
unichr and unicode literals.)


And Java doesn't appear to - that appendCodePoint() method was
wonderfully hard to find :-)

There's also the issue of indexing the Unicode strings. If we are going 
to insist that len(u) counts surrogate pairs as one character then 
random access to the characters of a string is going to be an extremely 
inefficient operation.


Surely it's desirable under all circumstances that

  len(u) == sum(1 for c in u)

and that

  [c for c in u] == [c[i] for i in range(*len(u))]

How would that play under Jeroen's proposed change?

regards
 Steve
--
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 17:32], Paul Moore ([EMAIL PROTECTED]) wrote:
>System.out.println(s.length());

I think you want to use codePointCount() to count the Unicode code points.
length() returns Unicode code units.

As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html explains:

In the J2SE API documentation, Unicode code point is used for character
values in the range between U+ and U+10, and Unicode code unit is
used for 16-bit char values that are code units of the UTF-16 encoding.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Man is the measure of all things...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 9:35 AM, Steve Holden <[EMAIL PROTECTED]> wrote:
> Paul Moore wrote:
>>
>> On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote:
>>>
>>> I don't see an answer there to the question of whether the length()
>>> method of a Java String object containing a single surrogate pair
>>> returns 1 or 2; I suspect it returns 2.
>>
>> It appears you're right:
>>
>>> type testucs.java
>>
>> class testucs {
>>public static void main(String[] args) {
>>StringBuilder s = new StringBuilder("Hello, ");
>>s.appendCodePoint(0x2F81A);
>>System.out.println(s); // Display the string.
>>System.out.println(s.length());
>>}
>> }
>>
>>> java testucs
>>
>> Hello, ?
>> 9
>>
>>> java -version
>>
>> java version "1.6.0_05"
>> Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
>> Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
>>
>>> Python 3 supports things like
>>> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
>>> unichr and unicode literals.)
>>
>> And Java doesn't appear to - that appendCodePoint() method was
>> wonderfully hard to find :-)
>>
> There's also the issue of indexing the Unicode strings. If we are going to
> insist that len(u) counts surrogate pairs as one character then random
> access to the characters of a string is going to be an extremely inefficient
> operation.

But my whole point is that len(u) should count surrogate pairs as TWO!

> Surely it's desirable under all circumstances that
>
>  len(u) == sum(1 for c in u)
>
> and that
>
>  [c for c in u] == [c[i] for i in range(*len(u))]
>
> How would that play under Jeroen's proposed change?

I am not considering such a change. At best there will be some helper
function in unicodedata, or perhaps a helper method on the 3.0 str
type to iterate over characters instead of 16-bit values. Whether that
iterator should yield 21-bit integer values or strings containing one
character (i.e. perhaps a surrogate pair) and what it would do with
lone surrogate halves is up to the committee to design this API.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 17:03], Guido van Rossum ([EMAIL PROTECTED]) wrote:
>I don't see an answer there to the question of whether the length()
>method of a Java String object containing a single surrogate pair
>returns 1 or 2; I suspect it returns 2.

As
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/CharSequence.html#length()
states:

int length()

Returns the length of this character sequence. The length is the number of
16-bit chars in the sequence. 

But since Java switched to full UTF-16 support in 1.5.0 they extended their
API since the existing methods have probably come too ingrained.

E.g. codePointCount()
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#codePointCount(char[],%20int,%20int)

>The one thing that may be missing from Python is things like
>interpretation of surrogates by functions like isalpha() and I'm okay
>with adding that (since those have to loop over the entire string
>anyway).

Those would be welcome already, yes. I'll see if I can help out.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Fallen into ever-mourn, with these wings so torn, after your day my dawn...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 18:45], James Y Knight ([EMAIL PROTECTED]) wrote:
>I think this is misguided.

Only trying to at least correct the current situation, which I consider a
bit of a mess, personally. (Although it seems others share my view.)

>I'd like to have 3 levels of access available:
>1) "byte"-level. In a new implementation I'd probably choose to make  
>all my strings stored in UTF-8, but UTF-16 is fine too.
>2) codepoint-level.
>3) grapheme-level.

Sounds interesting as well and I can very much see the advantages of such
levels and their methods. Especially in the i18n/l10n work I do.

>You should be able to iterate over the string at any of the levels,  
>ask for the nearest codepoint/grapheme boundary to the left or right  
>of an index at a different level, etc.

[snip]

Actually it seems Java already has a lot of similar methods.

>There are a few more desirable operations, to manipulate strings at  
>the grapheme level (because unlike for UTF-8/UTF-16 codepoints,  
>graphemes don't have the nice property of not containing prefixes  
>which are themselves valid graphemes). So, you want a find (and  
>everything else that implicitly does a find operation, like split,  
>replace, strip, etc) which requires that both endpoints of its match  
>are on a grapheme-boundary. [[Probably the easiest way to implement  
>this would be in the regexp engine.]]

Well, your ideas and seeing Java's stuff actually got me excited to work on
these kind of ideas, next to my datetime revamp.

What would the chances for inclusion in Python be if such a PEP + code would
be presented Guido?

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Beware of the fury of the patient man...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread James Y Knight

On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:


-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote:

Your seem to be suggesting that len(u"\U00012345") should return 1 on
a system that internally uses UTF-16 and hence represents this string
as a surrogate pair.


From a Unicode and UTF-16 point of view that makes the most sense.  
So yes, I

am suggesting that.



I think this is misguided.

IMO, basically every programming language gets string handling wrong.  
(maybe with the exception of the unreleased perl6? it had some  
interesting moves in this area, but I haven't really been paying  
attention.) Everyone treats strings as arrays, but they are used quite  
differently. For a string, there is hardly ever a time when a  
programmer needs to index it with an arbitrary offset in number of  
codepoints, and the length-in-codepoints is pretty non-useful as well.  
Constant-time access to arbitrary codepoints in a string is pretty  
much unimportant. What *is* of utmost importantance is constant-time  
access to previously-returned points in the string.


I'd like to have 3 levels of access available:
1) "byte"-level. In a new implementation I'd probably choose to make  
all my strings stored in UTF-8, but UTF-16 is fine too.

2) codepoint-level.
3) grapheme-level.

You should be able to iterate over the string at any of the levels,  
ask for the nearest codepoint/grapheme boundary to the left or right  
of an index at a different level, etc.


Python could probably still be made to work kinda like this. I think a  
language designed as such in the first place could be nicer, with  
opaque index objects into the string rather than integers, and such,  
but...whatever.


Let's assume python is changed to always store strings in UTF-16.

All it would take is adding a few more functions to the str object to  
operate on the higher levels. Wherever I say "pos" I mean an integer  
index into the string, at the UTF-16 level. That may sometimes be  
unaligned with the boundary of the representation you're asking about,  
and behavior in that case needs to be specified as well.


.nextcodepoint(curpos, how_many=1) -> returns an index into the string  
how_many codepoints to the right (or left if negative) of the index  
curpos.


.nextgrapheme(curpos, how_many=1) -> returns an index into the string  
how_many graphemes to the right (or left if negative) of the index  
curpos.


.codepoints(from_pos=0, to_pos=None) -> return an iterator of  
codepoints from 'from_pos' to 'to_pos'. I think codepoints could be  
represented as strings themselves (so usually one character, sometimes  
two character strings).


.graphemes(from_pos=0, to_pos=None) -> return an iterator of graphemes  
from 'from_pos' to 'to_pos'. Also could be represented by strings. The  
returned graphemes should probably be normalized.


There are a few more desirable operations, to manipulate strings at  
the grapheme level (because unlike for UTF-8/UTF-16 codepoints,  
graphemes don't have the nice property of not containing prefixes  
which are themselves valid graphemes). So, you want a find (and  
everything else that implicitly does a find operation, like split,  
replace, strip, etc) which requires that both endpoints of its match  
are on a grapheme-boundary. [[Probably the easiest way to implement  
this would be in the regexp engine.]]



A concrete example of that: u'A\N{COMBINING TILDE}\N{COMBINING MACRON  
BELOW}'.find(u'A\N{COMBINING TILDE}') returns 0. But you want a way to  
ask for only a *actual* "A with tilde", not an "A with tilde and  
macron".




Anyhow, I'm not going to tackle this issue or try to push it further,  
but if someone does tackle it, python could grow to have the best  
unicode available. :)


James

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 10:01 AM, Jeroen Ruigrok van der Werven
<[EMAIL PROTECTED]> wrote:
> What would the chances for inclusion in Python be if such a PEP + code would
> be presented Guido?

As long as it is clear that the len() function and the basic slicing
and indexing operations on strings continue to work in code units
(i.e. 16-bit quantities) and the APIs for dealing with code points
(i.e. treating surrogate pairs as a single character) are a separate
API, there is a chance. Existing code using the existing APIs should
not change its behavior (even if you consider the existing behavior
broken), with the exception of isalpha() and similar APIs, which can
IMO safely be extended to consider surrogate pairs.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] us.pycon.org down?

2008-07-03 Thread Facundo Batista
(sorry for the crossposting)

Do you know what happened with "http://us.pycon.org/";?

Thank you!

-- 
. Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
>>
>> -On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
>>>
>>> Unicode if full of combining code points - if you break such a sequence,
>>> the output will be just as wrong; regardless of UCS2 vs. UCS4.
>>
>> In my opinion you are confusing two related, but very separated things
>> here.
>> Combining characters have nothing to do with breaking up the encoding of a
>> single codepoint. Sure enough, if you arbitrary slice up codepoints that
>> consist of combining characters then your result is indeed odd looking.
>>
>> I never said that nor is that the point I am making.
>
> Please remember that lone surrogate pair code points are perfectly
> valid Unicode code points, nevertheless. Just as a lone combining
> code point is valid on its own.

That is a big part of these problems.  For all practical purposes, a
surrogate is like a UTF-8 code unit, and must be handled the same way,
so why the heck do they confuse everybody by saying "oh, it's a code
point too!"?


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> Basically everything but string forming or string printing seems to be
> broken for surrogate pairs, from what I can tell.

We probably disagree what "it works correctly" means. I think everything
works correctly.

> Also, I think you are confused about slicing in the middle of a surrogate
> pair, from a UTF-16 perspective this is 1 codepoint!

Yes, but it is two code units. Python's UTF-16 implementation operates
on code units, not code points.

> And as such Python
> needs to treat it as one character/codepoint in a string, dealing with
> slicing as appropriate.

It does. However, functions such as len, and all indexing, operate in
code units, not code points.

> The way you currently describe it is that UTF-16
> strings will be treated as UCS-2 when it comes to slicing and the likes.

No. In UCS-2, the surrogate range is reserved (for UTF-16). In Python,
it's not reserved, but interpreted as UTF-16.

> From a UTF-16 point of view such slicing can NEVER occur unless you are bit
> or byte slicing instead of character/codepoint slicing.

It most certainly can. UTF-16 is not a character set, but a character
encoding form (unlike UCS-2, which is a coded character set). Slicing
*can* occur at the code unit level. UTF-16 is also understood as a
character encoding scheme (by means of the BOM), then slicing can
occur even on the byte level.

> I think it can be fairly said that an item in a string is a character or
> codepoint.

Not in Python - it's a code unit.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [PyCon-Organizers] us.pycon.org down?

2008-07-03 Thread David Goodger
On Thu, Jul 3, 2008 at 13:12, Facundo Batista <[EMAIL PROTECTED]> wrote:
> (sorry for the crossposting)
>
> Do you know what happened with "http://us.pycon.org/";?

Not sure. The machine is still up (it serves www.pycon.org as well).
Either something is misconfigured, or a process can't start, or
something...

I'll ask Jeff Rush (whose machine it's on) and Doug Napoleone (who
knows more about the server than I, and has admin access) to look into
it.

-- 
David Goodger 
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> 1. System is NOT memory limited (i.e. most desktops): use a UCS-4 Python
> build, which is what most Linux distributions do (I'm not sure about the
> pydotorg provided Windows or Mac OS X builds).

The Windows builds must continue to use a two-byte representation, as
otherwise PythonWin will break (and anything else that tries to pass
Unicode strings directly to a Win32 *W function).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote:
>On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> Please remember that lone surrogate pair code points are perfectly
>> valid Unicode code points, nevertheless. Just as a lone combining
>> code point is valid on its own.
>
>That is a big part of these problems.  For all practical purposes, a
>surrogate is like a UTF-8 code unit, and must be handled the same way,
>so why the heck do they confuse everybody by saying "oh, it's a code
>point too!"?

Because surrogate code points are not Unicode scalar values, isolated UTF-16
code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode
5.0/5.1, section 3.9)

So, no, it is not a code point too.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Als men blijft geloven kan de zwaarste steen niet zinken...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> Please remember that lone surrogate pair code points are perfectly
> valid Unicode code points, nevertheless. Just as a lone combining
> code point is valid on its own.

Actually, I think they aren't (not any more than an invalid codepoint,
or an unassigned codepoint). They are reserved for UTF-16 only.

I would have to lookup the exact Unicode terminology, but "valid"
is probably not a predicate that they would use.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> I think you want to use codePointCount() to count the Unicode code points.
> length() returns Unicode code units.
> 
> As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html explains:
> 
> In the J2SE API documentation, Unicode code point is used for character
> values in the range between U+ and U+10, and Unicode code unit is
> used for 16-bit char values that are code units of the UTF-16 encoding.

So you would like to contribute a function codePointCount to Python's
standard library? Go ahead.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Bill Janssen
> Surely it's desirable under all circumstances that
> 
>len(u) == sum(1 for c in u)
> 
> and that
> 
>[c for c in u] == [c[i] for i in range(*len(u))]
> 
> How would that play under Jeroen's proposed change?

Yes, but I think the argument is about what "c" is -- a character or a
codepoint.  Your point about efficiency is well-taken; I doubt that
random access to a particular character in a string has to be
efficient -- kind of a dying technique these days -- but slices and
regexp performance need efficiency guarantees.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Terry Reedy



Daniel Arbuckle wrote:


Regardless, as I said before, nothing justifies silently changing the
meaning of a program based on an option that most users don't set for
themselves and are not aware of.


The premise of this thread seems to be that the majority should suffer 
for the benefit of a few.  That is not Python's philosophy.


Python hides many system differences.  It is gradually hiding more.  For 
instance, float('nan') works uniformly in 2.6 (with little performance 
hit), whereas it was system specific in 2.5  But Python does not promise 
to hide all system differences.


If the possible effects of (unicode) string build choice are not 
properly documented, then I agree that they should be, just as they are 
(or, in some cases, I presume are) the effects of underlying OS, 
processor integer and pointer size, float scheme, garbage collection 
scheme, and perhaps something I forgot.


Suggested documentation changes can be submitted to the tracker as 
specific ascii text targeted at a specific location.  If accepted, the 
doc maintainers will adapt submitted text to 'doc style' and add the 
needed markup.  Current response time is usually under a week, perhaps 
even a day.


Documented effects are not 'silent'.  But I am sure they could be made a 
bit louder.  Perhaps someday someone will volunteer to contribute a 
chapter to Using Python on Possible Semantic Variations that would run 
through the issues listed above so they are gathered together in one 
place as well as scattered throughout the Language and Library Reference 
manuals.


> When such a change would take place,

it should be reported explicitly as an error.


No, possible changes should be documented so that they are not silent. 
(But I am curious, by 'would' do you mean 'would with the current data' 
or 'theoretically could with chosen data'?)


Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 10:44 AM, Terry Reedy <[EMAIL PROTECTED]> wrote:
> The premise of this thread seems to be that the majority should suffer for
> the benefit of a few.  That is not Python's philosophy.

Who are the many here? Who are the few? I'd venture that (at least for
the foreseeable future, say, until China will finally have taken over
the role of the US as the de-facto dominant super power :-) the many
are people whose app will never see a Unicode character outside the
BMP, or who do such minimal string processing that their code doesn't
care whether it's handling UTF-16-encoded data.

Python's philosophy is also Practicality Beats Purity.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [PyCon-Organizers] us.pycon.org down?

2008-07-03 Thread doug . napoleone
In Montana visiting. Will be back at the hotel in about 4 hours. Looks
like base site include is missing or has wrong permissions.

On 7/3/08, David Goodger <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 3, 2008 at 13:12, Facundo Batista <[EMAIL PROTECTED]>
> wrote:
>> (sorry for the crossposting)
>>
>> Do you know what happened with "http://us.pycon.org/";?
>
> Not sure. The machine is still up (it serves www.pycon.org as well).
> Either something is misconfigured, or a process can't start, or
> something...
>
> I'll ask Jeff Rush (whose machine it's on) and Doug Napoleone (who
> knows more about the server than I, and has admin access) to look into
> it.
>
> --
> David Goodger 
> ___
> PyCon-organizers mailing list
> [EMAIL PROTECTED]
> http://mail.python.org/mailman/listinfo/pycon-organizers
>
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 19:31], "Martin v. Löwis" ([EMAIL PROTECTED]) wrote:
>Yes, but it is two code units. Python's UTF-16 implementation operates
>on code units, not code points.

Thank you, that is the single most important piece of information I got
about this entire thing because it does change the entire approach.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Knowledge comes, but Wisdom lingers...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [PyCon-Organizers] us.pycon.org down?

2008-07-03 Thread David Goodger
On Thu, Jul 3, 2008 at 13:32, David Goodger <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 3, 2008 at 13:12, Facundo Batista <[EMAIL PROTECTED]> wrote:
>> (sorry for the crossposting)
>>
>> Do you know what happened with "http://us.pycon.org/";?
>
> Not sure. The machine is still up (it serves www.pycon.org as well).
> Either something is misconfigured, or a process can't start, or
> something...
>
> I'll ask Jeff Rush (whose machine it's on) and Doug Napoleone (who
> knows more about the server than I, and has admin access) to look into
> it.

Jeff fixed it. URL rewriting was off by mistake.

-- 
David Goodger 
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 11:35 AM, Jeroen Ruigrok van der Werven
<[EMAIL PROTECTED]> wrote:
> -On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote:
>>On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>>> Please remember that lone surrogate pair code points are perfectly
>>> valid Unicode code points, nevertheless. Just as a lone combining
>>> code point is valid on its own.
>>
>>That is a big part of these problems.  For all practical purposes, a
>>surrogate is like a UTF-8 code unit, and must be handled the same way,
>>so why the heck do they confuse everybody by saying "oh, it's a code
>>point too!"?
>
> Because surrogate code points are not Unicode scalar values, isolated UTF-16
> code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode
> 5.0/5.1, section 3.9)
>
> So, no, it is not a code point too.


UTF-16
D91 UTF-16 encoding form: The Unicode encoding form that assigns each
Unicode scalar
 value in the ranges U+..U+D7FF and U+E000..U+ to a single unsigned
 16-bit code unit with the same numeric value as the Unicode
scalar value, and that
 assigns each Unicode scalar value in the range U+1..U+10
to a surrogate
 pair, according to Table 3-5.
   • In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented
 as <004D 0430 4E8C D800 DF02>, where  corresponds to
 U+10302.
   • Because surrogate code points are not Unicode scalar values,
isolated UTF-16
 code units in the range D80016..DFFF16 are ill-formed.

In the context of UTF-8 or UTF-32, a Unicode scalar value is a single
code point of a valid character (more or less) and a code unit is the
base unit (1 and 4 bytes respectively) of which 1 or more combine to
form a code point.  In UTF-16, code point becomes synonymous with code
unit and Unicode scalar value becomes one or more code points.  WTF?

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg

On 2008-07-03 19:21, Adam Olsen wrote:

On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:

On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:

-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:

Unicode if full of combining code points - if you break such a sequence,
the output will be just as wrong; regardless of UCS2 vs. UCS4.

In my opinion you are confusing two related, but very separated things
here.
Combining characters have nothing to do with breaking up the encoding of a
single codepoint. Sure enough, if you arbitrary slice up codepoints that
consist of combining characters then your result is indeed odd looking.

I never said that nor is that the point I am making.

Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.


That is a big part of these problems.  For all practical purposes, a
surrogate is like a UTF-8 code unit, and must be handled the same way,
so why the heck do they confuse everybody by saying "oh, it's a code
point too!"?


You have to take that up with the Unicode consortium :-)

It would have been better not to add surrogates to the standard
at all. To be fair, I don't think that anybody seriously assumed
at the time that more than 16 bits would be needed.

In practice, you do need to be able to build Unicode strings
that contain half a surrogate (ie. a single code point) or
a combining code point without its anchor code point, so trying
to be smart about detecting surrogates is going to create more
confusion than do good, e.g.

>>> x1 = u'\udbc0'
>>> x2 = u'\udc00'
>>> x1
u'\udbc0'
>>> x2
u'\udc00'
>>> len(x1)
1
>>> len(x2)
1

Having len(x1+x2) == 1 wouldn't be right and break all sorts
of assumptions you normally make about string concatenation.
Which is why len(x1+x2) gives 2 in both UCS2 and UCS4 builds.

The fact that u'\U0010' can map to a length 1 Unicode string
in UCS4 builds and a length 2 string in UCS2 builds is merely
due to the fact that the unicode-escape codec (which converts
the escaped string literal to a Unicode object) does know about
surrogates and uses them to avoid exceptions.

Programmers need to be aware of this fact, that's all...
just like they need to aware of differences between
integer and float division, different behavior of classic
and new-style classes, etc. etc.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go

 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg

On 2008-07-03 19:35, Jeroen Ruigrok van der Werven wrote:

-On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote:

On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:

Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.

That is a big part of these problems.  For all practical purposes, a
surrogate is like a UTF-8 code unit, and must be handled the same way,
so why the heck do they confuse everybody by saying "oh, it's a code
point too!"?


Because surrogate code points are not Unicode scalar values, isolated UTF-16
code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode
5.0/5.1, section 3.9)


True. They are not valid UTF-16 code units, but a code unit is
just a storage byte representation of a Unicode tranformation...

"""
Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The 
Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 
32-bit code units in the UTF-32 encoding form. (See definition D77 in  Section 3.9, Unicode Encoding Forms.)

"""

That's not the same thing as a code point which is an assignment
of a slot in the Unicode character set...

"""
Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 1016. (See definition D10 
in Section 3.4, Characters and Encoding.)

"""

Reference: http://www.unicode.org/glossary/

Also see Chapter 3.4 
(http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G2212):

"""
Surrogate code points and noncharacters are considered assigned code points,
but not assigned characters.
"""

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go

 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [PyCon-Organizers] us.pycon.org down?

2008-07-03 Thread Facundo Batista
2008/7/3 David Goodger <[EMAIL PROTECTED]>:

> Jeff fixed it. URL rewriting was off by mistake.

Thanks! :)

-- 
. Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg

On 2008-07-03 19:44, Terry Reedy wrote:
The premise of this thread seems to be that the majority should suffer 
for the benefit of a few.  That is not Python's philosophy.


In reality, most Unixes ship with UCS4 builds of Python. Windows
and Mac OS X ship with UCS2 builds. Still, anyone is free to build
their own favorite version - that's freedom of choice, which is good.

Programmers just need to be made aware of the differences in UCS2
and UCS4 builds and deal with it.

Here's talk I've given many many times over the years which explains
some of the details that a Python programmer needs to know when dealing
with Unicode:

http://www.egenix.com/files/python/PyConUK2007-Developing-Unicode-aware-applications-in-Python.pdf

Perhaps I should add a section on UCS2 vs. UCS4 the next time around ;-)

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go

 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] problems compiling ctypes

2008-07-03 Thread Jeremy Link
I've grabbed the latest libffi that contains support for the ARM processor.
I then enable FFI_CLOSURES in the arm/ffi.c file.

 

When I do this, I get compilation errors that it is missing
ffi_prep_closure.

 

Is ffi.c up to date for supporting the ARM platform?

 

Not sure if there is a simple configuration change in one of the files that
will fix *everything* or if ffi.c just doesn't support ARM yet and so it
needs be developed/revamped.

 

Thanks for any help.

 

 

 

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Steve Holden

M.-A. Lemburg wrote:

On 2008-07-03 19:44, Terry Reedy wrote:
The premise of this thread seems to be that the majority should suffer 
for the benefit of a few.  That is not Python's philosophy.


In reality, most Unixes ship with UCS4 builds of Python. Windows
and Mac OS X ship with UCS2 builds. Still, anyone is free to build
their own favorite version - that's freedom of choice, which is good.

Programmers just need to be made aware of the differences in UCS2
and UCS4 builds and deal with it.

Here's talk I've given many many times over the years which explains
some of the details that a Python programmer needs to know when dealing
with Unicode:

http://www.egenix.com/files/python/PyConUK2007-Developing-Unicode-aware-applications-in-Python.pdf 



Perhaps I should add a section on UCS2 vs. UCS4 the next time around ;-)


The indications are that would be helpful to many people (including myself).

regards
 Steve
--
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Terry Reedy



Guido van Rossum wrote:

On Thu, Jul 3, 2008 at 10:44 AM, Terry Reedy <[EMAIL PROTECTED]> wrote:

The premise of this thread seems to be that the majority should suffer for
the benefit of a few.  That is not Python's philosophy.


The premise is the OP's idea that Python should switch to all UCS4 to 
create a more pure ('ideal') situation or the idea that len(s) should 
count codepoints (correct term?) for all builds as a matter of purity 
even though on it would be time-costly on 16-bit builds as a matter of 
practicality.



Who are the many here?


Those who are happy with 3.0 strings as they are for their systems and 
who would not benefit from the proposed change.  In other words, what 
you say below.


> Who are the few?

Those who are stuck with 16-bit builds and who would benefit from 
32-bits builds because they need to use non basic plane chars and need 
to use the operations for which a change would make a positive difference.


In my opinion, such people with Windows should at least install Linux + 
UCS4 Python as an alternate install.



I'd venture that (at least for
the foreseeable future, say, until China will finally have taken over
the role of the US as the de-facto dominant super power :-) the many
are people whose app will never see a Unicode character outside the
BMP, or who do such minimal string processing that their code doesn't
care whether it's handling UTF-16-encoded data.


Just what I meant.


Python's philosophy is also Practicality Beats Purity.


Just what I meant, in the form 'Purity does not beat Practicality'.

Having summarized, perhaps too briefly, why Python's basic unicode 
implementation would not change in the near future, I went on to my main 
point, which is that better docs might be an alternative solution to the 
problems raised.


tjr



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] problems compiling ctypes

2008-07-03 Thread Martin v. Löwis
> Thanks for any help.

This list (python-dev) is not for getting help, but for providing it.
So if you have patches that you would like to discuss, please go
ahead. As you are seeking help, please use [EMAIL PROTECTED]
(aka news:comp.lang.python) instead.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
>
> The premise is the OP's idea that Python should switch to all UCS4 to create
> a more pure ('ideal') situation or the idea that len(s) should count
> codepoints (correct term?) for all builds as a matter of purity even though
> on it would be time-costly on 16-bit builds as a matter of practicality.

Wrong term - code units and code points are equivalent in UTF-16 and
UTF-32.  What you're looking for is unicode scalar values.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
>>
>> The premise is the OP's idea that Python should switch to all UCS4 to create
>> a more pure ('ideal') situation or the idea that len(s) should count
>> codepoints (correct term?) for all builds as a matter of purity even though
>> on it would be time-costly on 16-bit builds as a matter of practicality.
>
> Wrong term - code units and code points are equivalent in UTF-16 and
> UTF-32.  What you're looking for is unicode scalar values.

I don't think so. I have in my lap the Unicode 5.0 standard, which on
page 102, under UTF-16, states (amongst others):

"""
* In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
represented as <004D 0439 4E8C D800 DF02>, where 
corresponds to U+10302.

* Because surrogate code points are not Unicode scalar values,
isolated UTF-16 code units in the range D800[16]..DFFF[16] are
ill-formed.
"""

>From this I understand they distinguish carefully between code points
and code units -- D800 is a code unit but not a code point, 10302 is a
code point but not a (UTF-16) code unit.

OTOH outside the context of UTF-8, the surrogates are also referred to
as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of
Code Points").

I think the best thing we can do is to use "code points" to refer to
characters and "code units" to the individual 16-bit values in the
UTF-16 encoding; this seems compatible with usage elsewhere in this
thread by most folks.

Also see http://unicode.org/glossary/:

"""
Code Point. Any value in the Unicode codespace; that is, the range of
integers from 0 to 1016. (See definition D10 in Section 3.4,
Characters and Encoding.)
.
.
.
Code Unit. The minimal bit combination that can represent a unit of
encoded text for processing or interchange. The Unicode Standard uses
8-bit code units in the UTF-8 encoding form, 16-bit code units in the
UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding
form. (See definition D77 in  Section 3.9, Unicode Encoding Forms.)
"""

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> Wrong term - code units and code points are equivalent in UTF-16 and
> UTF-32.  What you're looking for is unicode scalar values.

How so? Section 2.5, UTF-16 says

"code points in the supplementary planes, in the range
U+1..U+10, are represented as pairs of 16-bit code units."

So clearly, code points in Unicode range from U+..U+10,
independent of encoding form.

In UTF-16, code units range from 0..65535.

OTOH, "unicode scalar value" is nearly synonymous to "code point":

D76 Unicode Scalar Value. Any Unicode  code point except high-surrogate
and low-surrogate code points.

So codepoint in Terry's message was the right term.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 4:21 PM, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <[EMAIL PROTECTED]> wrote:
>> On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
>>>
>>> The premise is the OP's idea that Python should switch to all UCS4 to create
>>> a more pure ('ideal') situation or the idea that len(s) should count
>>> codepoints (correct term?) for all builds as a matter of purity even though
>>> on it would be time-costly on 16-bit builds as a matter of practicality.
>>
>> Wrong term - code units and code points are equivalent in UTF-16 and
>> UTF-32.  What you're looking for is unicode scalar values.
>
> I don't think so. I have in my lap the Unicode 5.0 standard, which on
> page 102, under UTF-16, states (amongst others):
>
> """
> * In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
> represented as <004D 0439 4E8C D800 DF02>, where 
> corresponds to U+10302.

The literal interpretation is that the U+10302 code point should get
expanded into .  It doesn't say if  is a pair of
code units or a pair of code points.


> * Because surrogate code points are not Unicode scalar values,
> isolated UTF-16 code units in the range D800[16]..DFFF[16] are
> ill-formed.
> """

So a lone surrogate code unit is not a valid scalar.  It also implies
surrogate code points exist, rather than ruling them out.


> From this I understand they distinguish carefully between code points
> and code units -- D800 is a code unit but not a code point, 10302 is a
> code point but not a (UTF-16) code unit.

I disagree.  They switch between code point and code unit arbitrarily,
never than saying surrogate code points don't exist.


> OTOH outside the context of UTF-8, the surrogates are also referred to
> as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of
> Code Points").

You mean outside the context of UTF-16?  Regarding them as reserved
and lone surrogates as ill-formed code units would have been simpler,
but alas, is not the case.

Regarding changes in 5.1
(http://www.unicode.org/versions/Unicode5.1.0/), I can find this bit
to give some context:

Rendering Default Ignorable Code Points

Update the last paragraph on p. 192 of The Unicode Standard,
Version 5.0, in Section 5.20, Default Ignorable Code Points, to read
as follows:

Replacement Text
An implementation should ignore all default ignorable code
points in rendering whenever it does not support those code points,
whether they are assigned or not.

In previous versions of the Unicode Standard, surrogate code
points, private use code points, and some control characters were also
default ignorable code points. However, to avoid security problems,
such characters always should be displayed with a missing glyph, so
that there is a visible indication of their presence in the text. In
Unicode 5.1 these code points are no longer default ignorable code
points. For more information, see UTR #36, "Unicode Security
Considerations."

Clearly they act as if surrogate code points exist.

Finally, we find this in the glossary:

Unicode Scalar Value. Any Unicode  code point except
high-surrogate and low-surrogate code points. In other words, the
ranges of integers 0 to D7FF16 and E00016 to 1016 inclusive. (See
definition D76 in  Section 3.9, Unicode Encoding Forms.)

Clearly, each surrogate is a valid code point, regardless of encoding.
 A surrogate pair simultaneously represents both one code point (the
scalar value) and two code points (the surrogate code points).  To be
unambiguous you must instead use either code units (always 2 for
UTF-16) or scalar values (always 1 in any encoding).

The OP wanted it to always be 1, so the correct unambiguous term is
scalar value.


> I think the best thing we can do is to use "code points" to refer to
> characters and "code units" to the individual 16-bit values in the
> UTF-16 encoding; this seems compatible with usage elsewhere in this
> thread by most folks.
>
> Also see http://unicode.org/glossary/:
>
> """
> Code Point. Any value in the Unicode codespace; that is, the range of
> integers from 0 to 1016. (See definition D10 in Section 3.4,
> Characters and Encoding.)
> .
> .
> .
> Code Unit. The minimal bit combination that can represent a unit of
> encoded text for processing or interchange. The Unicode Standard uses
> 8-bit code units in the UTF-8 encoding form, 16-bit code units in the
> UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding
> form. (See definition D77 in  Section 3.9, Unicode Encoding Forms.)
> """
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>



-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 4:50 PM, Adam Olsen <[EMAIL PROTECTED]> wrote:
> Clearly, each surrogate is a valid code point, regardless of encoding.
>  A surrogate pair simultaneously represents both one code point (the
> scalar value) and two code points (the surrogate code points).  To be
> unambiguous you must instead use either code units (always 2 for
> UTF-16) or scalar values (always 1 in any encoding).
>
> The OP wanted it to always be 1, so the correct unambiguous term is
> scalar value.

Fine, if you want to be completely unambiguous you apparently you
can't use the word code point but you have to use either scalar values
(always Unicode characters) or code units (always part of an encoding,
and 8, 16 or 32 bits).

Regardless of what the OP might want, len() of a surrogate pair will
return 2 (since it counts code units), and we'll have to provide
another API to count scalar values / characters that sees a surrogate
pair as one.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com