Re: Python's handling of unicode surrogates

2007-04-24 Thread Pete Forman
Ross Ridge <[EMAIL PROTECTED]> writes: > The Unicode standard doesn't require that you support surrogates, > or any other kind of character, so no you wouldn't be lying. +1 on Ross Ridge's contributions to this thread. If Unicode is processed using UTF-8 or UTF-32 encoding forms then there are

Re: Python's handling of unicode surrogates

2007-04-23 Thread Ross Ridge
Ross Ridge writes: > The Unicode standard doesn't require that you support surrogates, or > any other kind of character, so no you wouldn't be lying. <[EMAIL PROTECTED]> wrote: > There is the notion of Unicode implementation levels, and each of them > does include a set of characters to support.

Re: Python's handling of unicode surrogates

2007-04-22 Thread Martin v. Löwis
> The Unicode standard doesn't require that you support surrogates, or > any other kind of character, so no you wouldn't be lying. There is the notion of Unicode implementation levels, and each of them does include a set of characters to support. In level 1, combining characters need not to be sup

Re: Python's handling of unicode surrogates

2007-04-22 Thread Martin v. Löwis
> IMHO what is really needed is a bunch of high level methods like > .graphemes() - iterate over graphemes > .codepoints() - iterate over codepoints > .isword() - check if the string represents one word > etc... This doesn't need to come as methods, though. If anybody wants to provide a library wi

Re: Python's handling of unicode surrogates

2007-04-22 Thread Ross Ridge
Rhamphoryncus <[EMAIL PROTECTED]> wrote: >I wish to write software that supports Unicode. Like it or not, >Unicode goes beyond the BMP, so I'd be lying if I said I supported >Unicode if I only handled the BMP. The Unicode standard doesn't require that you support surrogates, or any other kind of

Re: Python's handling of unicode surrogates

2007-04-22 Thread Leo Kislov
On Apr 20, 7:34 pm, Rhamphoryncus <[EMAIL PROTECTED]> wrote: > On Apr 20, 6:21 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > If you absolutely think support for non-BMP characters is necessary > > in every program, suggesting that Python use UCS-4 by default on > > all systems has a higher c

Re: Python's handling of unicode surrogates

2007-04-21 Thread Josiah Carlson
On Apr 20, 7:34 pm, Rhamphoryncus <[EMAIL PROTECTED]> wrote: > On Apr 20, 6:21 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > > > I don't believe this specific variant has been discussed. > > Now that you clarify it: no, it hasn't been discussed. I find that > > not surprising - this proposal

Re: Python's handling of unicode surrogates

2007-04-20 Thread Neil Hodgson
Paul Boddie: > Do we have a volunteer? ;-) I won't volunteer to do a real implementation - the Unicode type in Python is currently around 7000 lines long and there is other code to change in, for example, regular expressions. Here's a demonstration C++ implementation that stores an array o

Re: Python's handling of unicode surrogates

2007-04-20 Thread Rhamphoryncus
On Apr 20, 6:21 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > I don't believe this specific variant has been discussed. > > Now that you clarify it: no, it hasn't been discussed. I find that > not surprising - this proposal is so strange and unnatural that > probably nobody dared to suggest

Re: Python's handling of unicode surrogates

2007-04-20 Thread Rhamphoryncus
On Apr 20, 5:49 pm, Ross Ridge <[EMAIL PROTECTED]> wrote: > Rhamphoryncus <[EMAIL PROTECTED]> wrote: > >The only code that will be changed is that which doesn't handle > >surrogates properly. Some will start working properly. Some (ie > >random.choice(u'\U0010\u')) will fail explicitly (

Re: Python's handling of unicode surrogates

2007-04-20 Thread Martin v. Löwis
> I don't believe this specific variant has been discussed. Now that you clarify it: no, it hasn't been discussed. I find that not surprising - this proposal is so strange and unnatural that probably nobody dared to suggest it. > s[5] does not exist. You would get an IndexError indicating that i

Re: Python's handling of unicode surrogates

2007-04-20 Thread Ross Ridge
Rhamphoryncus <[EMAIL PROTECTED]> wrote: >The only code that will be changed is that which doesn't handle >surrogates properly. Some will start working properly. Some (ie >random.choice(u'\U0010\u')) will fail explicitly (rather than >silently). You're falsely assuming that any code tha

Re: Python's handling of unicode surrogates

2007-04-20 Thread Paul Boddie
On 20 Apr, 07:02, Neil Hodgson <[EMAIL PROTECTED]> wrote: > Adam Olsen: > > > To solve this I propose Python's unicode type using UTF-16 should have > > gaps in its index, allowing it to only expose complete unicode scalar > > values. Iteration would produce surrogate pairs rather than > > individ

Re: Python's handling of unicode surrogates

2007-04-19 Thread Rhamphoryncus
(Sorry for the dupe, Martin. Gmail made it look like your reply was in private.) On 4/19/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Thoughts, from all you readers out there? For/against? > > See PEP 261. This things have all been discussed at that time, > and an explicit decision again

Re: Python's handling of unicode surrogates

2007-04-19 Thread Rhamphoryncus
On Apr 19, 11:02 pm, Neil Hodgson <[EMAIL PROTECTED]> wrote: > Adam Olsen: > > > To solve this I propose Python's unicode type using UTF-16 should have > > gaps in its index, allowing it to only expose complete unicode scalar > > values. Iteration would produce surrogate pairs rather than > > indi

Re: Python's handling of unicode surrogates

2007-04-19 Thread Martin v. Löwis
> Thoughts, from all you readers out there? For/against? See PEP 261. This things have all been discussed at that time, and an explicit decision against what I think (*) your proposal is was taken. If you want to, you can try to revert that decision, but you would need to write a PEP. Regards,

Re: Python's handling of unicode surrogates

2007-04-19 Thread Neil Hodgson
Adam Olsen: > To solve this I propose Python's unicode type using UTF-16 should have > gaps in its index, allowing it to only expose complete unicode scalar > values. Iteration would produce surrogate pairs rather than > individual surrogates, indexing to the first half of a surrogate pair > woul

Python's handling of unicode surrogates

2007-04-19 Thread Adam Olsen
As was seen in another thread[1], there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own functions do this on occasion. This leads to different behaviour across platforms and makes it unneces