Re: Python's handling of unicode surrogates

2007-04-24 Thread Pete Forman
Ross Ridge <[EMAIL PROTECTED]> writes: > The Unicode standard doesn't require that you support surrogates, > or any other kind of character, so no you wouldn't be lying. +1 on Ross Ridge's contributions to this thread. If Unicode is processed using UTF-8 or UTF-32 encoding forms then there are

Re: Python's handling of unicode surrogates

2007-04-23 Thread Ross Ridge
Ross Ridge writes: > The Unicode standard doesn't require that you support surrogates, or > any other kind of character, so no you wouldn't be lying. <[EMAIL PROTECTED]> wrote: > There is the notion of Unicode implementation levels, and each of them > does include a set of characters to support.

Re: Python's handling of unicode surrogates

2007-04-22 Thread Martin v. Löwis
> The Unicode standard doesn't require that you support surrogates, or > any other kind of character, so no you wouldn't be lying. There is the notion of Unicode implementation levels, and each of them does include a set of characters to support. In level 1, combining characters need not to be sup

Re: Python's handling of unicode surrogates

2007-04-22 Thread Martin v. Löwis
> IMHO what is really needed is a bunch of high level methods like > .graphemes() - iterate over graphemes > .codepoints() - iterate over codepoints > .isword() - check if the string represents one word > etc... This doesn't need to come as methods, though. If anybody wants to provide a library wi

Re: Python's handling of unicode surrogates

2007-04-22 Thread Ross Ridge
Rhamphoryncus <[EMAIL PROTECTED]> wrote: >I wish to write software that supports Unicode. Like it or not, >Unicode goes beyond the BMP, so I'd be lying if I said I supported >Unicode if I only handled the BMP. The Unicode standard doesn't require that you support surrogates, or any other kind of

Re: Python's handling of unicode surrogates

2007-04-22 Thread Leo Kislov
On Apr 20, 7:34 pm, Rhamphoryncus <[EMAIL PROTECTED]> wrote: > On Apr 20, 6:21 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > If you absolutely think support for non-BMP characters is necessary > > in every program, suggesting that Python use UCS-4 by default on > > all systems has a higher c

Re: Python's handling of unicode surrogates

2007-04-21 Thread Josiah Carlson
On Apr 20, 7:34 pm, Rhamphoryncus <[EMAIL PROTECTED]> wrote: > On Apr 20, 6:21 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > > > I don't believe this specific variant has been discussed. > > Now that you clarify it: no, it hasn't been discussed. I find that > > not surprising - this proposal

Re: Python's handling of unicode surrogates

2007-04-20 Thread Neil Hodgson
Paul Boddie: > Do we have a volunteer? ;-) I won't volunteer to do a real implementation - the Unicode type in Python is currently around 7000 lines long and there is other code to change in, for example, regular expressions. Here's a demonstration C++ implementation that stores an array o

Re: Python's handling of unicode surrogates

2007-04-20 Thread Rhamphoryncus
On Apr 20, 6:21 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > I don't believe this specific variant has been discussed. > > Now that you clarify it: no, it hasn't been discussed. I find that > not surprising - this proposal is so strange and unnatural that > probably nobody dared to suggest

Re: Python's handling of unicode surrogates

2007-04-20 Thread Rhamphoryncus
On Apr 20, 5:49 pm, Ross Ridge <[EMAIL PROTECTED]> wrote: > Rhamphoryncus <[EMAIL PROTECTED]> wrote: > >The only code that will be changed is that which doesn't handle > >surrogates properly. Some will start working properly. Some (ie > >random.choice(u'\U0010\u')) will fail explicitly (

Re: Python's handling of unicode surrogates

2007-04-20 Thread Martin v. Löwis
> I don't believe this specific variant has been discussed. Now that you clarify it: no, it hasn't been discussed. I find that not surprising - this proposal is so strange and unnatural that probably nobody dared to suggest it. > s[5] does not exist. You would get an IndexError indicating that i

Re: Python's handling of unicode surrogates

2007-04-20 Thread Ross Ridge
Rhamphoryncus <[EMAIL PROTECTED]> wrote: >The only code that will be changed is that which doesn't handle >surrogates properly. Some will start working properly. Some (ie >random.choice(u'\U0010\u')) will fail explicitly (rather than >silently). You're falsely assuming that any code tha

Re: Python's handling of unicode surrogates

2007-04-20 Thread Paul Boddie
On 20 Apr, 07:02, Neil Hodgson <[EMAIL PROTECTED]> wrote: > Adam Olsen: > > > To solve this I propose Python's unicode type using UTF-16 should have > > gaps in its index, allowing it to only expose complete unicode scalar > > values. Iteration would produce surrogate pairs rather than > > individ

Re: Python's handling of unicode surrogates

2007-04-19 Thread Rhamphoryncus
(Sorry for the dupe, Martin. Gmail made it look like your reply was in private.) On 4/19/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Thoughts, from all you readers out there? For/against? > > See PEP 261. This things have all been discussed at that time, > and an explicit decision again

Re: Python's handling of unicode surrogates

2007-04-19 Thread Rhamphoryncus
On Apr 19, 11:02 pm, Neil Hodgson <[EMAIL PROTECTED]> wrote: > Adam Olsen: > > > To solve this I propose Python's unicode type using UTF-16 should have > > gaps in its index, allowing it to only expose complete unicode scalar > > values. Iteration would produce surrogate pairs rather than > > indi

Re: Python's handling of unicode surrogates

2007-04-19 Thread Martin v. Löwis
> Thoughts, from all you readers out there? For/against? See PEP 261. This things have all been discussed at that time, and an explicit decision against what I think (*) your proposal is was taken. If you want to, you can try to revert that decision, but you would need to write a PEP. Regards,

Re: Python's handling of unicode surrogates

2007-04-19 Thread Neil Hodgson
Adam Olsen: > To solve this I propose Python's unicode type using UTF-16 should have > gaps in its index, allowing it to only expose complete unicode scalar > values. Iteration would produce surrogate pairs rather than > individual surrogates, indexing to the first half of a surrogate pair > woul