On Apr 19, 11:02 pm, Neil Hodgson <[EMAIL PROTECTED]> wrote: > Adam Olsen: > > > To solve this I propose Python's unicode type using UTF-16 should have > > gaps in its index, allowing it to only expose complete unicode scalar > > values. Iteration would produce surrogate pairs rather than > > individual surrogates, indexing to the first half of a surrogate pair > > would produce the entire pair (indexing to the second half would raise > > IndexError), and slicing would be required to not separate a surrogate > > pair (IndexError otherwise). > > I expect having sequences with inaccessible indices will prove > overly surprising. They will behave quite similar to existing Python > sequences except when code that works perfectly well against other > sequences throws exceptions very rarely.
"Errors should never pass silently." The only way I can think of to make surrogates unsurprising would be to use UTF-8, thereby bombarding programmers with variable-length characters. > > Reasons to treat surrogates as undivisible: > > * \U escapes and repr() already do this > > * unichr(0x10000) would work on all unicode scalar values > > unichr could return a 2 code unit string without forcing surrogate > indivisibility. Indeed. I was actually surprised that it didn't, originally I had it listed with \U and repr(). > > * "There is no separate character type; a character is represented by > > a string of one item." > > Could amend this to "a string of one or two items". > > > * iteration would be identical on all platforms > > There could be a secondary iterator that iterates over characters > rather than code units. But since you should use that iterator 90%+ of the time, why not make it the default? > > * sorting would be identical on all platforms > > This should be fixable in the current scheme. True. > > * UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated > > surrogates, are ill-formed[2]. > > It would be interesting to see how far specifying (and enforcing) > UTF-16 over the current implementation would take us. That is for the 16 > bit Unicode implementation raising an exception if an operation would > produce an unpaired surrogate or other error. Single element indexing is > a problem although it could yield a non-string type. Err, what would be the point in having a non-string type when you could just as easily produce a string containing the surrogate pair? That's half my proposal. > > Reasons against such a change: > > * Breaks code which does range(len(s)) or enumerate(s). This can be > > worked around by using s = list(s) first. > > The code will work happily for the implementor and then break when > exposed to a surrogate. The code may well break already. I just make it explicit. > > * "Nobody is forcing you to use characters above 0xFFFF". This is a > > strawman. Unicode goes beyond 0xFFFF because real languages need it. > > Software should not break just because the user speaks a different > > language than the programmer. > > Characters over 0xFFFF are *very* rare. Most of the Supplementary > Multilingual Plane is for historical languages and I don't think there > are any surviving Phoenician speakers. Maybe the extra mathematical > signs or musical symbols will prove useful one software and fonts are > implemented for these ranges. The Supplementary Ideographic Plane is > historic Chinese and may have more users. Yet Unicode has deemed them worth including anyway. I see no reason to make them more painful then they have to be. A program written to use them today would most likely a) avoid iteration, and b) replace indexes with slices (s[i] -> s[i:i +len(sub)]. If they need iteration they'll have to reimplement it, providing the exact behaviour I propose. Or they can recompile Python to use UTF-32, but why shouldn't such features be available by default? > I think that effort would be better spent on an implementation that > appears to be UTF-32 but uses UTF-16 internally. The vast majority of > the time, no surrogates will be present, so operations can be simple and > fast. When a string contains a surrogate, a flag is flipped and all > operations go through more complex and slower code paths. This way, > consumers of the type see a simple, consistent interface which will not > report strange errors when used. Your solution would require code duplication and would be slower. My solution would have no duplication and would not be slower. I like mine. ;) > BTW, I just implemented support for supplemental planes (surrogates, > 4 byte UTF-8 sequences) for Scintilla, a text editing component. I dream of a day when complete unicode support is universal. With enough effort we may get there some day. :) -- Adam Olsen, aka Rhamphoryncus -- http://mail.python.org/mailman/listinfo/python-list