Sorry to reply to this, but I feel that this is a request for clarifications, not for a change. :^)

Dan Sugalski wrote:
Synthesized code points
=======================
...
becomes two integers, 0x00000041 and 0x000082A9. (Though it could
represent them as 16-bit integers, since no character takes three or
more bytes)

It strikes me that this scheme is not always null-safe (e.g. the character 00 11 would be indistinguishable from a bare 11). Are there any encodings this could cause a problem with?


getbyte             Ix, Sy, Iz
(u)getcodepoint  Ix, Sy, Iz
(u)getgrapheme   Sx, Sy, Iz
>
Get the byte, codepoint, or grapheme requested. Destination is either
an integer (representing the byte or codepoint) or a string. Sy is the
source string, Iz is the offset in bytes, code points, or graphemes
from the beginning of the string.

Since we're going to be shifting around the encoding essentially at will, does 'getbyte' make sense on non-binary strings? (And when we have a binary string, is there any difference between 'getbyte', 'getcodepoint', and 'getgrapheme' at all?)


If so, will 16- and 32-bit encodings have to implement this with a forward scan from the start of the string (the way getcodepoint would have to be implemented with a variable-width encoding), or do you have another trick up your sleeve?

setbyte             Sx, Iy, Iz
(u)setcodepoint  Sx, Iy, Iz
(u)setgrapheme   Sx, Sy, Iz

Likewise.

--
Brent "Dax" Royal-Gordon <[EMAIL PROTECTED]>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.

Reply via email to