Dan Sugalski wrote:
...Synthesized code points =======================
becomes two integers, 0x00000041 and 0x000082A9. (Though it could represent them as 16-bit integers, since no character takes three or more bytes)
It strikes me that this scheme is not always null-safe (e.g. the character 00 11 would be indistinguishable from a bare 11). Are there any encodings this could cause a problem with?
>getbyte Ix, Sy, Iz (u)getcodepoint Ix, Sy, Iz (u)getgrapheme Sx, Sy, Iz
Get the byte, codepoint, or grapheme requested. Destination is either an integer (representing the byte or codepoint) or a string. Sy is the source string, Iz is the offset in bytes, code points, or graphemes from the beginning of the string.
Since we're going to be shifting around the encoding essentially at will, does 'getbyte' make sense on non-binary strings? (And when we have a binary string, is there any difference between 'getbyte', 'getcodepoint', and 'getgrapheme' at all?)
If so, will 16- and 32-bit encodings have to implement this with a forward scan from the start of the string (the way getcodepoint would have to be implemented with a variable-width encoding), or do you have another trick up your sleeve?
setbyte Sx, Iy, Iz (u)setcodepoint Sx, Iy, Iz (u)setgrapheme Sx, Sy, Iz
Likewise.
-- Brent "Dax" Royal-Gordon <[EMAIL PROTECTED]> Perl and Parrot hacker
Oceania has always been at war with Eastasia.