Sorry to reply to this, but I feel that this is a request for clarifications, not for a change. :^)
Dan Sugalski wrote:...Synthesized code points =======================becomes two integers, 0x00000041 and 0x000082A9. (Though it could represent them as 16-bit integers, since no character takes three or more bytes)
It strikes me that this scheme is not always null-safe (e.g. the character 00 11 would be indistinguishable from a bare 11). Are there any encodings this could cause a problem with?
It's null-safe, for several reasons. First, of course, we keep the length, so we're fine there. Second we know the encoding scheme--we don't have to inspect the data to see if it's 16 or 32 bit. That's part of the encoding, attached to the string data.
>getbyte Ix, Sy, Iz (u)getcodepoint Ix, Sy, Iz (u)getgrapheme Sx, Sy, IzGet the byte, codepoint, or grapheme requested. Destination is either an integer (representing the byte or codepoint) or a string. Sy is the source string, Iz is the offset in bytes, code points, or graphemes from the beginning of the string.
Since we're going to be shifting around the encoding essentially at will, does 'getbyte' make sense on non-binary strings? (And when we have a binary string, is there any difference between 'getbyte', 'getcodepoint', and 'getgrapheme' at all?)
Makes sense? Well.... no. (Or rarely, at least. I can see it potentially being useful when writing some compression or checksumming code where you want to operate on the raw data of a buffer with a higher-level structure) Will people do it? Yes. Yes, they will, and if we don't make it possible they'll do it anyway. This is a concession to reality, so we've at least some chance to do this with some consistency.
If so, will 16- and 32-bit encodings have to implement this with a forward scan from the start of the string (the way getcodepoint would have to be implemented with a variable-width encoding), or do you have another trick up your sleeve?
There's no difference between the three for binary data, and no difference between the codepoint/grapheme version for character sets with no combining characters.
The getgrapheme will require a scan from the start on strings with combining characters, but there are caching tricks and such that can be done. Abstracting it out this way gives us a place to hide those tricks when we get time to add them in.
setbyte Sx, Iy, Iz (u)setcodepoint Sx, Iy, Iz (u)setgrapheme Sx, Sy, Iz
Likewise.
Yup, the same applies here. With the sets we can do consistency guarantees and whatnot.
--
Dan
--------------------------------------it's like this------------------- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk