Re: Strings. Finally.

Dan Sugalski Tue, 15 Jun 2004 07:52:00 -0700

At 8:41 PM -0700 6/14/04, Brent 'Dax' Royal-Gordon wrote:

Sorry to reply to this, but I feel that this is a request for clarifications, not for a change. :^)

Dan Sugalski wrote:
Synthesized code points
=======================
...
becomes two integers, 0x00000041 and 0x000082A9. (Though it could
represent them as 16-bit integers, since no character takes three or
more bytes)
It strikes me that this scheme is not always null-safe (e.g. the character 00 11 would be indistinguishable from a bare 11). Are there any encodings this could cause a problem with?

It's null-safe, for several reasons. First, of course, we keep the length, so we're fine there. Second we know the encoding scheme--we don't have to inspect the data to see if it's 16 or 32 bit. That's part of the encoding, attached to the string data.

getbyte             Ix, Sy, Iz
(u)getcodepoint  Ix, Sy, Iz
(u)getgrapheme   Sx, Sy, Iz
>
Get the byte, codepoint, or grapheme requested. Destination is either
an integer (representing the byte or codepoint) or a string. Sy is the
source string, Iz is the offset in bytes, code points, or graphemes
from the beginning of the string.
Since we're going to be shifting around the encoding essentially at will, does 'getbyte' make sense on non-binary strings? (And when we have a binary string, is there any difference between 'getbyte', 'getcodepoint', and 'getgrapheme' at all?)

Makes sense? Well.... no. (Or rarely, at least. I can see it potentially being useful when writing some compression or checksumming code where you want to operate on the raw data of a buffer with a higher-level structure) Will people do it? Yes. Yes, they will, and if we don't make it possible they'll do it anyway. This is a concession to reality, so we've at least some chance to do this with some consistency.

If so, will 16- and 32-bit encodings have to implement this with a forward scan from the start of the string (the way getcodepoint would have to be implemented with a variable-width encoding), or do you have another trick up your sleeve?

There's no difference between the three for binary data, and no difference between the codepoint/grapheme version for character sets with no combining characters.

The getgrapheme will require a scan from the start on strings with combining characters, but there are caching tricks and such that can be done. Abstracting it out this way gives us a place to hide those tricks when we get time to add them in.

setbyte             Sx, Iy, Iz
(u)setcodepoint  Sx, Iy, Iz
(u)setgrapheme   Sx, Sy, Iz

Likewise.

Yup, the same applies here. With the sets we can do consistency guarantees and whatnot. -- Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Re: Strings. Finally.

Reply via email to