Hi Dan & Michael,
As a guy who speaks a strange language (multi byte chars, multi glyph
chars, caseless
text and half vowels) , I think you have made it too complicated than
it should be .
> charset end of things, offsets will be in graphemes (or Freds. I
> don't remember what we finally decided to name the things))
Writing a unicode composition and rendering is VERY VERY HARD ...
Find a way to leech that ... (I've tried a pango module for malayalam and
it's really really hard to do).
> When dealing with variable-length encodings, removal of codepoints in
> the middle may make the string shrink, and adding them may make it
> grow. The encoding layer is responsible for managing the underlying
> byte buffer to maintain consistency.
It was soo easy with immutable strings ... I think that is why Java could
implement unicode properly :)
> >> void to_encoding(STRING *);
> >>
> >> Make the string the new encoding, in place
A String should always be Unicode IMHO , they should be converted to
byte buffers
by encoding and back from byte buffers while decoding.
> >> UINTVAL get_codepoint(STRING *, offset);
> >> void set_codepoint(STRING, offset, UINTVAL codepoint);
*if* , String always contains (length, UINTVAL[]) always , doesn't it
make life easier ?
> >> UINTVAL get_byte(STRING *, offset)
...
> Byte offset. Needs more clarity.
...
> >> void set_byte(STRING *, offset, UINTVAL byte);
...
My advice would be to never let the layer above the encoding know that
we're storing
it in bytes :)
> >> STRING *get_codepoints(STRING, offset, count);
Immutability of returned string (and original) would save memory ..
especially the UINTVAL
array was GC allocated :) .. of course what you have here is the
substring operation in
a new and obfuscated name :)
// some psuedo code as I see it.
substring(string, offset, count)
{
// validate params or catch fire and exit
string2=gc_alloc(string_header);
string2->length = count;
string2->data = &(string->data[offset]); // hopefully data is also gc_alloc'd
return string2;
}
I'm afraid your design is waaay too complicated, at least for an
average guy like me .
I'd like to suggest that all STRING be unicode and convert to byte
buffers and back for all
other purposes. But that's just a suggestion :)
Gopal