Jarkko Hietaniemi wrote:
> > Umm, one way or another I suspect UTF-8 will be in there.
>
> I suspect so too but very grudgingly. As Dan said dealing with
> variable length data is a major pain. UTF-8 is certainly a much
> better designed VLD than most but it's still a pain.
>
I guess that's why strings should be abstracted and only accessed by an API
from everywhere outside the string API handling functions.
The string API should be sufficiently smart to be able to convert data from
one encoding to another as it's more convenient. For example, if the
compiler sees a sub with some calls of "substr" inside a loop all acting on
the same string, it would probably setup things so that the sub tells the
string to morph into a string that easily accesses subscripts. If there's
only one "substr" outside of a loop, it probably wouldn't bother doing this,
since the cost of the conversion would be bigger than counting the indexes
on a variable character length string.
On the other side, for a string that is matched against regexps, it doesn't
matter much if it has variable character length, since regexps normally read
all the string anyway, and indexing characters isn't much of a concern.
It would be nice if the user had some control to this, for example by saying
"I don't care this string will be used by substr, leave it in UTF-8 since
it's too big and I don't want to waste memory!", or "This string isn't too
big, so I should convert it to bloated UTF-32 at once!", or even "use less
'memory';".
And I believe 8-bit ASCII will always be an option, for who doesn't care
about extended characters and want the best of both worlds on speed and
memory usage.
- Branden