> Absolutely. A few other issues that I remembered last night are:
>
>  - The current code assumes that the string data will be two
>    byte aligned for UTF-16 and four byte aligned for UTF-32 which
>    is probably reasonable but maybe not.

Yeah, I think we can handle that in the constant section for constants.  As
for run-time generated strings, we can generate them to be aligned to
whatever we want (assuming the underlying architecture allows it.)

>  - The utf8_t, utf16_t and utf32_t types will need to be determined
>    by configure as they will currently break on some machines. Plus
>    machines without native 8, 16 and 32 bit types will be a problem.

Almost all hardware should have char as an 8 bit type so that shouldn't be a
problem.  However, finding a 16 bit or 32 bit type might be a problem on
some hardware.  We might want to think about using arrays of 8 bit types or
using bit fields.

>  - There are byte ordering issues for UTF-16 and UTF-32 strings. The
>    current code assumes host byte ordering but should we be spotting
>    byte order markers in the strings and adjusting to cope?

There are byte ordering issues for all of parrot.  I assume we'll fix these
when we fix the rest of parrot.

>A fundamental question (which I think Simon was hinting at with his
>cryptic comment) is whether the native encoding is fixed when parrot
>is built or can change on the fly as they user changes their locale
>settings. If it's the latter than conversion to and from native will
>have to work by loading an appropriate conversion table at run time.

This could get interesting.  I don't think it should be fixed when parrot is
built, but should be gathered at initialization time.  If the user wants to
change his locale right in the middle of a parrot run, that is even more
interesting and we would probably have to have an opcode that would go
through each native string and change it to fit the new locale.

Tanton

Reply via email to