Re: Unicode step by step

Jeff Clites Sat, 10 Apr 2004 13:13:13 -0700

On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:

2) String PBC layout. The internal string type has changed. This currently breaks native_pbc tests (that have strings) as well as some "parrot xx.pbc" tests related to strings.

These are working for me (which tests are failing for you?)--I did update the PF_* API to match the changes to string internals. Of course, since the internals changed the pbc layout changed also, so the native_pbc test files need to be regenerated on the various platforms--but the ppc one I submitted (see other post, or original patch submission) should work. But if that one fails for you, it's probably b/c of byte order, and I need to look and find where we do the endianness correction for integers in pbc files, and hook in to do something similar for certain string cases. If someone can send me a number_X.pbc file generated on an i386 platform, that will help me test.

But, it's correct that there's no backward-compatibility code in place, to allow reading old pbc files. Do we want to have that sort of thing at this stage? (Certainly, I'd think that after 1.0 we'd want backward compatibility with any format changes, but do we need it at this stage?)

But let me know which "parrot xx.pbc" tests are failing for you.

The layout seems to depend somehow on the supported Unicode levels (or not). So before fixing the PBC issues, I'd just have a statememt: parrot_string_t looks such and such or of course as is now.

Could you rephrase? I'm not understanding what you are saying.

The only real change in the pbc format (if I'm recalling correctly--I'll have to go back and look) are that rather than serializing the encoding/chartype/language triple, we are writing out the s->representation (still followed by s->bufused and then the contents of the buffer). The only other wrinkle is that for cases where s->representation is 2 or 4, we need to endianness correct when we use the bytecode.

This is probably a separate discussion, but we _could_ decide instead to represent strings in pbc files always in UTF-8. Advantage: Simpler, no endianness correction needed, probably durable to further changes in string internals, could isolate s->representation awareness to string.c and string_primitives.c. Disadvantages: De-serializing a string from a pbc file will always involve a copy, and could result in larger files in some cases. I could argue it either way--one's cleaner, the other is probably faster.

There is of course still the question: Should we really have ICU in the tree. This needs tracking updates and patching (again) to make it build and so on.

One consideration is that I may need to patch ICU a few places--there's at least one API which they only expose in C++, so I need to wrap it in C and it's cleaner to do that as a patch to ICU rather than having C++ code in the core of parrot. Other than that, I think it boils down to convenience, and (possibly) consistency in being able to say that parrot version foo corresponds to ICU version bar (but maybe we don't need to be able to say that).

JEff

Re: Unicode step by step

Reply via email to