On Wed, Aug 03, 2005 at 10:44:47PM +0200, Leopold Toetsch wrote: > > On Aug 3, 2005, at 20:58, Will Coleda (via RT) wrote: > > > > >causes a segfault in the substr opcode (from tcl's lib/tclconst.pir), > >and forces a few tcl-unicode escape tests into TODOs. > > > >A short PIR test that is equivalent: > > > >.sub main @MAIN > > $S0 = "\\u666" > > $I0 = 0x666 > > $S1 = chr $I0 # works, but substr doesn't like this string. > > substr $S0, 0, 5, $S1 > >.end > > >#1 0x0002d04c in string_replace (interpreter=0xd00180, src=0xe5b1c0, > >offset=0, length=5, rep=0xe5a630, d=0x0) at src/string.c:1238 > > string_replace has still the old code relying on fixed-width encodings > with 1, 2, or 4 bytes per char, which is of course not true for utf8. > This needs fixing.
I thought that one thing Jarkko learned from perl 5's Unicode model was that the amount of code and pain to support a variable length encoding was greater than the space saving that that encoding gives. In turn Dan had decided that Parrot should internally unpack to some form of fixed width encoding. So all Unicode would be stored internally in the shortest of ISO-8859-1, UCS-16 and UCS-32 that encompassed all the code points used. 1: My memory may be wrong on this 2: It may not have been explicit 3: I may have missed an explicit change But having dealt with the fun of variable length encodings, my gut feeling is with Jarkko, that it's probably better to stay fixed width internally. Nicholas Clark