Tonight on #parrot:
03:15 <@mdiep> meaning that imcc doesn't know it's being feed utf8
instead of ascii
03:16 <@Coke> mdiep: B***it. it knows the encoding of the string.
*) Parrot's compilers take plain old C-strings and don't know anything
about the charset/encoding of the string - but read on
*) I'was thinking about changing the interface to pass in a STRING*
and convert *somehow*.
But given this snippet:
.sub main :main
.local string code, code2
.local pmc compiler, prog
code = <<'EOC'
set S0, iso-8859-1:"Tötsch"
print S0
EOC
code2 = downcase unicode:"END\n"
code .= code2 # code is now utf16/ucs2
compiler = compreg "PASM"
$I0 = find_encoding "utf8" ## case 1)
trans_encoding code, $I0
# $I0 = find_charset "ascii" ## case 2)
# trans_charset code, $I0
prog = compiler(code)
prog()
.end
Case 1 (transcode utf16 to utf8) results in this program being compiled:
set S0, iso-8859-1:"Tötsch"
print S0
end
( ./parrot -D20 foo.pir; cat EVAL_1 )
This is plain wrong, because the string contains now an utf8 encoded
sequence instead of the latin1 char. Sure the program even prints
something, but the wrong thing.
Case 2 (convert to ascii) just fails, because the text isn't ascii
"can't convert unicode to ascii"
So now what should we do? I have expressed my opinion several times on
IRC, with not much success so far: Parrot can't do nothing against this,
because the code to compile is invalid or wrong or bogus.
It's up to the compiler writer, to produce valid code. Valid for
PASM/PIR is currently just ascii.
Above program should start with:
code = <<'EOC'
set S0, iso-8859-1:"T\xf6tsch"
That is: everything non-ascii in the source code has to be escaped
properly, *then* a conversion to ascii succeeds and everything is fine.
We can IMHO just discuss, if Parrot should transcode a program code to
ascii, or if this is up to the compiler writer.
leo