Tonight on #parrot:

03:15 <@mdiep> meaning that imcc doesn't know it's being feed utf8 instead of ascii
03:16 <@Coke> mdiep: B***it. it knows the encoding of the string.

*) Parrot's compilers take plain old C-strings and don't know anything about the charset/encoding of the string - but read on

*) I'was thinking about changing the interface to pass in a STRING*
   and convert *somehow*.

But given this snippet:

.sub main :main
    .local string code, code2
    .local pmc compiler, prog
    code = <<'EOC'
        set S0, iso-8859-1:"Tötsch"
        print S0
EOC
    code2 = downcase unicode:"END\n"
    code .= code2                       # code is now utf16/ucs2
    compiler = compreg "PASM"
    $I0 = find_encoding "utf8"          ## case 1)
    trans_encoding code, $I0
    # $I0 = find_charset "ascii"        ## case 2)
    # trans_charset code, $I0
    prog = compiler(code)
    prog()
.end

Case 1 (transcode utf16 to utf8) results in this program being compiled:

        set S0, iso-8859-1:"Tötsch"
        print S0
end

( ./parrot -D20 foo.pir; cat EVAL_1 )

This is plain wrong, because the string contains now an utf8 encoded sequence instead of the latin1 char. Sure the program even prints something, but the wrong thing.

Case 2 (convert to ascii) just fails, because the text isn't ascii

"can't convert unicode to ascii"

So now what should we do? I have expressed my opinion several times on IRC, with not much success so far: Parrot can't do nothing against this, because the code to compile is invalid or wrong or bogus.

It's up to the compiler writer, to produce valid code. Valid for PASM/PIR is currently just ascii.

Above program should start with:

    code = <<'EOC'
        set S0, iso-8859-1:"T\xf6tsch"

That is: everything non-ascii in the source code has to be escaped properly, *then* a conversion to ascii succeeds and everything is fine.

We can IMHO just discuss, if Parrot should transcode a program code to ascii, or if this is up to the compiler writer.

leo

Reply via email to