On Tue Jun 16 03:35:58 2015, zef...@fysh.org wrote: > I remember when Unix programs used to be 8-bit clean. > > $ env - ACME=$'L\xe9on' ./perl6-m -e 'say "hi"' > Unhandled exception: Malformed UTF-8 at line 1 col 7 > at gen/moar/stage2/QRegex.nqp:183 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/QRegex.moarvm::175) > from gen/moar/stage2/QRegex.nqp:11 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/QRegex.moarvm:<mainline>:46) > from <unknown>:1 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/QRegex.moarvm:<load>:6) > from src/vm/moar/ModuleLoader.nqp:51 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/ModuleLoader.moarvm::87) > from src/vm/moar/ModuleLoader.nqp:41 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/ModuleLoader.moarvm:load_module:85) > from <unknown>:1 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/NQPP6QRegex.moarvm:<dependencies+deserialize>:28) > from src/vm/moar/ModuleLoader.nqp:51 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/ModuleLoader.moarvm::87) > from src/vm/moar/ModuleLoader.nqp:41 > (/home/zefram/usr/rakudo/rakudo/install/share/nqp/lib/ModuleLoader.moarvm:load_module:85) > from <unknown>:1 > (/home/zefram/usr/rakudo/rakudo/perl6.moarvm:<dependencies+deserialize>:28) > $ env - ./perl6-m -e 'say "hi"' $'L\xe9on' > Unhandled exception: Malformed UTF-8 at line 1 col 2 > at <unknown>:1 > (/home/zefram/usr/rakudo/rakudo/perl6.moarvm:<entry>:4) > > OK, interpreting arguments as UTF-8 is a convenience for some things, > but > the values passed between processes here are general octet strings, > with > only nul excluded. There has to be a way to get at the octets > unmolested, > as there is when reading from an input handle (.read vs .get). The > above > exceptions are happening too early for the program to even declare an > interest in the command line arguments or environment. The mandatory > UTF-8 decoding means that it is impossible to implement such basic > Unix > tools as echo(1), cat(1) (for the filenames), and env(1) in Perl 6. > > Not sure which of Rakudo, NQP, and MoarVM to blame for the failures. > The one with command line arguments looks like it's happening so early > that MoarVM must be the one in control, but the environment one is at > a higher level. MoarVM also exhibits failure if the `input.moarvm' > filename is non-UTF-8, whether it's an extant file or not, but the > error > message differs slightly between those two cases. >
Works now, with tests in S32-str/utf8-c8.t. In summary: things coming from/going to the OS are now encoded/decoded using UTF-8 C-8, which uses synthetic codepoints (same mechanism as used in NFG) to store the original octets. All things coming from the OS are decoded this way, and encoding the resulting Str back to utf8-c8 will therefore result in the same octet stream. The encoding is made available at the Perl 6 level, just like any other, so original octets can always be retrieved. /jnthn