On Wed, Jul 25, 2018 at 02:20:30PM +0200, Damien Pollet wrote: > On Wed, 25 Jul 2018 at 13:48, Sven Van Caekenberghe <s...@stfx.eu> wrote: > > > > On 25 Jul 2018, at 13:39, Damien Pollet <damien.pollet+ph...@gmail.com> > > wrote: > > > Related issue: command line arguments come from VM system attributes as > > ByteStrings??? and thus interpreted as iso-8859-1, which is incorrect in > > most > > cases nowadays, even though it seems to work as long as you only use ASCII. > > Decoding them is easy enough, but it requires two copies (asByteString > > utf8Decoded) > > > > Yes this is a really big issue. Anything coming in as command line arg or > > environment variable (or clipboard) is in a basically unknown OS determined > > encoding. I would assume/hope the UTF-8 is the sensible default today, but > > apparently not. And it is hard to find a cross platform solution. > > > > My point here was that it would make more sense for those to be passed into > the image as ByteArrays, revealing the fact that their encoding is unknown. > Currently the bytes are correct, but since they've been shoved into > ByteStrings by the VM, the characters will be wrong unless your system > happens to be using Latin 1.
That sounds right to me. Having said that, there should be no need to change the VM interface to do this. A ByteString is by definition an array of 8 bit wide characters, and conversion between ByteString and ByteArray is trivial. Any necessary changes can be done without touching the VM. Dave > > I suppose we can either have a setting for decoding (since it's pretty much > arbitrary), or heuristics like checking LC_CTYPE or whatever. Pablo > mentioned the Locale class, but it doesn't seem to detect anything correct > from the environment.