On Wed, Jul 25, 2018 at 02:20:30PM +0200, Damien Pollet wrote:
> On Wed, 25 Jul 2018 at 13:48, Sven Van Caekenberghe <s...@stfx.eu> wrote:
> 
> > > On 25 Jul 2018, at 13:39, Damien Pollet <damien.pollet+ph...@gmail.com>
> > wrote:
> > > Related issue: command line arguments come from VM system attributes as
> > ByteStrings??? and thus interpreted as iso-8859-1, which is incorrect in 
> > most
> > cases nowadays, even though it seems to work as long as you only use ASCII.
> > Decoding them is easy enough, but it requires two copies (asByteString
> > utf8Decoded)
> >
> > Yes this is a really big issue. Anything coming in as command line arg or
> > environment variable (or clipboard) is in a basically unknown OS determined
> > encoding. I would assume/hope the UTF-8 is the sensible default today, but
> > apparently not. And it is hard to find a cross platform solution.
> >
> 
> My point here was that it would make more sense for those to be passed into
> the image as ByteArrays, revealing the fact that their encoding is unknown.
> Currently the bytes are correct, but since they've been shoved into
> ByteStrings by the VM, the characters will be wrong unless your system
> happens to be using Latin 1.

That sounds right to me.

Having said that, there should be no need to change the VM interface to do
this. A ByteString is by definition an array of 8 bit wide characters, and
conversion between ByteString and ByteArray is trivial. Any necessary changes
can be done without touching the VM.

Dave

> 
> I suppose we can either have a setting for decoding (since it's pretty much
> arbitrary), or heuristics like checking LC_CTYPE or whatever. Pablo
> mentioned the Locale class, but it doesn't seem to detect anything correct
> from the environment.

Reply via email to